Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Python Web Scraper

  1. Mar 27, 2014 #1

    TheDemx27

    User Avatar
    Gold Member

    I'm starting this web scraper, and all I'm trying to do so far is to just retrieve the source code from the sites.

    Code:
    Code (Text):
    import urllib.request
    from urllib.request import urlopen

    urls = ["[PLAIN]http://google.com",[/PLAIN] [Broken] "[PLAIN]http://nytimes.com",[/PLAIN] [Broken] "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] [Broken]

    i = 0

    while i < len(urls):

        htmlfile = urllib.request.urlopen(urls)
        htmltext = htmlfile.read()
        print(htmltext)
        i+=1
    I run it, and I get this error:

    Code (Text):

    Traceback (most recent call last):
      File "C:\Python34\WebScraper.py", line 10, in <module>
        htmlfile = urllib.request.urlopen(urls)
      File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
        return opener.open(url, data, timeout)
      File "C:\Python34\lib\urllib\request.py", line 446, in open
        req.timeout = timeout
    AttributeError: 'list' object has no attribute 'timeout'
    [Finished in 0.2s with exit code 1]
     
    I'm using python 3.4.
     
    Last edited by a moderator: May 6, 2017
  2. jcsd
  3. Mar 27, 2014 #2

    AlephZero

    User Avatar
    Science Advisor
    Homework Helper

    Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

    The bad thing about languages with no type checking (like Python) is that when things go wrong, the result can be total confusion - hence the not very helpful error message about timeout.
     
  4. Mar 27, 2014 #3

    DavidSnider

    User Avatar
    Gold Member

    Also you might consider writing something like this:

    Code (Text):

    import urllib.request
    from urllib.request import urlopen
    urls = ["[PLAIN]http://www.google.com","http://www.nytimes.com","http://www.rockpapershotgun.com"][/PLAIN] [Broken]

    # create a function that returns the result of a page scrape
    def scrape_url(url):
            htmlfile = urllib.request.urlopen(url)
            htmltext = htmlfile.read()
            return htmltext

    #apply the scrape_url function to every element in the urls list and convert the resulting iterator into a list
    scrape = list(map(scrape_url,urls))

    #print the list
    print(scrape)

     
     
    Last edited by a moderator: May 6, 2017
  5. Mar 28, 2014 #4

    TheDemx27

    User Avatar
    Gold Member

    Good point. Programming never fails to make me feel idiotic. :P

    Code (Text):
    import urllib.request
    from urllib.request import urlopen

    urls = ["[PLAIN]http://google.com",[/PLAIN] [Broken] "[PLAIN]http://nytimes.com",[/PLAIN] [Broken] "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] [Broken]

    incrimentSite = 0 #Variable to choose from array urls

    while incrimentSite < len(urls):

        htmlfile = urllib.request.urlopen(urls[incrimentSite])
        htmltext = htmlfile.read()
        print(htmltext)
       
            incrimentSite +=1 #next url
    Works for me. Thankyou.
     
    Last edited by a moderator: May 6, 2017
  6. Apr 17, 2014 #5
    What is all this?

    Code (Text):

    while i < len (x):
        print (x[i])
        i +=1
     
    Is that Python? It looks a C programmer tried to write Python.

    Might I suggest this:
    Code (Text):

        things = ["alice", "bob", "carol"]
        for t in things:
            print (t)
     
    If you need the index, use this:

    Code (Text):

        things = ["alice", "bob", "carol"]
        for i, t in enumerate (things):
            print (t, "has list index", i)
     
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook




Similar Discussions: Python Web Scraper
  1. Error on Python (Replies: 9)

  2. Chess with Python (Replies: 2)

  3. Is Python the future? (Replies: 10)

  4. Python installation (Replies: 10)

Loading...