Python Web Scraping with Python 3.4: Getting Started

  • Thread starter Thread starter TheDemx27
  • Start date Start date
  • Tags Tags
    Python Web
Click For Summary
The discussion centers around a web scraping issue using Python 3.4. The original code attempts to retrieve HTML source code from multiple URLs but encounters an error because it incorrectly passes a list of URLs to the `urlopen()` function instead of a single URL. Participants suggest corrections, emphasizing the need to pass one URL at a time within the loop. A more efficient approach is proposed, utilizing a function to scrape each URL and applying it to the list using `map()`. Additionally, there are comments on improving code readability and structure, recommending the use of `for` loops instead of `while` loops for better clarity. Overall, the conversation highlights common pitfalls in Python programming and offers solutions to enhance the code's functionality.
TheDemx27
Gold Member
Messages
169
Reaction score
13
I'm starting this web scraper, and all I'm trying to do so far is to just retrieve the source code from the sites.

Code:
Code:
import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

i = 0

while i < len(urls):

	htmlfile = urllib.request.urlopen(urls)
	htmltext = htmlfile.read()
	print(htmltext)
	i+=1

I run it, and I get this error:

Code:
Traceback (most recent call last):
  File "C:\Python34\WebScraper.py", line 10, in <module>
    htmlfile = urllib.request.urlopen(urls)
  File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 446, in open
    req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
[Finished in 0.2s with exit code 1]

I'm using python 3.4.
 
Last edited by a moderator:
Technology news on Phys.org
Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

The bad thing about languages with no type checking (like Python) is that when things go wrong, the result can be total confusion - hence the not very helpful error message about timeout.
 
  • Like
Likes 1 person
Also you might consider writing something like this:

Code:
import urllib.request
from urllib.request import urlopen
urls = ["[PLAIN]http://www.google.com","http://www.nytimes.com","http://www.rockpapershotgun.com"][/PLAIN] 

# create a function that returns the result of a page scrape
def scrape_url(url):
        htmlfile = urllib.request.urlopen(url)
        htmltext = htmlfile.read()
        return htmltext

#apply the scrape_url function to every element in the urls list and convert the resulting iterator into a list
scrape = list(map(scrape_url,urls))

#print the list
print(scrape)
 
Last edited by a moderator:
AlephZero said:
Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

Good point. Programming never fails to make me feel idiotic. :P

Code:
import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

incrimentSite = 0 #Variable to choose from array urls

while incrimentSite < len(urls):

	htmlfile = urllib.request.urlopen(urls[incrimentSite])
	htmltext = htmlfile.read()
	print(htmltext)
	
        incrimentSite +=1 #next url

Works for me. Thankyou.
 
Last edited by a moderator:
What is all this?

Code:
while i < len (x):
    print (x[i])
    i +=1

Is that Python? It looks a C programmer tried to write Python.

Might I suggest this:
Code:
    things = ["alice", "bob", "carol"]
    for t in things:
        print (t)

If you need the index, use this:

Code:
    things = ["alice", "bob", "carol"]
    for i, t in enumerate (things):
        print (t, "has list index", i)
 
Learn If you want to write code for Python Machine learning, AI Statistics/data analysis Scientific research Web application servers Some microcontrollers JavaScript/Node JS/TypeScript Web sites Web application servers C# Games (Unity) Consumer applications (Windows) Business applications C++ Games (Unreal Engine) Operating systems, device drivers Microcontrollers/embedded systems Consumer applications (Linux) Some more tips: Do not learn C++ (or any other dialect of C) as a...

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 9 ·
Replies
9
Views
9K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
3
Views
2K
  • · Replies 17 ·
Replies
17
Views
3K