Python Web Scraping with Python 3.4: Getting Started

  • Thread starter Thread starter TheDemx27
  • Start date Start date
  • Tags Tags
    Python Web
AI Thread Summary
The discussion centers around a web scraping issue using Python 3.4. The original code attempts to retrieve HTML source code from multiple URLs but encounters an error because it incorrectly passes a list of URLs to the `urlopen()` function instead of a single URL. Participants suggest corrections, emphasizing the need to pass one URL at a time within the loop. A more efficient approach is proposed, utilizing a function to scrape each URL and applying it to the list using `map()`. Additionally, there are comments on improving code readability and structure, recommending the use of `for` loops instead of `while` loops for better clarity. Overall, the conversation highlights common pitfalls in Python programming and offers solutions to enhance the code's functionality.
TheDemx27
Gold Member
Messages
169
Reaction score
13
I'm starting this web scraper, and all I'm trying to do so far is to just retrieve the source code from the sites.

Code:
Code:
import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

i = 0

while i < len(urls):

	htmlfile = urllib.request.urlopen(urls)
	htmltext = htmlfile.read()
	print(htmltext)
	i+=1

I run it, and I get this error:

Code:
Traceback (most recent call last):
  File "C:\Python34\WebScraper.py", line 10, in <module>
    htmlfile = urllib.request.urlopen(urls)
  File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 446, in open
    req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
[Finished in 0.2s with exit code 1]

I'm using python 3.4.
 
Last edited by a moderator:
Technology news on Phys.org
Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

The bad thing about languages with no type checking (like Python) is that when things go wrong, the result can be total confusion - hence the not very helpful error message about timeout.
 
  • Like
Likes 1 person
Also you might consider writing something like this:

Code:
import urllib.request
from urllib.request import urlopen
urls = ["[PLAIN]http://www.google.com","http://www.nytimes.com","http://www.rockpapershotgun.com"][/PLAIN] 

# create a function that returns the result of a page scrape
def scrape_url(url):
        htmlfile = urllib.request.urlopen(url)
        htmltext = htmlfile.read()
        return htmltext

#apply the scrape_url function to every element in the urls list and convert the resulting iterator into a list
scrape = list(map(scrape_url,urls))

#print the list
print(scrape)
 
Last edited by a moderator:
AlephZero said:
Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

Good point. Programming never fails to make me feel idiotic. :P

Code:
import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

incrimentSite = 0 #Variable to choose from array urls

while incrimentSite < len(urls):

	htmlfile = urllib.request.urlopen(urls[incrimentSite])
	htmltext = htmlfile.read()
	print(htmltext)
	
        incrimentSite +=1 #next url

Works for me. Thankyou.
 
Last edited by a moderator:
What is all this?

Code:
while i < len (x):
    print (x[i])
    i +=1

Is that Python? It looks a C programmer tried to write Python.

Might I suggest this:
Code:
    things = ["alice", "bob", "carol"]
    for t in things:
        print (t)

If you need the index, use this:

Code:
    things = ["alice", "bob", "carol"]
    for i, t in enumerate (things):
        print (t, "has list index", i)
 
Dear Peeps I have posted a few questions about programing on this sectio of the PF forum. I want to ask you veterans how you folks learn program in assembly and about computer architecture for the x86 family. In addition to finish learning C, I am also reading the book From bits to Gates to C and Beyond. In the book, it uses the mini LC3 assembly language. I also have books on assembly programming and computer architecture. The few famous ones i have are Computer Organization and...
I had a Microsoft Technical interview this past Friday, the question I was asked was this : How do you find the middle value for a dataset that is too big to fit in RAM? I was not able to figure this out during the interview, but I have been look in this all weekend and I read something online that said it can be done at O(N) using something called the counting sort histogram algorithm ( I did not learn that in my advanced data structures and algorithms class). I have watched some youtube...

Similar threads

Replies
3
Views
2K
Replies
11
Views
2K
Replies
3
Views
2K
Replies
17
Views
3K
Back
Top