Python Web Scraping with Python 3.4: Getting Started

  • Thread starter Thread starter TheDemx27
  • Start date Start date
  • Tags Tags
    Python Web
AI Thread Summary
The discussion centers around a web scraping issue using Python 3.4. The original code attempts to retrieve HTML source code from multiple URLs but encounters an error because it incorrectly passes a list of URLs to the `urlopen()` function instead of a single URL. Participants suggest corrections, emphasizing the need to pass one URL at a time within the loop. A more efficient approach is proposed, utilizing a function to scrape each URL and applying it to the list using `map()`. Additionally, there are comments on improving code readability and structure, recommending the use of `for` loops instead of `while` loops for better clarity. Overall, the conversation highlights common pitfalls in Python programming and offers solutions to enhance the code's functionality.
TheDemx27
Gold Member
Messages
169
Reaction score
13
I'm starting this web scraper, and all I'm trying to do so far is to just retrieve the source code from the sites.

Code:
Code:
import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

i = 0

while i < len(urls):

	htmlfile = urllib.request.urlopen(urls)
	htmltext = htmlfile.read()
	print(htmltext)
	i+=1

I run it, and I get this error:

Code:
Traceback (most recent call last):
  File "C:\Python34\WebScraper.py", line 10, in <module>
    htmlfile = urllib.request.urlopen(urls)
  File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 446, in open
    req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
[Finished in 0.2s with exit code 1]

I'm using python 3.4.
 
Last edited by a moderator:
Technology news on Phys.org
Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

The bad thing about languages with no type checking (like Python) is that when things go wrong, the result can be total confusion - hence the not very helpful error message about timeout.
 
  • Like
Likes 1 person
Also you might consider writing something like this:

Code:
import urllib.request
from urllib.request import urlopen
urls = ["[PLAIN]http://www.google.com","http://www.nytimes.com","http://www.rockpapershotgun.com"][/PLAIN] 

# create a function that returns the result of a page scrape
def scrape_url(url):
        htmlfile = urllib.request.urlopen(url)
        htmltext = htmlfile.read()
        return htmltext

#apply the scrape_url function to every element in the urls list and convert the resulting iterator into a list
scrape = list(map(scrape_url,urls))

#print the list
print(scrape)
 
Last edited by a moderator:
AlephZero said:
Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

Good point. Programming never fails to make me feel idiotic. :P

Code:
import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

incrimentSite = 0 #Variable to choose from array urls

while incrimentSite < len(urls):

	htmlfile = urllib.request.urlopen(urls[incrimentSite])
	htmltext = htmlfile.read()
	print(htmltext)
	
        incrimentSite +=1 #next url

Works for me. Thankyou.
 
Last edited by a moderator:
What is all this?

Code:
while i < len (x):
    print (x[i])
    i +=1

Is that Python? It looks a C programmer tried to write Python.

Might I suggest this:
Code:
    things = ["alice", "bob", "carol"]
    for t in things:
        print (t)

If you need the index, use this:

Code:
    things = ["alice", "bob", "carol"]
    for i, t in enumerate (things):
        print (t, "has list index", i)
 
Thread 'Star maps using Blender'
Blender just recently dropped a new version, 4.5(with 5.0 on the horizon), and within it was a new feature for which I immediately thought of a use for. The new feature was a .csv importer for Geometry nodes. Geometry nodes are a method of modelling that uses a node tree to create 3D models which offers more flexibility than straight modeling does. The .csv importer node allows you to bring in a .csv file and use the data in it to control aspects of your model. So for example, if you...
I tried a web search "the loss of programming ", and found an article saying that all aspects of writing, developing, and testing software programs will one day all be handled through artificial intelligence. One must wonder then, who is responsible. WHO is responsible for any problems, bugs, deficiencies, or whatever malfunctions which the programs make their users endure? Things may work wrong however the "wrong" happens. AI needs to fix the problems for the users. Any way to...

Similar threads

Replies
3
Views
2K
Replies
11
Views
2K
Replies
3
Views
2K
Replies
17
Views
3K
Back
Top