Web Scraping with Python 3.4: Getting Started

  • Context: Python 
  • Thread starter Thread starter TheDemx27
  • Start date Start date
  • Tags Tags
    Python Web
Click For Summary

Discussion Overview

The discussion focuses on web scraping using Python 3.4, specifically addressing issues related to retrieving source code from websites. Participants explore coding practices, error handling, and alternative approaches to web scraping.

Discussion Character

  • Technical explanation, Debate/contested, Exploratory

Main Points Raised

  • One participant shares an initial code snippet for web scraping but encounters an error due to passing a list of URLs to the `urlopen()` function instead of a single URL.
  • Another participant points out the mistake in passing the entire list to `urlopen()` and comments on the confusion that can arise from Python's lack of type checking.
  • A different participant suggests a functional approach to web scraping by creating a function that processes each URL individually and uses `map()` to apply it to the list of URLs.
  • A later reply acknowledges the previous point and provides a corrected version of the initial code, successfully retrieving the HTML content for each URL.
  • One participant critiques the initial looping structure, suggesting a more Pythonic approach using a `for` loop instead of a `while` loop, and provides examples of both standard and indexed iteration.

Areas of Agreement / Disagreement

Participants generally agree on the need to pass individual URLs to the `urlopen()` function and express varying opinions on coding style and best practices. However, there is no consensus on a single preferred method for implementing the web scraper.

Contextual Notes

Some participants highlight the potential for confusion in error messages due to Python's dynamic typing, and there are differing opinions on coding style and structure, reflecting personal preferences rather than established standards.

TheDemx27
Gold Member
Messages
169
Reaction score
13
I'm starting this web scraper, and all I'm trying to do so far is to just retrieve the source code from the sites.

Code:
Code:
import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

i = 0

while i < len(urls):

	htmlfile = urllib.request.urlopen(urls)
	htmltext = htmlfile.read()
	print(htmltext)
	i+=1

I run it, and I get this error:

Code:
Traceback (most recent call last):
  File "C:\Python34\WebScraper.py", line 10, in <module>
    htmlfile = urllib.request.urlopen(urls)
  File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 446, in open
    req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
[Finished in 0.2s with exit code 1]

I'm using python 3.4.
 
Last edited by a moderator:
Technology news on Phys.org
Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

The bad thing about languages with no type checking (like Python) is that when things go wrong, the result can be total confusion - hence the not very helpful error message about timeout.
 
  • Like
Likes   Reactions: 1 person
Also you might consider writing something like this:

Code:
import urllib.request
from urllib.request import urlopen
urls = ["[PLAIN]http://www.google.com","http://www.nytimes.com","http://www.rockpapershotgun.com"][/PLAIN] 

# create a function that returns the result of a page scrape
def scrape_url(url):
        htmlfile = urllib.request.urlopen(url)
        htmltext = htmlfile.read()
        return htmltext

#apply the scrape_url function to every element in the urls list and convert the resulting iterator into a list
scrape = list(map(scrape_url,urls))

#print the list
print(scrape)
 
Last edited by a moderator:
AlephZero said:
Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

Good point. Programming never fails to make me feel idiotic. :P

Code:
import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

incrimentSite = 0 #Variable to choose from array urls

while incrimentSite < len(urls):

	htmlfile = urllib.request.urlopen(urls[incrimentSite])
	htmltext = htmlfile.read()
	print(htmltext)
	
        incrimentSite +=1 #next url

Works for me. Thankyou.
 
Last edited by a moderator:
What is all this?

Code:
while i < len (x):
    print (x[i])
    i +=1

Is that Python? It looks a C programmer tried to write Python.

Might I suggest this:
Code:
    things = ["alice", "bob", "carol"]
    for t in things:
        print (t)

If you need the index, use this:

Code:
    things = ["alice", "bob", "carol"]
    for i, t in enumerate (things):
        print (t, "has list index", i)
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 9 ·
Replies
9
Views
10K
  • · Replies 3 ·
Replies
3
Views
3K
Replies
3
Views
2K
  • · Replies 17 ·
Replies
17
Views
3K