Web Scraping with Python 3.4: Getting Started

TheDemx27 · Mar 27, 2014

I'm starting this web scraper, and all I'm trying to do so far is to just retrieve the source code from the sites.

Code:

Code:

import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

i = 0

while i < len(urls):

	htmlfile = urllib.request.urlopen(urls)
	htmltext = htmlfile.read()
	print(htmltext)
	i+=1

I run it, and I get this error:

Code:

Traceback (most recent call last):
  File "C:\Python34\WebScraper.py", line 10, in <module>
    htmlfile = urllib.request.urlopen(urls)
  File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 446, in open
    req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
[Finished in 0.2s with exit code 1]

I'm using python 3.4.

AlephZero · Mar 27, 2014

Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

The bad thing about languages with no type checking (like Python) is that when things go wrong, the result can be total confusion - hence the not very helpful error message about timeout.

DavidSnider · Mar 27, 2014

Also you might consider writing something like this:

Code:

import urllib.request
from urllib.request import urlopen
urls = ["[PLAIN]http://www.google.com","http://www.nytimes.com","http://www.rockpapershotgun.com"][/PLAIN] 

# create a function that returns the result of a page scrape
def scrape_url(url):
        htmlfile = urllib.request.urlopen(url)
        htmltext = htmlfile.read()
        return htmltext

#apply the scrape_url function to every element in the urls list and convert the resulting iterator into a list
scrape = list(map(scrape_url,urls))

#print the list
print(scrape)

TheDemx27 · Mar 28, 2014

AlephZero said:

Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

Good point. Programming never fails to make me feel idiotic. :P

Code:

import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

incrimentSite = 0 #Variable to choose from array urls

while incrimentSite < len(urls):

	htmlfile = urllib.request.urlopen(urls[incrimentSite])
	htmltext = htmlfile.read()
	print(htmltext)
	
        incrimentSite +=1 #next url

Works for me. Thankyou.

gmar · Apr 17, 2014

What is all this?

Code:

while i < len (x):
    print (x[i])
    i +=1

Is that Python? It looks a C programmer tried to write Python.

Might I suggest this:

Code:

    things = ["alice", "bob", "carol"]
    for t in things:
        print (t)

If you need the index, use this:

Code:

    things = ["alice", "bob", "carol"]
    for i, t in enumerate (things):
        print (t, "has list index", i)

Web Scraping with Python 3.4: Getting Started

Is A.I. more than the sum of its parts?

AI vs. Humans as Processors in an Environment

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Web Scraping with Python 3.4: Getting Started

Similar threads