Web Scraping with Python 3.4: Getting Started

  • Python
  • Thread starter TheDemx27
  • Start date
  • Tags
    Python Web
In summary, the code is trying to iterate through a list of items, but it's getting confused because the programmer is trying to write Python.
  • #1
TheDemx27
Gold Member
169
13
I'm starting this web scraper, and all I'm trying to do so far is to just retrieve the source code from the sites.

Code:
Code:
import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

i = 0

while i < len(urls):

	htmlfile = urllib.request.urlopen(urls)
	htmltext = htmlfile.read()
	print(htmltext)
	i+=1

I run it, and I get this error:

Code:
Traceback (most recent call last):
  File "C:\Python34\WebScraper.py", line 10, in <module>
    htmlfile = urllib.request.urlopen(urls)
  File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 446, in open
    req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
[Finished in 0.2s with exit code 1]

I'm using python 3.4.
 
Last edited by a moderator:
Technology news on Phys.org
  • #2
Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

The bad thing about languages with no type checking (like Python) is that when things go wrong, the result can be total confusion - hence the not very helpful error message about timeout.
 
  • Like
Likes 1 person
  • #3
Also you might consider writing something like this:

Code:
import urllib.request
from urllib.request import urlopen
urls = ["[PLAIN]http://www.google.com","http://www.nytimes.com","http://www.rockpapershotgun.com"][/PLAIN] 

# create a function that returns the result of a page scrape
def scrape_url(url):
        htmlfile = urllib.request.urlopen(url)
        htmltext = htmlfile.read()
        return htmltext

#apply the scrape_url function to every element in the urls list and convert the resulting iterator into a list
scrape = list(map(scrape_url,urls))

#print the list
print(scrape)
 
Last edited by a moderator:
  • #4
AlephZero said:
Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

Good point. Programming never fails to make me feel idiotic. :P

Code:
import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

incrimentSite = 0 #Variable to choose from array urls

while incrimentSite < len(urls):

	htmlfile = urllib.request.urlopen(urls[incrimentSite])
	htmltext = htmlfile.read()
	print(htmltext)
	
        incrimentSite +=1 #next url

Works for me. Thankyou.
 
Last edited by a moderator:
  • #5
What is all this?

Code:
while i < len (x):
    print (x[i])
    i +=1

Is that Python? It looks a C programmer tried to write Python.

Might I suggest this:
Code:
    things = ["alice", "bob", "carol"]
    for t in things:
        print (t)

If you need the index, use this:

Code:
    things = ["alice", "bob", "carol"]
    for i, t in enumerate (things):
        print (t, "has list index", i)
 

FAQ: Web Scraping with Python 3.4: Getting Started

1. What is web scraping and why is it useful?

Web scraping is the process of extracting information from websites using automated scripts or programs. It is useful for collecting large amounts of data from various sources quickly and efficiently. This data can then be analyzed and used for various purposes, such as market research, competitive analysis, or data-driven decision making.

2. What tools do I need to get started with web scraping in Python 3.4?

To get started with web scraping in Python 3.4, you will need to have the following tools:
- A text editor or IDE for writing your code (e.g. Sublime Text, PyCharm)
- The Python 3.4 interpreter installed on your computer
- The BeautifulSoup library for parsing HTML
- The requests library for making HTTP requests to websites
- A basic understanding of HTML and CSS

3. Is web scraping legal?

Web scraping is a gray area in terms of legality. While it is not explicitly illegal, there are certain ethical and legal considerations to keep in mind. It is generally considered acceptable to scrape public data from websites, but scraping private or copyrighted information without permission may be illegal. It is important to always check the terms of service and robots.txt file of a website before scraping it.

4. Can I use web scraping for any website?

Technically, you can attempt to scrape any website. However, some websites may have measures in place to prevent scraping, such as CAPTCHAs or IP blocking. It is important to be respectful of a website's policies and to not overload their servers with too many requests. Additionally, some websites may have anti-scraping measures in their terms of service, so it is important to check before scraping.

5. How can I handle errors while web scraping in Python 3.4?

There are a few ways to handle errors while web scraping in Python 3.4:
- Use try-except blocks to catch and handle specific errors
- Use the status_code attribute of the response object to check for successful requests
- Use the sleep() function from the time module to add a delay between requests
- Use proxies or rotating user agents to avoid getting blocked by websites

Similar threads

Replies
3
Views
2K
Replies
11
Views
2K
Replies
9
Views
9K
Replies
3
Views
2K
Replies
17
Views
3K
Back
Top