Web Scraping: HTML to Text (Streaming). Suggestions?

Click For Summary

Discussion Overview

The discussion revolves around web scraping techniques, specifically converting HTML to text, with a focus on using Python and potential intermediate steps involving Excel and CSV files. Participants explore various tools, legal considerations, and ethical implications related to scraping data from websites.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant expresses a desire to scrape HTML to text using Python, mentioning a lack of expertise and seeking general input.
  • Another participant suggests using JSoup, a Java program, and questions its similarity to Python's Beautiful Soup.
  • A participant shares their experience with Beautiful Soup, noting that it often requires trial and error to achieve desired results.
  • Concerns are raised about copyright issues and the perception of scraping as hostile to website owners, with some participants discussing the legality of scraping public web pages.
  • Participants debate the implications of using automated programs to access websites, with some arguing that it is similar to human browsing, while others caution about the potential for being blocked or blacklisted.
  • One participant mentions the need for permission from website owners before scraping, while another counters that as long as requests are not excessive, permission may not be necessary.
  • Discussion includes the notion that webmasters may provide structured data for scraping, yet questions why certain companies, like Facebook, react strongly against scraping activities.

Areas of Agreement / Disagreement

Participants express a mix of views on the legality and ethics of web scraping, with no clear consensus on whether permission is required or the implications of scraping on website owners. The discussion remains unresolved regarding the best practices and legal boundaries of web scraping.

Contextual Notes

Participants highlight various assumptions about the nature of web scraping, including the differences in server responses to automated requests versus human browsing, and the potential consequences of scraping on website resources.

WWGD
Science Advisor
Homework Helper
Messages
7,795
Reaction score
13,095
Hi All,
I am trying to figure out how to scrape ( if necessary; use API otherwise ) , html to text , maybe with an intermediate step using Excel and .csvs, EDIT Preferably using Python. I have some idea on how to go about it, but I am no expert. I would appreciate input in general.
Thanks.
 
Last edited:
Technology news on Phys.org
Try a java program called JSoup.
 
  • Like
Likes   Reactions: FactChecker, jedishrfu and WWGD
Borg said:
Try a java program called JSoup.
Thanks Borg, is that the same as Python's Beautiful soup? I am familiar with Python, not too much with Java.
 
And I'm not versed in Python. Been on my todo list way too long...
 
Borg said:
And I'm not versed in Python. Been on my todo list way too long...
Feel free to ask a question, tho I am no expert; that way we can level the answers from ## \infty \rightarrow 0## to ## \infty \rightarrow 1 ## , your answers ( to my questions) to my answers ( to your questions) ;).
 
WWGD said:
Hi All,
I am trying to figure out how to scrape ( if necessary; use API otherwise ) , html to text , maybe with an intermediate step using Excel and .csvs, EDIT Preferably using Python. I have some idea on how to go about it, but I am no expert. I would appreciate input in general.
Thanks.

I have done that using Beautiful Soup. It works pretty well, but you often have to tweak it by trial and error to get what you want.
 
  • Like
Likes   Reactions: WWGD
WWGD said:
Thanks Borg, is that the same as Python's Beautiful soup? I am familiar with Python, not too much with Java.
Beautiful Soup and JSoup are both on a list of HTML parsers on a Wiki comparison page.
 
  • Like
Likes   Reactions: WWGD
Hi,
Another point: I have been hearing about both Copyright and general issues re scraping; specifically some forms of scraping are seen as hostile to the owner of the site, while some data held in the page is deemed private. Anyone know more on this?.
 
About copyright, I guess it depends on the type of information and what you do with it.

About a web page being private ... good one! By definition - unless you hack some login page - when it is on the web, it's public. But if you retrieve a web page with a bot, the website owner can make your task harder to do.
 
  • #10
There aren't any major differences between visiting a page yourself or using a program to visit it for you. The requests look nearly identical on the server. The only issue might be if you use a program to retrieve hundreds of pages at a time. Even then, most large sites wouldn't even notice it.
 
  • #11
Borg said:
The only issue might be if you use a program to retrieve hundreds of pages at a time.
Especially if you do it very quickly. A time-delay between requests is probably helpful. Maybe 5 or 10 seconds?
 
  • #12
jtbell said:
Especially if you do it very quickly. A time-delay between requests is probably helpful. Maybe 5 or 10 seconds?
I have a program that makes 7500 page requests from an SEC web site, processes the data and stores it in a database. It completes in about 3 - 5 minutes. So that's somewhere around 1500 - 2500 page requests per minute. I wouldn't want to put even a one second delay on the requests because it would then take two hours to run.
 
Last edited:
  • #13
Yeah, I agree a major "industrial strength" site on its own server is different from a small business or hobby site on a shared server, like mine. Also text (small files) versus images (large files). My site is oriented around images, and I sometimes get people scraping hundreds of them in one go. My hosting plan doesn't have a bandwidth limit per se, but it does have limits on memory usage, number of simultaneous processes, etc., and I've occasionally hit them.
 
  • #14
Thanks all. Yes, I agree the Copyright thing does not seem to make sense. And the other problem is the means by which the scraping is done, and the effect it may have on the resources of the host.
 
  • #15
note: Do not run a scraper on Facebook.

As I recall: if they catch you using one, they will permanently deactivate your account.
 
  • #16
StoneTemplePython said:
note: Do not run a scraper on Facebook.

As I recall: if they catch you using one, they will permanently deactivate your account.
Thanks; you mean my non-existing account ( I'm one of around 7 people in the world without a FB account) ? Still, I will be cautious. Interesting that the company that deals with private info in questionable ways is ready to sqiftly punish those who do the same with their data.
 
  • #17
WWGD said:
Thanks; you mean my non-existing account ( I'm one of around 7 people in the world without a FB account) ? Still, I will be cautious. Interesting that the company that deals with private info in questionable ways is ready to sqiftly punish those who do the same with their data.

Yes.. FB and some other giant tech firms out there take a muscular approach to outsiders using web scrapers -- basically because they can. (Not sure exactly who else does this, but the thing about FB is you need to log in first to view any content and hence they have leverage over you via your login credentials.)
 
  • #18
jack action said:
About copyright, I guess it depends on the type of information and what you do with it.

About a web page being private ... good one! By definition - unless you hack some login page - when it is on the web, it's public. But if you retrieve a web page with a bot, the website owner can make your task harder to do.
Not only can they (we) make your task harder, but they can make your general online experience less enjoyable by registering you on suspicious IP blacklists. We use a variety of IP Intelligence sources and L4-L7 traffic analytics techniques to block and report abusive activity to our 15+ public state websites.
Whenever you want to scrape someone's website in any automated fashion, you should always get permission from the owner.
 
Last edited:
  • Like
Likes   Reactions: WWGD
  • #19
stoomart said:
Whenever you want to scrape someone's website in any automated fashion, you should always get permission from the owner.
I don't think you need a special permission to visit a website in an automated fashion as long as you don't flood the website with requests. Otherwise, one would expect Google asking permission to every website they visit (i.e. the entire world wide web).

A request by a program or a user is exactly the same thing from the point of view of the server.

Heck, webmasters often include structured data (like schema.org) specifically for automated web scraping.
 
  • #20
jack action said:
I don't think you need a special permission to visit a website in an automated fashion as long as you don't flood the website with requests. Otherwise, one would expect Google asking permission to every website they visit (i.e. the entire world wide web).

A request by a program or a user is exactly the same thing from the point of view of the server.

Heck, webmasters often include structured data (like schema.org) specifically for automated web scraping.

But why then do Facebook and others react the way they do in fighting, fending of scraping? EDIT: I am not disagreeing with you ( I don't know enough to ), just curious on your take.
 
Last edited:
  • #21
WWGD said:
But why then do Facebook and others react the way they do in fighting, fending of scraping? EDIT: I am not disagreeing with you ( I don't know enough to ), just curious on your take.
All I'm saying is that I don't think there are laws forbidding accessing a website with a program and without the permission of the owner (anyway, a web browser is a kind of program to access a website). Even visiting the «robots.txt» file is a courtesy, not an obligation by any law. But there are no laws either that force a website owner to honor every request he gets, so what Facebook and others are doing is legal as well.

From my point of view, if a website owner get offended because I accessed his website with an automated program that doesn't cause more burden than a typical user, that's an attitude problem on his part. Although, he can get offended if I broke a copyright law while misusing the data I recovered from his website (just like if I got it «manually»).

That being said, I doubt people attempting to recover data from Facebook often have respect for the capacity of their servers, ergo their reaction.
 
  • Like
Likes   Reactions: WWGD
  • #22
Information on web pages can have Copyright restrictions. Reading the data should be no problem, but using it in a publication without following Copyright rules may lead to legal problems.
 

Similar threads

  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 43 ·
2
Replies
43
Views
7K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K