HTML/CSS Web Scraping: HTML to Text (Streaming). Suggestions?

Click For Summary
The discussion centers around web scraping, specifically using Python and tools like Beautiful Soup, as well as the ethical and legal considerations involved. Participants share experiences with scraping HTML to text and mention the potential need for APIs. There is a comparison between Python's Beautiful Soup and Java's JSoup, with an emphasis on the trial-and-error nature of scraping.Legal concerns are highlighted, particularly regarding copyright and the privacy of data on web pages. It is noted that while web pages are generally public, scraping can be viewed as hostile by site owners, especially if done excessively. The importance of not overwhelming servers with requests is emphasized, with suggestions for implementing delays between requests to avoid detection and potential bans.The conversation touches on the strict policies of major platforms like Facebook against scraping, underscoring the legal gray areas surrounding automated data access. Participants agree that while scraping can be legal if done responsibly, it is crucial to respect the site owner's rights and the potential implications of copyright infringement when using scraped data.
WWGD
Science Advisor
Homework Helper
Messages
7,771
Reaction score
12,990
Hi All,
I am trying to figure out how to scrape ( if necessary; use API otherwise ) , html to text , maybe with an intermediate step using Excel and .csvs, EDIT Preferably using Python. I have some idea on how to go about it, but I am no expert. I would appreciate input in general.
Thanks.
 
Last edited:
Technology news on Phys.org
Try a java program called JSoup.
 
  • Like
Likes FactChecker, jedishrfu and WWGD
Borg said:
Try a java program called JSoup.
Thanks Borg, is that the same as Python's Beautiful soup? I am familiar with Python, not too much with Java.
 
And I'm not versed in Python. Been on my todo list way too long...
 
Borg said:
And I'm not versed in Python. Been on my todo list way too long...
Feel free to ask a question, tho I am no expert; that way we can level the answers from ## \infty \rightarrow 0## to ## \infty \rightarrow 1 ## , your answers ( to my questions) to my answers ( to your questions) ;).
 
WWGD said:
Hi All,
I am trying to figure out how to scrape ( if necessary; use API otherwise ) , html to text , maybe with an intermediate step using Excel and .csvs, EDIT Preferably using Python. I have some idea on how to go about it, but I am no expert. I would appreciate input in general.
Thanks.

I have done that using Beautiful Soup. It works pretty well, but you often have to tweak it by trial and error to get what you want.
 
  • Like
Likes WWGD
WWGD said:
Thanks Borg, is that the same as Python's Beautiful soup? I am familiar with Python, not too much with Java.
Beautiful Soup and JSoup are both on a list of HTML parsers on a Wiki comparison page.
 
  • Like
Likes WWGD
Hi,
Another point: I have been hearing about both Copyright and general issues re scraping; specifically some forms of scraping are seen as hostile to the owner of the site, while some data held in the page is deemed private. Anyone know more on this?.
 
About copyright, I guess it depends on the type of information and what you do with it.

About a web page being private ... good one! By definition - unless you hack some login page - when it is on the web, it's public. But if you retrieve a web page with a bot, the website owner can make your task harder to do.
 
  • #10
There aren't any major differences between visiting a page yourself or using a program to visit it for you. The requests look nearly identical on the server. The only issue might be if you use a program to retrieve hundreds of pages at a time. Even then, most large sites wouldn't even notice it.
 
  • #11
Borg said:
The only issue might be if you use a program to retrieve hundreds of pages at a time.
Especially if you do it very quickly. A time-delay between requests is probably helpful. Maybe 5 or 10 seconds?
 
  • #12
jtbell said:
Especially if you do it very quickly. A time-delay between requests is probably helpful. Maybe 5 or 10 seconds?
I have a program that makes 7500 page requests from an SEC web site, processes the data and stores it in a database. It completes in about 3 - 5 minutes. So that's somewhere around 1500 - 2500 page requests per minute. I wouldn't want to put even a one second delay on the requests because it would then take two hours to run.
 
Last edited:
  • #13
Yeah, I agree a major "industrial strength" site on its own server is different from a small business or hobby site on a shared server, like mine. Also text (small files) versus images (large files). My site is oriented around images, and I sometimes get people scraping hundreds of them in one go. My hosting plan doesn't have a bandwidth limit per se, but it does have limits on memory usage, number of simultaneous processes, etc., and I've occasionally hit them.
 
  • #14
Thanks all. Yes, I agree the Copyright thing does not seem to make sense. And the other problem is the means by which the scraping is done, and the effect it may have on the resources of the host.
 
  • #15
note: Do not run a scraper on Facebook.

As I recall: if they catch you using one, they will permanently deactivate your account.
 
  • #16
StoneTemplePython said:
note: Do not run a scraper on Facebook.

As I recall: if they catch you using one, they will permanently deactivate your account.
Thanks; you mean my non-existing account ( I'm one of around 7 people in the world without a FB account) ? Still, I will be cautious. Interesting that the company that deals with private info in questionable ways is ready to sqiftly punish those who do the same with their data.
 
  • #17
WWGD said:
Thanks; you mean my non-existing account ( I'm one of around 7 people in the world without a FB account) ? Still, I will be cautious. Interesting that the company that deals with private info in questionable ways is ready to sqiftly punish those who do the same with their data.

Yes.. FB and some other giant tech firms out there take a muscular approach to outsiders using web scrapers -- basically because they can. (Not sure exactly who else does this, but the thing about FB is you need to log in first to view any content and hence they have leverage over you via your login credentials.)
 
  • #18
jack action said:
About copyright, I guess it depends on the type of information and what you do with it.

About a web page being private ... good one! By definition - unless you hack some login page - when it is on the web, it's public. But if you retrieve a web page with a bot, the website owner can make your task harder to do.
Not only can they (we) make your task harder, but they can make your general online experience less enjoyable by registering you on suspicious IP blacklists. We use a variety of IP Intelligence sources and L4-L7 traffic analytics techniques to block and report abusive activity to our 15+ public state websites.
Whenever you want to scrape someone's website in any automated fashion, you should always get permission from the owner.
 
Last edited:
  • Like
Likes WWGD
  • #19
stoomart said:
Whenever you want to scrape someone's website in any automated fashion, you should always get permission from the owner.
I don't think you need a special permission to visit a website in an automated fashion as long as you don't flood the website with requests. Otherwise, one would expect Google asking permission to every website they visit (i.e. the entire world wide web).

A request by a program or a user is exactly the same thing from the point of view of the server.

Heck, webmasters often include structured data (like schema.org) specifically for automated web scraping.
 
  • #20
jack action said:
I don't think you need a special permission to visit a website in an automated fashion as long as you don't flood the website with requests. Otherwise, one would expect Google asking permission to every website they visit (i.e. the entire world wide web).

A request by a program or a user is exactly the same thing from the point of view of the server.

Heck, webmasters often include structured data (like schema.org) specifically for automated web scraping.

But why then do Facebook and others react the way they do in fighting, fending of scraping? EDIT: I am not disagreeing with you ( I don't know enough to ), just curious on your take.
 
Last edited:
  • #21
WWGD said:
But why then do Facebook and others react the way they do in fighting, fending of scraping? EDIT: I am not disagreeing with you ( I don't know enough to ), just curious on your take.
All I'm saying is that I don't think there are laws forbidding accessing a website with a program and without the permission of the owner (anyway, a web browser is a kind of program to access a website). Even visiting the «robots.txt» file is a courtesy, not an obligation by any law. But there are no laws either that force a website owner to honor every request he gets, so what Facebook and others are doing is legal as well.

From my point of view, if a website owner get offended because I accessed his website with an automated program that doesn't cause more burden than a typical user, that's an attitude problem on his part. Although, he can get offended if I broke a copyright law while misusing the data I recovered from his website (just like if I got it «manually»).

That being said, I doubt people attempting to recover data from Facebook often have respect for the capacity of their servers, ergo their reaction.
 
  • Like
Likes WWGD
  • #22
Information on web pages can have Copyright restrictions. Reading the data should be no problem, but using it in a publication without following Copyright rules may lead to legal problems.
 

Similar threads

  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 43 ·
2
Replies
43
Views
6K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K