Web Scraping: HTML to Text (Streaming). Suggestions?

In summary, the conversation is about scraping HTML to text and using Python or Java programs to do so. There is discussion about copyright and privacy issues related to scraping and the potential consequences of using a scraper on certain websites. Some participants also share their experiences with scraping and offer advice on how to avoid getting blocked or punished for scraping.
  • #1
WWGD
Science Advisor
Gold Member
6,935
10,343
Hi All,
I am trying to figure out how to scrape ( if necessary; use API otherwise ) , html to text , maybe with an intermediate step using Excel and .csvs, EDIT Preferably using Python. I have some idea on how to go about it, but I am no expert. I would appreciate input in general.
Thanks.
 
Last edited:
Computer science news on Phys.org
  • #2
Try a java program called JSoup.
 
  • Like
Likes FactChecker, jedishrfu and WWGD
  • #3
Borg said:
Try a java program called JSoup.
Thanks Borg, is that the same as Python's Beautiful soup? I am familiar with Python, not too much with Java.
 
  • #4
And I'm not versed in Python. Been on my todo list way too long...
 
  • #5
Borg said:
And I'm not versed in Python. Been on my todo list way too long...
Feel free to ask a question, tho I am no expert; that way we can level the answers from ## \infty \rightarrow 0## to ## \infty \rightarrow 1 ## , your answers ( to my questions) to my answers ( to your questions) ;).
 
  • #6
WWGD said:
Hi All,
I am trying to figure out how to scrape ( if necessary; use API otherwise ) , html to text , maybe with an intermediate step using Excel and .csvs, EDIT Preferably using Python. I have some idea on how to go about it, but I am no expert. I would appreciate input in general.
Thanks.

I have done that using Beautiful Soup. It works pretty well, but you often have to tweak it by trial and error to get what you want.
 
  • Like
Likes WWGD
  • #7
WWGD said:
Thanks Borg, is that the same as Python's Beautiful soup? I am familiar with Python, not too much with Java.
Beautiful Soup and JSoup are both on a list of HTML parsers on a Wiki comparison page.
 
  • Like
Likes WWGD
  • #8
Hi,
Another point: I have been hearing about both Copyright and general issues re scraping; specifically some forms of scraping are seen as hostile to the owner of the site, while some data held in the page is deemed private. Anyone know more on this?.
 
  • #9
About copyright, I guess it depends on the type of information and what you do with it.

About a web page being private ... good one! By definition - unless you hack some login page - when it is on the web, it's public. But if you retrieve a web page with a bot, the website owner can make your task harder to do.
 
  • #10
There aren't any major differences between visiting a page yourself or using a program to visit it for you. The requests look nearly identical on the server. The only issue might be if you use a program to retrieve hundreds of pages at a time. Even then, most large sites wouldn't even notice it.
 
  • #11
Borg said:
The only issue might be if you use a program to retrieve hundreds of pages at a time.
Especially if you do it very quickly. A time-delay between requests is probably helpful. Maybe 5 or 10 seconds?
 
  • #12
jtbell said:
Especially if you do it very quickly. A time-delay between requests is probably helpful. Maybe 5 or 10 seconds?
I have a program that makes 7500 page requests from an SEC web site, processes the data and stores it in a database. It completes in about 3 - 5 minutes. So that's somewhere around 1500 - 2500 page requests per minute. I wouldn't want to put even a one second delay on the requests because it would then take two hours to run.
 
Last edited:
  • #13
Yeah, I agree a major "industrial strength" site on its own server is different from a small business or hobby site on a shared server, like mine. Also text (small files) versus images (large files). My site is oriented around images, and I sometimes get people scraping hundreds of them in one go. My hosting plan doesn't have a bandwidth limit per se, but it does have limits on memory usage, number of simultaneous processes, etc., and I've occasionally hit them.
 
  • #14
Thanks all. Yes, I agree the Copyright thing does not seem to make sense. And the other problem is the means by which the scraping is done, and the effect it may have on the resources of the host.
 
  • #15
note: Do not run a scraper on Facebook.

As I recall: if they catch you using one, they will permanently deactivate your account.
 
  • #16
StoneTemplePython said:
note: Do not run a scraper on Facebook.

As I recall: if they catch you using one, they will permanently deactivate your account.
Thanks; you mean my non-existing account ( I'm one of around 7 people in the world without a FB account) ? Still, I will be cautious. Interesting that the company that deals with private info in questionable ways is ready to sqiftly punish those who do the same with their data.
 
  • #17
WWGD said:
Thanks; you mean my non-existing account ( I'm one of around 7 people in the world without a FB account) ? Still, I will be cautious. Interesting that the company that deals with private info in questionable ways is ready to sqiftly punish those who do the same with their data.

Yes.. FB and some other giant tech firms out there take a muscular approach to outsiders using web scrapers -- basically because they can. (Not sure exactly who else does this, but the thing about FB is you need to log in first to view any content and hence they have leverage over you via your login credentials.)
 
  • #18
jack action said:
About copyright, I guess it depends on the type of information and what you do with it.

About a web page being private ... good one! By definition - unless you hack some login page - when it is on the web, it's public. But if you retrieve a web page with a bot, the website owner can make your task harder to do.
Not only can they (we) make your task harder, but they can make your general online experience less enjoyable by registering you on suspicious IP blacklists. We use a variety of IP Intelligence sources and L4-L7 traffic analytics techniques to block and report abusive activity to our 15+ public state websites.
Whenever you want to scrape someone's website in any automated fashion, you should always get permission from the owner.
 
Last edited:
  • Like
Likes WWGD
  • #19
stoomart said:
Whenever you want to scrape someone's website in any automated fashion, you should always get permission from the owner.
I don't think you need a special permission to visit a website in an automated fashion as long as you don't flood the website with requests. Otherwise, one would expect Google asking permission to every website they visit (i.e. the entire world wide web).

A request by a program or a user is exactly the same thing from the point of view of the server.

Heck, webmasters often include structured data (like schema.org) specifically for automated web scraping.
 
  • #20
jack action said:
I don't think you need a special permission to visit a website in an automated fashion as long as you don't flood the website with requests. Otherwise, one would expect Google asking permission to every website they visit (i.e. the entire world wide web).

A request by a program or a user is exactly the same thing from the point of view of the server.

Heck, webmasters often include structured data (like schema.org) specifically for automated web scraping.

But why then do Facebook and others react the way they do in fighting, fending of scraping? EDIT: I am not disagreeing with you ( I don't know enough to ), just curious on your take.
 
Last edited:
  • #21
WWGD said:
But why then do Facebook and others react the way they do in fighting, fending of scraping? EDIT: I am not disagreeing with you ( I don't know enough to ), just curious on your take.
All I'm saying is that I don't think there are laws forbidding accessing a website with a program and without the permission of the owner (anyway, a web browser is a kind of program to access a website). Even visiting the «robots.txt» file is a courtesy, not an obligation by any law. But there are no laws either that force a website owner to honor every request he gets, so what Facebook and others are doing is legal as well.

From my point of view, if a website owner get offended because I accessed his website with an automated program that doesn't cause more burden than a typical user, that's an attitude problem on his part. Although, he can get offended if I broke a copyright law while misusing the data I recovered from his website (just like if I got it «manually»).

That being said, I doubt people attempting to recover data from Facebook often have respect for the capacity of their servers, ergo their reaction.
 
  • Like
Likes WWGD
  • #22
Information on web pages can have Copyright restrictions. Reading the data should be no problem, but using it in a publication without following Copyright rules may lead to legal problems.
 

1. What is web scraping?

Web scraping is the process of extracting data from websites by using automated tools or bots. It involves parsing HTML code of a website and extracting specific information, such as text, images, or links.

2. How does web scraping work?

Web scraping works by using a web crawler, also known as a bot or spider, to navigate through web pages and extract the desired information. The crawler follows hyperlinks on a website and downloads the HTML code, which is then parsed to extract the relevant data.

3. Is web scraping legal?

The legality of web scraping depends on how the data is being used. Generally, web scraping for personal use or non-commercial purposes is considered legal. However, scraping data for commercial purposes or without the website owner's permission may be a violation of their terms of use or copyright.

4. What is the difference between HTML and text?

HTML is a markup language used to create web pages, whereas text is the actual content of a web page. HTML includes tags that define the structure and formatting of a webpage, while text is the visible text that users read on a webpage.

5. How can I use web scraping for my research or business?

Web scraping can be a valuable tool for research and business purposes. It can help gather large amounts of data quickly and efficiently, which can then be analyzed to gain insights and inform decision-making. However, it is important to ensure that the data is being used ethically and legally.

Similar threads

  • Computing and Technology
Replies
3
Views
1K
  • Programming and Computer Science
Replies
4
Views
1K
  • Programming and Computer Science
2
Replies
54
Views
5K
  • STEM Career Guidance
Replies
9
Views
906
  • Programming and Computer Science
Replies
4
Views
1K
  • Differential Equations
Replies
5
Views
1K
  • Programming and Computer Science
Replies
2
Views
1K
  • Electrical Engineering
Replies
19
Views
1K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
4
Views
357
  • Programming and Computer Science
Replies
11
Views
1K
Back
Top