Web Scraping: HTML to Text (Streaming). Suggestions?

WWGD · Jun 26, 2017

Hi All,
I am trying to figure out how to scrape ( if necessary; use API otherwise ) , html to text , maybe with an intermediate step using Excel and .csvs, EDIT Preferably using Python. I have some idea on how to go about it, but I am no expert. I would appreciate input in general.
Thanks.

Borg · Jun 26, 2017

Try a java program called JSoup.

WWGD · Jun 26, 2017

Borg said:

Try a java program called JSoup.

Thanks Borg, is that the same as Python's Beautiful soup? I am familiar with Python, not too much with Java.

Borg · Jun 26, 2017

And I'm not versed in Python. Been on my todo list way too long...

WWGD · Jun 26, 2017

Borg said:

And I'm not versed in Python. Been on my todo list way too long...

Feel free to ask a question, tho I am no expert; that way we can level the answers from ## \infty \rightarrow 0## to ## \infty \rightarrow 1 ## , your answers ( to my questions) to my answers ( to your questions) ;).

stevendaryl · Jun 26, 2017

WWGD said:

Hi All,
I am trying to figure out how to scrape ( if necessary; use API otherwise ) , html to text , maybe with an intermediate step using Excel and .csvs, EDIT Preferably using Python. I have some idea on how to go about it, but I am no expert. I would appreciate input in general.
Thanks.

I have done that using Beautiful Soup. It works pretty well, but you often have to tweak it by trial and error to get what you want.

Borg · Jun 27, 2017

WWGD said:

Thanks Borg, is that the same as Python's Beautiful soup? I am familiar with Python, not too much with Java.

Beautiful Soup and JSoup are both on a list of HTML parsers on a Wiki comparison page.

WWGD · Jul 13, 2017

Hi,
Another point: I have been hearing about both Copyright and general issues re scraping; specifically some forms of scraping are seen as hostile to the owner of the site, while some data held in the page is deemed private. Anyone know more on this?.

jack action · Jul 13, 2017

About copyright, I guess it depends on the type of information and what you do with it.

About a web page being private ... good one! By definition - unless you hack some login page - when it is on the web, it's public. But if you retrieve a web page with a bot, the website owner can make your task harder to do.

Borg · Jul 14, 2017

There aren't any major differences between visiting a page yourself or using a program to visit it for you. The requests look nearly identical on the server. The only issue might be if you use a program to retrieve hundreds of pages at a time. Even then, most large sites wouldn't even notice it.

jtbell · Jul 14, 2017

Borg said:

The only issue might be if you use a program to retrieve hundreds of pages at a time.

Especially if you do it very quickly. A time-delay between requests is probably helpful. Maybe 5 or 10 seconds?

Borg · Jul 14, 2017

jtbell said:

Especially if you do it very quickly. A time-delay between requests is probably helpful. Maybe 5 or 10 seconds?

I have a program that makes 7500 page requests from an SEC web site, processes the data and stores it in a database. It completes in about 3 - 5 minutes. So that's somewhere around 1500 - 2500 page requests per minute. I wouldn't want to put even a one second delay on the requests because it would then take two hours to run.

jtbell · Jul 14, 2017

Yeah, I agree a major "industrial strength" site on its own server is different from a small business or hobby site on a shared server, like mine. Also text (small files) versus images (large files). My site is oriented around images, and I sometimes get people scraping hundreds of them in one go. My hosting plan doesn't have a bandwidth limit per se, but it does have limits on memory usage, number of simultaneous processes, etc., and I've occasionally hit them.

WWGD · Jul 14, 2017

Thanks all. Yes, I agree the Copyright thing does not seem to make sense. And the other problem is the means by which the scraping is done, and the effect it may have on the resources of the host.

StoneTemplePython · Jul 14, 2017

note: Do not run a scraper on Facebook.

As I recall: if they catch you using one, they will permanently deactivate your account.

WWGD · Jul 14, 2017

StoneTemplePython said:

note: Do not run a scraper on Facebook.

As I recall: if they catch you using one, they will permanently deactivate your account.

Thanks; you mean my non-existing account ( I'm one of around 7 people in the world without a FB account) ? Still, I will be cautious. Interesting that the company that deals with private info in questionable ways is ready to sqiftly punish those who do the same with their data.

StoneTemplePython · Jul 14, 2017

WWGD said:

Thanks; you mean my non-existing account ( I'm one of around 7 people in the world without a FB account) ? Still, I will be cautious. Interesting that the company that deals with private info in questionable ways is ready to sqiftly punish those who do the same with their data.

Yes.. FB and some other giant tech firms out there take a muscular approach to outsiders using web scrapers -- basically because they can. (Not sure exactly who else does this, but the thing about FB is you need to log in first to view any content and hence they have leverage over you via your login credentials.)

stoomart · Jul 14, 2017

jack action said:

About copyright, I guess it depends on the type of information and what you do with it.

About a web page being private ... good one! By definition - unless you hack some login page - when it is on the web, it's public. But if you retrieve a web page with a bot, the website owner can make your task harder to do.

Not only can they (we) make your task harder, but they can make your general online experience less enjoyable by registering you on suspicious IP blacklists. We use a variety of IP Intelligence sources and L4-L7 traffic analytics techniques to block and report abusive activity to our 15+ public state websites.
Whenever you want to scrape someone's website in any automated fashion, you should always get permission from the owner.

jack action · Jul 14, 2017

stoomart said:

Whenever you want to scrape someone's website in any automated fashion, you should always get permission from the owner.

I don't think you need a special permission to visit a website in an automated fashion as long as you don't flood the website with requests. Otherwise, one would expect Google asking permission to every website they visit (i.e. the entire world wide web).

A request by a program or a user is exactly the same thing from the point of view of the server.

Heck, webmasters often include structured data (like schema.org) specifically for automated web scraping.

WWGD · Jul 15, 2017

jack action said:

I don't think you need a special permission to visit a website in an automated fashion as long as you don't flood the website with requests. Otherwise, one would expect Google asking permission to every website they visit (i.e. the entire world wide web).

A request by a program or a user is exactly the same thing from the point of view of the server.

Heck, webmasters often include structured data (like schema.org) specifically for automated web scraping.

But why then do Facebook and others react the way they do in fighting, fending of scraping? EDIT: I am not disagreeing with you ( I don't know enough to ), just curious on your take.

jack action · Jul 15, 2017

WWGD said:

But why then do Facebook and others react the way they do in fighting, fending of scraping? EDIT: I am not disagreeing with you ( I don't know enough to ), just curious on your take.

All I'm saying is that I don't think there are laws forbidding accessing a website with a program and without the permission of the owner (anyway, a web browser is a kind of program to access a website). Even visiting the «robots.txt» file is a courtesy, not an obligation by any law. But there are no laws either that force a website owner to honor every request he gets, so what Facebook and others are doing is legal as well.

From my point of view, if a website owner get offended because I accessed his website with an automated program that doesn't cause more burden than a typical user, that's an attitude problem on his part. Although, he can get offended if I broke a copyright law while misusing the data I recovered from his website (just like if I got it «manually»).

That being said, I doubt people attempting to recover data from Facebook often have respect for the capacity of their servers, ergo their reaction.

FactChecker · Jul 15, 2017

Information on web pages can have Copyright restrictions. Reading the data should be no problem, but using it in a publication without following Copyright rules may lead to legal problems.

Web Scraping: HTML to Text (Streaming). Suggestions?

Is A.I. more than the sum of its parts?

AI vs. Humans as Processors in an Environment

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Web Scraping: HTML to Text (Streaming). Suggestions?

Similar threads