Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Web Scraping: HTML to Text (Streaming). Suggestions?

  1. Jun 26, 2017 #1

    WWGD

    User Avatar
    Science Advisor
    Gold Member

    Hi All,
    I am trying to figure out how to scrape ( if necessary; use API otherwise ) , html to text , maybe with an intermediate step using Excel and .csvs, EDIT Preferably using Python. I have some idea on how to go about it, but I am no expert. I would appreciate input in general.
    Thanks.
     
    Last edited: Jun 26, 2017
  2. jcsd
  3. Jun 26, 2017 #2

    Borg

    User Avatar
    Gold Member

    Try a java program called JSoup.
     
  4. Jun 26, 2017 #3

    WWGD

    User Avatar
    Science Advisor
    Gold Member

    Thanks Borg, is that the same as Python's Beautiful soup? I am familiar with Python, not too much with Java.
     
  5. Jun 26, 2017 #4

    Borg

    User Avatar
    Gold Member

    And I'm not versed in Python. Been on my todo list way too long...
     
  6. Jun 26, 2017 #5

    WWGD

    User Avatar
    Science Advisor
    Gold Member

    Feel free to ask a question, tho I am no expert; that way we can level the answers from ## \infty \rightarrow 0## to ## \infty \rightarrow 1 ## , your answers ( to my questions) to my answers ( to your questions) ;).
     
  7. Jun 26, 2017 #6

    stevendaryl

    User Avatar
    Staff Emeritus
    Science Advisor

    I have done that using Beautiful Soup. It works pretty well, but you often have to tweak it by trial and error to get what you want.
     
  8. Jun 27, 2017 #7

    Borg

    User Avatar
    Gold Member

    Beautiful Soup and JSoup are both on a list of HTML parsers on a Wiki comparison page.
     
  9. Jul 13, 2017 #8

    WWGD

    User Avatar
    Science Advisor
    Gold Member

    Hi,
    Another point: I have been hearing about both Copyright and general issues re scraping; specifically some forms of scraping are seen as hostile to the owner of the site, while some data held in the page is deemed private. Anyone know more on this?.
     
  10. Jul 13, 2017 #9

    jack action

    User Avatar
    Science Advisor
    Gold Member

    About copyright, I guess it depends on the type of information and what you do with it.

    About a web page being private ... good one! By definition - unless you hack some login page - when it is on the web, it's public. But if you retrieve a web page with a bot, the website owner can make your task harder to do.
     
  11. Jul 14, 2017 #10

    Borg

    User Avatar
    Gold Member

    There aren't any major differences between visiting a page yourself or using a program to visit it for you. The requests look nearly identical on the server. The only issue might be if you use a program to retrieve hundreds of pages at a time. Even then, most large sites wouldn't even notice it.
     
  12. Jul 14, 2017 #11

    jtbell

    User Avatar

    Staff: Mentor

    Especially if you do it very quickly. A time-delay between requests is probably helpful. Maybe 5 or 10 seconds?
     
  13. Jul 14, 2017 #12

    Borg

    User Avatar
    Gold Member

    I have a program that makes 7500 page requests from an SEC web site, processes the data and stores it in a database. It completes in about 3 - 5 minutes. So that's somewhere around 1500 - 2500 page requests per minute. I wouldn't want to put even a one second delay on the requests because it would then take two hours to run.
     
    Last edited: Jul 14, 2017
  14. Jul 14, 2017 #13

    jtbell

    User Avatar

    Staff: Mentor

    Yeah, I agree a major "industrial strength" site on its own server is different from a small business or hobby site on a shared server, like mine. Also text (small files) versus images (large files). My site is oriented around images, and I sometimes get people scraping hundreds of them in one go. My hosting plan doesn't have a bandwidth limit per se, but it does have limits on memory usage, number of simultaneous processes, etc., and I've occasionally hit them.
     
  15. Jul 14, 2017 #14

    WWGD

    User Avatar
    Science Advisor
    Gold Member

    Thanks all. Yes, I agree the Copyright thing does not seem to make sense. And the other problem is the means by which the scraping is done, and the effect it may have on the resources of the host.
     
  16. Jul 14, 2017 #15
    note: Do not run a scraper on Facebook.

    As I recall: if they catch you using one, they will permanently deactivate your account.
     
  17. Jul 14, 2017 #16

    WWGD

    User Avatar
    Science Advisor
    Gold Member

    Thanks; you mean my non-existing account ( I'm one of around 7 people in the world without a FB account) ? Still, I will be cautious. Interesting that the company that deals with private info in questionable ways is ready to sqiftly punish those who do the same with their data.
     
  18. Jul 14, 2017 #17
    Yes.. FB and some other giant tech firms out there take a muscular approach to outsiders using web scrapers -- basically because they can. (Not sure exactly who else does this, but the thing about FB is you need to log in first to view any content and hence they have leverage over you via your login credentials.)
     
  19. Jul 14, 2017 #18
    Not only can they (we) make your task harder, but they can make your general online experience less enjoyable by registering you on suspicious IP blacklists. We use a variety of IP Intelligence sources and L4-L7 traffic analytics techniques to block and report abusive activity to our 15+ public state websites.
    Whenever you want to scrape someone's website in any automated fashion, you should always get permission from the owner.
     
    Last edited: Jul 14, 2017
  20. Jul 14, 2017 #19

    jack action

    User Avatar
    Science Advisor
    Gold Member

    I don't think you need a special permission to visit a website in an automated fashion as long as you don't flood the website with requests. Otherwise, one would expect Google asking permission to every website they visit (i.e. the entire world wide web).

    A request by a program or a user is exactly the same thing from the point of view of the server.

    Heck, webmasters often include structured data (like schema.org) specifically for automated web scraping.
     
  21. Jul 15, 2017 #20

    WWGD

    User Avatar
    Science Advisor
    Gold Member

    But why then do Facebook and others react the way they do in fighting, fending of scraping? EDIT: I am not disagreeing with you ( I don't know enough to ), just curious on your take.
     
    Last edited: Jul 15, 2017
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted



Similar Discussions: Web Scraping: HTML to Text (Streaming). Suggestions?
  1. HTML Codes (Replies: 5)

  2. HTML question (Replies: 9)

  3. Learning HTML (Replies: 2)

  4. BLINKING TEXT-html (Replies: 6)

Loading...