Best way to get data from a website that is not obviously tabulated

  • Context: Python 
  • Thread starter Thread starter Hercuflea
  • Start date Start date
  • Tags Tags
    Data
Click For Summary

Discussion Overview

The discussion revolves around methods for extracting data from a complex webpage that does not present information in a straightforward tabulated format. Participants explore various programming approaches, particularly using JavaScript and Python, to access and analyze the data associated with counties and their values.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant describes their experience using Python's Beautiful Soup (BS4) for scraping tabulated data but finds the current webpage more complicated due to its structure.
  • Another participant suggests a JavaScript snippet to extract data from the webpage, noting the fields available in the dataset.
  • A subsequent reply indicates issues with the initial JavaScript code, leading to an infinite loop and undefined values.
  • A correction is made regarding a typo in the JavaScript code, which resolves the previous issues and allows data extraction to work correctly.
  • One participant modifies the script to include additional fields but raises a concern about the lack of state information for counties with the same name.
  • Another participant attempts to list all properties of the dataset but encounters difficulties, leading to a clarification about the correct method to access properties.
  • Several participants point out that the data source can be found in the page's source code, providing a direct link to a CSV file containing the data.
  • One participant suggests using Selenium for web scraping, while another shares their past experience using Perl for similar tasks, highlighting the challenges posed by changing HTML structures.
  • A question is raised about how services like Mint.com manage to extract data from numerous banks, considering the variability in HTML structures.

Areas of Agreement / Disagreement

Participants express various methods and tools for web scraping, with some agreeing on the effectiveness of JavaScript and others suggesting alternatives like Selenium or Perl. There is no consensus on a single best approach, and the discussion remains open-ended regarding the most effective techniques.

Contextual Notes

Limitations include the potential variability in HTML structures that may affect scraping methods, as well as the need for state information to differentiate counties with the same name.

Hercuflea
Messages
593
Reaction score
49
Hello,

I'm trying to download and analyze the data from this link.

I've used Python BS4 to read tabulated data before from a website, however this webpage is more complicated than any I have seen before. It's not set up as a table (at least that I can tell using inspect element). Is there a way to read each county and it's associated values (the ones that you mouse over to see)?
 
Technology news on Phys.org
Hercuflea said:
Hello,

I'm trying to download and analyze the data from this link.

I've used Python BS4 to read tabulated data before from a website, however this webpage is more complicated than any I have seen before. It's not set up as a table (at least that I can tell using inspect element). Is there a way to read each county and it's associated values (the ones that you mouse over to see)?

Try this:
1) Open that page in your web browser (I am using Firefox).
2) Open this url on top of it: javascript:for(var i=0;i<dataset.length;i++){var ds=dataset;document.writeln("<br>"+i+":"+ds.paid+", "+ds.paidrank);}

There were 3143 items (at 8:01pm EST).
Fields in dataset include: paid, paidrank, sharerank, cfips, cname, homeval, and homevalrank. There may be others.
You should have no problem modifying that javascript code to generate whatever table in whatever format you want.
 
Hmm, it returns 3143 items, but they all say "undefined,undefined". And it seems to be stuck in an infinite loop.
 
There's a typo:

Oh, I see. The sequence [, i, ] was turned into an instruction to switch to italics.
The statement ds=dataset should be ds=dataset with subscript i.
 
  • Like
Likes   Reactions: Hercuflea
Awesome! It's working now. Thank you!
 
I modified the script to:
JavaScript:
javascript:
for(var i=0; i < dataset.length; i++ ){
    var ds = dataset[i];
    document.writeln("<br>" + i + ": " + " cname: " + ds.cname + " paid: " + ds.paid + " paidrank: " + ds.paidrank + " cfips: " + ds.cfips + " homeval: " + ds.homeval + " homevalrank: " + ds.homevalrank + " share: " + ds.share + " sharerank: " + ds.sharerank);
}

This is great, but the variable "cname" apparently only contains the county names, and does not include the state. This wouldn't be a problem, but sometimes two states can have a county with the same name, so when I go to categorize them into states later, I will have a problem with those counties with the same names.

Do you know how I could get a list of all the properties of "dataset"? (looking for the one with the state name)

I tried:
JavaScript:
for (var property in dataset) {
    if (dataset.hasOwnProperty(property)) {
        document.writeln(property);
    }
}

to no avail.
 
When I tried, I got this:
cname state cfips paid paidrank homeval homevalrank share sharerank

The key is "var property in dataset[0]", not "var property in dataset".
 
  • Like
Likes   Reactions: Hercuflea
  • #10
I would use selenium to drive the web browser to parse it for me.
 
  • #11
In the past I've used Perl for this sort of thing. But even with Perl's nice regular expression handling, there was still the problem of web sites that would change some detail of the HTML, such as changing from <b> to <strong>, and then your search function had to be changed. Of course if it's a one-off, no worries about that. In any case, I used Perl, running from my desktop, along with SQLite, to take advantage of Perl's very nice database handling. It was all fairly easy, aside from the "moving target" issue.
 
  • Like
Likes   Reactions: Hercuflea
  • #12
How do sites like Mint.com do stuff like this? There are a huge number of banks that they have to pull data from and it is all stored in a different way, and even if they changed one tag in the HTML it would cause the search to fail?
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 15 ·
Replies
15
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 16 ·
Replies
16
Views
2K
Replies
4
Views
4K
  • · Replies 5 ·
Replies
5
Views
2K