Best way to get data from a website that is not obviously tabulated

Hercuflea · Feb 23, 2017

Hello,

I'm trying to download and analyze the data from this link.

I've used Python BS4 to read tabulated data before from a website, however this webpage is more complicated than any I have seen before. It's not set up as a table (at least that I can tell using inspect element). Is there a way to read each county and it's associated values (the ones that you mouse over to see)?

.Scott · Feb 23, 2017

Hercuflea said:

Hello,

I'm trying to download and analyze the data from this link.

I've used Python BS4 to read tabulated data before from a website, however this webpage is more complicated than any I have seen before. It's not set up as a table (at least that I can tell using inspect element). Is there a way to read each county and it's associated values (the ones that you mouse over to see)?

Try this:
1) Open that page in your web browser (I am using Firefox).
2) Open this url on top of it: javascript:for(var i=0;i<dataset.length;i++){var ds=dataset;document.writeln("<br>"+i+":"+ds.paid+", "+ds.paidrank);}

There were 3143 items (at 8:01pm EST).
Fields in dataset include: paid, paidrank, sharerank, cfips, cname, homeval, and homevalrank. There may be others.
You should have no problem modifying that javascript code to generate whatever table in whatever format you want.

Hercuflea · Feb 23, 2017

Hmm, it returns 3143 items, but they all say "undefined,undefined". And it seems to be stuck in an infinite loop.

.Scott · Feb 23, 2017

There's a typo:

Oh, I see. The sequence [, i, ] was turned into an instruction to switch to italics.
The statement ds=dataset should be ds=dataset with subscript i.

Hercuflea · Feb 23, 2017

Awesome! It's working now. Thank you!

Hercuflea · Feb 23, 2017

I modified the script to:

JavaScript:

javascript:
for(var i=0; i < dataset.length; i++ ){
    var ds = dataset[i];
    document.writeln("<br>" + i + ": " + " cname: " + ds.cname + " paid: " + ds.paid + " paidrank: " + ds.paidrank + " cfips: " + ds.cfips + " homeval: " + ds.homeval + " homevalrank: " + ds.homevalrank + " share: " + ds.share + " sharerank: " + ds.sharerank);
}

This is great, but the variable "cname" apparently only contains the county names, and does not include the state. This wouldn't be a problem, but sometimes two states can have a county with the same name, so when I go to categorize them into states later, I will have a problem with those counties with the same names.

Do you know how I could get a list of all the properties of "dataset"? (looking for the one with the state name)

I tried:

JavaScript:

for (var property in dataset) {
    if (dataset.hasOwnProperty(property)) {
        document.writeln(property);
    }
}

to no avail.

.Scott · Feb 23, 2017

When I tried, I got this:
cname state cfips paid paidrank homeval homevalrank share sharerank

The key is "var property in dataset[0]", not "var property in dataset".

jack action · Feb 24, 2017

If you look at the source code of the page, you can see on line 111 that the data comes from /framed/~/media/multimedia/interactives/2013/property_taxes/propcounty.csv

.Scott · Feb 24, 2017

jack action said:

If you look at the source code of the page, you can see on line 111 that the data comes from from /framed/~/media/multimedia/interactives/2013/property_taxes/propcounty.csv

Very good. I had been looking for that.

newjerseyrunner · Feb 24, 2017

I would use selenium to drive the web browser to parse it for me.

Aufbauwerk 2045 · Feb 26, 2017

In the past I've used Perl for this sort of thing. But even with Perl's nice regular expression handling, there was still the problem of web sites that would change some detail of the HTML, such as changing from <b> to <strong>, and then your search function had to be changed. Of course if it's a one-off, no worries about that. In any case, I used Perl, running from my desktop, along with SQLite, to take advantage of Perl's very nice database handling. It was all fairly easy, aside from the "moving target" issue.

Hercuflea · Feb 26, 2017

How do sites like Mint.com do stuff like this? There are a huge number of banks that they have to pull data from and it is all stored in a different way, and even if they changed one tag in the HTML it would cause the search to fail?

Best way to get data from a website that is not obviously tabulated

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

How to increase phone signal strength by lying about it

Who is responsible for the software when AI takes over programming?

Learning Assembly and computer architecture for x86

Use of AI (ML/DL) in Science

Could the reason why I can't select any kernels in VS Code be this error?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers