Best way to get data from a website that is not obviously tabulated

  • Python
  • Thread starter Hercuflea
  • Start date
  • Tags
    Data
In summary: The statement ds=dataset should be ds=dataset with subscript i.Try this:1) Open that page in your web browser (I am using Firefox).2) Open this url on top of it: javascript:for(var i=0;i<dataset.length;i++){var ds=dataset;document.writeln("<br>"+i+":"+ds.paid+", "+ds.paidrank);}There were 3143 items (at 8:01pm EST).Fields in dataset include: paid, paidrank, sharerank, cfips, cname, homeval, and home
  • #1
Hercuflea
596
49
Hello,

I'm trying to download and analyze the data from this link.

I've used Python BS4 to read tabulated data before from a website, however this webpage is more complicated than any I have seen before. It's not set up as a table (at least that I can tell using inspect element). Is there a way to read each county and it's associated values (the ones that you mouse over to see)?
 
Technology news on Phys.org
  • #2
Hercuflea said:
Hello,

I'm trying to download and analyze the data from this link.

I've used Python BS4 to read tabulated data before from a website, however this webpage is more complicated than any I have seen before. It's not set up as a table (at least that I can tell using inspect element). Is there a way to read each county and it's associated values (the ones that you mouse over to see)?

Try this:
1) Open that page in your web browser (I am using Firefox).
2) Open this url on top of it: javascript:for(var i=0;i<dataset.length;i++){var ds=dataset;document.writeln("<br>"+i+":"+ds.paid+", "+ds.paidrank);}

There were 3143 items (at 8:01pm EST).
Fields in dataset include: paid, paidrank, sharerank, cfips, cname, homeval, and homevalrank. There may be others.
You should have no problem modifying that javascript code to generate whatever table in whatever format you want.
 
  • #3
Hmm, it returns 3143 items, but they all say "undefined,undefined". And it seems to be stuck in an infinite loop.
 
  • #4
There's a typo:

Oh, I see. The sequence [, i, ] was turned into an instruction to switch to italics.
The statement ds=dataset should be ds=dataset with subscript i.
 
  • Like
Likes Hercuflea
  • #5
Awesome! It's working now. Thank you!
 
  • #6
I modified the script to:
JavaScript:
javascript:
for(var i=0; i < dataset.length; i++ ){
    var ds = dataset[i];
    document.writeln("<br>" + i + ": " + " cname: " + ds.cname + " paid: " + ds.paid + " paidrank: " + ds.paidrank + " cfips: " + ds.cfips + " homeval: " + ds.homeval + " homevalrank: " + ds.homevalrank + " share: " + ds.share + " sharerank: " + ds.sharerank);
}

This is great, but the variable "cname" apparently only contains the county names, and does not include the state. This wouldn't be a problem, but sometimes two states can have a county with the same name, so when I go to categorize them into states later, I will have a problem with those counties with the same names.

Do you know how I could get a list of all the properties of "dataset"? (looking for the one with the state name)

I tried:
JavaScript:
for (var property in dataset) {
    if (dataset.hasOwnProperty(property)) {
        document.writeln(property);
    }
}

to no avail.
 
  • #7
When I tried, I got this:
cname state cfips paid paidrank homeval homevalrank share sharerank

The key is "var property in dataset[0]", not "var property in dataset".
 
  • Like
Likes Hercuflea
  • #10
I would use selenium to drive the web browser to parse it for me.
 
  • #11
In the past I've used Perl for this sort of thing. But even with Perl's nice regular expression handling, there was still the problem of web sites that would change some detail of the HTML, such as changing from <b> to <strong>, and then your search function had to be changed. Of course if it's a one-off, no worries about that. In any case, I used Perl, running from my desktop, along with SQLite, to take advantage of Perl's very nice database handling. It was all fairly easy, aside from the "moving target" issue.
 
  • Like
Likes Hercuflea
  • #12
How do sites like Mint.com do stuff like this? There are a huge number of banks that they have to pull data from and it is all stored in a different way, and even if they changed one tag in the HTML it would cause the search to fail?
 

1. How can I extract data from a website that is not easily accessible?

The best way to get data from a website that is not obviously tabulated is to use a web scraping tool or write a custom script. These methods allow you to access the website's source code and extract the data you need.

2. Is web scraping legal?

The legality of web scraping depends on the website's terms of use and the purpose for which you are scraping the data. It is always best to check the website's terms of use and obtain permission before scraping any data.

3. Can I use web scraping to collect personal data?

No, it is not ethical to use web scraping to collect personal data without the individual's consent. It is important to respect people's privacy and only collect data that is publicly available and relevant to your research or analysis.

4. How often should I scrape data from a website?

This depends on the frequency of updates on the website and your specific needs. If the website is updated frequently, you may need to scrape the data more often to ensure you have the most recent information. It is best to consult with the website owner or refer to their API guidelines for recommended scraping frequencies.

5. What are the potential challenges of web scraping?

Some potential challenges of web scraping include changes to the website's structure, anti-scraping measures implemented by the website, and the need to continuously monitor and update the scraping code. It is also important to consider the ethical implications of scraping data and to ensure that you are not violating any laws or terms of use.

Similar threads

  • Programming and Computer Science
Replies
15
Views
1K
  • Programming and Computer Science
Replies
13
Views
1K
  • Programming and Computer Science
Replies
9
Views
1K
  • Programming and Computer Science
Replies
6
Views
3K
  • Programming and Computer Science
Replies
6
Views
1K
Replies
5
Views
944
  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
3K
  • Engineering and Comp Sci Homework Help
Replies
10
Views
1K
  • Atomic and Condensed Matter
Replies
0
Views
377
  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
1K
Back
Top