Python 2.7 Pandas BSoup4 Scrape: Outputs Column Names but not the Data

  • Context: Python 
  • Thread starter Thread starter WWGD
  • Start date Start date
  • Tags Tags
    Column Data Python
Click For Summary

Discussion Overview

The discussion revolves around a user's attempt to scrape data from a JSON endpoint using BeautifulSoup and pandas in Python. The user is able to retrieve column names but not the corresponding data, leading to questions about the appropriate method for accessing the data.

Discussion Character

  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant shares their code for scraping a JSON endpoint but reports that while the code compiles without errors, it does not return any data.
  • Another participant suggests using a debugger to check if the page is read correctly and proposes comparing it with a working URL.
  • A participant points out that BeautifulSoup is designed for HTML, while the resource in question is JSON, which can be handled directly using Python's json library.
  • Another participant confirms that they successfully retrieved data using pandas' read_json method on the same JSON endpoint.
  • A participant shares a different scraping example that successfully retrieves data from an HTML page, indicating that the method works in other contexts.
  • One participant clarifies that the issue arises because the user is looking for HTML elements (tr/td) in a JSON response, which does not contain such elements.

Areas of Agreement / Disagreement

Participants generally agree that the original approach using BeautifulSoup is inappropriate for the JSON data format. However, there is no consensus on the best method to address the user's issue, as multiple approaches are discussed.

Contextual Notes

The discussion highlights the limitations of using HTML parsing methods on JSON data, emphasizing the need for appropriate tools based on the data format.

WWGD
Science Advisor
Homework Helper
Messages
7,795
Reaction score
13,095
TL;DR
Code outputs wanted columns but not the data
Hi, trying to scrape a page :
https://www.hofstede-insights.com/wp-json/v1/countryI get the list of columns I want, but not the data assigned to the columns in the page that is being scraped.
Python:
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www.hofstede-insights.com/wp-json/v1/country"page = requests.get(url)
soup = BeautifulSoup(page.text)

adjective= []
name = []
id= []
idv=  []
ind =  []
ivr =  []
lto =  []
mas =  []
pdi =  []
slug= []
title= []
uai= []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    adjective.append(column_1)
    column_2 = col[2].string.strip()
    name.append(column_2)
    column_3 = col[3].string.strip()
    id.append(column_3)
    column_4 = col[4].string.strip()
    idv.append(column_4)
    column_5 = col[5].string.strip()
    ind.append(column_5)
    column_6 = col[6].string.strip()
    ivr.append(column_6)
    column_7 = col[7].string.strip()
    ito.append(column_7)
    column_ =col[8].string.strip()
    mas.append(column_8)
    column_ =col[9 ].string.strip()
    pdi.append(column_9)
    column_ =col[10 ].string.strip()
    slug.append(column_10)
    column_ =col[11].string.strip()
    title.append(column_11)
    column_ =col[12].string.strip()
    uai.append(column_12)

columns = {
"adjective":adjective,
"name" :name,
"id": id,
"idv": idv,
"ind" :ind,
"ivr" : ivr,
"lto" : lto,
"mas" : mas,
"pdi" : pdi,
"slug":slug,
"title":title,
"uai":uai
}
 
df = pd.DataFrame(columns)

df.to_csv("somefile.csv",index = False)
Code compiles with no problem. Output is:
adjective id idv ind ivr lto mas name pdi slug title uai
______________________________________________________________________________________

And no error message. But no data.

I used the same code with no problem for another page. Where did I go wrong here?
 
Last edited:
Technology news on Phys.org
I am not familiar with BeautifulSoup.
My first thought would be to import pdb, break at line 10, and check that the page was correctly read in and reformatted.
But it may be faster to compare the contents of this problem page with one that works.
What is a url that works?
 
  • Like
Likes   Reactions: WWGD
.Scott said:
I am not familiar with BeautifulSoup.
My first thought would be to import pdb, break at line 10, and check that the page was correctly read in and reformatted.
But it may be faster to compare the contents of this problem page with one that works.
What is a url that works?
Thanks.I will post an example that works when I get to my PC , I am on my phone now. The scraping went well with pandas' read_json too. Will post it also.
 
BeautifulSoup is used to extract structured data from HTML, but that resource is JSON which is already a structured format and can simply be imported using the loads (load string) method of the built-in json library:
Python:
import requests
import json

response = requests.get("https://www.hofstede-insights.com/wp-json/v1/country")
countries = json.loads(response.text)
print(countries[0]["title"]) # Albania*
 
  • Like
Likes   Reactions: .Scott and WWGD
pbuk said:
BeautifulSoup is used to extract structured data from HTML, but that resource is JSON which is already a structured format and can simply be imported using the loads (load string) method of the built-in json library:
Python:
import requests
import json

response = requests.get("https://www.hofstede-insights.com/wp-json/v1/country")
countries = json.loads(response.text)
print(countries[0]["title"]) # Albania*
But I do get data output in Pandas with JSON:

Hofs=pd.read_json("https://www.hofstede-insights.com/wp-json/v1/country")

What am I doing wrong?
 
@.Scott

I get data output from the same Beautiful Soup setup ( of course, adapting for new names ) with the page

"http://www.usatoday.com/sports/mlb/salaries/" The code is:
Python:
[from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "http://www.usatoday.com/sports/mlb/salaries/"

page = requests.get(url)
soup = BeautifulSoup(page.text)

name = []
team = []
pos = []
salary = []
years = []
value = []
annual = []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    name.append(column_1)
    column_2 = col[2].string.strip()
    team.append(column_2)
    column_3 = col[3].string.strip()
    pos.append(column_3)
    column_4 = col[4].string.strip()
    salary.append(column_4)
    column_5 = col[5].string.strip()
    years.append(column_5)
    column_6 = col[6].string.strip()
    value.append(column_6)
    column_7 = col[7].string.strip()
    annual.append(column_7)

columns = {
    "name": name,
    "team": team, 
    "pos": pos, 
    "salary": salary, 
    "years": years, 
    "value": value,
    "annual": annual
    }
df = pd.DataFrame(columns)

df.to_csv("somefilename.csv",index = False)
[\Code]
 
WWGD said:
But I do get data output in Pandas with JSON:

Hofs=pd.read_json("https://www.hofstede-insights.com/wp-json/v1/country")

What am I doing wrong?
The specific problem is that you are looking for data in html table/tr/td form. But the page at that url is not HTML - and it does not have any "tr"s or "td"s. So when you look for the "tr"s, it doesn't find any.
 
  • Like
Likes   Reactions: pbuk and WWGD
My bad for not paying attention, PBUK also explained it well.