Python Python 2.7 Pandas BSoup4 Scrape: Outputs Column Names but not the Data

AI Thread Summary
The issue arises from attempting to scrape data from a JSON endpoint using BeautifulSoup, which is designed for HTML parsing. The URL provided returns data in JSON format, lacking the HTML structure of "tr" and "td" elements. Instead, the data can be directly accessed using Python's built-in JSON library, allowing for easier extraction. The discussion highlights that using `pd.read_json()` can effectively retrieve the data without the need for BeautifulSoup. Therefore, switching to JSON parsing is the recommended solution for this case.
WWGD
Science Advisor
Homework Helper
Messages
7,713
Reaction score
12,858
TL;DR Summary
Code outputs wanted columns but not the data
Hi, trying to scrape a page :
https://www.hofstede-insights.com/wp-json/v1/countryI get the list of columns I want, but not the data assigned to the columns in the page that is being scraped.
Python:
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www.hofstede-insights.com/wp-json/v1/country"page = requests.get(url)
soup = BeautifulSoup(page.text)

adjective= []
name = []
id= []
idv=  []
ind =  []
ivr =  []
lto =  []
mas =  []
pdi =  []
slug= []
title= []
uai= []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    adjective.append(column_1)
    column_2 = col[2].string.strip()
    name.append(column_2)
    column_3 = col[3].string.strip()
    id.append(column_3)
    column_4 = col[4].string.strip()
    idv.append(column_4)
    column_5 = col[5].string.strip()
    ind.append(column_5)
    column_6 = col[6].string.strip()
    ivr.append(column_6)
    column_7 = col[7].string.strip()
    ito.append(column_7)
    column_ =col[8].string.strip()
    mas.append(column_8)
    column_ =col[9 ].string.strip()
    pdi.append(column_9)
    column_ =col[10 ].string.strip()
    slug.append(column_10)
    column_ =col[11].string.strip()
    title.append(column_11)
    column_ =col[12].string.strip()
    uai.append(column_12)

columns = {
"adjective":adjective,
"name" :name,
"id": id,
"idv": idv,
"ind" :ind,
"ivr" : ivr,
"lto" : lto,
"mas" : mas,
"pdi" : pdi,
"slug":slug,
"title":title,
"uai":uai
}
 
df = pd.DataFrame(columns)

df.to_csv("somefile.csv",index = False)
Code compiles with no problem. Output is:
adjective id idv ind ivr lto mas name pdi slug title uai
______________________________________________________________________________________

And no error message. But no data.

I used the same code with no problem for another page. Where did I go wrong here?
 
Last edited:
Technology news on Phys.org
I am not familiar with BeautifulSoup.
My first thought would be to import pdb, break at line 10, and check that the page was correctly read in and reformatted.
But it may be faster to compare the contents of this problem page with one that works.
What is a url that works?
 
  • Like
Likes WWGD
.Scott said:
I am not familiar with BeautifulSoup.
My first thought would be to import pdb, break at line 10, and check that the page was correctly read in and reformatted.
But it may be faster to compare the contents of this problem page with one that works.
What is a url that works?
Thanks.I will post an example that works when I get to my PC , I am on my phone now. The scraping went well with pandas' read_json too. Will post it also.
 
BeautifulSoup is used to extract structured data from HTML, but that resource is JSON which is already a structured format and can simply be imported using the loads (load string) method of the built-in json library:
Python:
import requests
import json

response = requests.get("https://www.hofstede-insights.com/wp-json/v1/country")
countries = json.loads(response.text)
print(countries[0]["title"]) # Albania*
 
  • Like
Likes .Scott and WWGD
pbuk said:
BeautifulSoup is used to extract structured data from HTML, but that resource is JSON which is already a structured format and can simply be imported using the loads (load string) method of the built-in json library:
Python:
import requests
import json

response = requests.get("https://www.hofstede-insights.com/wp-json/v1/country")
countries = json.loads(response.text)
print(countries[0]["title"]) # Albania*
But I do get data output in Pandas with JSON:

Hofs=pd.read_json("https://www.hofstede-insights.com/wp-json/v1/country")

What am I doing wrong?
 
@.Scott

I get data output from the same Beautiful Soup setup ( of course, adapting for new names ) with the page

"http://www.usatoday.com/sports/mlb/salaries/" The code is:
Python:
[from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "http://www.usatoday.com/sports/mlb/salaries/"

page = requests.get(url)
soup = BeautifulSoup(page.text)

name = []
team = []
pos = []
salary = []
years = []
value = []
annual = []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    name.append(column_1)
    column_2 = col[2].string.strip()
    team.append(column_2)
    column_3 = col[3].string.strip()
    pos.append(column_3)
    column_4 = col[4].string.strip()
    salary.append(column_4)
    column_5 = col[5].string.strip()
    years.append(column_5)
    column_6 = col[6].string.strip()
    value.append(column_6)
    column_7 = col[7].string.strip()
    annual.append(column_7)

columns = {
    "name": name,
    "team": team, 
    "pos": pos, 
    "salary": salary, 
    "years": years, 
    "value": value,
    "annual": annual
    }
df = pd.DataFrame(columns)

df.to_csv("somefilename.csv",index = False)
[\Code]
 
WWGD said:
But I do get data output in Pandas with JSON:

Hofs=pd.read_json("https://www.hofstede-insights.com/wp-json/v1/country")

What am I doing wrong?
The specific problem is that you are looking for data in html table/tr/td form. But the page at that url is not HTML - and it does not have any "tr"s or "td"s. So when you look for the "tr"s, it doesn't find any.
 
  • Like
Likes pbuk and WWGD
My bad for not paying attention, PBUK also explained it well.
 
Back
Top