Python Python 2.7 Pandas BSoup4 Scrape: Outputs Column Names but not the Data

WWGD · Nov 14, 2019

Hi, trying to scrape a page :
https://www.hofstede-insights.com/wp-json/v1/countryI get the list of columns I want, but not the data assigned to the columns in the page that is being scraped.

Python:

from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www.hofstede-insights.com/wp-json/v1/country"page = requests.get(url)
soup = BeautifulSoup(page.text)

adjective= []
name = []
id= []
idv=  []
ind =  []
ivr =  []
lto =  []
mas =  []
pdi =  []
slug= []
title= []
uai= []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    adjective.append(column_1)
    column_2 = col[2].string.strip()
    name.append(column_2)
    column_3 = col[3].string.strip()
    id.append(column_3)
    column_4 = col[4].string.strip()
    idv.append(column_4)
    column_5 = col[5].string.strip()
    ind.append(column_5)
    column_6 = col[6].string.strip()
    ivr.append(column_6)
    column_7 = col[7].string.strip()
    ito.append(column_7)
    column_ =col[8].string.strip()
    mas.append(column_8)
    column_ =col[9 ].string.strip()
    pdi.append(column_9)
    column_ =col[10 ].string.strip()
    slug.append(column_10)
    column_ =col[11].string.strip()
    title.append(column_11)
    column_ =col[12].string.strip()
    uai.append(column_12)

columns = {
"adjective":adjective,
"name" :name,
"id": id,
"idv": idv,
"ind" :ind,
"ivr" : ivr,
"lto" : lto,
"mas" : mas,
"pdi" : pdi,
"slug":slug,
"title":title,
"uai":uai
}
 
df = pd.DataFrame(columns)

df.to_csv("somefile.csv",index = False)

Code compiles with no problem. Output is:
adjective id idv ind ivr lto mas name pdi slug title uai
______________________________________________________________________________________

And no error message. But no data.

I used the same code with no problem for another page. Where did I go wrong here?

.Scott · Nov 15, 2019

I am not familiar with BeautifulSoup.
My first thought would be to import pdb, break at line 10, and check that the page was correctly read in and reformatted.
But it may be faster to compare the contents of this problem page with one that works.
What is a url that works?

WWGD · Nov 15, 2019

.Scott said:

I am not familiar with BeautifulSoup.
My first thought would be to import pdb, break at line 10, and check that the page was correctly read in and reformatted.
But it may be faster to compare the contents of this problem page with one that works.
What is a url that works?

Thanks.I will post an example that works when I get to my PC , I am on my phone now. The scraping went well with pandas' read_json too. Will post it also.

pbuk · Nov 15, 2019

BeautifulSoup is used to extract structured data from HTML, but that resource is JSON which is already a structured format and can simply be imported using the loads (load string) method of the built-in json library:

Python:

import requests
import json

response = requests.get("https://www.hofstede-insights.com/wp-json/v1/country")
countries = json.loads(response.text)
print(countries[0]["title"]) # Albania*

WWGD · Nov 15, 2019

pbuk said:
BeautifulSoup is used to extract structured data from HTML, but that resource is JSON which is already a structured format and can simply be imported using the loads (load string) method of the built-in json library:
Python:
import requests
import json

response = requests.get("https://www.hofstede-insights.com/wp-json/v1/country")
countries = json.loads(response.text)
print(countries[0]["title"]) # Albania*

But I do get data output in Pandas with JSON:

Hofs=pd.read_json("https://www.hofstede-insights.com/wp-json/v1/country")

What am I doing wrong?

WWGD · Nov 15, 2019

@.Scott

I get data output from the same Beautiful Soup setup ( of course, adapting for new names ) with the page

"http://www.usatoday.com/sports/mlb/salaries/" The code is:

Python:

[from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "http://www.usatoday.com/sports/mlb/salaries/"

page = requests.get(url)
soup = BeautifulSoup(page.text)

name = []
team = []
pos = []
salary = []
years = []
value = []
annual = []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    name.append(column_1)
    column_2 = col[2].string.strip()
    team.append(column_2)
    column_3 = col[3].string.strip()
    pos.append(column_3)
    column_4 = col[4].string.strip()
    salary.append(column_4)
    column_5 = col[5].string.strip()
    years.append(column_5)
    column_6 = col[6].string.strip()
    value.append(column_6)
    column_7 = col[7].string.strip()
    annual.append(column_7)

columns = {
    "name": name,
    "team": team, 
    "pos": pos, 
    "salary": salary, 
    "years": years, 
    "value": value,
    "annual": annual
    }
df = pd.DataFrame(columns)

df.to_csv("somefilename.csv",index = False)
[\Code]

.Scott · Nov 16, 2019

WWGD said:

But I do get data output in Pandas with JSON:

Hofs=pd.read_json("https://www.hofstede-insights.com/wp-json/v1/country")

What am I doing wrong?

The specific problem is that you are looking for data in html table/tr/td form. But the page at that url is not HTML - and it does not have any "tr"s or "td"s. So when you look for the "tr"s, it doesn't find any.

WWGD · Nov 16, 2019

My bad for not paying attention, PBUK also explained it well.

Python Python 2.7 Pandas BSoup4 Scrape: Outputs Column Names but not the Data

Hot Threads

Touch-typing for programmers

How to calculate Tension for a series of connected points?

Python Complaining About Python

Fortran Reading files in pre-f77 - handling end of file

Sequential Analog Computers?

Recent Insights

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers