Python 2.7 Pandas BSoup4 Scrape: Outputs Column Names but not the Data

WWGD · Nov 14, 2019

Hi, trying to scrape a page :
https://www.hofstede-insights.com/wp-json/v1/countryI get the list of columns I want, but not the data assigned to the columns in the page that is being scraped.

Python:

from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www.hofstede-insights.com/wp-json/v1/country"page = requests.get(url)
soup = BeautifulSoup(page.text)

adjective= []
name = []
id= []
idv=  []
ind =  []
ivr =  []
lto =  []
mas =  []
pdi =  []
slug= []
title= []
uai= []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    adjective.append(column_1)
    column_2 = col[2].string.strip()
    name.append(column_2)
    column_3 = col[3].string.strip()
    id.append(column_3)
    column_4 = col[4].string.strip()
    idv.append(column_4)
    column_5 = col[5].string.strip()
    ind.append(column_5)
    column_6 = col[6].string.strip()
    ivr.append(column_6)
    column_7 = col[7].string.strip()
    ito.append(column_7)
    column_ =col[8].string.strip()
    mas.append(column_8)
    column_ =col[9 ].string.strip()
    pdi.append(column_9)
    column_ =col[10 ].string.strip()
    slug.append(column_10)
    column_ =col[11].string.strip()
    title.append(column_11)
    column_ =col[12].string.strip()
    uai.append(column_12)

columns = {
"adjective":adjective,
"name" :name,
"id": id,
"idv": idv,
"ind" :ind,
"ivr" : ivr,
"lto" : lto,
"mas" : mas,
"pdi" : pdi,
"slug":slug,
"title":title,
"uai":uai
}
 
df = pd.DataFrame(columns)

df.to_csv("somefile.csv",index = False)

Code compiles with no problem. Output is:
adjective id idv ind ivr lto mas name pdi slug title uai
______________________________________________________________________________________

And no error message. But no data.

I used the same code with no problem for another page. Where did I go wrong here?

.Scott · Nov 15, 2019

I am not familiar with BeautifulSoup.
My first thought would be to import pdb, break at line 10, and check that the page was correctly read in and reformatted.
But it may be faster to compare the contents of this problem page with one that works.
What is a url that works?

WWGD · Nov 15, 2019

.Scott said:

I am not familiar with BeautifulSoup.
My first thought would be to import pdb, break at line 10, and check that the page was correctly read in and reformatted.
But it may be faster to compare the contents of this problem page with one that works.
What is a url that works?

Thanks.I will post an example that works when I get to my PC , I am on my phone now. The scraping went well with pandas' read_json too. Will post it also.

pbuk · Nov 15, 2019

BeautifulSoup is used to extract structured data from HTML, but that resource is JSON which is already a structured format and can simply be imported using the loads (load string) method of the built-in json library:

Python:

import requests
import json

response = requests.get("https://www.hofstede-insights.com/wp-json/v1/country")
countries = json.loads(response.text)
print(countries[0]["title"]) # Albania*

WWGD · Nov 15, 2019

pbuk said:
BeautifulSoup is used to extract structured data from HTML, but that resource is JSON which is already a structured format and can simply be imported using the loads (load string) method of the built-in json library:
Python:
import requests
import json

response = requests.get("https://www.hofstede-insights.com/wp-json/v1/country")
countries = json.loads(response.text)
print(countries[0]["title"]) # Albania*

But I do get data output in Pandas with JSON:

Hofs=pd.read_json("https://www.hofstede-insights.com/wp-json/v1/country")

What am I doing wrong?

WWGD · Nov 15, 2019

@.Scott

I get data output from the same Beautiful Soup setup ( of course, adapting for new names ) with the page

"http://www.usatoday.com/sports/mlb/salaries/" The code is:

Python:

[from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "http://www.usatoday.com/sports/mlb/salaries/"

page = requests.get(url)
soup = BeautifulSoup(page.text)

name = []
team = []
pos = []
salary = []
years = []
value = []
annual = []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    name.append(column_1)
    column_2 = col[2].string.strip()
    team.append(column_2)
    column_3 = col[3].string.strip()
    pos.append(column_3)
    column_4 = col[4].string.strip()
    salary.append(column_4)
    column_5 = col[5].string.strip()
    years.append(column_5)
    column_6 = col[6].string.strip()
    value.append(column_6)
    column_7 = col[7].string.strip()
    annual.append(column_7)

columns = {
    "name": name,
    "team": team, 
    "pos": pos, 
    "salary": salary, 
    "years": years, 
    "value": value,
    "annual": annual
    }
df = pd.DataFrame(columns)

df.to_csv("somefilename.csv",index = False)
[\Code]

.Scott · Nov 16, 2019

WWGD said:

But I do get data output in Pandas with JSON:

Hofs=pd.read_json("https://www.hofstede-insights.com/wp-json/v1/country")

What am I doing wrong?

The specific problem is that you are looking for data in html table/tr/td form. But the page at that url is not HTML - and it does not have any "tr"s or "td"s. So when you look for the "tr"s, it doesn't find any.

WWGD · Nov 16, 2019

My bad for not paying attention, PBUK also explained it well.

Python 2.7 Pandas BSoup4 Scrape: Outputs Column Names but not the Data

1. What is Python 2.7 Pandas BSoup4 Scrape?

2. How does Python 2.7 Pandas BSoup4 Scrape work?

3. Why is the output only showing column names and not the data?

4. How can I fix the issue of not getting any data in the output?

5. What are some tips for successful data extraction using Python 2.7 Pandas BSoup4 Scrape?

Hot Threads

Recent Insights