Python 2.7 Pandas BSoup4 Scrape: Outputs Column Names but not the Data

In summary: But, the point is there is no table or any others tags in the page, so if you want to scrape that page, you will have to use something else which is not BeautifulSoup. You can use requests or selenium for that purpose. You can also look for any APIs from the website, which will make your job easier.In summary, the conversation discussed an issue with scraping a webpage using BeautifulSoup. The code was not returning any data and the individual was wondering where they went wrong. It was suggested to import pdb and compare the contents of the problem page with one that works. It was also mentioned that the webpage was in JSON format and could be imported using the built-in json library. Additionally, the conversation discussed an example that worked with BeautifulSoup
  • #1
WWGD
Science Advisor
Gold Member
7,009
10,469
TL;DR Summary
Code outputs wanted columns but not the data
Hi, trying to scrape a page :
https://www.hofstede-insights.com/wp-json/v1/countryI get the list of columns I want, but not the data assigned to the columns in the page that is being scraped.
Python:
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www.hofstede-insights.com/wp-json/v1/country"page = requests.get(url)
soup = BeautifulSoup(page.text)

adjective= []
name = []
id= []
idv=  []
ind =  []
ivr =  []
lto =  []
mas =  []
pdi =  []
slug= []
title= []
uai= []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    adjective.append(column_1)
    column_2 = col[2].string.strip()
    name.append(column_2)
    column_3 = col[3].string.strip()
    id.append(column_3)
    column_4 = col[4].string.strip()
    idv.append(column_4)
    column_5 = col[5].string.strip()
    ind.append(column_5)
    column_6 = col[6].string.strip()
    ivr.append(column_6)
    column_7 = col[7].string.strip()
    ito.append(column_7)
    column_ =col[8].string.strip()
    mas.append(column_8)
    column_ =col[9 ].string.strip()
    pdi.append(column_9)
    column_ =col[10 ].string.strip()
    slug.append(column_10)
    column_ =col[11].string.strip()
    title.append(column_11)
    column_ =col[12].string.strip()
    uai.append(column_12)

columns = {
"adjective":adjective,
"name" :name,
"id": id,
"idv": idv,
"ind" :ind,
"ivr" : ivr,
"lto" : lto,
"mas" : mas,
"pdi" : pdi,
"slug":slug,
"title":title,
"uai":uai
}
 
df = pd.DataFrame(columns)

df.to_csv("somefile.csv",index = False)
Code compiles with no problem. Output is:
adjective id idv ind ivr lto mas name pdi slug title uai
______________________________________________________________________________________

And no error message. But no data.

I used the same code with no problem for another page. Where did I go wrong here?
 
Last edited:
Technology news on Phys.org
  • #2
I am not familiar with BeautifulSoup.
My first thought would be to import pdb, break at line 10, and check that the page was correctly read in and reformatted.
But it may be faster to compare the contents of this problem page with one that works.
What is a url that works?
 
  • Like
Likes WWGD
  • #3
.Scott said:
I am not familiar with BeautifulSoup.
My first thought would be to import pdb, break at line 10, and check that the page was correctly read in and reformatted.
But it may be faster to compare the contents of this problem page with one that works.
What is a url that works?
Thanks.I will post an example that works when I get to my PC , I am on my phone now. The scraping went well with pandas' read_json too. Will post it also.
 
  • #4
BeautifulSoup is used to extract structured data from HTML, but that resource is JSON which is already a structured format and can simply be imported using the loads (load string) method of the built-in json library:
Python:
import requests
import json

response = requests.get("https://www.hofstede-insights.com/wp-json/v1/country")
countries = json.loads(response.text)
print(countries[0]["title"]) # Albania*
 
  • Like
Likes .Scott and WWGD
  • #5
pbuk said:
BeautifulSoup is used to extract structured data from HTML, but that resource is JSON which is already a structured format and can simply be imported using the loads (load string) method of the built-in json library:
Python:
import requests
import json

response = requests.get("https://www.hofstede-insights.com/wp-json/v1/country")
countries = json.loads(response.text)
print(countries[0]["title"]) # Albania*
But I do get data output in Pandas with JSON:

Hofs=pd.read_json("https://www.hofstede-insights.com/wp-json/v1/country")

What am I doing wrong?
 
  • #6
@.Scott

I get data output from the same Beautiful Soup setup ( of course, adapting for new names ) with the page

"http://www.usatoday.com/sports/mlb/salaries/" The code is:
Python:
[from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "http://www.usatoday.com/sports/mlb/salaries/"

page = requests.get(url)
soup = BeautifulSoup(page.text)

name = []
team = []
pos = []
salary = []
years = []
value = []
annual = []for row in soup.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[1].string.strip()
    name.append(column_1)
    column_2 = col[2].string.strip()
    team.append(column_2)
    column_3 = col[3].string.strip()
    pos.append(column_3)
    column_4 = col[4].string.strip()
    salary.append(column_4)
    column_5 = col[5].string.strip()
    years.append(column_5)
    column_6 = col[6].string.strip()
    value.append(column_6)
    column_7 = col[7].string.strip()
    annual.append(column_7)

columns = {
    "name": name,
    "team": team, 
    "pos": pos, 
    "salary": salary, 
    "years": years, 
    "value": value,
    "annual": annual
    }
df = pd.DataFrame(columns)

df.to_csv("somefilename.csv",index = False)
[\Code]
 
  • #7
WWGD said:
But I do get data output in Pandas with JSON:

Hofs=pd.read_json("https://www.hofstede-insights.com/wp-json/v1/country")

What am I doing wrong?
The specific problem is that you are looking for data in html table/tr/td form. But the page at that url is not HTML - and it does not have any "tr"s or "td"s. So when you look for the "tr"s, it doesn't find any.
 
  • Like
Likes pbuk and WWGD
  • #8
My bad for not paying attention, PBUK also explained it well.
 

1. What is Python 2.7 Pandas BSoup4 Scrape?

Python 2.7 Pandas BSoup4 Scrape is a process of using Python 2.7 programming language, Pandas library, and BeautifulSoup4 library to extract data from a website. This is commonly used in web scraping, where data from a website is collected and organized for further analysis.

2. How does Python 2.7 Pandas BSoup4 Scrape work?

Python 2.7 Pandas BSoup4 Scrape works by using the Pandas library to read HTML data from a website and convert it into a data frame. Then, the BeautifulSoup4 library is used to extract specific data from the data frame. Finally, the extracted data is organized and stored in an output file.

3. Why is the output only showing column names and not the data?

This could be due to several reasons. One possible reason is that the code used to extract the data is incorrect or incomplete. Another reason could be that the website's HTML structure has changed, causing the code to no longer work properly. It is also possible that there is an issue with the data itself, such as missing values or incorrect formatting.

4. How can I fix the issue of not getting any data in the output?

To fix this issue, you can start by checking the code used for the scraping process. Make sure that it is correct and complete. Next, check the HTML structure of the website to see if it has changed. If it has, you may need to adjust your code accordingly. You can also try using a different data source or checking for any errors in the data itself.

5. What are some tips for successful data extraction using Python 2.7 Pandas BSoup4 Scrape?

Some tips for successful data extraction using this method include thoroughly understanding the website's HTML structure, using the correct libraries and methods for extracting data, and regularly checking and updating the code to account for any changes. It is also important to handle any errors or exceptions that may occur during the scraping process. Additionally, having a basic understanding of the Python programming language and data manipulation techniques can greatly improve the success of the data extraction process.

Back
Top