Python Why Use Python for Web Scraping Instead of Excel?

AI Thread Summary
The discussion revolves around a Python script designed to scrape school data from a website using BeautifulSoup and requests. The code extracts school names, addresses, phone numbers, emails, and fax numbers, then saves this information into an Excel file. However, users are experiencing issues with the script, particularly that the generated Excel file is blank despite the code running successfully. Suggestions for debugging include printing the DataFrame and individual school items to verify data extraction. Additionally, there are recommendations to consider alternative methods for data extraction, such as directly using Excel's capabilities if the website allows it. The conversation also touches on the importance of code formatting in Python and the need for effective debugging practices.
Leo_Chau_430
Messages
8
Reaction score
1
TL;DR Summary
I am trying to write a program that can automatically scrap through the website https://www.goodschool.hk/ss to make an Excel that contains phone number, address, email address and fax number of all the secondary schools, primary schools and kindergarten in Hong Kong. However, I have faced some problems... My code can be ran successfully, but the excel generated is blank.
My code is as follow:

Python:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import os

url = 'https://www.goodschool.hk/ss'

response = requests.get(url)
html = response.content

soup = BeautifulSoup(html, 'html.parser')

school_items = soup.find_all('div', {'class': 'school-item'})

school_names_en = []
school_names_zh = []
school_addresses_en = []
school_addresses_zh = []
school_phones = []
school_emails = []
school_faxes = []

for school_item in school_items:

    name_elements = school_item.select('a.school-name')
    school_names_en.append(name_elements[0].text.strip())
    school_names_zh.append(name_elements[1].text.strip())

    address_elements = school_item.select('div.school-address')
    school_addresses_en.append(address_elements[0].text.strip())
    school_addresses_zh.append(address_elements[1].text.strip())

    contact_elements = school_item.select('div.contact-info')
    school_phones.append(contact_elements[0].text.strip())
    school_emails.append(contact_elements[1].text.strip())
    school_faxes.append(contact_elements[2].text.strip())

df = pd.DataFrame({
    'School Name (English)': school_names_en,
    'School Name (Chinese)': school_names_zh,
    'Address (English)': school_addresses_en,
    'Address (Chinese)': school_addresses_zh,
    'Phone Number': school_phones,
    'Email Address': school_emails,
    'Fax Number': school_faxes
})

desktop_path = os.path.join(os.path.expanduser("~"), "Desktop")
excel_file_path = os.path.join(desktop_path, "school_data.xlsx")
df.to_excel(excel_file_path, index=False)

if os.path.exists(excel_file_path):
    print("Excel file generated successfully!")
else:
    print("Failed to generate Excel file.")
 
Technology news on Phys.org
Why don't you just e-mail them, asking if a non-html list is available ? Other resources may include local school boards, etc.
 
I have found the contact list of the schools on the website https://data.gov.hk/tc/ However the list there do not have the email address of the schools...
 
Leo_Chau_430 said:
My code is as follow
Code should be in a BBCode code block. For one thing, doing that preserves the exact formatting and whitespace of your code, which with Python is very important.

I have used magic moderator powers to edit your OP to put your code in a code block. Please review and make sure that the code formatting and indentation is correct (it looks ok to me but you're the one that wrote the code).
 
PeterDonis said:
Code should be in a BBCode code block. For one thing, doing that preserves the exact formatting and whitespace of your code, which with Python is very important.

I have used magic moderator powers to edit your OP to put your code in a code block. Please review and make sure that the code formatting and indentation is correct (it looks ok to me but you're the one that wrote the code).
Sorry I am new to this forum, you said your code is as follow, but I cannot see them, where can I find it
 
As a debug measure, print out the dataframe and make sure it contains data.
 
Greg Bernhardt said:
As a debug measure, print out the dataframe and make sure it contains data.
Ya I have tried to print out the data but it seems that the data is not extracted properly
 
Leo_Chau_430 said:
Ya I have tried to print out the data but it seems that the data is not extracted properly
Next debug is to print out each school_item. I suspect you're not parsing the classes right.
 
Greg Bernhardt said:
Next debug is to print out each school_item. I suspect you're not parsing the classes right.
I have just checked, school_items are blank data sets. However, when I print soup it has a valid output
 
  • #10
Leo_Chau_430 said:
you said your code is as follow, but I cannot see them, where can I find it
I meant your code, the code you posted in the OP of this thread. I have put it inside a BBCode code block. If you can't see that, try reloading the page.
 
  • #11
PeterDonis said:
I meant your code, the code you posted in the OP of this thread. I have put it inside a BBCode code block. If you can't see that, try reloading the page.
Thank you I can see the code now. I think the identation should be correct.
 
  • #12
Leo_Chau_430 said:
My code can be ran successfully, but the excel generated is blank.
Did you get that code from ChatGPT? Wherever it came from, you need to approach writing and debugging code in a different way.

Why don't you try inserting some print() statements to see what data is being scraped?
The contents of the page are generated by JavaScript so BeutifulSoup doesn't see what you see.
Even if BeautifulSoup could run JavaScript the selectors you are trying to use e.g. {'class': 'school-item'}) don't exist in the page.
 
  • Like
  • Love
Likes harborsparrow and Leo_Chau_430
  • #13
pbuk said:
Did you get that code from ChatGPT? Wherever it came from, you need to approach writing and debugging code in a different way.

Why don't you try inserting some print() statements to see what data is being scraped?
The contents of the page are generated by JavaScript so BeutifulSoup doesn't see what you see.
Even if BeautifulSoup could run JavaScript the selectors you are trying to use e.g. {'class': 'school-item'}) don't exist in the page.
Thank you very much!
 
  • #14
Just curious why you chose Python when MS Excel supports this sort of task natively. If the web site you are scraping is cooperative you can sometimes even do it directly in an Excel worksheet with no code required.
 
  • Like
Likes Vanadium 50 and Greg Bernhardt
Back
Top