Why Use Python for Web Scraping Instead of Excel?

In summary: Thank you very much!Just curious why you chose Python when MS Excel supports this sort of task natively.
  • #1
Leo_Chau_430
8
1
TL;DR Summary
I am trying to write a program that can automatically scrap through the website https://www.goodschool.hk/ss to make an Excel that contains phone number, address, email address and fax number of all the secondary schools, primary schools and kindergarten in Hong Kong. However, I have faced some problems... My code can be ran successfully, but the excel generated is blank.
My code is as follow:

Python:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import os

url = 'https://www.goodschool.hk/ss'

response = requests.get(url)
html = response.content

soup = BeautifulSoup(html, 'html.parser')

school_items = soup.find_all('div', {'class': 'school-item'})

school_names_en = []
school_names_zh = []
school_addresses_en = []
school_addresses_zh = []
school_phones = []
school_emails = []
school_faxes = []

for school_item in school_items:

    name_elements = school_item.select('a.school-name')
    school_names_en.append(name_elements[0].text.strip())
    school_names_zh.append(name_elements[1].text.strip())

    address_elements = school_item.select('div.school-address')
    school_addresses_en.append(address_elements[0].text.strip())
    school_addresses_zh.append(address_elements[1].text.strip())

    contact_elements = school_item.select('div.contact-info')
    school_phones.append(contact_elements[0].text.strip())
    school_emails.append(contact_elements[1].text.strip())
    school_faxes.append(contact_elements[2].text.strip())

df = pd.DataFrame({
    'School Name (English)': school_names_en,
    'School Name (Chinese)': school_names_zh,
    'Address (English)': school_addresses_en,
    'Address (Chinese)': school_addresses_zh,
    'Phone Number': school_phones,
    'Email Address': school_emails,
    'Fax Number': school_faxes
})

desktop_path = os.path.join(os.path.expanduser("~"), "Desktop")
excel_file_path = os.path.join(desktop_path, "school_data.xlsx")
df.to_excel(excel_file_path, index=False)

if os.path.exists(excel_file_path):
    print("Excel file generated successfully!")
else:
    print("Failed to generate Excel file.")
 
Technology news on Phys.org
  • #2
Why don't you just e-mail them, asking if a non-html list is available ? Other resources may include local school boards, etc.
 
  • #3
I have found the contact list of the schools on the website https://data.gov.hk/tc/ However the list there do not have the email address of the schools...
 
  • Like
Likes hmmm27
  • #4
Leo_Chau_430 said:
My code is as follow
Code should be in a BBCode code block. For one thing, doing that preserves the exact formatting and whitespace of your code, which with Python is very important.

I have used magic moderator powers to edit your OP to put your code in a code block. Please review and make sure that the code formatting and indentation is correct (it looks ok to me but you're the one that wrote the code).
 
  • Haha
Likes jedishrfu
  • #5
PeterDonis said:
Code should be in a BBCode code block. For one thing, doing that preserves the exact formatting and whitespace of your code, which with Python is very important.

I have used magic moderator powers to edit your OP to put your code in a code block. Please review and make sure that the code formatting and indentation is correct (it looks ok to me but you're the one that wrote the code).
Sorry I am new to this forum, you said your code is as follow, but I cannot see them, where can I find it
 
  • #6
As a debug measure, print out the dataframe and make sure it contains data.
 
  • Like
Likes pbuk
  • #7
Greg Bernhardt said:
As a debug measure, print out the dataframe and make sure it contains data.
Ya I have tried to print out the data but it seems that the data is not extracted properly
 
  • #8
Leo_Chau_430 said:
Ya I have tried to print out the data but it seems that the data is not extracted properly
Next debug is to print out each school_item. I suspect you're not parsing the classes right.
 
  • #9
Greg Bernhardt said:
Next debug is to print out each school_item. I suspect you're not parsing the classes right.
I have just checked, school_items are blank data sets. However, when I print soup it has a valid output
 
  • #10
Leo_Chau_430 said:
you said your code is as follow, but I cannot see them, where can I find it
I meant your code, the code you posted in the OP of this thread. I have put it inside a BBCode code block. If you can't see that, try reloading the page.
 
  • #11
PeterDonis said:
I meant your code, the code you posted in the OP of this thread. I have put it inside a BBCode code block. If you can't see that, try reloading the page.
Thank you I can see the code now. I think the identation should be correct.
 
  • #12
Leo_Chau_430 said:
My code can be ran successfully, but the excel generated is blank.
Did you get that code from ChatGPT? Wherever it came from, you need to approach writing and debugging code in a different way.

Why don't you try inserting some print() statements to see what data is being scraped?
The contents of the page are generated by JavaScript so BeutifulSoup doesn't see what you see.
Even if BeautifulSoup could run JavaScript the selectors you are trying to use e.g. {'class': 'school-item'}) don't exist in the page.
 
  • Like
  • Love
Likes harborsparrow and Leo_Chau_430
  • #13
pbuk said:
Did you get that code from ChatGPT? Wherever it came from, you need to approach writing and debugging code in a different way.

Why don't you try inserting some print() statements to see what data is being scraped?
The contents of the page are generated by JavaScript so BeutifulSoup doesn't see what you see.
Even if BeautifulSoup could run JavaScript the selectors you are trying to use e.g. {'class': 'school-item'}) don't exist in the page.
Thank you very much!
 
  • #14
Just curious why you chose Python when MS Excel supports this sort of task natively. If the web site you are scraping is cooperative you can sometimes even do it directly in an Excel worksheet with no code required.
 
  • Like
Likes Vanadium 50 and Greg Bernhardt

What is an Excel Web Scraping Program?

An Excel Web Scraping Program is a tool that allows users to extract data from websites and import it into a Microsoft Excel spreadsheet for further analysis and manipulation.

How does an Excel Web Scraping Program work?

An Excel Web Scraping Program typically works by using a web scraping library or tool to access and retrieve data from websites. The data is then parsed and structured in a way that can be imported into Excel. Some programs may also allow users to specify which data to extract and how to format it.

What are the benefits of using an Excel Web Scraping Program?

Using an Excel Web Scraping Program can save time and effort by automating the process of collecting and organizing data from multiple websites. It can also help with data management and analysis by providing structured and reliable data in a familiar format like Excel.

Are there any limitations to using an Excel Web Scraping Program?

While an Excel Web Scraping Program can be a useful tool, it may have limitations in terms of the amount and complexity of data it can extract. Some websites may also have measures in place to prevent web scraping, making it difficult to retrieve data. Additionally, the accuracy and reliability of the extracted data may vary.

How can I choose the right Excel Web Scraping Program for my needs?

There are many Excel Web Scraping Programs available, each with its own features and capabilities. It is important to consider your specific data scraping requirements, the complexity of the websites you need to extract data from, and the reliability and support of the program before making a decision. Reading reviews and trying out demos can also help in choosing the right program for your needs.

Similar threads

  • Programming and Computer Science
Replies
7
Views
1K
  • Programming and Computer Science
Replies
9
Views
2K
  • Programming and Computer Science
Replies
4
Views
1K
  • Programming and Computer Science
Replies
3
Views
1K
  • Programming and Computer Science
Replies
4
Views
7K
Back
Top