Why Use Python for Web Scraping Instead of Excel?

Leo_Chau_430 · May 16, 2023

My code is as follow:

Python:

import pandas as pd
from bs4 import BeautifulSoup
import requests
import os

url = 'https://www.goodschool.hk/ss'

response = requests.get(url)
html = response.content

soup = BeautifulSoup(html, 'html.parser')

school_items = soup.find_all('div', {'class': 'school-item'})

school_names_en = []
school_names_zh = []
school_addresses_en = []
school_addresses_zh = []
school_phones = []
school_emails = []
school_faxes = []

for school_item in school_items:

    name_elements = school_item.select('a.school-name')
    school_names_en.append(name_elements[0].text.strip())
    school_names_zh.append(name_elements[1].text.strip())

    address_elements = school_item.select('div.school-address')
    school_addresses_en.append(address_elements[0].text.strip())
    school_addresses_zh.append(address_elements[1].text.strip())

    contact_elements = school_item.select('div.contact-info')
    school_phones.append(contact_elements[0].text.strip())
    school_emails.append(contact_elements[1].text.strip())
    school_faxes.append(contact_elements[2].text.strip())

df = pd.DataFrame({
    'School Name (English)': school_names_en,
    'School Name (Chinese)': school_names_zh,
    'Address (English)': school_addresses_en,
    'Address (Chinese)': school_addresses_zh,
    'Phone Number': school_phones,
    'Email Address': school_emails,
    'Fax Number': school_faxes
})

desktop_path = os.path.join(os.path.expanduser("~"), "Desktop")
excel_file_path = os.path.join(desktop_path, "school_data.xlsx")
df.to_excel(excel_file_path, index=False)

if os.path.exists(excel_file_path):
    print("Excel file generated successfully!")
else:
    print("Failed to generate Excel file.")

hmmm27 · May 16, 2023

Why don't you just e-mail them, asking if a non-html list is available ? Other resources may include local school boards, etc.

Leo_Chau_430 · May 16, 2023

I have found the contact list of the schools on the website https://data.gov.hk/tc/ However the list there do not have the email address of the schools...

PeterDonis · May 16, 2023

Leo_Chau_430 said:

My code is as follow

Code should be in a BBCode code block. For one thing, doing that preserves the exact formatting and whitespace of your code, which with Python is very important.

I have used magic moderator powers to edit your OP to put your code in a code block. Please review and make sure that the code formatting and indentation is correct (it looks ok to me but you're the one that wrote the code).

Leo_Chau_430 · May 16, 2023

PeterDonis said:

Code should be in a BBCode code block. For one thing, doing that preserves the exact formatting and whitespace of your code, which with Python is very important.

I have used magic moderator powers to edit your OP to put your code in a code block. Please review and make sure that the code formatting and indentation is correct (it looks ok to me but you're the one that wrote the code).

Sorry I am new to this forum, you said your code is as follow, but I cannot see them, where can I find it

Greg Bernhardt · May 16, 2023

As a debug measure, print out the dataframe and make sure it contains data.

Leo_Chau_430 · May 16, 2023

Greg Bernhardt said:

As a debug measure, print out the dataframe and make sure it contains data.

Ya I have tried to print out the data but it seems that the data is not extracted properly

Greg Bernhardt · May 16, 2023

Leo_Chau_430 said:

Ya I have tried to print out the data but it seems that the data is not extracted properly

Next debug is to print out each school_item. I suspect you're not parsing the classes right.

Leo_Chau_430 · May 16, 2023

Greg Bernhardt said:

Next debug is to print out each school_item. I suspect you're not parsing the classes right.

I have just checked, school_items are blank data sets. However, when I print soup it has a valid output

PeterDonis · May 16, 2023

Leo_Chau_430 said:

you said your code is as follow, but I cannot see them, where can I find it

I meant your code, the code you posted in the OP of this thread. I have put it inside a BBCode code block. If you can't see that, try reloading the page.

Leo_Chau_430 · May 16, 2023

PeterDonis said:

I meant your code, the code you posted in the OP of this thread. I have put it inside a BBCode code block. If you can't see that, try reloading the page.

Thank you I can see the code now. I think the identation should be correct.

pbuk · May 17, 2023

Leo_Chau_430 said:

My code can be ran successfully, but the excel generated is blank.

Did you get that code from ChatGPT? Wherever it came from, you need to approach writing and debugging code in a different way.

Why don't you try inserting some print() statements to see what data is being scraped?

The contents of the page are generated by JavaScript so BeutifulSoup doesn't see what you see.

Even if BeautifulSoup could run JavaScript the selectors you are trying to use e.g. {'class': 'school-item'}) don't exist in the page.

Leo_Chau_430 · May 17, 2023

pbuk said:

Did you get that code from ChatGPT? Wherever it came from, you need to approach writing and debugging code in a different way.

Why don't you try inserting some print() statements to see what data is being scraped?

The contents of the page are generated by JavaScript so BeutifulSoup doesn't see what you see.

Even if BeautifulSoup could run JavaScript the selectors you are trying to use e.g. {'class': 'school-item'}) don't exist in the page.

Thank you very much!

Grelbr42 · May 25, 2023

Just curious why you chose Python when MS Excel supports this sort of task natively. If the web site you are scraping is cooperative you can sometimes even do it directly in an Excel worksheet with no code required.

Why Use Python for Web Scraping Instead of Excel?

Similar threads

How to increase phone signal strength by lying about it

Who is responsible for the software when AI takes over programming?

A Crisis for Newly Minted CompSci Majors -- entry level jobs gone

Learning Assembly and computer architecture for x86

Learning data structures and algorithms in different programming languages

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers