Why Use Python for Web Scraping Instead of Excel?

Click For Summary

Discussion Overview

The discussion revolves around the use of Python for web scraping compared to using Excel for similar tasks. Participants explore the technical aspects of web scraping, debugging code, and the potential advantages of using Python over Excel.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Homework-related

Main Points Raised

  • One participant shares a Python code snippet for scraping school data from a website and saving it to an Excel file.
  • Another participant suggests contacting schools directly for data instead of scraping.
  • A participant mentions finding a contact list on a government website but notes the absence of email addresses.
  • Several participants emphasize the importance of formatting code correctly, particularly in Python, and suggest using BBCode for clarity.
  • Debugging suggestions include printing the dataframe and checking the contents of the scraped data, with one participant noting that the data extraction appears to be failing.
  • Concerns are raised about the effectiveness of the scraping code, with participants questioning the parsing of HTML elements.
  • One participant asks why Python was chosen for this task when Excel can handle similar tasks without coding.

Areas of Agreement / Disagreement

Participants express differing views on the effectiveness of Python versus Excel for web scraping, with no consensus reached on the best approach. There are also varying opinions on the debugging process and the adequacy of the provided code.

Contextual Notes

Some participants note that the scraping code may not be extracting data correctly, and there are unresolved issues regarding the HTML structure being parsed. The discussion includes various debugging strategies that have not yet led to a resolution.

Leo_Chau_430
Messages
8
Reaction score
1
TL;DR
I am trying to write a program that can automatically scrap through the website https://www.goodschool.hk/ss to make an Excel that contains phone number, address, email address and fax number of all the secondary schools, primary schools and kindergarten in Hong Kong. However, I have faced some problems... My code can be ran successfully, but the excel generated is blank.
My code is as follow:

Python:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import os

url = 'https://www.goodschool.hk/ss'

response = requests.get(url)
html = response.content

soup = BeautifulSoup(html, 'html.parser')

school_items = soup.find_all('div', {'class': 'school-item'})

school_names_en = []
school_names_zh = []
school_addresses_en = []
school_addresses_zh = []
school_phones = []
school_emails = []
school_faxes = []

for school_item in school_items:

    name_elements = school_item.select('a.school-name')
    school_names_en.append(name_elements[0].text.strip())
    school_names_zh.append(name_elements[1].text.strip())

    address_elements = school_item.select('div.school-address')
    school_addresses_en.append(address_elements[0].text.strip())
    school_addresses_zh.append(address_elements[1].text.strip())

    contact_elements = school_item.select('div.contact-info')
    school_phones.append(contact_elements[0].text.strip())
    school_emails.append(contact_elements[1].text.strip())
    school_faxes.append(contact_elements[2].text.strip())

df = pd.DataFrame({
    'School Name (English)': school_names_en,
    'School Name (Chinese)': school_names_zh,
    'Address (English)': school_addresses_en,
    'Address (Chinese)': school_addresses_zh,
    'Phone Number': school_phones,
    'Email Address': school_emails,
    'Fax Number': school_faxes
})

desktop_path = os.path.join(os.path.expanduser("~"), "Desktop")
excel_file_path = os.path.join(desktop_path, "school_data.xlsx")
df.to_excel(excel_file_path, index=False)

if os.path.exists(excel_file_path):
    print("Excel file generated successfully!")
else:
    print("Failed to generate Excel file.")
 
Technology news on Phys.org
Why don't you just e-mail them, asking if a non-html list is available ? Other resources may include local school boards, etc.
 
I have found the contact list of the schools on the website https://data.gov.hk/tc/ However the list there do not have the email address of the schools...
 
  • Like
Likes   Reactions: hmmm27
Leo_Chau_430 said:
My code is as follow
Code should be in a BBCode code block. For one thing, doing that preserves the exact formatting and whitespace of your code, which with Python is very important.

I have used magic moderator powers to edit your OP to put your code in a code block. Please review and make sure that the code formatting and indentation is correct (it looks ok to me but you're the one that wrote the code).
 
  • Haha
Likes   Reactions: jedishrfu
PeterDonis said:
Code should be in a BBCode code block. For one thing, doing that preserves the exact formatting and whitespace of your code, which with Python is very important.

I have used magic moderator powers to edit your OP to put your code in a code block. Please review and make sure that the code formatting and indentation is correct (it looks ok to me but you're the one that wrote the code).
Sorry I am new to this forum, you said your code is as follow, but I cannot see them, where can I find it
 
As a debug measure, print out the dataframe and make sure it contains data.
 
  • Like
Likes   Reactions: pbuk
Greg Bernhardt said:
As a debug measure, print out the dataframe and make sure it contains data.
Ya I have tried to print out the data but it seems that the data is not extracted properly
 
Leo_Chau_430 said:
Ya I have tried to print out the data but it seems that the data is not extracted properly
Next debug is to print out each school_item. I suspect you're not parsing the classes right.
 
Greg Bernhardt said:
Next debug is to print out each school_item. I suspect you're not parsing the classes right.
I have just checked, school_items are blank data sets. However, when I print soup it has a valid output
 
  • #10
Leo_Chau_430 said:
you said your code is as follow, but I cannot see them, where can I find it
I meant your code, the code you posted in the OP of this thread. I have put it inside a BBCode code block. If you can't see that, try reloading the page.
 
  • #11
PeterDonis said:
I meant your code, the code you posted in the OP of this thread. I have put it inside a BBCode code block. If you can't see that, try reloading the page.
Thank you I can see the code now. I think the identation should be correct.
 
  • #12
Leo_Chau_430 said:
My code can be ran successfully, but the excel generated is blank.
Did you get that code from ChatGPT? Wherever it came from, you need to approach writing and debugging code in a different way.

Why don't you try inserting some print() statements to see what data is being scraped?
The contents of the page are generated by JavaScript so BeutifulSoup doesn't see what you see.
Even if BeautifulSoup could run JavaScript the selectors you are trying to use e.g. {'class': 'school-item'}) don't exist in the page.
 
  • Like
  • Love
Likes   Reactions: harborsparrow and Leo_Chau_430
  • #13
pbuk said:
Did you get that code from ChatGPT? Wherever it came from, you need to approach writing and debugging code in a different way.

Why don't you try inserting some print() statements to see what data is being scraped?
The contents of the page are generated by JavaScript so BeutifulSoup doesn't see what you see.
Even if BeautifulSoup could run JavaScript the selectors you are trying to use e.g. {'class': 'school-item'}) don't exist in the page.
Thank you very much!
 
  • #14
Just curious why you chose Python when MS Excel supports this sort of task natively. If the web site you are scraping is cooperative you can sometimes even do it directly in an Excel worksheet with no code required.
 
  • Like
Likes   Reactions: Vanadium 50 and Greg Bernhardt

Similar threads

  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 4 ·
Replies
4
Views
7K
Replies
3
Views
2K