Web Scraping Py Script Generates Error for Old Norse Chars

  • Thread starter Thread starter deltapapazulu
  • Start date Start date
  • Tags Tags
    Error Web
Click For Summary

Discussion Overview

The discussion revolves around a Python script for web scraping using BeautifulSoup, specifically addressing issues encountered when attempting to scrape Old Norse characters from a glossary page. Participants explore character encoding challenges and potential solutions related to writing these characters to a CSV file.

Discussion Character

  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant describes a Python script that successfully scrapes Latin characters but fails with Old Norse characters due to a UnicodeEncodeError when writing to a CSV file.
  • Another participant suggests that the issue may stem from Old Norse characters being multi-byte, which could complicate parsing and writing.
  • A participant expresses frustration with character encoding inconsistencies across the internet and seeks a standard solution.
  • Some participants advocate for using UTF-8 encoding to handle characters outside the Latin-1 codepage, asserting that it allows for safe writing of any Unicode character.
  • One participant reports success after modifying their code to include 'encoding=utf-8' in the file open function, sharing their revised code as a solution.

Areas of Agreement / Disagreement

Participants generally agree that character encoding is the root of the issue, with multiple suggestions for using UTF-8 encoding. However, there is no consensus on the best approach to handle the problem, as some participants express ongoing confusion.

Contextual Notes

Limitations include the lack of clarity on the specific differences between the Latin and Old Norse glossary pages that may affect scraping, as well as the unresolved nature of character encoding issues in general.

deltapapazulu
Messages
84
Reaction score
13
I have posted the Python code up front. See my question/inquiry below it.

Python:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
html = urlopen("https://lrc.la.utexas.edu/eieol_master_gloss/latol/1")
bsObj = BeautifulSoup(html)

nameList = bsObj.findAll("strong")

with open('mycsv4.csv', 'w', newline='') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])

Hi! I am using the Python package BeautifulSoup to learn web scraping/crawling and the above code works for grabbing Latin characters but not Old Norse characters.

Example. UT Austin has a languages page featuring long glossaries and dictionaries of ancient languages. I have chosen these pages to test the following scraping code because they conveniently (on the glossaries pages) have each glossary word (in the source code) enclosed in 'strong' tags E.G.:

<strong><span lang='la' class='Latin'>coloni</span></strong>

The desired text here is the Latin glossary word "coloni'. Anyway the above code works fine for UTAustin's Latin Glossary page. It grabbed and put into a CSV file all the Latin glossary words into a tight neat column in the order in which they appear on the webpage, for a total of 1191 Latin words.

https://lrc.la.utexas.edu/eieol_master_gloss/latol/1

But when I attempt to run the same code on UTAustin's "Old Norse Glossary" page it generates the following error.

--------------------------------------------
Traceback (most recent call last):
File "C:/Users/JohnP/PycharmProjects/FirstProgram/main.py", line 39, in <module>
thewriter.writerow([name.get_text()])
File "C:\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0301' in position 1: character maps to <undefined>
--------------------------------------------

NOTE: The script DOES work for generating the list of Old Norse words in my PyCharm interpreter output window using the immediately following code, it just hangs up when trying to put it into a CSV file using the aforesaid code.

Python:
nameList = bsObj.findAll("strong")
for name in nameList:
     print(name.get_text())

Here is the Old Norse glossary page:

https://lrc.la.utexas.edu/eieol_master_gloss/norol/18

And here is the above code but with the Old Norse page URL

Python:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://lrc.la.utexas.edu/eieol_master_gloss/norol/18")
bsObj = BeautifulSoup(html, "lxml")

nameList = bsObj.findAll("strong")

import csv
with open('mycsv5.csv', 'w', newline='') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])

thank you for any suggestions on how to resolve this problem
 
Technology news on Phys.org
My guess is that the old Norse characters are more than one byte long and that is causing BeautifulSoup to fail in parsing the page. You show inspect the web pages to see how they are different.

Actually the error is shown as it’s trying to write the data out it fails because it can’t convert the specified 301 character into an equivalent one suitable for printing. This means you will have to select an encoding for the output that will accept the 301 character like UTF8 and not latin1 or ascii encodings.

Stackexchange has a couple ways to do this

https://stackoverflow.com/questions/6048085/writing-unicode-text-to-a-text-file#6048203
 
All I got to say is what a gosh-dang mess this character encoding crap is. The internet is all over the place on this. What i was hoping for was some standard fix for my code

Anyway thanks Jedishrfu for your reply.

I am still trying to figure this out.
 
Just use UTF-8. Any characters outside of the standard Latin-1 codepage are written safely to your file.

UTF-8 is similar to Unicode aka the codepage to end all code pages. UTF-8 allows you to write any Unicode character to file without any null bytes or unprintable characters.

There’s a whole body of knowledge of how to get computers to share text in a safe way where no character is left behind :-)
 
jedishrfu said:
Just use UTF-8. Any characters outside of the standard Latin-1 codepage are written safely to your file.

UTF-8 is similar to Unicode aka the codepage to end all code pages. UTF-8 allows you to write any Unicode character to file without any null bytes or unprintable characters.

There’s a whole body of knowledge of how to get computers to share text in a safe way where no character is left behind :-)

thank you for reply, part of the problem is that I am still relatively new to Python and programming in general.

Ok so on this page:

https://docs.python.org/3/library/functions.html

I noticed that for the function ' open() ' , one of the arguments is ' encoding= ' . So I put: encoding='utf-8' :in my code and it worked. Here is the fix. I added this argument to the following code (see below) and it worked: encoding='utf-8'

Here is the whole working code for anyone interested. This takes all the bold type glossary words from UTAustin's Old Norse glossary page, creates a CSV file and puts all the words into a neat tight single-word column, and the CSV file can then be opened with Excel Spreadsheet with all the words in column A

NOTE: this code has been slightly tweaked from the code that appears in my original post but it does the same thing. And again, I solved the problem outlined in my original post by adding: encoding='utf-8' to the line beginning: with open( .

Python:
import csv
import requests
from bs4 import BeautifulSoup

url = "https://lrc.la.utexas.edu/eieol_master_gloss/norol/18"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
nameList = soup.findAll("strong")
with open('some.csv', 'w', newline='', encoding='utf-8') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K