Web Scraping Py Script Generates Error for Old Norse Chars

  • Thread starter Thread starter deltapapazulu
  • Start date Start date
  • Tags Tags
    Error Web
AI Thread Summary
The discussion revolves around using Python's BeautifulSoup for web scraping, specifically to extract words from Latin and Old Norse glossaries hosted by UT Austin. The initial code successfully retrieves Latin words but encounters a UnicodeEncodeError when attempting to write Old Norse characters to a CSV file. The error arises due to the inability of the default encoding to handle certain characters. The solution involves specifying 'utf-8' as the encoding in the open() function, allowing the script to properly write the Old Norse words to a CSV file without errors. This adjustment enables the creation of a neatly formatted CSV file that can be opened in Excel, containing all the extracted glossary words. The final working code is shared, highlighting the importance of using UTF-8 encoding for handling diverse character sets in web scraping tasks.
deltapapazulu
Messages
84
Reaction score
13
I have posted the Python code up front. See my question/inquiry below it.

Python:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
html = urlopen("https://lrc.la.utexas.edu/eieol_master_gloss/latol/1")
bsObj = BeautifulSoup(html)

nameList = bsObj.findAll("strong")

with open('mycsv4.csv', 'w', newline='') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])

Hi! I am using the Python package BeautifulSoup to learn web scraping/crawling and the above code works for grabbing Latin characters but not Old Norse characters.

Example. UT Austin has a languages page featuring long glossaries and dictionaries of ancient languages. I have chosen these pages to test the following scraping code because they conveniently (on the glossaries pages) have each glossary word (in the source code) enclosed in 'strong' tags E.G.:

<strong><span lang='la' class='Latin'>coloni</span></strong>

The desired text here is the Latin glossary word "coloni'. Anyway the above code works fine for UTAustin's Latin Glossary page. It grabbed and put into a CSV file all the Latin glossary words into a tight neat column in the order in which they appear on the webpage, for a total of 1191 Latin words.

https://lrc.la.utexas.edu/eieol_master_gloss/latol/1

But when I attempt to run the same code on UTAustin's "Old Norse Glossary" page it generates the following error.

--------------------------------------------
Traceback (most recent call last):
File "C:/Users/JohnP/PycharmProjects/FirstProgram/main.py", line 39, in <module>
thewriter.writerow([name.get_text()])
File "C:\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0301' in position 1: character maps to <undefined>
--------------------------------------------

NOTE: The script DOES work for generating the list of Old Norse words in my PyCharm interpreter output window using the immediately following code, it just hangs up when trying to put it into a CSV file using the aforesaid code.

Python:
nameList = bsObj.findAll("strong")
for name in nameList:
     print(name.get_text())

Here is the Old Norse glossary page:

https://lrc.la.utexas.edu/eieol_master_gloss/norol/18

And here is the above code but with the Old Norse page URL

Python:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://lrc.la.utexas.edu/eieol_master_gloss/norol/18")
bsObj = BeautifulSoup(html, "lxml")

nameList = bsObj.findAll("strong")

import csv
with open('mycsv5.csv', 'w', newline='') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])

thank you for any suggestions on how to resolve this problem
 
Technology news on Phys.org
My guess is that the old Norse characters are more than one byte long and that is causing BeautifulSoup to fail in parsing the page. You show inspect the web pages to see how they are different.

Actually the error is shown as it’s trying to write the data out it fails because it can’t convert the specified 301 character into an equivalent one suitable for printing. This means you will have to select an encoding for the output that will accept the 301 character like UTF8 and not latin1 or ascii encodings.

Stackexchange has a couple ways to do this

https://stackoverflow.com/questions/6048085/writing-unicode-text-to-a-text-file#6048203
 
All I got to say is what a gosh-dang mess this character encoding crap is. The internet is all over the place on this. What i was hoping for was some standard fix for my code

Anyway thanks Jedishrfu for your reply.

I am still trying to figure this out.
 
Just use UTF-8. Any characters outside of the standard Latin-1 codepage are written safely to your file.

UTF-8 is similar to Unicode aka the codepage to end all code pages. UTF-8 allows you to write any Unicode character to file without any null bytes or unprintable characters.

There’s a whole body of knowledge of how to get computers to share text in a safe way where no character is left behind :-)
 
jedishrfu said:
Just use UTF-8. Any characters outside of the standard Latin-1 codepage are written safely to your file.

UTF-8 is similar to Unicode aka the codepage to end all code pages. UTF-8 allows you to write any Unicode character to file without any null bytes or unprintable characters.

There’s a whole body of knowledge of how to get computers to share text in a safe way where no character is left behind :-)

thank you for reply, part of the problem is that I am still relatively new to Python and programming in general.

Ok so on this page:

https://docs.python.org/3/library/functions.html

I noticed that for the function ' open() ' , one of the arguments is ' encoding= ' . So I put: encoding='utf-8' :in my code and it worked. Here is the fix. I added this argument to the following code (see below) and it worked: encoding='utf-8'

Here is the whole working code for anyone interested. This takes all the bold type glossary words from UTAustin's Old Norse glossary page, creates a CSV file and puts all the words into a neat tight single-word column, and the CSV file can then be opened with Excel Spreadsheet with all the words in column A

NOTE: this code has been slightly tweaked from the code that appears in my original post but it does the same thing. And again, I solved the problem outlined in my original post by adding: encoding='utf-8' to the line beginning: with open( .

Python:
import csv
import requests
from bs4 import BeautifulSoup

url = "https://lrc.la.utexas.edu/eieol_master_gloss/norol/18"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
nameList = soup.findAll("strong")
with open('some.csv', 'w', newline='', encoding='utf-8') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])
 
Thread 'Is this public key encryption?'
I've tried to intuit public key encryption but never quite managed. But this seems to wrap it up in a bow. This seems to be a very elegant way of transmitting a message publicly that only the sender and receiver can decipher. Is this how PKE works? No, it cant be. In the above case, the requester knows the target's "secret" key - because they have his ID, and therefore knows his birthdate.

Similar threads

Replies
3
Views
2K
Back
Top