Web Scraping Py Script Generates Error for Old Norse Chars

deltapapazulu · Apr 2, 2018

I have posted the Python code up front. See my question/inquiry below it.

Python:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
html = urlopen("https://lrc.la.utexas.edu/eieol_master_gloss/latol/1")
bsObj = BeautifulSoup(html)

nameList = bsObj.findAll("strong")

with open('mycsv4.csv', 'w', newline='') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])

Hi! I am using the Python package BeautifulSoup to learn web scraping/crawling and the above code works for grabbing Latin characters but not Old Norse characters.

Example. UT Austin has a languages page featuring long glossaries and dictionaries of ancient languages. I have chosen these pages to test the following scraping code because they conveniently (on the glossaries pages) have each glossary word (in the source code) enclosed in 'strong' tags E.G.:

<strong><span lang='la' class='Latin'>coloni</span></strong>

The desired text here is the Latin glossary word "coloni'. Anyway the above code works fine for UTAustin's Latin Glossary page. It grabbed and put into a CSV file all the Latin glossary words into a tight neat column in the order in which they appear on the webpage, for a total of 1191 Latin words.

https://lrc.la.utexas.edu/eieol_master_gloss/latol/1

But when I attempt to run the same code on UTAustin's "Old Norse Glossary" page it generates the following error.

--------------------------------------------
Traceback (most recent call last):
File "C:/Users/JohnP/PycharmProjects/FirstProgram/main.py", line 39, in <module>
thewriter.writerow([name.get_text()])
File "C:\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0301' in position 1: character maps to <undefined>
--------------------------------------------

NOTE: The script DOES work for generating the list of Old Norse words in my PyCharm interpreter output window using the immediately following code, it just hangs up when trying to put it into a CSV file using the aforesaid code.

Python:

nameList = bsObj.findAll("strong")
for name in nameList:
     print(name.get_text())

Here is the Old Norse glossary page:

https://lrc.la.utexas.edu/eieol_master_gloss/norol/18

And here is the above code but with the Old Norse page URL

Python:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://lrc.la.utexas.edu/eieol_master_gloss/norol/18")
bsObj = BeautifulSoup(html, "lxml")

nameList = bsObj.findAll("strong")

import csv
with open('mycsv5.csv', 'w', newline='') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])

thank you for any suggestions on how to resolve this problem

jedishrfu · Apr 2, 2018

My guess is that the old Norse characters are more than one byte long and that is causing BeautifulSoup to fail in parsing the page. You show inspect the web pages to see how they are different.

Actually the error is shown as it’s trying to write the data out it fails because it can’t convert the specified 301 character into an equivalent one suitable for printing. This means you will have to select an encoding for the output that will accept the 301 character like UTF8 and not latin1 or ascii encodings.

Stackexchange has a couple ways to do this

https://stackoverflow.com/questions/6048085/writing-unicode-text-to-a-text-file#6048203

deltapapazulu · Apr 3, 2018

All I got to say is what a gosh-dang mess this character encoding crap is. The internet is all over the place on this. What i was hoping for was some standard fix for my code

Anyway thanks Jedishrfu for your reply.

I am still trying to figure this out.

jedishrfu · Apr 3, 2018

Just use UTF-8. Any characters outside of the standard Latin-1 codepage are written safely to your file.

UTF-8 is similar to Unicode aka the codepage to end all code pages. UTF-8 allows you to write any Unicode character to file without any null bytes or unprintable characters.

There’s a whole body of knowledge of how to get computers to share text in a safe way where no character is left behind :-)

deltapapazulu · Apr 3, 2018

jedishrfu said:

Just use UTF-8. Any characters outside of the standard Latin-1 codepage are written safely to your file.

UTF-8 is similar to Unicode aka the codepage to end all code pages. UTF-8 allows you to write any Unicode character to file without any null bytes or unprintable characters.

There’s a whole body of knowledge of how to get computers to share text in a safe way where no character is left behind :-)

thank you for reply, part of the problem is that I am still relatively new to Python and programming in general.

Ok so on this page:

https://docs.python.org/3/library/functions.html

I noticed that for the function ' open() ' , one of the arguments is ' encoding= ' . So I put: encoding='utf-8' :in my code and it worked. Here is the fix. I added this argument to the following code (see below) and it worked: encoding='utf-8'

Here is the whole working code for anyone interested. This takes all the bold type glossary words from UTAustin's Old Norse glossary page, creates a CSV file and puts all the words into a neat tight single-word column, and the CSV file can then be opened with Excel Spreadsheet with all the words in column A

NOTE: this code has been slightly tweaked from the code that appears in my original post but it does the same thing. And again, I solved the problem outlined in my original post by adding: encoding='utf-8' to the line beginning: with open( .

Python:

import csv
import requests
from bs4 import BeautifulSoup

url = "https://lrc.la.utexas.edu/eieol_master_gloss/norol/18"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
nameList = soup.findAll("strong")
with open('some.csv', 'w', newline='', encoding='utf-8') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])

Web Scraping Py Script Generates Error for Old Norse Chars

Is A.I. more than the sum of its parts?

AI vs. Humans as Processors in an Environment

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Web Scraping Py Script Generates Error for Old Norse Chars

Similar threads