Web Scraping Py Script Generates Error for Old Norse Chars

  • Thread starter deltapapazulu
  • Start date
  • Tags
    Error Web
In summary: List = bsObj.findAll("strong")with open('mycsv6.csv', 'w', newline='') as f: thewriter = csv.writer(f) for name in nameList: thewriter.writerow([name.get_text()])
  • #1
deltapapazulu
84
12
I have posted the Python code up front. See my question/inquiry below it.

Python:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
html = urlopen("https://lrc.la.utexas.edu/eieol_master_gloss/latol/1")
bsObj = BeautifulSoup(html)

nameList = bsObj.findAll("strong")

with open('mycsv4.csv', 'w', newline='') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])

Hi! I am using the Python package BeautifulSoup to learn web scraping/crawling and the above code works for grabbing Latin characters but not Old Norse characters.

Example. UT Austin has a languages page featuring long glossaries and dictionaries of ancient languages. I have chosen these pages to test the following scraping code because they conveniently (on the glossaries pages) have each glossary word (in the source code) enclosed in 'strong' tags E.G.:

<strong><span lang='la' class='Latin'>coloni</span></strong>

The desired text here is the Latin glossary word "coloni'. Anyway the above code works fine for UTAustin's Latin Glossary page. It grabbed and put into a CSV file all the Latin glossary words into a tight neat column in the order in which they appear on the webpage, for a total of 1191 Latin words.

https://lrc.la.utexas.edu/eieol_master_gloss/latol/1

But when I attempt to run the same code on UTAustin's "Old Norse Glossary" page it generates the following error.

--------------------------------------------
Traceback (most recent call last):
File "C:/Users/JohnP/PycharmProjects/FirstProgram/main.py", line 39, in <module>
thewriter.writerow([name.get_text()])
File "C:\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0301' in position 1: character maps to <undefined>

--------------------------------------------

NOTE: The script DOES work for generating the list of Old Norse words in my PyCharm interpreter output window using the immediately following code, it just hangs up when trying to put it into a CSV file using the aforesaid code.

Python:
nameList = bsObj.findAll("strong")
for name in nameList:
     print(name.get_text())

Here is the Old Norse glossary page:

https://lrc.la.utexas.edu/eieol_master_gloss/norol/18

And here is the above code but with the Old Norse page URL

Python:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://lrc.la.utexas.edu/eieol_master_gloss/norol/18")
bsObj = BeautifulSoup(html, "lxml")

nameList = bsObj.findAll("strong")

import csv
with open('mycsv5.csv', 'w', newline='') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])

thank you for any suggestions on how to resolve this problem
 
Technology news on Phys.org
  • #2
My guess is that the old Norse characters are more than one byte long and that is causing BeautifulSoup to fail in parsing the page. You show inspect the web pages to see how they are different.

Actually the error is shown as it’s trying to write the data out it fails because it can’t convert the specified 301 character into an equivalent one suitable for printing. This means you will have to select an encoding for the output that will accept the 301 character like UTF8 and not latin1 or ascii encodings.

Stackexchange has a couple ways to do this

https://stackoverflow.com/questions/6048085/writing-unicode-text-to-a-text-file#6048203
 
  • #3
All I got to say is what a gosh-dang mess this character encoding crap is. The internet is all over the place on this. What i was hoping for was some standard fix for my code

Anyway thanks Jedishrfu for your reply.

I am still trying to figure this out.
 
  • #4
Just use UTF-8. Any characters outside of the standard Latin-1 codepage are written safely to your file.

UTF-8 is similar to Unicode aka the codepage to end all code pages. UTF-8 allows you to write any Unicode character to file without any null bytes or unprintable characters.

There’s a whole body of knowledge of how to get computers to share text in a safe way where no character is left behind :-)
 
  • #5
jedishrfu said:
Just use UTF-8. Any characters outside of the standard Latin-1 codepage are written safely to your file.

UTF-8 is similar to Unicode aka the codepage to end all code pages. UTF-8 allows you to write any Unicode character to file without any null bytes or unprintable characters.

There’s a whole body of knowledge of how to get computers to share text in a safe way where no character is left behind :-)

thank you for reply, part of the problem is that I am still relatively new to Python and programming in general.

Ok so on this page:

https://docs.python.org/3/library/functions.html

I noticed that for the function ' open() ' , one of the arguments is ' encoding= ' . So I put: encoding='utf-8' :in my code and it worked. Here is the fix. I added this argument to the following code (see below) and it worked: encoding='utf-8'

Here is the whole working code for anyone interested. This takes all the bold type glossary words from UTAustin's Old Norse glossary page, creates a CSV file and puts all the words into a neat tight single-word column, and the CSV file can then be opened with Excel Spreadsheet with all the words in column A

NOTE: this code has been slightly tweaked from the code that appears in my original post but it does the same thing. And again, I solved the problem outlined in my original post by adding: encoding='utf-8' to the line beginning: with open( .

Python:
import csv
import requests
from bs4 import BeautifulSoup

url = "https://lrc.la.utexas.edu/eieol_master_gloss/norol/18"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
nameList = soup.findAll("strong")
with open('some.csv', 'w', newline='', encoding='utf-8') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])
 

1. What is web scraping?

Web scraping is the process of extracting data from websites using automated software or tools. It involves accessing the HTML code of a webpage and extracting specific information from it.

2. How does web scraping work?

Web scraping works by using a web scraping tool or software to access the HTML code of a webpage. The tool then parses through the code and extracts the desired information, which can be saved in a structured format for further analysis.

3. Why is the Py script generating an error for Old Norse characters?

The Py script may be generating an error for Old Norse characters because the script is not updated to handle non-English characters. This could be due to a limitation of the programming language or the script itself.

4. How can I fix the error for Old Norse characters in the Py script?

To fix the error for Old Norse characters in the Py script, you can try using a different programming language or a different web scraping tool that supports non-English characters. You can also modify the script to handle these characters, or reach out to the developer for assistance.

5. Is web scraping legal?

The legality of web scraping depends on the purpose and method of scraping. Generally, it is legal to scrape publicly available data for personal use. However, scraping data for commercial use or without the website's permission may be considered illegal. It is important to check the website's terms of service and consult a legal professional before engaging in web scraping.

Similar threads

  • Programming and Computer Science
Replies
3
Views
1K
Back
Top