- #1
deltapapazulu
- 84
- 12
I have posted the Python code up front. See my question/inquiry below it.
Hi! I am using the Python package BeautifulSoup to learn web scraping/crawling and the above code works for grabbing Latin characters but not Old Norse characters.
Example. UT Austin has a languages page featuring long glossaries and dictionaries of ancient languages. I have chosen these pages to test the following scraping code because they conveniently (on the glossaries pages) have each glossary word (in the source code) enclosed in 'strong' tags E.G.:
<strong><span lang='la' class='Latin'>coloni</span></strong>
The desired text here is the Latin glossary word "coloni'. Anyway the above code works fine for UTAustin's Latin Glossary page. It grabbed and put into a CSV file all the Latin glossary words into a tight neat column in the order in which they appear on the webpage, for a total of 1191 Latin words.
https://lrc.la.utexas.edu/eieol_master_gloss/latol/1
But when I attempt to run the same code on UTAustin's "Old Norse Glossary" page it generates the following error.
--------------------------------------------
Traceback (most recent call last):
File "C:/Users/JohnP/PycharmProjects/FirstProgram/main.py", line 39, in <module>
thewriter.writerow([name.get_text()])
File "C:\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0301' in position 1: character maps to <undefined>
--------------------------------------------
NOTE: The script DOES work for generating the list of Old Norse words in my PyCharm interpreter output window using the immediately following code, it just hangs up when trying to put it into a CSV file using the aforesaid code.
Here is the Old Norse glossary page:
https://lrc.la.utexas.edu/eieol_master_gloss/norol/18
And here is the above code but with the Old Norse page URL
thank you for any suggestions on how to resolve this problem
Python:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
html = urlopen("https://lrc.la.utexas.edu/eieol_master_gloss/latol/1")
bsObj = BeautifulSoup(html)
nameList = bsObj.findAll("strong")
with open('mycsv4.csv', 'w', newline='') as f:
thewriter = csv.writer(f)
for name in nameList:
thewriter.writerow([name.get_text()])
Hi! I am using the Python package BeautifulSoup to learn web scraping/crawling and the above code works for grabbing Latin characters but not Old Norse characters.
Example. UT Austin has a languages page featuring long glossaries and dictionaries of ancient languages. I have chosen these pages to test the following scraping code because they conveniently (on the glossaries pages) have each glossary word (in the source code) enclosed in 'strong' tags E.G.:
<strong><span lang='la' class='Latin'>coloni</span></strong>
The desired text here is the Latin glossary word "coloni'. Anyway the above code works fine for UTAustin's Latin Glossary page. It grabbed and put into a CSV file all the Latin glossary words into a tight neat column in the order in which they appear on the webpage, for a total of 1191 Latin words.
https://lrc.la.utexas.edu/eieol_master_gloss/latol/1
But when I attempt to run the same code on UTAustin's "Old Norse Glossary" page it generates the following error.
--------------------------------------------
Traceback (most recent call last):
File "C:/Users/JohnP/PycharmProjects/FirstProgram/main.py", line 39, in <module>
thewriter.writerow([name.get_text()])
File "C:\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0301' in position 1: character maps to <undefined>
--------------------------------------------
NOTE: The script DOES work for generating the list of Old Norse words in my PyCharm interpreter output window using the immediately following code, it just hangs up when trying to put it into a CSV file using the aforesaid code.
Python:
nameList = bsObj.findAll("strong")
for name in nameList:
print(name.get_text())
Here is the Old Norse glossary page:
https://lrc.la.utexas.edu/eieol_master_gloss/norol/18
And here is the above code but with the Old Norse page URL
Python:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://lrc.la.utexas.edu/eieol_master_gloss/norol/18")
bsObj = BeautifulSoup(html, "lxml")
nameList = bsObj.findAll("strong")
import csv
with open('mycsv5.csv', 'w', newline='') as f:
thewriter = csv.writer(f)
for name in nameList:
thewriter.writerow([name.get_text()])
thank you for any suggestions on how to resolve this problem