Can Mathematical Symbols Be Extracted from Scanned Journal Pages?

  • Thread starter Thread starter qnach
  • Start date Start date
  • Tags Tags
    Journal
AI Thread Summary
Many ancient journals have been digitized into PDF files, primarily as images, which raises the question of how to make these documents searchable. Optical character recognition (OCR) technology is essential for converting images into searchable text, though it often encounters challenges, especially with complex scripts or mathematical content. While OCR has improved significantly, errors can still occur, particularly in mathematical papers where inaccuracies may go unnoticed. PDF editing tools like Adobe Acrobat and ABBYY can help refine the text layer and correct OCR mistakes. Additionally, there are tools available that can extract text from PDFs and even specialized technologies aimed at recognizing mathematical symbols from handwritten or scanned documents. The effectiveness of these technologies may depend on context, and there is potential for AI to enhance recognition accuracy. However, the inherent difficulties in processing older documents suggest that human expertise may still be necessary to ensure accuracy, particularly for complex content.
qnach
Messages
144
Reaction score
4
TL;DR Summary
Many ancient journals were scanned into PDF files. What I don't know is how could these pages become mark-able and search-able?

They should only be images.
Many ancient journals were scanned into PDF files. What I don't know is how could these pages become mark-able and search-able?
They should only be images.
 
Computer science news on Phys.org
Optical character recognition (OCR).
 
  • Like
Likes hmmm27, Vanadium 50, jedishrfu and 3 others
In addition, PDF tools may allow for edits of the searchable text when OCR fails to recognize some scripts.
 
pbuk said:
Optical character recognition (OCR).
It works quite well these days but I read a lot of books (obviously scanned) on my Kindle and there are one or two mistakes in most of the books. Context usually digs me out of the problem but the Maths in some papers could produce undetectable errors (undetectable by my Maths brain, at least).
 
jedishrfu said:
In addition, PDF tools may allow for edits of the searchable text when OCR fails to recognize some scripts.
That sounds v intelligent. You mean an improved bolt on when the context reads as garbage? It's a matter of spotting an error in the first place.
 
PDF Editors like Adobe Acrobat (https://www.adobe.com/acrobat/) and ABBYY (https://pdf.abbyy.com/) can show the text layer. I'm sure there are tools for spell-checking and maybe even grammar-testing that layer.

There are also tools that can extract the whole text layer into a text file (e.g. https://en.wikipedia.org/wiki/Pdftotext ).

A new technology is trying to extract mathematical symbols from handwriting and from scanned images
(e.g.
https://mathpix.com/
https://photomath.com/en/
https://webdemo.myscript.com/
https://socratic.org/ (from Google)
http://write-math.com/
https://www.i2ocr.com/free-online-math-equation-ocr
https://www.cs.rit.edu/~dprl/software.html
https://www.xthink.com/mathjournal.html (once promising)
https://www.inftyproject.org/en/software.html InftyReader
)

Maybe AI can help learn the appropriate context to improve recognition
https://www.searchonmath.com/
https://approach0.xyz/search/
https://mathdeck.org/I haven't tried all of these sites.
 
The weak link is in the OCR though. OCR utilities can spot text and non-text but some old papers give OCR a hard time. I should imagine any system might need to ask a human for advice. This could involve a lot of (specialist) man-hours for millions of documents.
 
robphy said:
A new technology is trying to extract mathematical symbols from handwriting and from scanned images
That could be very demanding. It could involve parsing an equation and working out its meaning from the context. Sort of thing that only an expert in the field of the paper could do. But never say never, about computing.
 
Back
Top