Can Mathematical Symbols Be Extracted from Scanned Journal Pages?

  • Thread starter Thread starter qnach
  • Start date Start date
  • Tags Tags
    Journal
Click For Summary

Discussion Overview

The discussion revolves around the extraction of mathematical symbols from scanned journal pages, particularly focusing on the challenges and technologies involved in making scanned documents searchable and editable. It touches on the use of optical character recognition (OCR) and the potential for AI to improve recognition accuracy.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • Some participants suggest that optical character recognition (OCR) is a primary method for making scanned pages searchable.
  • Others note that PDF tools can allow for editing of text when OCR fails, highlighting the importance of context in recognizing errors.
  • One participant mentions that while OCR works well, it can still produce mistakes, particularly with mathematical content, which may be harder to detect.
  • Several participants discuss various tools and technologies that can extract text and mathematical symbols from scanned images, including specific software and online resources.
  • There is a suggestion that AI could enhance the recognition process by learning context, although this remains speculative.
  • Concerns are raised about the limitations of OCR, especially with older documents, and the potential need for human intervention in the recognition process.
  • One participant emphasizes the complexity of extracting mathematical symbols, noting that understanding the context may require expertise in the subject matter.

Areas of Agreement / Disagreement

Participants express a range of views on the effectiveness of OCR and the challenges of recognizing mathematical symbols. There is no consensus on the best approach or the reliability of current technologies, indicating ongoing debate and uncertainty.

Contextual Notes

Limitations include the potential for OCR to misinterpret text, especially in older documents, and the dependency on context for accurate recognition of mathematical content. The discussion does not resolve these issues.

qnach
Messages
154
Reaction score
4
TL;DR
Many ancient journals were scanned into PDF files. What I don't know is how could these pages become mark-able and search-able?

They should only be images.
Many ancient journals were scanned into PDF files. What I don't know is how could these pages become mark-able and search-able?
They should only be images.
 
Computer science news on Phys.org
Optical character recognition (OCR).
 
  • Like
Likes   Reactions: hmmm27, Vanadium 50, jedishrfu and 3 others
In addition, PDF tools may allow for edits of the searchable text when OCR fails to recognize some scripts.
 
pbuk said:
Optical character recognition (OCR).
It works quite well these days but I read a lot of books (obviously scanned) on my Kindle and there are one or two mistakes in most of the books. Context usually digs me out of the problem but the Maths in some papers could produce undetectable errors (undetectable by my Maths brain, at least).
 
jedishrfu said:
In addition, PDF tools may allow for edits of the searchable text when OCR fails to recognize some scripts.
That sounds v intelligent. You mean an improved bolt on when the context reads as garbage? It's a matter of spotting an error in the first place.
 
PDF Editors like Adobe Acrobat (https://www.adobe.com/acrobat/) and ABBYY (https://pdf.abbyy.com/) can show the text layer. I'm sure there are tools for spell-checking and maybe even grammar-testing that layer.

There are also tools that can extract the whole text layer into a text file (e.g. https://en.wikipedia.org/wiki/Pdftotext ).

A new technology is trying to extract mathematical symbols from handwriting and from scanned images
(e.g.
https://mathpix.com/
https://photomath.com/en/
https://webdemo.myscript.com/
https://socratic.org/ (from Google)
http://write-math.com/
https://www.i2ocr.com/free-online-math-equation-ocr
https://www.cs.rit.edu/~dprl/software.html
https://www.xthink.com/mathjournal.html (once promising)
https://www.inftyproject.org/en/software.html InftyReader
)

Maybe AI can help learn the appropriate context to improve recognition
https://www.searchonmath.com/
https://approach0.xyz/search/
https://mathdeck.org/I haven't tried all of these sites.
 
The weak link is in the OCR though. OCR utilities can spot text and non-text but some old papers give OCR a hard time. I should imagine any system might need to ask a human for advice. This could involve a lot of (specialist) man-hours for millions of documents.
 
robphy said:
A new technology is trying to extract mathematical symbols from handwriting and from scanned images
That could be very demanding. It could involve parsing an equation and working out its meaning from the context. Sort of thing that only an expert in the field of the paper could do. But never say never, about computing.
 

Similar threads

  • · Replies 15 ·
Replies
15
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
4K
  • · Replies 5 ·
Replies
5
Views
1K
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
958
Replies
14
Views
2K
  • · Replies 22 ·
Replies
22
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K