Re-ORC'ing *text* (not scanned) PDFs

  • Thread starter Thread starter CRGreathouse
  • Start date Start date
  • Tags Tags
    Text
Click For Summary

Discussion Overview

The discussion revolves around challenges faced when dealing with PDFs that contain text encoded in a way that makes it difficult to search or copy, particularly in the context of math and science papers. Participants explore various methods and tools for converting or extracting usable text from these PDFs, including OCR techniques and software recommendations.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Homework-related
  • Experimental/applied

Main Points Raised

  • One participant describes encountering unreadable text in PDFs, particularly from LaTeX documents, which hinders their ability to search or copy content.
  • Another participant suggests that dvi-pdf programs create high-quality PDF images and recommends adjusting export settings in Adobe to improve OCR results.
  • There is a mention of using Ghostview as a tool to read and convert PDFs, which may help in manipulating the files.
  • Some participants discuss the possibility of using InftyReader for recognizing scanned images of scientific documents, including math formulas, and converting them into various formats.
  • One participant expresses frustration with the quality of exported images for OCR and the slow process involved in obtaining usable text.
  • Another participant notes that the encoding issues may stem from the way PDFs are compressed, although this is challenged by others who suggest it may not be the case.
  • Several links to external resources and threads are shared, although some participants indicate that these may not directly address their specific issues.
  • Participants discuss the operating systems they are using, which may influence the tools and methods available for solving the problem.

Areas of Agreement / Disagreement

Participants do not reach a consensus on a single solution to the problem, as multiple competing views and methods are presented. The discussion remains unresolved regarding the best approach to handle the encoding issues in PDFs.

Contextual Notes

Participants acknowledge limitations in their current tools and methods, with some expressing uncertainty about the effectiveness of various software solutions and the quality of OCR results. The discussion highlights the dependency on specific software capabilities and the challenges posed by different PDF encodings.

Who May Find This Useful

This discussion may be useful for researchers, students, and professionals in STEM fields who frequently work with PDFs of mathematical and scientific documents and encounter similar text encoding issues.

CRGreathouse
Science Advisor
Homework Helper
Messages
2,832
Reaction score
0
I have a problem that I suspect many others here have had with math and science papers.

I have a number of PDFs that have real text (as opposed to images of text) in the file, but the encoding is different. Sometimes this happens with LaTeX documents, sometimes for other reasons that aren't clear to me. Regardless, this robs me of my ability to search or copy: it's not much use to grab "  !"#  $%#!% #&"  ' ($ %) *"
!" %) !%+" %)  ,- " #&"
" from a file.

Exporting as an image and OCRing is possible, but the quality degrades so severely that it's not particularly usable. (I don't actually know why the quality is so bad with an export, but it is...) You'd think there would be a way to have Acrobat look at the actual shape of the letters and OCR that...

If it helps, I have Acrobat Pro 9 here.
 
Computer science news on Phys.org
A lot of dvi-pdf programs generate pdf image files. This guarantees that the font and layout are exactly as Tex wants it, rather than Adobe messing anything up.
There is no reason that the exported image should be poorer quality.
I don't know the export options in Adobe but try selecting tiff and a high resolution, you might also need to choose greyscale rather than mono?
To OCR them there are a few free or trial version sof OCR software around.

If the docs aren't confidential and you aren't in a hurry you can just put the original pdfs on your website, let google spider them and then use google to search for the OCR'ed text!
 
If anyone's wondering, the garbage I pasted above is the first sentence of an abstract from a paper I was reading. Yes, it does actually come out that bad.

mgb_phys said:
There is no reason that the exported image should be poorer quality.
I don't know the export options in Adobe but try selecting tiff and a high resolution, you might also need to choose greyscale rather than mono?

I'll play around with the settings a bit, but the initial result were so bad I was fairly discouraged. The process is slow (export, wait, wait, import, join page-by-page, OCR) and the results were poor at best.
 
Download ghostview,it can read PDF/PS and convert it to just about anything.
It can also do lots of tricks like splitting,combining files - and it's free!

PDF can be compressed, perhaps that is the compressed data you are seeing?
 
mgb_phys said:
PDF can be compressed, perhaps that is the compressed data you are seeing?

I don't think so. I have a thought -- PM me your email address and I'll send you the PDF, then you can tell me what you think. Maybe I'm just being dumb.
 
technutz said:
Here is a thread on the Latex site that might help:
http://www.latex-community.org/viewtopic.php?f=5&t=1124

That's nice, a partial solution for making pdfs from TeX source. I'll keep that in mind. But my problem here is with a pdf for which I do not have LaTeX source.
 
  • #10
technutz said:
There are a few help tips and work around here on this thread:
https://www.physicsforums.com/archive/index.php/t-225791.html

I don't see how that relates to my problem. I don't have LaTeX source for the document -- if I did, I'd just pdfTeX it and be done. Instead I have a math document that was PDF'd by someone else and which I can't search or copy.

I'm interested in a general solution because this comes up not infrequently.

technutz said:
Which operating system are you using? XP, Vista, OS X or Linux variation?

Well this computer is Vista Business, but I could use XP or, in a pinch, Xubuntu or Server 2003.
 
  • #11
The link to the other thread had the different methods of doing print screen on the various OS's. That would solve the copy issue holding down the "shift" key and press "print screen" that will capture the windows of Acrobat with the current page. Using the built in "MS Paint" under "Accessories" to further crop the image. Saving the image will make it a smaller image size. Copy and Paste out of MS Paint the defaulted to Bitmap.

As for "searching" I have reviewed the Adobe site as well as several university sites and come up with nothing as the encoding in the PDF seems to be different from one Latex to another. It would seem like a simple TrueType font (which is search able) would resolve the issue but that does not appear to be an option. Adobe was no help at all on Latex. Hopefully the PrintScreen will help get you by.

I will continue to look and keep and eye out for a better solution.
 
  • #12
CRGreathouse said:
I don't see how that relates to my problem. I don't have LaTeX source for the document -- if I did, I'd just pdfTeX it and be done. Instead I have a math document that was PDF'd by someone else and which I can't search or copy.

I'm interested in a general solution because this comes up not infrequently.

Did you try InftyReader linked above?
It claims
"InftyReader recognizes scanned images of printed scientific documents including Math formulae, an outputs the recognition results in various formats:
XML format for InftyEditor, LaTeX, MathML, Human-Readable TeX for the blinds, etc."
So, if it works, you could get a LaTeX-file back.

While the process of making quality images for OCR might be time-consuming and tedious, one might be able to write a set of scripts to process them (say, running in the background or overnight).
 
  • #13
robphy said:
Did you try InftyReader linked above?

I'm looking into it. It looks really nice -- if it works, I'll probably buy it (though I don't really need it). Someone has to support fun toys!
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 4 ·
Replies
4
Views
4K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
17
Views
6K
  • · Replies 4 ·
Replies
4
Views
7K
  • · Replies 4 ·
Replies
4
Views
4K
  • · Replies 7 ·
Replies
7
Views
36K