Re-ORC'ing *text* (not scanned) PDFs

  • Thread starter CRGreathouse
  • Start date
  • Tags
    Text
In summary, the encoding in a PDF made from a TeX source is different from the encoding in a PDF made from other sources. This makes it difficult to search or copy the content. There is no reason that the exported image should be poorer quality.
  • #1
CRGreathouse
Science Advisor
Homework Helper
2,844
0
I have a problem that I suspect many others here have had with math and science papers.

I have a number of PDFs that have real text (as opposed to images of text) in the file, but the encoding is different. Sometimes this happens with LaTeX documents, sometimes for other reasons that aren't clear to me. Regardless, this robs me of my ability to search or copy: it's not much use to grab "  !"#  $%#!% #&"  ' ($ %) *"
!" %) !%+" %)  ,- " #&"
" from a file.

Exporting as an image and OCRing is possible, but the quality degrades so severely that it's not particularly usable. (I don't actually know why the quality is so bad with an export, but it is...) You'd think there would be a way to have Acrobat look at the actual shape of the letters and OCR that...

If it helps, I have Acrobat Pro 9 here.
 
Computer science news on Phys.org
  • #2
A lot of dvi-pdf programs generate pdf image files. This guarantees that the font and layout are exactly as Tex wants it, rather than Adobe messing anything up.
There is no reason that the exported image should be poorer quality.
I don't know the export options in Adobe but try selecting tiff and a high resolution, you might also need to choose greyscale rather than mono?
To OCR them there are a few free or trial version sof OCR software around.

If the docs aren't confidential and you aren't in a hurry you can just put the original pdfs on your website, let google spider them and then use google to search for the OCR'ed text!
 
  • #3
If anyone's wondering, the garbage I pasted above is the first sentence of an abstract from a paper I was reading. Yes, it does actually come out that bad.

mgb_phys said:
There is no reason that the exported image should be poorer quality.
I don't know the export options in Adobe but try selecting tiff and a high resolution, you might also need to choose greyscale rather than mono?

I'll play around with the settings a bit, but the initial result were so bad I was fairly discouraged. The process is slow (export, wait, wait, import, join page-by-page, OCR) and the results were poor at best.
 
  • #4
Download ghostview,it can read PDF/PS and convert it to just about anything.
It can also do lots of tricks like splitting,combining files - and it's free!

PDF can be compressed, perhaps that is the compressed data you are seeing?
 
  • #5
mgb_phys said:
PDF can be compressed, perhaps that is the compressed data you are seeing?

I don't think so. I have a thought -- PM me your email address and I'll send you the PDF, then you can tell me what you think. Maybe I'm just being dumb.
 
  • #7
technutz said:
Here is a thread on the Latex site that might help:
http://www.latex-community.org/viewtopic.php?f=5&t=1124

That's nice, a partial solution for making pdfs from TeX source. I'll keep that in mind. But my problem here is with a pdf for which I do not have LaTeX source.
 
  • #10
technutz said:
There are a few help tips and work around here on this thread:
https://www.physicsforums.com/archive/index.php/t-225791.html

I don't see how that relates to my problem. I don't have LaTeX source for the document -- if I did, I'd just pdfTeX it and be done. Instead I have a math document that was PDF'd by someone else and which I can't search or copy.

I'm interested in a general solution because this comes up not infrequently.

technutz said:
Which operating system are you using? XP, Vista, OS X or Linux variation?

Well this computer is Vista Business, but I could use XP or, in a pinch, Xubuntu or Server 2003.
 
  • #11
The link to the other thread had the different methods of doing print screen on the various OS's. That would solve the copy issue holding down the "shift" key and press "print screen" that will capture the windows of Acrobat with the current page. Using the built in "MS Paint" under "Accessories" to further crop the image. Saving the image will make it a smaller image size. Copy and Paste out of MS Paint the defaulted to Bitmap.

As for "searching" I have reviewed the Adobe site as well as several university sites and come up with nothing as the encoding in the PDF seems to be different from one Latex to another. It would seem like a simple TrueType font (which is search able) would resolve the issue but that does not appear to be an option. Adobe was no help at all on Latex. Hopefully the PrintScreen will help get you by.

I will continue to look and keep and eye out for a better solution.
 
  • #12
CRGreathouse said:
I don't see how that relates to my problem. I don't have LaTeX source for the document -- if I did, I'd just pdfTeX it and be done. Instead I have a math document that was PDF'd by someone else and which I can't search or copy.

I'm interested in a general solution because this comes up not infrequently.

Did you try InftyReader linked above?
It claims
"InftyReader recognizes scanned images of printed scientific documents including Math formulae, an outputs the recognition results in various formats:
XML format for InftyEditor, LaTeX, MathML, Human-Readable TeX for the blinds, etc."
So, if it works, you could get a LaTeX-file back.

While the process of making quality images for OCR might be time-consuming and tedious, one might be able to write a set of scripts to process them (say, running in the background or overnight).
 
  • #13
robphy said:
Did you try InftyReader linked above?

I'm looking into it. It looks really nice -- if it works, I'll probably buy it (though I don't really need it). Someone has to support fun toys!
 

1. What is "Re-ORC'ing" and why is it necessary for *text* PDFs?

"Re-ORC'ing" stands for "re-Optical Character Recognition" and it refers to the process of using OCR software to convert scanned images of text into searchable and editable text. This is necessary for *text* PDFs because without OCR, the content of the PDF cannot be easily accessed or manipulated.

2. How does "Re-ORC'ing" differ from regular OCR?

Regular OCR is typically used on documents that were originally created digitally, such as Word documents or webpages. "Re-ORC'ing" is specifically used for scanned documents that were originally in physical form, such as books or handwritten notes. It involves a more complex process of recognizing and converting the text from the scanned images.

3. What are the benefits of "Re-ORC'ing" *text* PDFs?

The main benefit of "Re-ORC'ing" *text* PDFs is that it makes the content of the document searchable and editable. This can save time and effort when trying to find specific information within a document, or when making changes to the document. "Re-ORC'ing" also helps to preserve the original formatting and layout of the document.

4. Is "Re-ORC'ing" always accurate?

No, "Re-ORC'ing" is not always 100% accurate. The accuracy of the conversion depends on the quality of the scanned images and the capabilities of the OCR software. In some cases, manual corrections may be necessary to ensure the accuracy of the converted text.

5. Can any OCR software be used for "Re-ORC'ing" *text* PDFs?

No, not all OCR software is suitable for "Re-ORC'ing" *text* PDFs. It is important to use software that is specifically designed for this purpose and has the capability to handle complex conversions from scanned images. Some popular OCR software options for "Re-ORC'ing" include Adobe Acrobat, ABBYY FineReader, and Nuance OmniPage.

Similar threads

  • Computing and Technology
Replies
5
Views
3K
  • Computing and Technology
Replies
4
Views
3K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
6
Views
2K
Replies
9
Views
1K
  • Computing and Technology
Replies
3
Views
2K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
2
Views
2K
  • Programming and Computer Science
Replies
4
Views
5K
  • Electrical Engineering
Replies
3
Views
793
  • Programming and Computer Science
Replies
1
Views
1K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
4
Views
3K
Back
Top