Re-ORC'ing text (not scanned) PDFs

CRGreathouse · Dec 30, 2008

I have a problem that I suspect many others here have had with math and science papers.

I have a number of PDFs that have real text (as opposed to images of text) in the file, but the encoding is different. Sometimes this happens with LaTeX documents, sometimes for other reasons that aren't clear to me. Regardless, this robs me of my ability to search or copy: it's not much use to grab " !"# $%#!% #&" ' ($ %) *"
!" %) !%+" %) ,- " #&"
" from a file.

Exporting as an image and OCRing is possible, but the quality degrades so severely that it's not particularly usable. (I don't actually know why the quality is so bad with an export, but it is...) You'd think there would be a way to have Acrobat look at the actual shape of the letters and OCR that...

If it helps, I have Acrobat Pro 9 here.

mgb_phys · Dec 30, 2008

A lot of dvi-pdf programs generate pdf image files. This guarantees that the font and layout are exactly as Tex wants it, rather than Adobe messing anything up.
There is no reason that the exported image should be poorer quality.
I don't know the export options in Adobe but try selecting tiff and a high resolution, you might also need to choose greyscale rather than mono?
To OCR them there are a few free or trial version sof OCR software around.

If the docs aren't confidential and you aren't in a hurry you can just put the original pdfs on your website, let google spider them and then use google to search for the OCR'ed text!

CRGreathouse · Dec 30, 2008

If anyone's wondering, the garbage I pasted above is the first sentence of an abstract from a paper I was reading. Yes, it does actually come out that bad.

mgb_phys said:

There is no reason that the exported image should be poorer quality.
I don't know the export options in Adobe but try selecting tiff and a high resolution, you might also need to choose greyscale rather than mono?

I'll play around with the settings a bit, but the initial result were so bad I was fairly discouraged. The process is slow (export, wait, wait, import, join page-by-page, OCR) and the results were poor at best.

mgb_phys · Dec 30, 2008

Download ghostview,it can read PDF/PS and convert it to just about anything.
It can also do lots of tricks like splitting,combining files - and it's free!

PDF can be compressed, perhaps that is the compressed data you are seeing?

CRGreathouse · Dec 30, 2008

mgb_phys said:

PDF can be compressed, perhaps that is the compressed data you are seeing?

I don't think so. I have a thought -- PM me your email address and I'll send you the PDF, then you can tell me what you think. Maybe I'm just being dumb.

technutz · Jan 4, 2009

Here is a thread on the Latex site that might help:
http://www.latex-community.org/viewtopic.php?f=5&t=1124

CRGreathouse · Jan 4, 2009

technutz said:

Here is a thread on the Latex site that might help:
http://www.latex-community.org/viewtopic.php?f=5&t=1124

That's nice, a partial solution for making pdfs from TeX source. I'll keep that in mind. But my problem here is with a pdf for which I do not have LaTeX source.

robphy · Jan 4, 2009

This might be an interesting test for InftyReader...
http://www.inftyproject.org/en/software.html

technutz · Jan 5, 2009

Which operating system are you using? XP, Vista, OS X or Linux variation?

There are a few help tips and work around here on this thread:
https://www.physicsforums.com/archive/index.php/t-225791.html

CRGreathouse · Jan 5, 2009

technutz said:

There are a few help tips and work around here on this thread:
https://www.physicsforums.com/archive/index.php/t-225791.html

I don't see how that relates to my problem. I don't have LaTeX source for the document -- if I did, I'd just pdfTeX it and be done. Instead I have a math document that was PDF'd by someone else and which I can't search or copy.

I'm interested in a general solution because this comes up not infrequently.

technutz said:

Which operating system are you using? XP, Vista, OS X or Linux variation?

Well this computer is Vista Business, but I could use XP or, in a pinch, Xubuntu or Server 2003.

technutz · Jan 5, 2009

The link to the other thread had the different methods of doing print screen on the various OS's. That would solve the copy issue holding down the "shift" key and press "print screen" that will capture the windows of Acrobat with the current page. Using the built in "MS Paint" under "Accessories" to further crop the image. Saving the image will make it a smaller image size. Copy and Paste out of MS Paint the defaulted to Bitmap.

As for "searching" I have reviewed the Adobe site as well as several university sites and come up with nothing as the encoding in the PDF seems to be different from one Latex to another. It would seem like a simple TrueType font (which is search able) would resolve the issue but that does not appear to be an option. Adobe was no help at all on Latex. Hopefully the PrintScreen will help get you by.

I will continue to look and keep and eye out for a better solution.

robphy · Jan 5, 2009

CRGreathouse said:

I don't see how that relates to my problem. I don't have LaTeX source for the document -- if I did, I'd just pdfTeX it and be done. Instead I have a math document that was PDF'd by someone else and which I can't search or copy.

I'm interested in a general solution because this comes up not infrequently.

Did you try InftyReader linked above?
It claims
"InftyReader recognizes scanned images of printed scientific documents including Math formulae, an outputs the recognition results in various formats:
XML format for InftyEditor, LaTeX, MathML, Human-Readable TeX for the blinds, etc."
So, if it works, you could get a LaTeX-file back.

While the process of making quality images for OCR might be time-consuming and tedious, one might be able to write a set of scripts to process them (say, running in the background or overnight).

CRGreathouse · Jan 5, 2009

robphy said:

Did you try InftyReader linked above?

I'm looking into it. It looks really nice -- if it works, I'll probably buy it (though I don't really need it). Someone has to support fun toys!

Re-ORC'ing text (not scanned) PDFs

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

France to ditch Windows for Linux

Is This Music AI?

Gmail AI summaries

Help me build my server with a laptop that has a broken screen

AI Used In Peer Review

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Re-ORC'ing *text* (not scanned) PDFs

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

Re-ORC'ing text (not scanned) PDFs