Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Re-ORC'ing *text* (not scanned) PDFs

  1. Dec 30, 2008 #1

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    I have a problem that I suspect many others here have had with math and science papers.

    I have a number of PDFs that have real text (as opposed to images of text) in the file, but the encoding is different. Sometimes this happens with LaTeX documents, sometimes for other reasons that aren't clear to me. Regardless, this robs me of my ability to search or copy: it's not much use to grab "  !"#  $%#!% #&"  ' ($ %) *"
    !" %) !%+" %)  ,- " #&"
    " from a file.

    Exporting as an image and OCRing is possible, but the quality degrades so severely that it's not particularly usable. (I don't actually know why the quality is so bad with an export, but it is...) You'd think there would be a way to have Acrobat look at the actual shape of the letters and OCR that...

    If it helps, I have Acrobat Pro 9 here.
     
  2. jcsd
  3. Dec 30, 2008 #2

    mgb_phys

    User Avatar
    Science Advisor
    Homework Helper

    A lot of dvi-pdf programs generate pdf image files. This guarantees that the font and layout are exactly as Tex wants it, rather than Adobe messing anything up.
    There is no reason that the exported image should be poorer quality.
    I don't know the export options in Adobe but try selecting tiff and a high resolution, you might also need to choose greyscale rather than mono?
    To OCR them there are a few free or trial version sof OCR software around.

    If the docs aren't confidential and you aren't in a hurry you can just put the original pdfs on your website, let google spider them and then use google to search for the OCR'ed text!
     
  4. Dec 30, 2008 #3

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    If anyone's wondering, the garbage I pasted above is the first sentence of an abstract from a paper I was reading. Yes, it does actually come out that bad.

    I'll play around with the settings a bit, but the initial result were so bad I was fairly discouraged. The process is slow (export, wait, wait, import, join page-by-page, OCR) and the results were poor at best.
     
  5. Dec 30, 2008 #4

    mgb_phys

    User Avatar
    Science Advisor
    Homework Helper

    Download ghostview,it can read PDF/PS and convert it to just about anything.
    It can also do lots of tricks like splitting,combining files - and it's free!

    PDF can be compressed, perhaps that is the compressed data you are seeing?
     
  6. Dec 30, 2008 #5

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    I don't think so. I have a thought -- PM me your email address and I'll send you the PDF, then you can tell me what you think. Maybe I'm just being dumb.
     
  7. Jan 4, 2009 #6
  8. Jan 4, 2009 #7

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    That's nice, a partial solution for making pdfs from TeX source. I'll keep that in mind. But my problem here is with a pdf for which I do not have LaTeX source.
     
  9. Jan 4, 2009 #8

    robphy

    User Avatar
    Science Advisor
    Homework Helper
    Gold Member

  10. Jan 5, 2009 #9
  11. Jan 5, 2009 #10

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    I don't see how that relates to my problem. I don't have LaTeX source for the document -- if I did, I'd just pdfTeX it and be done. Instead I have a math document that was PDF'd by someone else and which I can't search or copy.

    I'm interested in a general solution because this comes up not infrequently.

    Well this computer is Vista Business, but I could use XP or, in a pinch, Xubuntu or Server 2003.
     
  12. Jan 5, 2009 #11
    The link to the other thread had the different methods of doing print screen on the various OS's. That would solve the copy issue holding down the "shift" key and press "print screen" that will capture the windows of Acrobat with the current page. Using the built in "MS Paint" under "Accessories" to further crop the image. Saving the image will make it a smaller image size. Copy and Paste out of MS Paint the defaulted to Bitmap.

    As for "searching" I have reviewed the Adobe site as well as several university sites and come up with nothing as the encoding in the PDF seems to be different from one Latex to another. It would seem like a simple TrueType font (which is search able) would resolve the issue but that does not appear to be an option. Adobe was no help at all on Latex. Hopefully the PrintScreen will help get you by.

    I will continue to look and keep and eye out for a better solution.
     
  13. Jan 5, 2009 #12

    robphy

    User Avatar
    Science Advisor
    Homework Helper
    Gold Member

    Did you try InftyReader linked above?
    It claims
    "InftyReader recognizes scanned images of printed scientific documents including Math formulae, an outputs the recognition results in various formats:
    XML format for InftyEditor, LaTeX, MathML, Human-Readable TeX for the blinds, etc."
    So, if it works, you could get a LaTeX-file back.

    While the process of making quality images for OCR might be time-consuming and tedious, one might be able to write a set of scripts to process them (say, running in the background or overnight).
     
  14. Jan 5, 2009 #13

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    I'm looking into it. It looks really nice -- if it works, I'll probably buy it (though I don't really need it). Someone has to support fun toys!
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: Re-ORC'ing *text* (not scanned) PDFs
  1. Equation Scanning (Replies: 3)

Loading...