Can't see PDF contents as well as Google does

  • Thread starter Thread starter Stephen Tashi
  • Start date Start date
  • Tags Tags
    Google Pdf
Click For Summary

Discussion Overview

The discussion revolves around the issue of searching for text within PDF files and the differences in searchability between Google’s indexing and local tools. Participants explore the implications of scanned documents and the use of OCR (Optical Character Recognition) technology.

Discussion Character

  • Technical explanation, Debate/contested

Main Points Raised

  • One participant notes that some PDFs are not searchable despite Google being able to find keywords, questioning whether Google accesses a different source than the PDF itself.
  • Another participant explains that the example PDF contains scanned images without actual text, requiring OCR software to extract searchable text, which Google is capable of performing.
  • A participant mentions their own experience with scanned PDFs that were not processed with OCR, yet Google could still find text within them, highlighting inconsistencies in searchability.
  • One participant asks about the presence of a specific word in the PDF, indicating that Google’s search results included it, but it was not visible in the document as an image.
  • A later reply confirms the location of the word within the document, suggesting that it is indeed present but may not be easily accessible without OCR.

Areas of Agreement / Disagreement

Participants express differing views on the searchability of scanned PDFs and the role of OCR, indicating that there is no consensus on the mechanisms behind Google’s ability to find text in these documents versus local search tools.

Contextual Notes

The discussion highlights limitations related to the nature of scanned documents, the necessity of OCR for text extraction, and the variability in how different tools handle PDF content.

Stephen Tashi
Science Advisor
Homework Helper
Education Advisor
Messages
7,864
Reaction score
1,605
There are some PDF files where Google finds keywords, but when I open the link Google gives, the PDF isn't searchable by any tools I've tried ( such as the search feature of Firefox). If I save the PDF to a file, I can't search it with the linux commands grep and pdfgrep.

Example: http://www.rld.state.nm.us/uploads/...d5a3b8f137/Recent_Liquor_License_Sales_19.pdf

Is Google finding keywords in some source different than the PDF?
 
Computer science news on Phys.org
I opened your example file in my (paid) Acrobat Pro, and tried to search for a word in it. It contains scanned images with no actual text. In order to do a text search, you have to run OCR software on it to extract the text. Acrobat Pro can do this, and then it can search for text (e.g. "Albuquerque" which I tried just now).

Apparently Google does OCR on scanned documents in PDFs.

[added] My website has some files which I scanned and converted to PDF without doing OCR on them. I've just now verified that Google can find text strings in them. For example, searching for "rapid rider special" (with quotes) gives me

rapidriderspecial.gif


where the searched-for text is upside down (!) at the bottom right of the first page of the PDF.
 

Attachments

  • rapidriderspecial.gif
    rapidriderspecial.gif
    10.8 KB · Views: 618
Last edited:
  • Like
Likes   Reactions: DrClaude
Can you find the word "Mariah" in the PDF? In my Google search, Google quoted that word in the excerpt it gave with the link, but I don't see the word as an image.
 
It's on page 7, line 4. That's the only occurrence in the document.
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
4
Views
3K
  • · Replies 4 ·
Replies
4
Views
5K
  • · Replies 14 ·
Replies
14
Views
4K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 15 ·
Replies
15
Views
2K