Very strange copy-paste bug in PDF document

  • Thread starter Thread starter DaveC426913
  • Start date Start date
  • Tags Tags
    Bug Pdf
Click For Summary
A user is experiencing a copy-paste issue with a PDF document where the text "affordable" is pasted incorrectly as "aJordable." The problem is attributed to the PDF's use of ligatures, which replace certain letter combinations with special glyphs, causing discrepancies when copying text. The discussion highlights the need for designers to avoid using complex fonts and ligatures in documents intended for text transfer. There is a consensus that relying on PDFs for copy can lead to significant errors, making it necessary to establish better document management practices. The conversation emphasizes the importance of clear communication with designers to ensure text is delivered in a more reliable format, like Word.
  • #31
Vanadium 50 said:
The fundamental problem is that copy-and-patse provides no gurantee of accuracy.
What would be an alternative?

I can't write the copy myself merely by reading what someone wrote. That's prone to even more errors.

Copy and paste does guarantee accuracy if you establish standards. A straight up .txt document will faithfully copy almost everything. Enough that the list of exceptions is reduced to a manageable level.

Vanadium 50 said:
What goes on the clipboard is what one program chooses to write, and what comes back is what another chooses to read.
Which is why you set standards that are known to guarantee compatibility.

Vanadium 50 said:
I could, if I were inclined to be perverse, wrire a program that,
You could, yes. That would be dishonest.

You could also use invisible ink or pour sugar in our gas tanks, if you were of a mind.

What is the point in arguing absurdities?

It troubles me that you seem to consider sabotage is a valid concern. Or are you joking around?

Vanadium 50 said:
If this is viewed as too risky or inaccurate, you need to drop cut and paste entirely.
No you don't.

First thing you do is hire people who are not saboteurs.
Second thing you do is establish standards.

I am not sure why you appear to be treating this as a joke. This is real consequential issue. I was just lucky to catch this error because it happened to be the last word in the copy. Frankly I still don't know if I published any other misspellings because I had to publish what I got and couldn't afford to stop everything to ask the chain of stakeholders to rewrite.

What if it had been a dollar figure? How would I ever know? I can't report something I don't know about. And then, before it's caught, some customer tries to sue us for fraudulent pricing. Still a joke?
 
  • Like
  • Love
Likes Wrichik Basu, Tom.G and PeterDonis
Computer science news on Phys.org
  • #32
The clipboard does not guarantee perfect fidelity. If perfect fidelity is a requirement, you need to use something else. That something else might not even be a PC.
 
  • #33
Vanadium 50 said:
The clipboard does not guarantee perfect fidelity. If perfect fidelity is a requirement, you need to use something else. That something else might not even be a PC.
Nobody said 'perfect', except you (three times now, including post 15). I think this is carrying the argument to an impractical (ad absurdum) degree, and I'm not sure why.

What I actually said (post 31) was "...the list of exceptions is reduced to a manageable level."

One should be able to count on standard text copying as standard text. The ASCII string 'affordable' doesn't get mangled by a simple copy paste.

So we will establish a policy that uses the lowest common denominator of text representation.
 
  • #34
"Imperfect fidelity acceptable to Dave" is not a clear spec.

I will point out that I was the one who pointed out what was happening and have received nothing but grief for it. Lesson learned.
 
  • #35
Vanadium 50 said:
"Imperfect fidelity acceptable to Dave" is not a clear spec.
Clear for whom? You?
What mater si that it is sufficiently clear to our department.

Vanadium 50 said:
I will point out that I was the one who pointed out what was happening and have received nothing but grief for it. Lesson learned.
You told me I was scanning it and using OCR (you said "sweeping away the cruft").
Then you told me it was a simple ten second fix, no matter how many ways I explained that it was not.
Then you mocked me because you apparently didn't understand how processes in departments work.

You've been mansplaining. That's what you've been getting push back on. (Is it "grief" if I am telling you where your assumptions are incorrect and asking why you are making a joke of this?).

I have been trying to be diplomatic: "Sorry, I am becoming aware that lack of context is spinning this off in unexpected directions." This is me trying to not point fingers, despite my frustration that you are giving me grief at every turn.

That being said, thank you sincerely for your help.
 
  • Like
  • Love
Likes jedishrfu and Wrichik Basu
  • #36
DaveC426913 said:
there is no way someone down the line can even know if there's a corruption in the copy - not without proofing the entire document every time, for every revision that comes down the pipe.
It is fairly easy to search a Word document for 'unusual' characters - I have posted instructions here:
https://www.physicsforums.com/threads/searching-for-unusual-characters-in-a-word-document.1066335/

DaveC426913 said:
(what if it occurred in a date or dollar value?).
This is less likely to happen with numbers (but not impossible, for instance the alternative digits that are used in some fonts where some 'normal' digits descend below the baseline).
 
  • Like
Likes berkeman
  • #37
We have exhausted this topic and I think Dave understands the issue that he has to convey to the upstream designer.

So it's a good time to close the thread and without further ado...

Thank you all for contributing here.

PS: The ff ligature and other characters that are beyond the standard 8-bit ASCII character set will likely be encoded in the document as UTF-8 encodings so it would be possible to scan the document looking for those multibyte characters.

The Linux commands awk, sed, iconv, and tr come to mind for removing these special characters. Windows has equivalent third-party versions of these commands but you'll have to search for them.

In general, awk is the most powerful of the set since its a programming language geared for text search and replacement. Awk is still my favorite goto language when I need to work with text files.

You can access some or all of these commands in a Linux environment on Windows if you install WSL ie Windows Subsystem for Linux.

Alternatively you can install the cygwin tools and get access to them that way.

Neither of which are easy installs and may require an admin in a work environment.

Or you could install GNU awk (gawk) on Windows and find online or develop a script that locates the UTF-8 characters and removes or replaces them with what you want.

https://www.gnu.org/software/gawk/

https://www.gnu.org/software/software.html

Some brief info on the Linux commands:

https://confluence.cornell.edu/display/CNF/Linux+CheatSheet#LinuxCheatSheet

One more alternative came to mind, write a python script to do it. Again you'll need python on your machine but that is becoming more common as machine learning proliferates.
 
  • Informative
Likes Tom.G

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
Replies
7
Views
9K
  • · Replies 4 ·
Replies
4
Views
4K
  • · Replies 2 ·
Replies
2
Views
4K
  • · Replies 2 ·
Replies
2
Views
13K
  • · Replies 2 ·
Replies
2
Views
3K