Very strange copy-paste bug in PDF document

DaveC426913 · Oct 13, 2024

Vanadium 50 said:

The fundamental problem is that copy-and-patse provides no gurantee of accuracy.

What would be an alternative?

I can't write the copy myself merely by reading what someone wrote. That's prone to even more errors.

Copy and paste does guarantee accuracy if you establish standards. A straight up .txt document will faithfully copy almost everything. Enough that the list of exceptions is reduced to a manageable level.

Vanadium 50 said:

What goes on the clipboard is what one program chooses to write, and what comes back is what another chooses to read.

Which is why you set standards that are known to guarantee compatibility.

Vanadium 50 said:

I could, if I were inclined to be perverse, wrire a program that,

You could, yes. That would be dishonest.

You could also use invisible ink or pour sugar in our gas tanks, if you were of a mind.

What is the point in arguing absurdities?

It troubles me that you seem to consider sabotage is a valid concern. Or are you joking around?

Vanadium 50 said:

If this is viewed as too risky or inaccurate, you need to drop cut and paste entirely.

No you don't.

First thing you do is hire people who are not saboteurs.
Second thing you do is establish standards.

I am not sure why you appear to be treating this as a joke. This is real consequential issue. I was just lucky to catch this error because it happened to be the last word in the copy. Frankly I still don't know if I published any other misspellings because I had to publish what I got and couldn't afford to stop everything to ask the chain of stakeholders to rewrite.

What if it had been a dollar figure? How would I ever know? I can't report something I don't know about. And then, before it's caught, some customer tries to sue us for fraudulent pricing. Still a joke?

Vanadium 50 · Oct 14, 2024

The clipboard does not guarantee perfect fidelity. If perfect fidelity is a requirement, you need to use something else. That something else might not even be a PC.

DaveC426913 · Oct 14, 2024

Vanadium 50 said:

The clipboard does not guarantee perfect fidelity. If perfect fidelity is a requirement, you need to use something else. That something else might not even be a PC.

Nobody said 'perfect', except you (three times now, including post 15). I think this is carrying the argument to an impractical (ad absurdum) degree, and I'm not sure why.

What I actually said (post 31) was "...the list of exceptions is reduced to a manageable level."

One should be able to count on standard text copying as standard text. The ASCII string 'affordable' doesn't get mangled by a simple copy paste.

So we will establish a policy that uses the lowest common denominator of text representation.

Vanadium 50 · Oct 14, 2024

"Imperfect fidelity acceptable to Dave" is not a clear spec.

I will point out that I was the one who pointed out what was happening and have received nothing but grief for it. Lesson learned.

DaveC426913 · Oct 14, 2024

Vanadium 50 said:

"Imperfect fidelity acceptable to Dave" is not a clear spec.

Clear for whom? You?
What mater si that it is sufficiently clear to our department.

Vanadium 50 said:

I will point out that I was the one who pointed out what was happening and have received nothing but grief for it. Lesson learned.

You told me I was scanning it and using OCR (you said "sweeping away the cruft").
Then you told me it was a simple ten second fix, no matter how many ways I explained that it was not.
Then you mocked me because you apparently didn't understand how processes in departments work.

You've been mansplaining. That's what you've been getting push back on. (Is it "grief" if I am telling you where your assumptions are incorrect and asking why you are making a joke of this?).

I have been trying to be diplomatic: "Sorry, I am becoming aware that lack of context is spinning this off in unexpected directions." This is me trying to not point fingers, despite my frustration that you are giving me grief at every turn.

That being said, thank you sincerely for your help.

pbuk · Oct 16, 2024

DaveC426913 said:

there is no way someone down the line can even know if there's a corruption in the copy - not without proofing the entire document every time, for every revision that comes down the pipe.

It is fairly easy to search a Word document for 'unusual' characters - I have posted instructions here:
https://www.physicsforums.com/threads/searching-for-unusual-characters-in-a-word-document.1066335/

DaveC426913 said:

(what if it occurred in a date or dollar value?).

This is less likely to happen with numbers (but not impossible, for instance the alternative digits that are used in some fonts where some 'normal' digits descend below the baseline).

jedishrfu · Oct 16, 2024

We have exhausted this topic and I think Dave understands the issue that he has to convey to the upstream designer.

So it's a good time to close the thread and without further ado...

Thank you all for contributing here.

PS: The ff ligature and other characters that are beyond the standard 8-bit ASCII character set will likely be encoded in the document as UTF-8 encodings so it would be possible to scan the document looking for those multibyte characters.

The Linux commands awk, sed, iconv, and tr come to mind for removing these special characters. Windows has equivalent third-party versions of these commands but you'll have to search for them.

In general, awk is the most powerful of the set since its a programming language geared for text search and replacement. Awk is still my favorite goto language when I need to work with text files.

You can access some or all of these commands in a Linux environment on Windows if you install WSL ie Windows Subsystem for Linux.

Alternatively you can install the cygwin tools and get access to them that way.

Neither of which are easy installs and may require an admin in a work environment.

Or you could install GNU awk (gawk) on Windows and find online or develop a script that locates the UTF-8 characters and removes or replaces them with what you want.

https://www.gnu.org/software/gawk/

https://www.gnu.org/software/software.html

Some brief info on the Linux commands:

https://confluence.cornell.edu/display/CNF/Linux+CheatSheet#LinuxCheatSheet

One more alternative came to mind, write a python script to do it. Again you'll need python on your machine but that is becoming more common as machine learning proliferates.

Very strange copy-paste bug in PDF document

Is A.I. more than the sum of its parts?

AI vs. Humans as Processors in an Environment

France to ditch Windows for Linux

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Very strange copy-paste bug in PDF document

Similar threads