Searching for unusual characters in a Word document

  • Thread starter Thread starter pbuk
  • Start date Start date
AI Thread Summary
Unusual characters can be introduced into documents through copy and paste from sources like PDFs, leading to potential corruption that may go unnoticed unless the entire document is proofed. These characters, often encoded in UTF-8, can be identified using tools like Awk, sed, and tr, but a simpler method is available in Microsoft Word. By using the "Advanced Find" feature with the wildcard search option, users can locate non-ASCII characters by entering a specific search string. This method allows for the identification of characters outside the standard ASCII range, including ligatures created by OCR processes, such as the 'fi' ligature. The wildcard search functionality in Word enhances the search capabilities, enabling users to specify which characters to include or exclude in their search, thus facilitating the cleanup of documents from unwanted special characters.
pbuk
Science Advisor
Homework Helper
Gold Member
Messages
4,966
Reaction score
3,217
TL;DR Summary
Unusual characters in a Word document (e.g. ligatures like œ and ff) can be troublesome but there is an easy way to find them when working with English text.
One way unusual characters can be introduced is by copy and paste from another document e.g. a PDF.

DaveC426913 said:
"...there is no way someone down the line can even know if there's a corruption in the copy - not without proofing the entire document every time.." (and even then one can't know if there are any other ligatures lurking in there that aren't even visible - I only caught this one by sheer luck because it happened to be the very last word in the entire copy).

jedishrfu said:
The ff ligature and other characters beyond the Ascii set will likely be encoded in the document as UTF-8 encoding so it would be possible to scan the document looking for those multibyte characters.

Awk, sed, icons and tr come to mind for removing these special characters.

Rather than using a command line tool that may not be available in Windows (and may corrupt any non-text file content e.g. embedded images), you can easily search for non-ASCII characters within Word.

You need to select "Advanced Find" from the menu bar, in the "Find what" box type [! -~^13] (note the space between ! and -) and select the "Use Wildcards" option. You can then select "Reading Highlight" to see all the culprits, or "Find Next" to go through them one by one.

1729075871315.png


To understand how this works note that the "Use wildcards" option gives some characters in the search box special powers. Here [...] means 'look for characters that match...', ! changes that to mean 'look for characters that don't match, -~ means 'match any character with a code point between that of a space (32) and that of a tilde (127); these are the first and last printable ASCII characters, and ^13 means 'match ASCII character 13' which is the end of paragraph marker in Word.

If you want you can add other characters within the square brackets that you want to allow such as £ or €, the 'smart' quotation marks “” and ‘’ or the emdash – just copy them from here.
 
  • Informative
  • Like
Likes Wrichik Basu, berkeman, DaveC426913 and 2 others
Computer science news on Phys.org
Interesting.
pbuk said:
One way unusual characters can be introduced is by copy and paste from another document e.g. a PDF.
The OCR that I use, tends to tie 'fi' into one ligature character within words. One of those "normal looking" words will be detected by a spell check, so I select the ligature and replace it with the expansion throughout the document.
 
pbuk said:
To understand how this works note that the "Use wildcards" option gives some characters in the search box special powers. Here [...] means 'look for characters that match...', ! changes that to mean 'look for characters that don't match, -~ means 'match any character with a code point between that of a space (32) and that of a tilde (127); these are the first and last printable ASCII characters, and ^13 means 'match ASCII character 13' which is the end of paragraph marker in Word.

Very useful, I didn't know that 'wildcards' would include regular expressions (almost...).
 
In my discussions elsewhere, I've noticed a lot of disagreement regarding AI. A question that comes up is, "Is AI hype?" Unfortunately, when this question is asked, the one asking, as far as I can tell, may mean one of three things which can lead to lots of confusion. I'll list them out now for clarity. 1. Can AI do everything a human can do and how close are we to that? 2. Are corporations and governments using the promise of AI to gain more power for themselves? 3. Are AI and transhumans...
Thread 'ChatGPT Examples, Good and Bad'
I've been experimenting with ChatGPT. Some results are good, some very very bad. I think examples can help expose the properties of this AI. Maybe you can post some of your favorite examples and tell us what they reveal about the properties of this AI. (I had problems with copy/paste of text and formatting, so I'm posting my examples as screen shots. That is a promising start. :smile: But then I provided values V=1, R1=1, R2=2, R3=3 and asked for the value of I. At first, it said...
Back
Top