Searching for unusual characters in a Word document

  • Thread starter Thread starter pbuk
  • Start date Start date
Click For Summary
SUMMARY

This discussion focuses on identifying and removing unusual characters in Word documents, particularly those introduced through copy-pasting from PDFs. Users can utilize the "Advanced Find" feature in Microsoft Word, employing the wildcard search pattern [! -~^13] to locate non-ASCII characters. The discussion highlights the importance of UTF-8 encoding in recognizing multibyte characters and suggests tools like Awk and sed for those comfortable with command-line interfaces. Additionally, it emphasizes the utility of wildcards in enhancing search capabilities within Word.

PREREQUISITES
  • Familiarity with Microsoft Word's "Advanced Find" feature
  • Understanding of UTF-8 encoding and its implications for character representation
  • Basic knowledge of wildcard search patterns
  • Experience with command-line tools like Awk and sed (optional)
NEXT STEPS
  • Research how to effectively use Microsoft Word wildcards for advanced searches
  • Learn about UTF-8 encoding and its impact on document formatting
  • Explore command-line tools like Awk and sed for text processing
  • Investigate methods for preventing character corruption during copy-pasting from PDFs
USEFUL FOR

This discussion is beneficial for content creators, editors, and anyone working with Word documents who needs to ensure text integrity by identifying and removing unusual characters.

pbuk
Science Advisor
Homework Helper
Gold Member
Messages
4,971
Reaction score
3,224
TL;DR
Unusual characters in a Word document (e.g. ligatures like œ and ff) can be troublesome but there is an easy way to find them when working with English text.
One way unusual characters can be introduced is by copy and paste from another document e.g. a PDF.

DaveC426913 said:
"...there is no way someone down the line can even know if there's a corruption in the copy - not without proofing the entire document every time.." (and even then one can't know if there are any other ligatures lurking in there that aren't even visible - I only caught this one by sheer luck because it happened to be the very last word in the entire copy).

jedishrfu said:
The ff ligature and other characters beyond the Ascii set will likely be encoded in the document as UTF-8 encoding so it would be possible to scan the document looking for those multibyte characters.

Awk, sed, icons and tr come to mind for removing these special characters.

Rather than using a command line tool that may not be available in Windows (and may corrupt any non-text file content e.g. embedded images), you can easily search for non-ASCII characters within Word.

You need to select "Advanced Find" from the menu bar, in the "Find what" box type [! -~^13] (note the space between ! and -) and select the "Use Wildcards" option. You can then select "Reading Highlight" to see all the culprits, or "Find Next" to go through them one by one.

1729075871315.png


To understand how this works note that the "Use wildcards" option gives some characters in the search box special powers. Here [...] means 'look for characters that match...', ! changes that to mean 'look for characters that don't match, -~ means 'match any character with a code point between that of a space (32) and that of a tilde (127); these are the first and last printable ASCII characters, and ^13 means 'match ASCII character 13' which is the end of paragraph marker in Word.

If you want you can add other characters within the square brackets that you want to allow such as £ or €, the 'smart' quotation marks “” and ‘’ or the emdash – just copy them from here.
 
  • Informative
  • Like
Likes   Reactions: Wrichik Basu, berkeman, DaveC426913 and 2 others
Computer science news on Phys.org
Interesting.
pbuk said:
One way unusual characters can be introduced is by copy and paste from another document e.g. a PDF.
The OCR that I use, tends to tie 'fi' into one ligature character within words. One of those "normal looking" words will be detected by a spell check, so I select the ligature and replace it with the expansion throughout the document.
 
pbuk said:
To understand how this works note that the "Use wildcards" option gives some characters in the search box special powers. Here [...] means 'look for characters that match...', ! changes that to mean 'look for characters that don't match, -~ means 'match any character with a code point between that of a space (32) and that of a tilde (127); these are the first and last printable ASCII characters, and ^13 means 'match ASCII character 13' which is the end of paragraph marker in Word.

Very useful, I didn't know that 'wildcards' would include regular expressions (almost...).
 

Similar threads

  • · Replies 18 ·
Replies
18
Views
4K
  • · Replies 1 ·
Replies
1
Views
5K
  • · Replies 4 ·
Replies
4
Views
16K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 17 ·
Replies
17
Views
320K
  • · Replies 3 ·
Replies
3
Views
2K