Understanding UTF-16 Encoding for Multilingual Character Sets

  • Thread starter Thread starter kandelabr
  • Start date Start date
Click For Summary
SUMMARY

This discussion clarifies the encoding of characters in UTF-16, emphasizing that it is a variable-length encoding capable of using either 2 or 4 bytes per character. The misconception that all characters in UTF-16 are fixed at 2 bytes is addressed, with a distinction made between UTF-16 and UCS-2, the latter being a fixed-length encoding. Concerns regarding the handling of 4-byte characters in Windows API (WINAPI) using WCHAR, which is 2 bytes long, are discussed, highlighting that characters requiring more than one WCHAR will be represented as a sequence of WCHARs in the program.

PREREQUISITES
  • Understanding of UTF-16 and its variable-length encoding
  • Familiarity with WCHAR data type in C programming
  • Knowledge of Unicode character sets and their representations
  • Basic concepts of Windows API and string handling
NEXT STEPS
  • Research the differences between UTF-16 and UCS-2 encoding
  • Learn how to handle multi-byte characters in C using WCHAR
  • Explore Windows API functions that support UTF-16 encoding
  • Investigate character encoding best practices for multilingual applications
USEFUL FOR

Software developers, particularly those working with Windows applications, and anyone involved in internationalization and localization of software products.

kandelabr
Messages
110
Reaction score
0
Hello all,

I hope I understand this correctly:
In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.

Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?

I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?

Thanks.
 
Technology news on Phys.org
kandelabr said:
In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.
Yes.
kandelabr said:
Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?
No; actually, according to Wikipedia, UTF-16 is also a variable-length encoding that can use either 2 or 4 bytes per character. You may be thinking of UCS-2, which is a fixed-length encoding using 2 bytes per character; in that encoding it's impossible to represent any character with a Unicode code point greater than 65535.
kandelabr said:
I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?
I'm not familiar with Windows programming, but it might be that WCHAR is really just a 16-bit unit of data that doesn't necessarily have to map one-to-one to a single character. In that case, what you actually pass to the function is the UTF-16 encoded representation of the string you want to process (or perhaps it uses a different representation). If your string includes characters that are represented by more than one 16-byte unit in UTF-16, then those characters will correspond to a sequence of more than one WCHAR in the WCHAR[] in your program.
 
thanks for the reply

by accident i found this in my offline help:

Wide characters are multilingual character codes that are always 16 bits wide. The type for character constants is char; for wide characters, the type is wchar_t. Since wide characters are always a fixed size, using wide characters simplifies programming with international character sets.

also, if you look at the windows character map - advanced view, you cannot enter more than 4 characters into edit box "Go to unicode".

to me it looks like windows has support for utf-16, but a bit abridged.
 

Similar threads

Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K
Replies
1
Views
4K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 2 ·
Replies
2
Views
3K