Understanding UTF-16 Encoding for Multilingual Character Sets

  • Thread starter Thread starter kandelabr
  • Start date Start date
Click For Summary
In UTF-8, ASCII characters are represented as 8 bits, while other characters can take 2 to 4 bytes. Conversely, UTF-16 primarily uses 2 bytes per character but can also accommodate 4 bytes for characters outside the Basic Multilingual Plane (BMP). This distinction is crucial for programming in environments like WINAPI, where WCHAR is a 2-byte unit. Concerns arise about handling 4-byte characters, as they may not directly map to a single WCHAR, potentially complicating string parsing and handling. The discussion clarifies that WCHAR is a 16-bit unit, and UTF-16 characters exceeding the BMP are represented by surrogate pairs, requiring multiple WCHARs. Windows supports UTF-16, but its implementation may be limited, as indicated by the character map's restrictions.
kandelabr
Messages
110
Reaction score
0
Hello all,

I hope I understand this correctly:
In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.

Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?

I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?

Thanks.
 
Technology news on Phys.org
kandelabr said:
In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.
Yes.
kandelabr said:
Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?
No; actually, according to Wikipedia, UTF-16 is also a variable-length encoding that can use either 2 or 4 bytes per character. You may be thinking of UCS-2, which is a fixed-length encoding using 2 bytes per character; in that encoding it's impossible to represent any character with a Unicode code point greater than 65535.
kandelabr said:
I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?
I'm not familiar with Windows programming, but it might be that WCHAR is really just a 16-bit unit of data that doesn't necessarily have to map one-to-one to a single character. In that case, what you actually pass to the function is the UTF-16 encoded representation of the string you want to process (or perhaps it uses a different representation). If your string includes characters that are represented by more than one 16-byte unit in UTF-16, then those characters will correspond to a sequence of more than one WCHAR in the WCHAR[] in your program.
 
thanks for the reply

by accident i found this in my offline help:

Wide characters are multilingual character codes that are always 16 bits wide. The type for character constants is char; for wide characters, the type is wchar_t. Since wide characters are always a fixed size, using wide characters simplifies programming with international character sets.

also, if you look at the windows character map - advanced view, you cannot enter more than 4 characters into edit box "Go to unicode".

to me it looks like windows has support for utf-16, but a bit abridged.
 
Learn If you want to write code for Python Machine learning, AI Statistics/data analysis Scientific research Web application servers Some microcontrollers JavaScript/Node JS/TypeScript Web sites Web application servers C# Games (Unity) Consumer applications (Windows) Business applications C++ Games (Unreal Engine) Operating systems, device drivers Microcontrollers/embedded systems Consumer applications (Linux) Some more tips: Do not learn C++ (or any other dialect of C) as a...

Similar threads

Replies
2
Views
2K
Replies
1
Views
3K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 2 ·
Replies
2
Views
3K