Understanding UTF-16 Encoding for Multilingual Character Sets

kandelabr · Sep 22, 2011

Hello all,

I hope I understand this correctly:
In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.

Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?

I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?

Thanks.

diazona · Sep 22, 2011

kandelabr said:

In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.

Yes.

kandelabr said:

Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?

No; actually, according to Wikipedia, UTF-16 is also a variable-length encoding that can use either 2 or 4 bytes per character. You may be thinking of UCS-2, which is a fixed-length encoding using 2 bytes per character; in that encoding it's impossible to represent any character with a Unicode code point greater than 65535.

kandelabr said:

I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?

I'm not familiar with Windows programming, but it might be that WCHAR is really just a 16-bit unit of data that doesn't necessarily have to map one-to-one to a single character. In that case, what you actually pass to the function is the UTF-16 encoded representation of the string you want to process (or perhaps it uses a different representation). If your string includes characters that are represented by more than one 16-byte unit in UTF-16, then those characters will correspond to a sequence of more than one WCHAR in the WCHAR[] in your program.

kandelabr · Sep 23, 2011

thanks for the reply

by accident i found this in my offline help:

Wide characters are multilingual character codes that are always 16 bits wide. The type for character constants is char; for wide characters, the type is wchar_t. Since wide characters are always a fixed size, using wide characters simplifies programming with international character sets.

also, if you look at the windows character map - advanced view, you cannot enter more than 4 characters into edit box "Go to unicode".

to me it looks like windows has support for utf-16, but a bit abridged.

Understanding UTF-16 Encoding for Multilingual Character Sets

Related to Understanding UTF-16 Encoding for Multilingual Character Sets

1. What is Unicode and why is it important in computer science?

2. How does Unicode differ from ASCII and other character encoding systems?

3. How does Unicode handle characters that are not included in its standard?

4. Can Unicode be used in all programming languages and operating systems?

5. How does Unicode impact web development and online communication?

Similar threads

Hot Threads

Recent Insights