View Full Version : Unicode: a few explanations
kandelabr
Sep22-11, 04:38 PM
Hello all,
I hope I understand this correctly:
In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.
Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?
I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?
Thanks.
diazona
Sep22-11, 06:47 PM
In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.
Yes.
Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?
No; actually, according to Wikipedia (http://en.wikipedia.org/wiki/UTF-16/UCS-2), UTF-16 is also a variable-length encoding that can use either 2 or 4 bytes per character. You may be thinking of UCS-2, which is a fixed-length encoding using 2 bytes per character; in that encoding it's impossible to represent any character with a Unicode code point greater than 65535.
I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?
I'm not familiar with Windows programming, but it might be that WCHAR is really just a 16-bit unit of data that doesn't necessarily have to map one-to-one to a single character. In that case, what you actually pass to the function is the UTF-16 encoded representation of the string you want to process (or perhaps it uses a different representation). If your string includes characters that are represented by more than one 16-byte unit in UTF-16, then those characters will correspond to a sequence of more than one WCHAR in the WCHAR[] in your program.
kandelabr
Sep23-11, 04:09 AM
thanks for the reply
by accident i found this in my offline help:
Wide characters are multilingual character codes that are always 16 bits wide. The type for character constants is char; for wide characters, the type is wchar_t. Since wide characters are always a fixed size, using wide characters simplifies programming with international character sets.
also, if you look at the windows character map - advanced view, you cannot enter more than 4 characters into edit box "Go to unicode".
to me it looks like windows has support for utf-16, but a bit abridged.
vBulletin® v3.8.7, Copyright ©2000-2012, vBulletin Solutions, Inc.