Understanding UTF-16 Encoding for Multilingual Character Sets

  • Thread starter Thread starter kandelabr
  • Start date Start date
AI Thread Summary
In UTF-8, ASCII characters are represented as 8 bits, while other characters can take 2 to 4 bytes. Conversely, UTF-16 primarily uses 2 bytes per character but can also accommodate 4 bytes for characters outside the Basic Multilingual Plane (BMP). This distinction is crucial for programming in environments like WINAPI, where WCHAR is a 2-byte unit. Concerns arise about handling 4-byte characters, as they may not directly map to a single WCHAR, potentially complicating string parsing and handling. The discussion clarifies that WCHAR is a 16-bit unit, and UTF-16 characters exceeding the BMP are represented by surrogate pairs, requiring multiple WCHARs. Windows supports UTF-16, but its implementation may be limited, as indicated by the character map's restrictions.
kandelabr
Messages
110
Reaction score
0
Hello all,

I hope I understand this correctly:
In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.

Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?

I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?

Thanks.
 
Technology news on Phys.org
kandelabr said:
In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.
Yes.
kandelabr said:
Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?
No; actually, according to Wikipedia, UTF-16 is also a variable-length encoding that can use either 2 or 4 bytes per character. You may be thinking of UCS-2, which is a fixed-length encoding using 2 bytes per character; in that encoding it's impossible to represent any character with a Unicode code point greater than 65535.
kandelabr said:
I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?
I'm not familiar with Windows programming, but it might be that WCHAR is really just a 16-bit unit of data that doesn't necessarily have to map one-to-one to a single character. In that case, what you actually pass to the function is the UTF-16 encoded representation of the string you want to process (or perhaps it uses a different representation). If your string includes characters that are represented by more than one 16-byte unit in UTF-16, then those characters will correspond to a sequence of more than one WCHAR in the WCHAR[] in your program.
 
thanks for the reply

by accident i found this in my offline help:

Wide characters are multilingual character codes that are always 16 bits wide. The type for character constants is char; for wide characters, the type is wchar_t. Since wide characters are always a fixed size, using wide characters simplifies programming with international character sets.

also, if you look at the windows character map - advanced view, you cannot enter more than 4 characters into edit box "Go to unicode".

to me it looks like windows has support for utf-16, but a bit abridged.
 
Thread 'Is this public key encryption?'
I've tried to intuit public key encryption but never quite managed. But this seems to wrap it up in a bow. This seems to be a very elegant way of transmitting a message publicly that only the sender and receiver can decipher. Is this how PKE works? No, it cant be. In the above case, the requester knows the target's "secret" key - because they have his ID, and therefore knows his birthdate.
I tried a web search "the loss of programming ", and found an article saying that all aspects of writing, developing, and testing software programs will one day all be handled through artificial intelligence. One must wonder then, who is responsible. WHO is responsible for any problems, bugs, deficiencies, or whatever malfunctions which the programs make their users endure? Things may work wrong however the "wrong" happens. AI needs to fix the problems for the users. Any way to...

Similar threads

Replies
3
Views
3K
Replies
9
Views
2K
Replies
10
Views
2K
Replies
8
Views
2K
Replies
7
Views
3K
Replies
2
Views
3K
Back
Top