Understanding UTF-16 Encoding for Multilingual Character Sets

kandelabr · Sep 22, 2011

Hello all,

I hope I understand this correctly:
In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.

Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?

I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?

Thanks.

diazona · Sep 22, 2011

kandelabr said:

In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.

Yes.

kandelabr said:

Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?

No; actually, according to Wikipedia, UTF-16 is also a variable-length encoding that can use either 2 or 4 bytes per character. You may be thinking of UCS-2, which is a fixed-length encoding using 2 bytes per character; in that encoding it's impossible to represent any character with a Unicode code point greater than 65535.

kandelabr said:

I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?

I'm not familiar with Windows programming, but it might be that WCHAR is really just a 16-bit unit of data that doesn't necessarily have to map one-to-one to a single character. In that case, what you actually pass to the function is the UTF-16 encoded representation of the string you want to process (or perhaps it uses a different representation). If your string includes characters that are represented by more than one 16-byte unit in UTF-16, then those characters will correspond to a sequence of more than one WCHAR in the WCHAR[] in your program.

kandelabr · Sep 23, 2011

thanks for the reply

by accident i found this in my offline help:

Wide characters are multilingual character codes that are always 16 bits wide. The type for character constants is char; for wide characters, the type is wchar_t. Since wide characters are always a fixed size, using wide characters simplifies programming with international character sets.

also, if you look at the windows character map - advanced view, you cannot enter more than 4 characters into edit box "Go to unicode".

to me it looks like windows has support for utf-16, but a bit abridged.

Understanding UTF-16 Encoding for Multilingual Character Sets

Thread 'How to connect Frontend and Backend?'

Thread 'Is this public key encryption?'

Thread 'Who is responsible for the software when AI takes over programming?'

Similar threads

Hot Threads

Touch-typing for programmers

How to calculate Tension for a series of connected points?

Python Complaining About Python

Fortran Reading files in pre-f77 - handling end of file

Python Partial pivoting in getting the reduced row echelon form of a matrix

Recent Insights

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem