Understanding UTF-16 Encoding for Multilingual Character Sets

  • Thread starter kandelabr
  • Start date
In summary: that is, it only allows you to work with utf-16 encoded strings, and 4 byte characters just don't fit into that format.
  • #1
kandelabr
113
0
Hello all,

I hope I understand this correctly:
In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.

Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?

I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?

Thanks.
 
Technology news on Phys.org
  • #2
kandelabr said:
In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.
Yes.
kandelabr said:
Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?
No; actually, according to Wikipedia, UTF-16 is also a variable-length encoding that can use either 2 or 4 bytes per character. You may be thinking of UCS-2, which is a fixed-length encoding using 2 bytes per character; in that encoding it's impossible to represent any character with a Unicode code point greater than 65535.
kandelabr said:
I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?
I'm not familiar with Windows programming, but it might be that WCHAR is really just a 16-bit unit of data that doesn't necessarily have to map one-to-one to a single character. In that case, what you actually pass to the function is the UTF-16 encoded representation of the string you want to process (or perhaps it uses a different representation). If your string includes characters that are represented by more than one 16-byte unit in UTF-16, then those characters will correspond to a sequence of more than one WCHAR in the WCHAR[] in your program.
 
  • #3
thanks for the reply

by accident i found this in my offline help:

Wide characters are multilingual character codes that are always 16 bits wide. The type for character constants is char; for wide characters, the type is wchar_t. Since wide characters are always a fixed size, using wide characters simplifies programming with international character sets.

also, if you look at the windows character map - advanced view, you cannot enter more than 4 characters into edit box "Go to unicode".

to me it looks like windows has support for utf-16, but a bit abridged.
 

1. What is Unicode and why is it important in computer science?

Unicode is a character encoding standard that allows computers to represent and process text from any language in a consistent manner. It is important in computer science because it enables the use of multiple languages and characters in a single system, making international communication and data processing more efficient and accurate.

2. How does Unicode differ from ASCII and other character encoding systems?

Unlike ASCII, which only supports basic characters in the English language, Unicode includes a much larger set of characters from various languages and scripts. It also has a wider range of characters compared to other character encoding systems, making it more versatile and inclusive.

3. How does Unicode handle characters that are not included in its standard?

Unicode has a system for encoding characters that are not included in its standard called "surrogates." These are two special characters that are used together to represent a character not included in the standard. This allows for future expansion of the Unicode standard to include even more characters.

4. Can Unicode be used in all programming languages and operating systems?

Yes, Unicode is a universal standard and can be used in all programming languages and operating systems. However, it may require specific libraries or functions to properly handle and display different characters depending on the language or platform being used.

5. How does Unicode impact web development and online communication?

Unicode has greatly simplified web development and online communication by allowing for the use of multiple languages and characters in a single webpage or online platform. This has made it easier for people from different parts of the world to communicate and access information online without any language barriers.

Similar threads

  • Programming and Computer Science
Replies
1
Views
1K
  • Programming and Computer Science
Replies
10
Views
1K
  • Programming and Computer Science
Replies
9
Views
1K
  • Programming and Computer Science
Replies
3
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
8
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
8
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
7
Views
2K
  • Programming and Computer Science
Replies
7
Views
3K
  • Programming and Computer Science
Replies
2
Views
2K
  • Programming and Computer Science
Replies
5
Views
5K
Back
Top