Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Unicode: a few explanations

  1. Sep 22, 2011 #1
    Hello all,

    I hope I understand this correctly:
    In UTF-8 all ASCII characters are 8 bits long and other, more sophisticated, 2 to 4 bytes.

    Every character in UTF-16 is 2 bytes long. So, how does one encode a character that is, say, 4 bytes long in UTF-16?

    I'm asking because I'm a bit worried - in WINAPI (plain C), every unicode function is based on WCHAR that is 2 bytes long. Also all (my) string parsing and handling is based on "fact" that every character takes up 2 bytes. What happens if I enter a 4-byte character? All my strings go to hell?

    Thanks.
     
  2. jcsd
  3. Sep 22, 2011 #2

    diazona

    User Avatar
    Homework Helper

    Yes.
    No; actually, according to Wikipedia, UTF-16 is also a variable-length encoding that can use either 2 or 4 bytes per character. You may be thinking of UCS-2, which is a fixed-length encoding using 2 bytes per character; in that encoding it's impossible to represent any character with a Unicode code point greater than 65535.
    I'm not familiar with Windows programming, but it might be that WCHAR is really just a 16-bit unit of data that doesn't necessarily have to map one-to-one to a single character. In that case, what you actually pass to the function is the UTF-16 encoded representation of the string you want to process (or perhaps it uses a different representation). If your string includes characters that are represented by more than one 16-byte unit in UTF-16, then those characters will correspond to a sequence of more than one WCHAR in the WCHAR[] in your program.
     
  4. Sep 23, 2011 #3
    thanks for the reply

    by accident i found this in my offline help:

    also, if you look at the windows character map - advanced view, you cannot enter more than 4 characters into edit box "Go to unicode".

    to me it looks like windows has support for utf-16, but a bit abridged.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook




Similar Discussions: Unicode: a few explanations
Loading...