C# GetString method from Encoding's Unicode class

  • Context: C# 
  • Thread starter Thread starter Silicon Waffle
  • Start date Start date
  • Tags Tags
    Class Method
Click For Summary

Discussion Overview

The discussion revolves around the behavior of the C# Encoding.Unicode.GetString method when decoding a byte array. Participants explore how different byte representations affect the resulting string, particularly in the context of Unicode and UTF-32 encoding.

Discussion Character

  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant describes a byte array and its decoding using Encoding.Unicode, noting that the resulting string is "zýó".
  • Another participant references Unicode tables, suggesting that characters in the lowest plane require two bytes and that the byte order appears to be little-endian.
  • A different participant points out that the characters in the byte array are represented as single bytes, not four bytes, and corrects the misunderstanding about byte size.
  • One participant questions the compilation of the provided code, suggesting that the byte array should be initialized with explicit casting to byte and that the resulting string would include null characters.
  • Another participant proposes that if the bytes originate from a file and are in UTF-32 format, the correct method to decode them would be using Encoding.UTF32.GetString.

Areas of Agreement / Disagreement

Participants express differing views on the correct interpretation of byte sizes and encoding methods, indicating that multiple competing views remain unresolved.

Contextual Notes

There are unresolved assumptions regarding the encoding format of the byte array and the implications of using different encoding methods on the resulting string.

Silicon Waffle
Messages
160
Reaction score
202
I have a byte array (each character consumes 4 bytes) of size 64 for example,
Now I decode it using
Code:
byte[] bytes={'z','\0','\0','\0','ý','\0','\0','\0','ó','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0',...};
string s=Encoding.Unicode.GetString(bytes);

Amazingly, after the code is executed, s="zýó";

But if my bytes contains string e.g "劉三好" (byte array should be all numbers representing these Chinese characters), then after I execute the above code line I get this string too. How can that be done ?
 
Technology news on Phys.org
Checkout the Unicode tables:

http://en.wikipedia.org/wiki/Unicode

Unicode characters need two-byte if in the lowest plane so I'd expect two non-zero byte followed by two zero bytes... Also it looks like your string is in least byte order ie little endian order.
 
Silicon Waffle said:
I have a byte array (each character consumes 4 bytes) of size 64 for example
A byte is an 8-bit unsigned integer (see https://msdn.microsoft.com/en-us/library/system.byte(v=vs.100).aspx). The characters you show in your byte array are one byte each, not 4 bytes. Also, 4 bytes isn't 64 bits, it's 32 bits.
Silicon Waffle said:
,
Now I decode it using
Code:
byte[] bytes={'z','\0','\0','\0','ý','\0','\0','\0','ó','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0',...};
string s=Encoding.Unicode.GetString(bytes);

Amazingly, after the code is executed, s="zýó";

But if my bytes contains string e.g "劉三好" (byte array should be all numbers representing these Chinese characters), then after I execute the above code line I get this string too. How can that be done ?
 
Silicon Waffle said:
I have a byte array (each character consumes 4 bytes) of size 64 for example,
Now I decode it using
Code:
byte[] bytes={'z','\0','\0','\0','ý','\0','\0','\0','ó','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0',...};
string s=Encoding.Unicode.GetString(bytes);

Amazingly, after the code is executed, s="zýó";

But if my bytes contains string e.g "劉三好" (byte array should be all numbers representing these Chinese characters), then after I execute the above code line I get this string too. How can that be done ?
This code do not compile. The correct way to initialize a byte array is
Code:
byte[] bytes={(byte)'z', (byte)'\0',(byte)'\0',(byte)'\0',(byte)'ý',...
string s=Encoding.Unicode.GetString(bytes);
Such a call will return s="z\0ý\0......" and not what you said.
I suppose the bytes comes from a file. If this is UTF32 (Not Unicode / UTF16) you should use
Code:
string s=Encoding.UTF32.GetString(bytes)
Note that this string is tailed by \0 that probably needs cleaning...
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
8K
Replies
3
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 1 ·
Replies
1
Views
6K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 1 ·
Replies
1
Views
9K
  • · Replies 11 ·
Replies
11
Views
4K
  • · Replies 11 ·
Replies
11
Views
5K