C# GetString method from Encoding's Unicode class

Silicon Waffle · Jun 11, 2015

I have a byte array (each character consumes 4 bytes) of size 64 for example,
Now I decode it using

Code:

byte[] bytes={'z','\0','\0','\0','ý','\0','\0','\0','ó','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0',...};
string s=Encoding.Unicode.GetString(bytes);

Amazingly, after the code is executed, s="zýó";

But if my bytes contains string e.g "劉三好" (byte array should be all numbers representing these Chinese characters), then after I execute the above code line I get this string too. How can that be done ?

jedishrfu · Jun 11, 2015

Checkout the Unicode tables:

http://en.wikipedia.org/wiki/Unicode

Unicode characters need two-byte if in the lowest plane so I'd expect two non-zero byte followed by two zero bytes... Also it looks like your string is in least byte order ie little endian order.

Mark44 · Jun 11, 2015

Silicon Waffle said:

I have a byte array (each character consumes 4 bytes) of size 64 for example

A byte is an 8-bit unsigned integer (see https://msdn.microsoft.com/en-us/library/system.byte(v=vs.100).aspx). The characters you show in your byte array are one byte each, not 4 bytes. Also, 4 bytes isn't 64 bits, it's 32 bits.

Silicon Waffle said:
,
Now I decode it using
Code:
byte[] bytes={'z','\0','\0','\0','ý','\0','\0','\0','ó','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0',...};
string s=Encoding.Unicode.GetString(bytes);
Amazingly, after the code is executed, s="zýó";

But if my bytes contains string e.g "劉三好" (byte array should be all numbers representing these Chinese characters), then after I execute the above code line I get this string too. How can that be done ?

Boing3000 · Jun 22, 2015

Silicon Waffle said:
I have a byte array (each character consumes 4 bytes) of size 64 for example,
Now I decode it using
Code:
byte[] bytes={'z','\0','\0','\0','ý','\0','\0','\0','ó','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0','\0',...};
string s=Encoding.Unicode.GetString(bytes);
Amazingly, after the code is executed, s="zýó";

But if my bytes contains string e.g "劉三好" (byte array should be all numbers representing these Chinese characters), then after I execute the above code line I get this string too. How can that be done ?

This code do not compile. The correct way to initialize a byte array is

Code:

byte[] bytes={(byte)'z', (byte)'\0',(byte)'\0',(byte)'\0',(byte)'ý',...
string s=Encoding.Unicode.GetString(bytes);

Such a call will return s="z\0ý\0......" and not what you said.
I suppose the bytes comes from a file. If this is UTF32 (Not Unicode / UTF16) you should use

Code:

string s=Encoding.UTF32.GetString(bytes)

Note that this string is tailed by \0 that probably needs cleaning...

C# GetString method from Encoding's Unicode class

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Use of AI (ML/DL) in Science

Other than just FizzBuzz to test programmer candidates

Sweetspot of data compression

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

HTML/CSS Problems with DNS records

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect