All about Unicode /programming with Unicode [tutorial]

rohanprabhu · Jan 6, 2008

http://www.tritium.frihost.net/tutorials/unicode/all_about_unicode.jpg

NOTE: To view the tutorial properly, your browser must be setup to display Unicode text properly. Please refer to this article in order to do so: http://tinyurl.com/2v5nfd

http://digg.com/programming/All_about_Unicode_programming_in_Unicode

Prologue

Long after completing my exams, I now have this urge of writing another tutorial. And I chose to cover Unicode. Unicode is a very useful industry standard, and it's use is going to increase exponentially in the coming years. Almost all applications today have support for Unicode. Windows 2000/NT/XP use Unicode [in the form of UTF-16] for internal text representation. Java and the .NET environment also use it for internal text representation. Despite it's utility, there were very few resources on the net or anywhere else to cover the Unicode standard completely. Ofcourse, you have the 'Unicode Standard 5.0' itself as a free download at the Unicode site, a whole book is sometimes only pursued by enthusiasts in a particular area. As a programmer, or even as a user, you need to know what Unicode is, how it works, what are character encodings, what the transformation formats [UTF] are and what UCS means. Though depending on your role, the amount you need to know differs, I'd say reading till Chapter 2 would be recommended even for general users. We, Indians are a part of the group who extensively use non-Latin characters for communication and we, as programmers are a part of the group who cater to a clientele from various parts of the world, many of whom extensively use non-Latin characters for communication. So, let's begin with 'all about Unicode'.

Introduction

Unicode is an industry standard which is used to store, retrieve and transmit text in many of the world's writing systems. Unicode's coverage of writing system is exhaustive. It is so exhaustive that it even contains symbols from some fictional scripts. Not only fictional scripts, Unicode also includes some of the commonly used symbols like the gender symbols for male/female etc. Before going to Unicode, we'll first cover 'character encoding'. Once we learn what character encoding is, we'll go into 'The Unicode family' which would be an introduction to the various standards and technologies that come under Unicode. Later on chapters are targetted at programmers. In this chapters all numbers are in decimal unless prefixed by 0x or U+ [meaning hexadecimal] or suffixed by 'b' [meaning binary].

Chapter 1: Character Encoding

All data on computers and in electronics is stored in terms of numbers. At the actual hardware level, it may be stored as charge on a capacitor, a magnetic field, may be transmitted as a particular amplitude or a frequency in a waveform, but for programmers, they all correspond to numbers. The letter 'A' is a number. The symbol it represents is stored in the form of 'glyphs' [glyphs are what form 'fonts'] which are basically drawings, which are represented using using various numbers. In this tutorial, we would not be going into how Unicode text is rendered or how fonts work, but we would be talking about how text from various writing systems is represented as a sequence of numbers. The display of Unicode characters is handled well by applications today, who evolved a great ability to use character from various fonts to render the text.

Coming back to the story of numbers, text also needs to be represented by numbers. To do this, various standards exist. These standards are collectively known as 'Character Encodings'. A popular character encoding is ASCII. What a character encoding does is pair a number and a symbol. For ex., in ASCII the number 97 is 'a'. Now, ASCII does not tell anything about the shape of the symbol 'a'. It talks about the character 'a'. We can say something like in ASCII the number 97 is the first letter of the english alphabet, lower case. So, when an application receives a text stream, concludes it to be encoded in ASCII and encounters the number 97 in the stream, the application renders the 'a' symbol on the screen. Here '97' is referred to as the 'code point' of 'a'. The information of the symbol is stored in fonts [actually, glyphs]. The fonts, on the other hand, pair a number and a glyph, so that given a particular number, a particular glyph i.e. a symbol can be displayed.

1.1| The ASCII encoding system

ASCII is a 8-bit encoding system. As we all know that in computers all numbers at the lowest level are manipulated in the binary numeral system. Every digit is a called a 'bit'. Each bit can have a value of either '1' or '0'. And 8-bits make a byte. In ASCII, every byte represents a character. However, ASCII does not use the 8th bit. It uses only 7 bits i.e the range from 00000000b - 01111111b i.e 128 characters. The first bit was earlier used for carrying information about the data protocol over the stream (most commonly to carry the parity bit). This is the reason some that many UTF-8 text streams appear corrupted while using email [covered later in the chapter]. ASCII contains 95 printable characters. Though these characters are more than sufficient for communication in English, and bashing on IRC, the need for coverage of more characters was clearly felt because these characters weren't enough to transfer text in some languages that even used the Latin alphabet. Other languages including Chinese/Japanese/Korean, Devenagari etc. were a different issue altogether.

Other than printable characters, ASCII reserved the first 32 characters for 'control characters'. These characters are not used for transmitting readable text, but rather to control devices. For ex., the number '17' in ASCII is designated as 'Device Control 1' and was oftenly used to switch off devices, whereas '19', 'Device Control 3' was oftenly used to switch off devices. Due to the complexity involved in todays hardware, this method is completely deprecated.

1.2| The ISO-8859 standard

The ISO/IEC put forward a standard named ISO/IEC 8859 to overcome the shortcomings of ASCII. Most of the characters had the same code points as in ASCII to maintain backward compatibility. ISO-8859 increased the number of characters that can be represented by a single byte by making use of the 8th byte, giving in a total of 256 characters: twice that of ASCII. However, this still did not meet the requirements and many subsequent 'parts' were developed for encoding other characters. There were 16 such parts. For example, the 3rd part known as ISO-8859-3 encodes characters in the Turkish, Maltese, and Esperanto writing system. This still was not enough as a text stream could be encoded in only one encoding stream. So, it was not possible to have Turkish [ISO-8859-3] and Lativian [ISO-8859-4] characters in the same file. [Not completely true. The ISO 2022 standard had a solution. More more information, search for ISO 2022].

An encoding system that could cover all the writing systems was needed.

1.3| ISCII, JIS X 0201 and other encodings

The characters sets that were not implemented in the ISO-8859 standard were then implemented using alternative encodings. Devanagari was earlier a part of ISO-8859-12, but then ISCII took over. ISCII for Indian Standard Code for Information Interchange gave an encoding system for many Indian writing systems including Devanagari, Gurumukhi etc.

JIS X 0201 'Japanese Industrial Standard X 0201', was a character encoding for the Katakana writing system.

Such encoding systems were proving to be a problem as they could not be used in the same file. Moreover, due to more number of encodings, protability was also proving to be a problem.

1.4| Problems with typography

Many characters important for typography were not a part of these standards for example, curly quotation marks. It was because of the already existing shortage of code points that all standards focused on transmitting text rather than the design aspect of it. Including the design aspect, the multi-lingual aspect and the internet bringing people from all segments and locations on the computer, it was necessary for an encoding that supported a huge number of characters to become existent.

1.5| Understanding a text stream

A stream is just a sequence of numbers. This sequence of numbers can come in from a file on your hard disk or via a channel on your network. The application that receives this stream is to decide what to do with that stream. If it detects it as text, it's a text stream. If it detects is as some video file, it is a multimedia stream. From now on, you need to know that files are NOT categorized as text files and binary files. Only binary files exist. Files are nothing but a specific segement of the storage which stores some numbers in a sequence. If the categorization is not there, why have you been taught so in school? This is what we'll go into the later parts of the chapter. For now, we'll cover what Unicode is all about.

rohanprabhu · Jan 6, 2008

Chapter 2: The Unicode family

If you've read the first chapter and have no knowledge of Unicode earlier to this, you might be wondering how Unicode supports so many writing systems in one encoding scheme only. Unicode does so by going beyond 1 byte. It uses more than one byte to represent a character. For example, the character 'à¤š' in Unicode is 2330 i.e 0x91A. This value in binary is 1001 00011010. It actually takes 1.5 bytes for storage. Most of the characters from the ASCII character set are maintained at the same code points. For ex, 'a' is still 97.

From now on ahead, whenever we mention code points, we will always use the hexadecimal numeral system. Also, instead of 0x used a prefix for hexadecimal numbers, we will use U+ to specify that these are Unicode characters we are talking about.

2.1| A bit of history

The ISO/IEC set out to build a standard to overcome the shortcomings for previous character encodings like ISO-8859. The result was the 'Universal Character Set', (formally known as ISO-10646) long for 'UCS'. UCS used a multi-byte scheme for storage of characters. It assigned code points for many characters. Since 1991, Unicode worked closely with the ISO/IEC 10646 team in the making of the Unicode Standard and the UCS standard. As you will learn in the following segment, character encodings in modern encoding systems don't take place as they took place in the older systems. There were three ways for encoding text in UCS. Those were: UCS-2, UCS-4 and UTF-1. Each had their pros and cons, which we shall not discuss here.

What is important for us to know is that, the code point assignment is the same in UCS as is in Unicode. From now on, we will deal only with Unicode and related concepts.

2.2| Character encodings in Modern encoding systems

As in earlier standards, every character took one byte. So, the sequence of characters was converted to a sequence of a bytes. The conversion to bytes, and the representation of each character by a number was covered by more or less by the same standard. In Unicode, each character may require 1 to 3 bytes. If each character is stored as 2 bytes (for ex., for lower ranges), then the text can be stored. But there are some problems associated with it.

Firstly, consider the letter 'a' and 'ज', code points for whom are: U+0061 and U+091A. We can store it in a text file as: 0x0061 0x091A. But here, 'a' is taking twice it's required space. Most of the characters in an English document will have letters in the lower range. Which means that these characters will be taking twice the space than is required. This causes ineffecient use of space.

Another problem is that, since we have decided to treat a 'word' [two bytes make 1 word] as the primary block, it is easier for us to read and write text in that format. But, older applications who don't have Unicode support, will be rendered useless as it'll still treat a byte as a buidling block. Thereby even if the document contains only Latin characters, it'll read 'a' i.e 0x0061 as 0x00 and 0x61 i.e it will interpret 2 characters instead of 1. Which means, this method of encoding causes backward incompatibility.

There are ways to overcome these problems. Unicode calls it solution the 'Unicode Transformation Format'. The Unicode Standard recognizes UTF-8 and UTF-16 as it's two official encoding systems [or transformation formats].

2.3| Multilingual Planes

Unicode currently defines around 100,000 characters, but there is space for 1,114,112 code points. These code points are divided into 17 "planes". These planes are numbered from 0 - 16. Plane 0 is known as the Basic Multilingual Plane(BMP). The BMP is the maximum you'll have to deal with. Atleast for the next 5 years. Each plance has 65,536 characters. As of today, only the first 65,536 characters i.e. the characters in the 0th plane are in common usage.

The BMP contains characters for most of the major writing systems. Plane 15 and 16 are reserved as Private Use Area.

2.4| Unicode control characters, dicritic and Han unification

Unicode controls characters are characters which are used to control the way text is displayed. These characters do not have any visual representation. They only modify the way the following text is shown. An example is the character code point U+2028 and U+2029. These code points represent 'new line' and a 'paragraph break' respectively.

Unicode control characters are also used for controlling the direction of text [i.e Left-to-right text or Right-to-left text (LTR/RTL)]. Due to this, we can use text running in both directions in the same file, same line, right beside each other. To do this, Unicode uses various characters to mark whether the following set of characters should be rendered LTR or RTL.

Han unification: The CJK languages (Chinese/Japanese/Korean) have writing systems which inherit their characters from common Han characters. Han unification is an effort of the Unicode Consortium to create separate glyphs for the features of Han characters and combine them in the text stream so as to 'unify' the character set for CJK languages. Not only CJK but various other writing systems can be covered under this.

Dicritics: All languages have dicritics. For example, 'i' and 'î' both a part of the Latin alphabet. But the '^' can be used over almost any alphabet. If each such character were to be given a different code point, the requirement of code points would increase exponentially and even the current number of code points would not be enough. Thereby, what Unicode does is use something called 'Combining characters'. These characters are combined by the application. For example the sequence for 'i', followed by a diacritic '^'. Note here that the diacritic has a different code point than that for the exponenetial symbol. Diacritics are in a different range of the Unicode code points. One the application knows that it has to combine those two characters [it knows that since it has encountered a charactre, which is in the diacritic range, and all diacritics are supposed to be combined], it will display the resultant character as 'î'.

Let us take another example of 'जी'. It is composed of two characters 'ज' and 'ी'. Ther, if you can't see it properly is something like this:

http://www.tritium.frihost.net/tutorials/unicode/img1.gif

The dashed circle represents that it is to be combined with another character. Although in the Devanagari system 'matras' are not considered as diacritics, but in Unicode it is convinient to assume so. All diacritics follow the character they are supposed to combine with.

2.5| Input method editors

Our primary text input device is naturally, the keyboard. Any keyboard supports text input in atleast 1 writing system. However, including support for more writing systems, symbols etc. is not practically possible. Unicode has defined 100,000 characters. And there are a lot more that can be defined. Then how do people exploit the total potential of Unicode without extensive hardware support? We use something known as 'Input method editors'. 'Input method editor' is a a Windows computing term, actually. Under different operating systems, it is known differently. However, we will go with Input method editors (IME).

Such a program basically allows one to use characters which are not available on standard hardware devices, but through standard hardware devices itself. Confused? You have a roman keyboard [a basic keyboard with english alphabets on it]. Ofcourse, you communicate in your language on Messengers etc. even if your language does not use Latin characters. You do it by using words, written in english that pronounce something similar to a word in your language. For example, in hindi ’यार’ is pronounced as 'yaar'. On SMS, YIM etc. we commonly use English to talk in Hindi. I'd like to apologize for the fact that I cannot furnish such an example in some other language too.. Similarly, IMEs use transliteration (commonly) to convert such words like 'yaar' into proper characters like 'यार’. An example of such a software would be Baraha. It allows the user to type in many Indian writing systems using romanized input.

Such transliteration schemes have also been standardised. A common term for such practices for Japanese is known as 'Romaji', for ex.

2.6| So, what exactly does Unicode do?

Unicode, as stated earlier, provides code points for characters throughout many character systems. For example, here is chart for the Devanagari writing system.

http://www.tritium.frihost.net/tutorials/unicode/chart1.gif

Now, from looking at this chart you can easily make out that each character corresponds to a number. For example, the letter ’र’ is U+0930.

Another example, is the chart for the Katakana writing system (on of the 3 writing systems used for Japanese):

http://www.tritium.frihost.net/tutorials/unicode/chart2.gif

The basic work of Unicode is this assignment of code points to characters.

NOTE: Both these charts, have been sourced from http://en.wikipedia.org

rohanprabhu · Jan 6, 2008

Chapter 3: The transformation formats

Now that we have learned what Unicode is, we shall now learn how Unicode text is stored in files. We shall be learning the two major transformation formats: UTF-8 and UTF-16. We shall also do a minor introduction to UTF-32. In Chapter 4, we shall build our own converter that will convert files from one encoding format to another. Beyond this point, it is necessary for you to have the following tools:

1. A Hex Editor: It is an absolute must. Check out http://en.wikipedia.org/wiki/Comparison_of_hex_editors to pick one for yourself.
2. A Text editor supporting Unicode: This is again important as you need to see the character representation of whatever you type in. On Windows, I use Notepad2. I'd recommend Notepad2 to you also. It has a great support for Unicode files and gives you a lot of control over the encoding process.
3. Sample text: You need some sample text to work on from different writing systems. Wikipedia in languages like Chinese, Japanese, Armenian etc. can give you lots of sample text.

Also, brush up your hexadecimal skills. Since you'll be working with binary files from now, you'll always work with hexadecimal. [as said earlier, even though Unicode is for text, while working at the encoding level, we shall not differentiate between text files and binary files]. You can refer to this tutorial here: http://www.thinkdigit.com/forum/showthread.php?t=17090 which I made sometime back. It's pretty juvenille and silly at some points, but shall be enough for the following chapters.

3.1| Few basics on Unicode Storage

Unicode Scalar: A Unicode scalar is the absolute value of a character code point. When Unicode text is stored in a text file using one of the Transformation Formats, a character does not have the same value as it's code point. For example, the letter '' in Unicode has the value U+091C. When stored as UTF-8, it is stored as 0xE0A49C. This value i.e. 0xE0A49C is it's 'encoded value', while U+091C is the 'Unicode scalar' for the particular character. When stored as UTF-16, the character has a different value altogether. Therefore, whenever an application deals with files, it has to convert all the 'encoded values' to their corresponding Unicode scalars. We will be doing this too in Chapter 4, when we make our own converter. It is important to see here that we prefix the encoded values with 0x, whereas we prefix the Unicode Scalars with U+.

Byte Order Mark: A Byte-order mark [BOM] is the Unicode character U+FEFF. Those are both non-valid ASCII characters (as ASCII characters do not use the 8th bit, there cannot be an 'F' value in the first nibble) and hence do not cause backward incompatibility issues. A Byte Order Mark is an Unicode character which is placed in the beginning of a file to specify:

1. The Encoding of the file.
2. The Byte-order of the file (i.e. Endianness)

Encoding: The character U+FEFF, as you know is an Unicode Scalar. In UTF-8, it's value is EF BB BF. When the first character of a UTF-8 stream is received as EF BB B4, the application can understand that it is an UTF-8 encoded value. In many cases, this is not used because in files like executable scripts, where older interpretors may not completely support ASCII, there the UTF-8 BOM 0xEF 0xBB 0xB4 are the three ASCII characters ''. This input will cause the interpretor to throw an exception. To overcome this, in many cases the encoding is mentioned in the file. For ex., in HTML, the encoding is mentioned in a meta tag. When your web browser encounters a meta tag that says something like:

Code:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

It now knows that the stream is encoded with UTF-8. The fact that it could read the data till now without knowing the encoding can be attributed to the backward compatibility of UTF-8. This line of code if stored as ASCII, is valid UTF-8 text too. Therefore, the browser can read it as ASCII text till it is explicitly mentioned that UTF-8 is the encoding for the document.

At the same time in UTF-8, the BOM is not really a 'Byte-order' mark as UTF-8 does not have Byte-order issues. In UTF-16, the encoded value is 0xFEFF (for big-endian) and 0xFFFE (for little-endian).

Endianness: Endianness is the way a byte is ordered while storing data. For example the number 2412. In hexadecimal, it's value is: 0x096C. The smallest memory block is a byte. So the 2 bytes needed to store it would be 0x09 and 0x6C. In memory, or in a data file or any sequence of numbers used to store this data, it can be stored as:

Code:

+------+------+
| 0x09 | 0x6C |
+------+------+

OR

+------+------+
| 0x6C | 0x09 |
+------+------+

The first case, in which the ordering matches the hexadecimal notation is known as 'big-endian' byte ordering, while the second case is known as 'little-endian' byte ordering.

Thereby when a BOM is 0xFEFF, it is UTF-16 (big-endian), when the BOM is 0xFFFE, it is UTF-16 (little-endian).

3.2| The UTF-8 encoding system

The UTF-8 encoding system is one of the most widely used encoding system. In UTF-8, we follow a 'Range table'. A particular Unicode scalar is first identified with a particular range it belongs to, and then encoded according to that. Each range has a particular way of encoding text. However the method of encoding multi-byte characters is similar in all cases. We will get into it.

All code points in Unicode can be categorized in two categories:

1. Single-byte characters: The characters which take 1 character to encode. In Hexadecimal, these fall in the range 0x00 - 0x7F. It might surprise you that some characters from 0x80 - 0xFF also take one character. Then why have they been put in the multi-byte range? We will learn about this in a short while.
2. Multiy-byte characters: The characters which take more than 1 character to encode. UTF-8 can encode characters upto any range, however the Unicode standard restricts it to encode only till 0x10FFFF. So, technically speaking, in hexadecimal, characters in this range have code points in the range of 0x80 - 0x10FFFF.

Both UTF-8 and UTF-16 have been restricted in terms of the number of characters that they can encode. No method should officially support characters outside the range of 0x000000 - 0x10FFFF. Take a look at the number 0x10FFFF. In general form, this can be written as: 0xBBXXXX. Here 'XXXX' is the Unicode Scalar, whereas 'BB' is the plane number (discussed earlier). 0x10 in hexadecimal is 16. Therefore, BB can be replaced with 17 values [from 0x00 - 0x10], which makes for 17 multi lingual planes. Also, the range 0xD7FF - 0xDFFF is reserved for surrogate pair codes (which we will cover in UTF-16) and hence are not permitted (in standard by UTF-8). The Unicode Consortium recognizes this specification, and mandates that no character can be assigned a code point in that range.

Now we will learn how character encoding goes on in UTF-8.

Single byte characters: Single byte characters are stored just as their ASCII representation would be stored. Thereby, in a file, 'a' will be stored in UTF-8 as the character 'a' would have been stored in ASCII i.e. with the number 0x61. All characters in the range 0x00 - 0x7F are stored in this way. Thereby if a file has to store the word 'hello', it will store it in UTF-8 just as it is supposed to store it in ASCII:

Code:

+--+--+--+--+--+
|68|65|6C|6C|6F| 
+--+--+--+--+--+

Now, this is both the ASCII and UTF-8 representation of the word 'hello'

Multi byte characters: We come back to the question we left unanswered. Why is it that 0x80 - 0xFF despite being single byte characters have been categorized into Multi byte characters? It is because, UTF-8 uses the MSB i.e. the 1st bit of a byte to determine whether the byte forms a part of a single byte character or a multi byte character. The range 0x00 - 0x7F in binary is the range 00000000b - 01111111b. Here, the 1st bit is always '0'. If the first bit of a byte is zero, it is a single byte character. Otherwise, it is a part of a multi byte character.

To encode characters using UTF-8, one must know how to use a range table. First let us have a look at the range table for UTF-8:

http://www.tritium.frihost.net/tutorials/unicode/img2.jpg
Source: http://en.wikipedia.org

Let's go over the table. The first column is the 'Code Range' in hexadecimal. Each row corresponds to a particular Code range. The fields in that row give information on how to encode a given character which falls in that Code Range.

Next, we have the 'Scalar Value'. When your particular code point falls in a given range, assume a general form. For example, in the range, 0x00E000 - 0x00FFFF, we have a value, let's say 0x00D32A. In binary it is: 11010011 00101010b. The scalar is shown as xxxxyyyy yyzzzzzz. So, comparing those two, we get: xxxx refers to 1101, yyyy yy refers to 0011 00 and zzzzzz refers to 101010. Next, we have the 'UTF-8' column. It's UTF-8 encoding is given as: 1110xxxx 10yyyyyy 10zzzzzz. Clearly, xxxx, yyyyyy and zzzzzz are to be replaced from our number into this new number, which would give us: 11101101 10001100 10101010b. In Hexadecimal, this would be: 0xED8CAA. Therefore, the Unicode Scalar U+D32A is encoded in UTF-8 as 0xED8CAA.

There is a pattern in which multi byte characters are stored. The first byte of a multi byte sequence has it's first 'n' bits as '1', where 'n' is the number of bytes that make up that one character. The next byte is always a zero. For example, U+D32A. In UTF-8, it will take 3 bytes to encode. The first byte therefore has it's first '3' bits as '1' and the 4th bit as '0' i.e. 1110xxxx. Then, the next (n - 1) bytes, which are a part of the multi byte sequence, will have the sequence '10' in front of them.

You can look up at the table and see that the values requiring 4 bytes of storage has it's first byte as 11110xxx. Using this scheme, UTF-8 can store any number as a code point. However, as we have previously mentioned, it is restricted by the Unicode standard. The 4 byte requirement, however will never be encountered by us in regular usage, as this range is used for all characters outside the BMP.

One more thing to notice here is that for a byte to have the value '0xFX', where X = any hexadecimal number, it's binary form has to be 1111xxxx. This type of an arrangement is possible only in the 4 byte sequence. Moreover, there is a '0' in the 5th bit in that case. Therefore, it can take the values from 0xF0 - 0xF7. Meaning that bytes in the range of 0xF7 - 0xFF can never be encountered in a proper UTF-8 stream. Now, if you remember, 0xFF is a part of the BOM (Byte order mark). The BOM, as specified in Unicode is U+FEFF. Let's convert it to UTF-8.

0xFEFF falls in the range 0xE000 - 0xFFFF.

1. It's binary form is: 11111110 11111111b.
2. Now, this range takes 3 bytes to store in UTF-8. So our first byte would be: 1110xxxx.
3. Fill the first 4 empty places (marked by 'x') with the first four bits in the sequence of bytes we have: 11101111b.
4. Now, the next byte is part of a multi byte sequence, but not the first byte in the sequence. Therefore, the byte has it's first values as: 10yyyyyy. Fill these 6 empty places (marked by 'y') with the next 6 bits in our byte sequence: 10111011b.
5. Proceeding in the same way with the 3rd byte, we start with: 10zzzzzz. Fillinf it out, we get: 10111111b.

Combining these three bytes, we get: 11101111 10111011 10111111. This in hexadecimal is: 0xEFBBBF i.e. 0xEF 0xBB 0xBF. As we have mentioned previously, this is the BOM for UTF-8.

You can similarly convert other characters to UTF-8 from their Unicode scalars. If you'd like to practise some of it, you can just open wordpad and type in a few Unicode characters (using programs like Baraha for example) and then check the UTF-8 encoded value using a Hex editor and compare it with the one you calculated.

So, that'd be the end of this huge section. Let's move on to UTF-16.

3.2| The UTF-16 encoding system

As we know earlier that the ISO/IEC put forward the ISO 10646 specification. It also propsed an encoding system called UCS-2. The shortcomings of UCS were that they couldn't store values in planes other than the BMP. Therefore, UTF-16 was created. UTF-16 is officially defined in the ISO/IEC 10646 standard. UTF-16 is exactly the same as UCS-2, other than the fact that it supports something known as a 'surrogate pair'. A 'surrogate pair' is what allows UTF-16 to store characters outside the BMP.

In UTF-16, characters in the range of 0x0000 and 0xFFFF [i.e. characters in the BMP] are stored as 16-bit 'words' (2 bytes make up a word). So, 'a' which is 0x61, is stored as 0x0061. Meaning, if there was a text stream like 'ab', it'd be stored as:

Code:

In ASCII and UTF-8:
+--+--+
|61|62|
+--+--+

In UTF-16:

+--+--+--+--+
|00|61|00|62|
+--+--+--+--+

Byte-order considerations: Since the building block here is not a byte, but rather a word, the relative positioning inside a word can be of two types: Big-endian and little-endian. Both these terms have been discussed earlier. Considering our previous case, the string 'ab' is encoded as:

Code:

UTF-16 Big-endian:

+--+--+--+--+
|00|61|00|62|
+--+--+--+--+

UTF-16: Little-endian:

+--+--+--+--+
|61|00|62|00|
+--+--+--+--+

As we have discussed in earlier parts, the Byte-order of a stream is given by the Byte-order mark i.e. the BOM.

As you can clearly see, UTF-16 is not backward compatible. At the same time it is used over UTF-8 in Java and .NET because, the decoding process is very light on the processor as compared to UTF-8. Considering a word a time, a simple concatenation of two bytes would give you the direct Unicode scalar. The decoding process for characters outside the BMP is a bit more complex than this, but as we have been saying, very few characters outside the BMP are used regularly.

For characters outside the BMP, i.e. characters in the range of 0xFFFF - 0x10FFFF, the following encoding method is followed: [Note that this method is exclusive to UTF-16 and is what distinguishes it from UCS-2]:

1. Consider a character u > 0xFFFF.
2. Subtract 0x10000 from u and store this value as u'. [i.e. u' = u - 0x10000].
3. u' will always be a value that can be represented by 20 bits as the highest value possible is 0x10FFFF. 0x10FFFF - 0x10000 gives 11111111111111111111b i.e. 0xFFFFF, which is represented by 20 bits. All other values, will obviously fit in the given set.
4. Split it into two parts h and l. h = higher ten bits of u' and l = lower 10 bits of u'. Higher means the first 10 bits of u' when counted from left.
5. Two words w1 and w2 are initialized with the values 0xD800 and 0xDC00 i.e. w1 = 0xD800 and w2 = 0xDC00. If you look at their binary equivalents, both of them have 10 trailing zeroes in their values. These bits will store our required values.
6. Use the OR operator on w1 and uh to give you the first byte of the surrogate pair. i.e. w1 = w1 | uh.
7. Use the OR operator on w1 and ul to give the second byte of the surrogate pair. i.e. w2 = 2 | ul.

Then w1 and w2 is the pair that represent the required code point.

Let's take an example of a code point, let's say: 0x119070. Don't try to see what character it represents. It's the 'musical bass clef' symbol. Let's convert it to UTF-16:

1. u = 0x1D11E.
2. Since u > 0xFFFF, we have to use the method of surrogate pairing to store this value.
3. u' = u - 0x10000. Now, u' is a 20 bit value. Here, u' = 0x1D11E i.e. u' = 0001 1101 0001 0001 1110.
4. uh = higher (or, first 10 bits of u') 10 bits of u', i.e. uh = 0001 1101 00
5. ul = lower (or, the next 10 bits of u') 10 bits of u' i.e. ul = 01 0001 1110
6. Let, w1 = 0xD800 and w2 = 0xDC00. These are arbitrary values, which have been defined in the standard. These values have been defined very cleverly as we shall soon see.

Code:

7.
w1 = w1 | uh
   = 0xD800 | 0001 1101 00
   = 1101100000000000 |
           0001110100
   = 1101100001110100
   = 0xD834

w2 = w2 | ul
   = 0xDC00 | 0100 0001 1100 00
   = 1101110000000000 |
           0100011110
   = 1101110100011110
   = 0xDD1E

Thereby, the UTF-16 representation for 0x1D11E would be the pair: 0xD834 0xDD1E. This pair is known as a 'surrogate pair'. Now, coming back to the point where I said that these values were chosen cleverly. Firstly, both these values have 10 trailing '0's for storage. Note here that it is not that the '0' is important. It is the fact that the last few characters that are '0' is known, an OR operator can work and produce a surrogate value. If it had some other values, then well, we still could have used this algorithm, but two or more inputs would have given the same outputs. It could also have been trailing '1's. In that case, we would have used the XOR operator.

Seocndly, the values of the first surrogate value will always lie between 0xD800–0xDBFF and the lower surrogate value will always lie between 0xDC00-0xDFFF. Thereby, whenever an application encounters a byte in the surrogate pair range, it can easily know whether the byte forms the first value or the second value of the pair. The 'surrogate pair' range is an explicitly defined range between 0xD800 - 0xDFFF. As we saw in UTF-8, this is the range excluded from the encoding. Also, in the Unicode Standard, no code point in this range will ever be assigned any character, to assist UTF-16.

With this we end with UTF-16 as well. Now that we know about UTF-8 and UTF-16, we'll move on to Chapter 4, which is meant for programmers. But before that, let's have a few words on some other encoding methods for Unicode.

3.3| Other encoding methods

Other encoding methods for Unicode include:

1. UTF-7: As mentioned earlier, older network infrastructure was such that it used the 1st bit in every byte of a byte sequence to store information about the data protocol. Quite commonly, it was used to store the parity bit. Due to this, UTF-8 could not be supported over such networks. Thereby, UTF-7 was made, which uses 7 bits to store Unicode characters. UTF-7 is not an official encoding method of The Unicode standard.

2. UTF-EBCDIC: Just as there was ASCII, there was another encoding method called EBCDIC. EBCDIC is a single byte system and used a similar layout to that of ASCII. Just as UTF-8 maintains backward compatibility with ASCII, UTF-EBCDIC maintains compatibility with EBCDIC.

3. UTF-32: UTF-32 is a fixed-width encoding system i.e. it uses 4 bytes for each character to be stored. All values are stored as their respective Unicode scalars. UTF-32 is also an official encoding method. UTF-32 is not popular because it is not efficient with space. It uses 4 bytes for all characters, even for those that require only 1 character.

4. UTF-x/2: UTF-x/2 is a pointless encoding method. But it still finds it's way here. Why? Because I made this method as a school project. Although it sucks more or less, I've put it here for bragging rights. UTF-x/2 uses an alternative code range table and operates on 'nibbles' rather than 'bytes'. To know more about it, check:

3.4| Decoding

In the sections previous to this, you could've very well deduced how to decode a given sequence in a particular encoding back to it's Unicode scalar. This might involve pure logic on your side. However, we need an algorithm for encoding and decoding a given sequence to put it into practise. This is not important at this level, and hence I will be discussing that in Chapter 4.

rohanprabhu · Jan 6, 2008

Chapter 4: Our own encoding translator

We will use the C++ programming language for this part. I'm using the gcc compilers, and the code I post over here should compile without any problems on gcc v3.4.2 We will be making a console based applications using only file I/O. For file I/O we will use the <fstream> library which is a part of the STL. The necessary input from the user will be taken in form of command line options. So, let's begin. It is assumed that the reader has knowledge of C++ and the functions in the <fstream> library.

4.1| How to go about it

We can break down the converting process into 2 simple parts:

1. Conversion to Unicode scalar from encoded stream.
2. Conversion to encoded stream from a Unicode scalar.

Basically, we will first take a file name and it's encoding as input. If there is a BOM detected, it will be converted too. Using our previous knowledge, we will devise an algorithm to convert the given input file into Unicode scalars. These Unicode scalars will be stored in an Unicode Intermediate Format (UIF) [Covered after this].

Then, we will use the pure scalars in the UIF and convert them to the desired encoding. Note here that our converter is made for only UTF-8 and UTF-16. Thereby, if an UTF-8 file is given, it will be converted to UTF-16 and vice versa. We will not be checking for erroneous input, which makes our application incapable of supporting real life situations. As such, this excercise is supposed to be a mere educational tool. Adding error detection and correction can be programmed by the reader as an excercise following this tutorial.

NOTE: When we input a UTF-16 file, we shall consider ONLY big-endian files.

The Unicode Intermediate Format(UIF): The UIF, is no real format, but something that we will use an Intermediate file between encodings. If there are more than 2 encodings you are working with, it is quite beneficial if we convert them to an Intermediate file, and then using that Intermediate file, it can converted to any other encoding format. It need not neccessarily be a file. You can also use an array of 1 byte datatypes [in C++, a 'char'] to store the UIF. Each character will take 3 bytes in the UIF.

Methods while working with bits: We will be using some techniques while working with bits. We'll cover them one by one.

1. Check for a sequence: We use the AND operator for this. What the AND operator does is, on 2 numbers, it compares each bit of it's 1st operand with the corresponding bit of it's 2nd operand. Only and only if both these bits are '1', does it put the corresponding bit in the result as '1'. By 'corresponding' we mean that those bits have the same place values. This is an interesting property, as it helps check for a particular sequence in a byte. For ex:

Code:

u = 11010111b
v = 00010100b
w = u & v
  = 11010111
   &00010100
  = 00010100

Here, if you notice, if we take the operation with reference to 'v', we
can see that w = v. This means, u contains 'v' as a sequence. It is so because,
only those bits will be '1' where u and v, both are 1. So, if 'w' has to be equal
to 'v', then 'u' must have '1's at the same positions as 'v' has 1s.

Also, you can see that w != u. Therefore, 'v' does not have 'u' as a sequence.

We shall use this method to check for multi-byte signatures.

2. Extraction of particular bits: From a byte, let's say we want to extract the last 5 bits, or the first 4, or the middle 2. How can we do it? One WTF (Worse-than-failure) method is to use a function to convert the number into the binary format (in a string) and then use string functions on it. Unbelievably, I have seen this in practice. However, we shall try not to end up on http://worsethanfailure.com .

To extract a particular position of bits from a byte, we create a byte which has '1's at those positions from where we need to extract the bits. Then we use the AND operator on these 2 bytes, and the resultant byte gives us the bits from those positions. For ex:

Code:

u = 11010111b
//We need the first 4 bits from u
v = 11110000b
//We have to take care of the place values of 'u'

w = u & v
  = 11010111
   &11110000
  = 11010000

Thus 'w' now contains only the 1st 4 bits of 'u'. This happens because,
since 'v' has '1' at only the the 1st four bits, all other values will
always result in '0'. On the 1st 4 bits however, thos values which are 
'0' in u, will result in '0', while the bits having a value of '1' will
result in a '1'.

Now, if we needed to truncate w = 11010000 to 1101, how do we go about it? Doing so will return us the exact value of what the 1st 4 bits will? What we are basically doing is reducing the place value of a particular segment of bits by a factor of the base i.e. everytime we remove a zero from a binary number, we divde it by '2' just like everytime we remove a 0 from a decimal number, we divide it by it's base i.e. 10 (ten). So, in this case as we have to remove 4 zeroes (in binary form), we divide it by 2^4. In hexadecimal this number is: 0xD0. Here, we have to remove only '1' zero. But, the base of the hexadecimal numeral system is '16'. Therefore, we divide it by '16'. As can be easily seen in both cases, 2^4 = 16.

3. Appending, merging i.e. concatenation: The Unicode Transformation Formats, sort of treat bits and bytes like string. Therefore, we have to 'concatenate' numbers. How do we go about doing this? Let's say we have to concatenate 2 numbers 'a' & 'b'. It is important that 'b' is specified to be padded to a specified number of bits. Here's why:

Code:

a = 1001b
b = 10b

b = 0010b is also the same value
also, b = 010b perfectly retains the value

but on concatenation with a, they result in:

100110, 10010010 and 1001010

which are 3 different values. Hence the padding of 'b'
to a particular number of digits is to be specified

To perform padding, we don't do any operations on 'b'. We rather do some operations on 'a'. Here, we will be increasing the place values of the bits of 'a' in 'a' itself to the padding of 'b'. Then, we add 'b' to it. Here's what happens:

Code:

a = 1001b
b = 10b //to be padded to 4 bits i.e. 0010b

a = 1001b * 2^4
  = 10010000

a = a + b
  = 10010000
   +      10
  = 10010010

//Here 'a' contains the merged result of 'a' and 'b', 'b' padded to
4 bits. This will be particularly useful for us in the UTF-16 encoding
process.

Here, also, for increasing the place value as a multiple of the base i.e. adding a zero to the end, we multiply it by the base for every '0' we remove.

Note, that we will use the operator '|' and '+' interchangebly while merging. This is why:

Code:

If x and y are two bits,
and if either (x = 0) or (y = 0), then

x | y gives the same result as x+y

and while merging, we make sure that the addition/merging takes place only
on the '0' positions the 1st byte.

These concepts are a bit difficult to understand if it is the first time you are working at the binary level or using bitwise operators. However, with a little practice, it should become much simpler.

4.2| The Base program

Here is the source code of the basic program that we are making:

Code:

#include<iostream>
#include<fstream>
#include<conio.h>
#include<string.h>
#include<process.h>

#define UNICODE_U8 1
#define UNICODE_U16 2

using namespace std;

int enc_inp;
ifstream fin;
ofstream fout;
fstream ustream;

int cpoint = 0;

const char UIF_PATH[] = "C:\\WINDOWS\\TEMP\\uconvert_4852_temp1.uif";

void createUIF();
void convertToU8();
void convertToU16();

int main(int argc, char *argv[]) {
    if(argc != 3) {
        cout<<"uconvert requires 2 parameters to work with: "<<endl
            <<"   uconvert [file] [encoding]"<<endl<<endl
            <<"[file] = Path of the input file"<<endl
            <<"[encoding] = Encoding of the input file: "<<endl
            <<"             u8 = UTF-8"<<endl
            <<"             u16 = UTF-16"<<endl;
            
        getch();
        exit(1);
    }
    
    if(strcmpi(argv[2], "u8") == 0) {
        enc_inp = UNICODE_U8;
    } else if(strcmpi(argv[2], "u16") == 0) {
        enc_inp = UNICODE_U16;
    } else {
        cout<<"2nd parameter to uconvert must be either 'u8' or 'u16'";
        getch();
        exit(1);
    }
    
    char *fname;
    
    fname = argv[1];
    fin.open(fname, ios::in|ios::binary);
    
    if(enc_inp == UNICODE_U8) {
        fname = strcat(fname, ".u16");
    } else if(enc_inp == UNICODE_U16) {
        fname = strcat(fname, ".u8");
    } else {
        cout<<"Fatal Error: Assertion failed.";
        getch();
        exit(1);
    }
    
    fout.open(fname, ios::out|ios::binary);
    ustream.open(UIF_PATH, ios::out|ios::binary);
    
    if(!(fout) || !(fin) || !(ustream)) {
        cout<<"File I/O error";
        exit(1);
    }
        
    createUIF();
    ustream.close();
    
    ustream.open(UIF_PATH, ios::in|ios::binary);
    
    if(enc_inp == UNICODE_U8) {
        convertToU16();
    } else {
        convertToU8();
    }
    
    cout<<"Converted "<<cpoint<<" Unicode code points happily ;)"<<endl;    
    getch();
    fin.close();
    fout.close();
}

The global variables will be used by the converting functions. The functions and the global variables are as follows:

A. PREDEFINED CONSTANTS

1. UNICODE_U8 : Denotes a UTF-8 encoding
2. UNICODE_U16 : Denotes a UTF-16 encoding

B. GLOBAL VARIABLES

1. ecn_ip (int) : Encoding type of the input file. Can be anyone of these values: UNICODE_U8 / UNICODE_U16.
2. fin (ifstream) : File stream of the input file.
3. fout (ofstream) : File stream of the output file.
4. UIF_PATH (const char) : Pathname to a temporary UIF file.
5. ustream (fstream) : File stream of the UIF file.

C. FUNCTIONS

1. void createUIF() : Based on the current input type, the functions converts fin to their respective Unicode scalars and writes them to ustream.
2. void convertToU8() : Converts Unicode scalars from ustream to an UTF-8 stream and stores the stream to fout.
3. void convertToU16() : Converts Unicode scalars from ustream to an UTF-16 stream and stores the stream to fout.

The UIF we use will use 3 byte code points. On a 32-bit sub system, this would mean that it can easily fit inside an 'int'. Depending on your system, you may have to use other datatypes. For most of you, who would be using gcc 3.4.x [I'd recommend Dev-C++ as the IDE], this would be the case.

4.3| The convertToUIF() function

1. Conversion of a UNICODE_U8 i.e. an UTF-8 stream:

Here's the algorithm our program is supposed to follow:

i] Read a byte. For this, you'll be using the 'char' datatype. A 'char' is defined to be exactly 1 byte in size and hence will not be of any problems on any platform [x86 architecture]. If it is below 0x7F, it is in the form of an Unicode scalar. Write it in the UIF, in four bytes. [For ex. if the byte is 0x61, the four bytes to be written to the file are: 0x00, 0x00, 0x00 and 0x61].
ii] If the byte is above 0x7F, it falls in one of the multi byte ranges. The range can be determined by the number of '1's in the beginning of the byte. To do this, we need to AND the byte with 11000000b, 11100000b, 11110000b etc. This is how it works:

Code:

Let's say the byte we have is u = 11110110.
First we AND it with v = 11000000

v' = v & u
   =  11000000
     &11110110
   =  11000000

Now, since v' = v, we can say that the first 2 bits of u are '11'.

Now, we continue this step with increasing number of '1's in the
beginning of v until, v' != v. Once we reach this step, we know
that the number of '1's in the beginning of the byte is equal to
the number of '1's in the beginning of 'v' till the last step.

For ex., this time, let v = 11100000

v' = v & u
   = 11100000
    &11110110
   = 11100000

Here also, v' = v

Let's do it again with v = 11110000

v' = v & u
   = 11110000
    &11110110
   = 11110000

Here again, v' = v

Let's do it again with v = 11111000

v' = v & u
   = 11111000
    &11110000
   = 11110000

Here, v' != v. Therefore, the number of '1's in the beginning of the
byte are equal to 4 i.e. the number of '1's in the beginning of 'v' in
the previous step.

NOTE: We shall never be checking for v = 11111000, as the maximum 
preeceding '1's a valid UTF-8 character shall ever have is 4 since UTF-8
has been restricted till 0x10FFFF. However, this is given by me as an example
to illustrate how UTF-8 can store any given code point.

NOTE: The initial byte of the sequence i.e. '11110' that tells us that a particular byte marks the beginning of a multi byte character will henceforth be reffered to as a 'multi-byte signature'.

This is not the way we will be proceeding however. Since we know that there are only a few ranges a given code point can fall into, we will check for those ranges explicitly.

iii] Once we know that, let the number of '1's in the beginning of the byte be x. Now, 'x-1' is the number of bytes that make up the character (including the current byte). So, we need to append the last 3 bits (as after the multi-byte signature, we are left with 3 bits in the byte) and the last 6 bits from the next (x-2) bytes and add it to a variable. For how to go about doing this, have a look at the previous section 'Byte extraction, appending and padded writing'. For ex:

Code:

UTF-8 Character sequence: 11110110 10110011 10001101
The first character clearly shows that it is a part of a 3 byte character
sequence. Here, 'x' = 4. Therefore, (x-1) = 3 bytes make up this character.

Extracting the last 2 bits from the first byte and last 6 bits from the next
(x-2) = 2 bytes, we get: [110] [110011] [001101]. Appending these to each other,
we have: 110110011001101b. This is our Unicode scalar, which is U+6CCD.

When we read a byte, for example when we read (x-2) bytes and extracted the last 6 bits from them, we have to move the file pointer (x-2) bytes ahead. This is automatically done by the fstream handlers whenever we use any of it's read functions.

One important thing we need to answer is, what if we receive a byte which is something like: 10xxxxxx as a byte which was not succeded by any byte having a multi-byte signature? The fact is that, such an input would be an erroneous input and when you add error checking to your program, you have to check for these kind of errors.

Let us invent some terminology here. This will help you develop an algorithm for error checking in a UTF-8 stream. A byte, about which you have no information at all before reading it, is known as a 'free byte'. For example, in this text stream:

Code:

[All numbers in binary]

01100110 11110110 10110011 10001101 11000110 10011010

Here, the first byte is always a free byte, because we have no information about
it before processing. In this particular sequence, the second byte is also a 
free byte, since we have no information about it. However, it gives us information
about the next 2 bytes. The 2nd byte says that it is the 1st byte of a 3 byte
sequence (by it's multi byte signature). Thereby, we have information about the
next 2 bytes. The next 2 bytes are part of a 3 byte sequence, and we know the first
few bits of that character. Therefore, these 2 bytes are NOT free bytes.

For error checking this is useful as a free byte can never have '10' as it's first
2 bits.

Once we have obtained a Unicode scalar, we have to pad it and write it. Padding is covered in the previous section. Refer to it in case you need help.

rohanprabhu · Jan 6, 2008

2. Conversion of a UNICODE_U16 i.e. an UTF-16 stream:

UTF-16 streams are relatively easy to convert to UIF as compared to UTF-8. While reading a UTF-16 stream, we will be more interested in reading sequence of 2 bytes (i.e. a 'word') rather than a sequence of single bytes. To do this, we shall use the 'short int' datatype. Again, the size of a datatype depends on your system. However, for most platforms, using gcc, 'short int' should work for you. If you use 64-bit compilers, or you are working on a DOS compiler [which runs under a 16-bit subsystem], you might have problems. All that you have to do is hunt for a datatype which has a datasize of 2 bytes.

Here's the algorithm for converting a UTF-16 stream to it's Unicode scalars:

i] First get a word. Let the word be called 'u'.
ii] If, the value is between 0xD800 - 0xDFFF, obviously it forms a surrogate pair with it's next word. Once you encounter such a word, it is clear that the next word is not a 'free word'. A 'free word' is analogous to a 'free byte' discussed earlier. To convert this to a Unicode scalar, we follow this algorithm:

a] Let the first value of the pair be 'u1' and the second be 'u2'. Then, we AND both the values with 1111111111b to get w1 and w2. w1 and w2 are padded to 10 bits. On appending w1 and w2, we get the 0x10000 less than the Unicode scalar (Check the UTF-16 converage in the previous chapter for why this is so). Adding 0x10000, will give us the Unicode scalar. For ex:

Code:

u1 = 0xD834
u2 = 0xDD1E

x = 0x3FF //0x3FF = 1111111111b

w1 = u1 & x = 0x34
w2 = u2 & x = 0x11E

w1 = w1, padded to 10 bits
w2 = w2, padded to 10 bits

u = append(w1, w2) = 0xD11E
u = u + 0x10000
  = 0x1D11E

iii] If, the word does not lie in the range of 0xD800 - 0xDFFF, the word is itself the Unicode scalar, and is simply padded and written to the file.

This completes the UIF creation part of the whole process. This is the code for the createUIF() function, with complete dissection following it:

Code:

void createUIF() {
    if(enc_inp == UNICODE_U8) {
        unsigned char u8_hold;
        unsigned char uwrite;
        unsigned char zero = 0;
        
        while(!(fin.eof())) {        
            fin.read((char *) &u8_hold, sizeof(u8_hold));
            if(u8_hold <= 0x7F) {
                uwrite = (int)u8_hold;
                padAndWrite((long)uwrite);
            } else {
                if((u8_hold & 0xF0) == 0xF0) {
                    unsigned char b1, b2, b3;
                    fin.read((char *) &b1, sizeof(b1));
                    fin.read((char *) &b2, sizeof(b2));
                    fin.read((char *) &b3, sizeof(b3));
                    
                    int write = (long)(((u8_hold & 7)*0x40000) + ((b1 & 63)*0x1000) + ((b2 & 63)*64) + (b3&63));
                    padAndWrite((long)write);
                } else if((u8_hold & 0xE0) == 0xE0) {
                    unsigned char b1, b2;
                    fin.read((char *) &b1, sizeof(b1));
                    fin.read((char *) &b2, sizeof(b1));
                    
                    int write = (long)(((u8_hold & 15)*4096) + ((b1 & 63)*64) + (b2 & 63));
                    padAndWrite((long)write);
                } else if((u8_hold & 0xC0) == 0xC0) {
                    unsigned char b1;
                    fin.read((char *) &b1, sizeof(b1));
                    
                    int write = (long)(((u8_hold & 31)*64) + (b1 & 63));
                    padAndWrite((long)write);
                }
            }
        }
    } else if(enc_inp == UNICODE_U16) {
        unsigned short int u16_hold;
        unsigned char b1, b2;
        
        while(!fin.eof()) {
            u16_hold = u16Read();
            
            if(fin.eof()) {
                break;
            }
            
            if((u16_hold < 0xD800) || (u16_hold > 0xDFFF)) {
                padAndWrite(u16_hold);
            } else {
                short int b2;
                short int w1, w2;
                
                b2 = u16Read();
                w1 = u16_hold & 0x3FF;
                w2 = b2 & 0x3FF;
                
                int write = (long)(((w1 * 0x400) | w2) + 0x10000);
                padAndWrite((long)write);
            }
        }
    }
}

This code is available, syntax highlighted at: http://knowingunicode.pastebin.com/f787db6e1

Dissection of the code:

3-5: Declaration of variables. u8_hold will hold the currently read element. Depending on u8_hold, the size of the character will be determined by checking for the multi-byte signature and further bytes will be read. uwrite will hold the byte to be written to ustream. 'zero' is a byte having the value '0'. It is useful for padding shorter bytesizes to appropriate lengths.
7-8: A while loop traverses the entire file, byte-by-byte. Each byte is stored to u8_hold.
9-12: If the currently read character forms a single-byte character, we write the character to the UIF. We use the padAndWrite() function to do this, as we need each character to be held by 3 bytes.
13-20: A multi-byte signature of 11110xxx is checked for. [11110000b == 0xF0]. If it is true, then 4 bytes form a character. All these 4 characters are read. The same step is followed for all ranges, where multi-byte signatures [11100000b;0xE0], [11000000b;0xC0] are checked for.

In our test, it is important that the larger ranges be checked for first. This is because, a check for 1110xxxx using the AND operator method, will just check for the presence of '111' as the first 3 bits as '1'. Hence, it will return true even in the case of a byte having 11110xxx as a byte structure. But, obviously, it is not true vice versa. Hence, we check for larger multi-byte signatures first, and when they return false, we check for lower signatures.

Now, moving on to the UTF-16 part,

37-41: Declaration of variables, and execution of a loop, traversing the whole file, word-by-word. For trversing the file word-by-word, we use the function u16Read(). It basically reads 2 bytes, in the order it is stored in file, and then compounds it. The word is stored to u16_hold An obvious question is, why didn't we read a datatype of size '2' from the file. It would have been a word too. The problem is that, we don't know what endianness the executable assumes the file to be in. fstream.read() and fstream.write() are used to write blocks of data. But, once the size of the block is given, it reads the whole block and converts it to the given datatype (depending on the size), but while doing so, it assumes an endianness, which may or may not be the endianness the file contains. Do note, that in this tutorial, for UTF-16, we cover Big-endian byte-order ONLY. Little-endian byte-order is erroneous output for our program.

48-50: If u16_hold does not lie within the range 0xD800 - 0xDFFF, it is the Unicode scalar we are looking for. It is written directly to ustream. Again, due to endianness considerations, we split it into bytes and write it to have complete control over the byte-order (handled by padAndWrite()).
51-52: If u16_hold does lie in the range of 0xD800 - 0xDFFF, variables b2, w1 and w2 are declared. The next word is read and stored into b2. This is done because, since we have the range of 0xD800 - 0xDFFF, it is obvious that the current word is a part of a surrogate pair. Hence, the next byte is read, which alongwith u16_hold, forms a surrogate pair.
54-56: In w1 and w2, we store the last 10 bits of the surrogate pairs.
58: When w2 and w2 are merged, we get 0x10000 less than the Unicode scalar we are looking for. So we add 0x10000 to it and write it to the file.

As mentioned earlier, your knowledge of C++ and fstream is assumed.

4.4| The convertToU8() and the convertToU16() functions

Both these functions use the ustream to extract a data stream of 3 bytes in size. Once we extract such a variable, we use methods to encode it to a UTF-8 stream or a UTF-16 stream and store it in fout. We'll cover 1 at a time:

1. The convertToU8() function

Once we have a scalar, let's say 'u'. We now have to convert it to an UTF-8 encoded character. Here's how we go about it:

i] If u < 0x7F i.e. it is a single-byte ASCII character, meaning the scalar in itself is a valid UTF-8 character. It can be written just as it is.
ii] If it is above that range, we check which range it falls. Depending on the range, we prepare the first byte with the respective multi-byte signature. Using the range table, we split bits from the scalar as appropriate and then merge them with the respective 'first byte' and it's preeceding bytes. For example:

Code:

Let u = 0x05D0 = 10111010000
//u falls in the range of 0x0080 - 0x07FF
//Therefore, the first byte will have the multi-byte signature '110xxxxx'

u1 = 1100000

//we need to extract the 'yyyyy' from 000000yyy yyzzzzzz in 'u'.

u' = u & 11111000000
   = 10111010000
    &11111000000
   = 10111

u1 = merge(u1, u')
   = 11010111

Therefore the first byte is u1 = 11010111

Now, the next byte, we initialize it with the signature
we give to all those bytes which form a byte of a multi-byte
character, but are not the a 'first byte' i.e. the '10' sequence.

u2 = 10000000

//we need to extract zzzzzz from 000000yyy yyzzzzzz in 'u'

u' = u & 111111
   = 10111010000
    &00000111111
   = 00000010000
   = 010000 //padded to six bytes

u2 = merge(u2, u')
   = 1010000

Therefore, the 2nd byte is u2 = 10010000
Hence the UTF-8 encoded value is: 11010111 10010000 i.e. 0xD790

We can use the range table when we know that the code points are restricted till 0x10FFFF. However as mentioned earlier, UTF-8 can encode any given code point value. Here's how to go about it in case we generalize it to any characters:

If it is above that range, we first see how many characters it is going to require to encode the text. Let the number of bits taken by the character be 'y'. Let the number of characters be 'x'. Then, in the first character (x+1) bits will be used to store the multi-byte signature [x '1's and 1 '0', making it x+1]. Thereby it leaves us with 8-(x+1) = 7 - x bits for our data in the first byte. The remaining (x-1) bytes will have 6 bits each for storage as the first 2 bits will contain the sequence '10b'. Thereby, we have 6*(x-1) = 6x - 6 bits for encoding. In total, we have (7 - x + 6x - 6) = (5x + 1) bits for storage. These many bits should be equal to be number of bits required our character. Equating these values, we have 5x + 1 = y. Thereby, x = (y - 1)/5. This won't be an integer in all cases. Thereby we require a value of 'x' such that 'x' is the smallest integer which satisifies the equation: x >= (y - 1)/5. Thereby, we can use a ceiling function on the LHS of the equation. Thereby, the resulting equation is: x = ceiling((y - 1)/5). Once you have this value, you can split the given sequence into the remaining number of bits in the first byte using (7 - x) and extract the rest of the bits and store them in subsequent bytes by using bit extraction.

Here's the convertToU8() function, dissected at every step:

Code:

void convertToU8() {
    while(!ustream.eof()) {
        unsigned char b1, b2, b3;
        ustream.read((char *) &b1, sizeof(b1));
        ustream.read((char *) &b2, sizeof(b2));
        ustream.read((char *) &b3, sizeof(b3));
        
        if(ustream.eof()) {
            break; //necessary to prevent an extra character read
        }
    
        unsigned int write = (long)((b1*0x10000) + (b2*0x100) + (b3));
    
        if((write > 0x00) && (write <= 0x7F)) {
            fout.write((char *) &b1, sizeof(b1));
            cpoint++;
        } else if(write <= 0x7FF) {
            b1 = (char)(((write & 0x7C0)/0x40) + 0x06*0x20);
            b2 = (char)((write & 0x3F) + (0x02*0x40));
            
            fout.write((char *) &b1, sizeof(b1));
            fout.write((char *) &b2, sizeof(b2));
            cpoint++;
        } else if(write <= 0xFFFF) {
            b1 = (char)(((write & 0xF000)/0x1000) + (0x0E*0x10));
            b2 = (char)(((write & 0xFC0)/0x40) + (0x02*0x40));
            b3 = (char)((write & 0x3F) + (0x02*0x40));
            
            fout.write((char *) &b1, sizeof(b1));
            fout.write((char *) &b2, sizeof(b2));
            fout.write((char *) &b3, sizeof(b3));
            cpoint++;
        } else if(write <= 0x10FFFF) {
            char b0 = (char)(((write & 0x1C0000)/0x40000) + (0x30*8));
            b1 = (char)(((write & 0x1F00)/0x1000) + (0x02*0x40));
	    b2 = (char)(((write & 0xFC0)/0x40) + (0x02*0x40));
	    b3 = (char)((write & 0x3F) + (0x02*0x40));
            
            fout.write((char *) &b0, sizeof(b0));
            fout.write((char *) &b1, sizeof(b1));
            fout.write((char *) &b2, sizeof(b2));
            fout.write((char *) &b3, sizeof(b3));
            cpoint++;
        }
    }           
}

This code is available, syntax highlighted at: http://knowingunicode.pastebin.com/d179f0e2c

And.. a line by line dissection:

1-12: We now traverse ustream, byte-by-byte. Actually, we read it 3 bytes at a time. During each iteration of the loop, 3 bytes are read and on line 12, these 3 bytes are merged, to get the Unicode scalar in a computable, integral variable, 'write'.
14-16: If 'write' is in the ASCII i.e. single-byte range, it is written to fout as it is.
17-22: We perform range checks to determine a proper range. Once a range is determined, a byte 'b1' or 'b0', is designated to be the first byte. It is prepared with the required multi-byte signature and then merged with the proper segment from write. For example, in lines 17-22, b1 is prepared with the signature '110' by merging '110' with the first 5 bits of 'write'. The next byte, b2 is prepared with '10' and then merged with the next 6 bytes of 'write'.

The same is followed for all ranges.

2. The convertToU16() function

The convertToU16() function is relatively simpler as compared to the convertToU8() function. When we have an unicode scalar 'u', such that u < 0x10000, it is the UTF-16 encoded value itself, and hence is written to the file just as is. For values above this value, the algorithmic method has already been described in Chapter 3, so I won't be going about in here. This short section is all about the code, again dissected, which goes something like this:

Code:

void convertToU16() {    
    while(!(ustream.eof())) {
        unsigned char b1, b2, b3;
        long int cpoint;
        ustream.read((char *) &b1, sizeof(b1));
        ustream.read((char *) &b2, sizeof(b2));
        ustream.read((char *) &b3, sizeof(b3));
        
        cpoint = (long)((b1*0x10000) + (b2*0x100) + b3);
        
        if(cpoint < 0x10000) {
            u16Write((long)cpoint);
        } else {
            cpoint = cpoint - 0x10000;
            int cp_h, cp_l;
            int w1 = 0xD800, w2 = 0xDC00;
            
            cp_h = ((cpoint & 0xFFC00)/0x400);
            cp_l = (cpoint & 0x3FF);
            
            w1 = w1 | cp_h;
            w2 = w2 | cp_l;
            
            u16Write((long)w1);
            u16Write((long)w2);
        }    
    }
}

This code is available, syntax highlighted at: http://knowingunicode.pastebin.com/dc9b2de7

Ahh.. the simplicity of UTF-16, dissected:

1-9: As done previously, 3 bytes are read and then converted to a computable integral variable, 'cpoint'. Yeah.. this is not a good programming practice to keep incosistent names, but I'd ask for your forgiveness here.
11-13: If 'cpoint' is less than 0x10000, it is the UTF-16 encoded value itself, and is written to fout. For endianness considerations, we define and use a function, u16Write(). It is different from padAndWrite() in ways that it writes to fout, and writes 2 bytes rather than 3. It's main aim is byte-order consideration.
14-25: As discussed in the algorithm, 0x10000 is first subtracted from it. Then, it is split into 2 variables containing the 1st and the last 10 bits of the scalar. The arbitary variable values are declared and the 10 bits are merged there. Once the merging is done, we get two variables, which are then written using u16Write() to fout.

Conclusion

We have learned what Unicode is, how it is stored and how to develop your applications around it. I hope that this has been as comprehensive as I intend it to be. Having said that, this is truly one of the biggest tutorials I've ever written. This is around 10000 words, which by my laziness is quite a huge thing. I'd love your comments, drop in a reply in this forum or you can send me an email at: http://www.tritium.frihost.net/tutorials/unicode/email_img.gif

I apologize for any errors in this text, cause I'm pretty sure there are some :D. Hoping to see you in my next tutorial.

If you liked this tutorial, please show your support by http://digg.com/programming/All_about_Unicode_programming_in_Unicode Thank you.

Licensed under Creative Commons Attribution-Noncommercial 2.5 India License

Complete code and samples (with Windows binary): http://www.tritium.frihost.net/tutorials/unicode/uconvert.rar

http://www.tritium.frihost.net/tutorials/unicode/all_about_unicode2.jpg

All about Unicode /programming with Unicode [tutorial]

1. What is Unicode and why is it important in programming?

2. How is Unicode different from other character encodings?

3. How do I use Unicode in my programming language?

4. Can I use Unicode for all languages?

5. Are there any drawbacks to using Unicode in programming?

Similar threads

Hot Threads

Recent Insights