- #1
rohanprabhu
- 414
- 2
http://www.tritium.frihost.net/tutorials/unicode/all_about_unicode.jpg
NOTE: To view the tutorial properly, your browser must be setup to display Unicode text properly. Please refer to this article in order to do so: http://tinyurl.com/2v5nfd
http://digg.com/programming/All_about_Unicode_programming_in_Unicode
Prologue
Long after completing my exams, I now have this urge of writing another tutorial. And I chose to cover Unicode. Unicode is a very useful industry standard, and it's use is going to increase exponentially in the coming years. Almost all applications today have support for Unicode. Windows 2000/NT/XP use Unicode [in the form of UTF-16] for internal text representation. Java and the .NET environment also use it for internal text representation. Despite it's utility, there were very few resources on the net or anywhere else to cover the Unicode standard completely. Ofcourse, you have the 'Unicode Standard 5.0' itself as a free download at the Unicode site, a whole book is sometimes only pursued by enthusiasts in a particular area. As a programmer, or even as a user, you need to know what Unicode is, how it works, what are character encodings, what the transformation formats [UTF] are and what UCS means. Though depending on your role, the amount you need to know differs, I'd say reading till Chapter 2 would be recommended even for general users. We, Indians are a part of the group who extensively use non-Latin characters for communication and we, as programmers are a part of the group who cater to a clientele from various parts of the world, many of whom extensively use non-Latin characters for communication. So, let's begin with 'all about Unicode'.
Introduction
Unicode is an industry standard which is used to store, retrieve and transmit text in many of the world's writing systems. Unicode's coverage of writing system is exhaustive. It is so exhaustive that it even contains symbols from some fictional scripts. Not only fictional scripts, Unicode also includes some of the commonly used symbols like the gender symbols for male/female etc. Before going to Unicode, we'll first cover 'character encoding'. Once we learn what character encoding is, we'll go into 'The Unicode family' which would be an introduction to the various standards and technologies that come under Unicode. Later on chapters are targetted at programmers. In this chapters all numbers are in decimal unless prefixed by 0x or U+ [meaning hexadecimal] or suffixed by 'b' [meaning binary].
Chapter 1: Character Encoding
All data on computers and in electronics is stored in terms of numbers. At the actual hardware level, it may be stored as charge on a capacitor, a magnetic field, may be transmitted as a particular amplitude or a frequency in a waveform, but for programmers, they all correspond to numbers. The letter 'A' is a number. The symbol it represents is stored in the form of 'glyphs' [glyphs are what form 'fonts'] which are basically drawings, which are represented using using various numbers. In this tutorial, we would not be going into how Unicode text is rendered or how fonts work, but we would be talking about how text from various writing systems is represented as a sequence of numbers. The display of Unicode characters is handled well by applications today, who evolved a great ability to use character from various fonts to render the text.
Coming back to the story of numbers, text also needs to be represented by numbers. To do this, various standards exist. These standards are collectively known as 'Character Encodings'. A popular character encoding is ASCII. What a character encoding does is pair a number and a symbol. For ex., in ASCII the number 97 is 'a'. Now, ASCII does not tell anything about the shape of the symbol 'a'. It talks about the character 'a'. We can say something like in ASCII the number 97 is the first letter of the english alphabet, lower case. So, when an application receives a text stream, concludes it to be encoded in ASCII and encounters the number 97 in the stream, the application renders the 'a' symbol on the screen. Here '97' is referred to as the 'code point' of 'a'. The information of the symbol is stored in fonts [actually, glyphs]. The fonts, on the other hand, pair a number and a glyph, so that given a particular number, a particular glyph i.e. a symbol can be displayed.
1.1| The ASCII encoding system
ASCII is a 8-bit encoding system. As we all know that in computers all numbers at the lowest level are manipulated in the binary numeral system. Every digit is a called a 'bit'. Each bit can have a value of either '1' or '0'. And 8-bits make a byte. In ASCII, every byte represents a character. However, ASCII does not use the 8th bit. It uses only 7 bits i.e the range from 00000000b - 01111111b i.e 128 characters. The first bit was earlier used for carrying information about the data protocol over the stream (most commonly to carry the parity bit). This is the reason some that many UTF-8 text streams appear corrupted while using email [covered later in the chapter]. ASCII contains 95 printable characters. Though these characters are more than sufficient for communication in English, and bashing on IRC, the need for coverage of more characters was clearly felt because these characters weren't enough to transfer text in some languages that even used the Latin alphabet. Other languages including Chinese/Japanese/Korean, Devenagari etc. were a different issue altogether.
Other than printable characters, ASCII reserved the first 32 characters for 'control characters'. These characters are not used for transmitting readable text, but rather to control devices. For ex., the number '17' in ASCII is designated as 'Device Control 1' and was oftenly used to switch off devices, whereas '19', 'Device Control 3' was oftenly used to switch off devices. Due to the complexity involved in todays hardware, this method is completely deprecated.
1.2| The ISO-8859 standard
The ISO/IEC put forward a standard named ISO/IEC 8859 to overcome the shortcomings of ASCII. Most of the characters had the same code points as in ASCII to maintain backward compatibility. ISO-8859 increased the number of characters that can be represented by a single byte by making use of the 8th byte, giving in a total of 256 characters: twice that of ASCII. However, this still did not meet the requirements and many subsequent 'parts' were developed for encoding other characters. There were 16 such parts. For example, the 3rd part known as ISO-8859-3 encodes characters in the Turkish, Maltese, and Esperanto writing system. This still was not enough as a text stream could be encoded in only one encoding stream. So, it was not possible to have Turkish [ISO-8859-3] and Lativian [ISO-8859-4] characters in the same file. [Not completely true. The ISO 2022 standard had a solution. More more information, search for ISO 2022].
An encoding system that could cover all the writing systems was needed.
1.3| ISCII, JIS X 0201 and other encodings
The characters sets that were not implemented in the ISO-8859 standard were then implemented using alternative encodings. Devanagari was earlier a part of ISO-8859-12, but then ISCII took over. ISCII for Indian Standard Code for Information Interchange gave an encoding system for many Indian writing systems including Devanagari, Gurumukhi etc.
JIS X 0201 'Japanese Industrial Standard X 0201', was a character encoding for the Katakana writing system.
Such encoding systems were proving to be a problem as they could not be used in the same file. Moreover, due to more number of encodings, protability was also proving to be a problem.
1.4| Problems with typography
Many characters important for typography were not a part of these standards for example, curly quotation marks. It was because of the already existing shortage of code points that all standards focused on transmitting text rather than the design aspect of it. Including the design aspect, the multi-lingual aspect and the internet bringing people from all segments and locations on the computer, it was necessary for an encoding that supported a huge number of characters to become existent.
1.5| Understanding a text stream
A stream is just a sequence of numbers. This sequence of numbers can come in from a file on your hard disk or via a channel on your network. The application that receives this stream is to decide what to do with that stream. If it detects it as text, it's a text stream. If it detects is as some video file, it is a multimedia stream. From now on, you need to know that files are NOT categorized as text files and binary files. Only binary files exist. Files are nothing but a specific segement of the storage which stores some numbers in a sequence. If the categorization is not there, why have you been taught so in school? This is what we'll go into the later parts of the chapter. For now, we'll cover what Unicode is all about.
NOTE: To view the tutorial properly, your browser must be setup to display Unicode text properly. Please refer to this article in order to do so: http://tinyurl.com/2v5nfd
http://digg.com/programming/All_about_Unicode_programming_in_Unicode
Prologue
Long after completing my exams, I now have this urge of writing another tutorial. And I chose to cover Unicode. Unicode is a very useful industry standard, and it's use is going to increase exponentially in the coming years. Almost all applications today have support for Unicode. Windows 2000/NT/XP use Unicode [in the form of UTF-16] for internal text representation. Java and the .NET environment also use it for internal text representation. Despite it's utility, there were very few resources on the net or anywhere else to cover the Unicode standard completely. Ofcourse, you have the 'Unicode Standard 5.0' itself as a free download at the Unicode site, a whole book is sometimes only pursued by enthusiasts in a particular area. As a programmer, or even as a user, you need to know what Unicode is, how it works, what are character encodings, what the transformation formats [UTF] are and what UCS means. Though depending on your role, the amount you need to know differs, I'd say reading till Chapter 2 would be recommended even for general users. We, Indians are a part of the group who extensively use non-Latin characters for communication and we, as programmers are a part of the group who cater to a clientele from various parts of the world, many of whom extensively use non-Latin characters for communication. So, let's begin with 'all about Unicode'.
Introduction
Unicode is an industry standard which is used to store, retrieve and transmit text in many of the world's writing systems. Unicode's coverage of writing system is exhaustive. It is so exhaustive that it even contains symbols from some fictional scripts. Not only fictional scripts, Unicode also includes some of the commonly used symbols like the gender symbols for male/female etc. Before going to Unicode, we'll first cover 'character encoding'. Once we learn what character encoding is, we'll go into 'The Unicode family' which would be an introduction to the various standards and technologies that come under Unicode. Later on chapters are targetted at programmers. In this chapters all numbers are in decimal unless prefixed by 0x or U+ [meaning hexadecimal] or suffixed by 'b' [meaning binary].
Chapter 1: Character Encoding
All data on computers and in electronics is stored in terms of numbers. At the actual hardware level, it may be stored as charge on a capacitor, a magnetic field, may be transmitted as a particular amplitude or a frequency in a waveform, but for programmers, they all correspond to numbers. The letter 'A' is a number. The symbol it represents is stored in the form of 'glyphs' [glyphs are what form 'fonts'] which are basically drawings, which are represented using using various numbers. In this tutorial, we would not be going into how Unicode text is rendered or how fonts work, but we would be talking about how text from various writing systems is represented as a sequence of numbers. The display of Unicode characters is handled well by applications today, who evolved a great ability to use character from various fonts to render the text.
Coming back to the story of numbers, text also needs to be represented by numbers. To do this, various standards exist. These standards are collectively known as 'Character Encodings'. A popular character encoding is ASCII. What a character encoding does is pair a number and a symbol. For ex., in ASCII the number 97 is 'a'. Now, ASCII does not tell anything about the shape of the symbol 'a'. It talks about the character 'a'. We can say something like in ASCII the number 97 is the first letter of the english alphabet, lower case. So, when an application receives a text stream, concludes it to be encoded in ASCII and encounters the number 97 in the stream, the application renders the 'a' symbol on the screen. Here '97' is referred to as the 'code point' of 'a'. The information of the symbol is stored in fonts [actually, glyphs]. The fonts, on the other hand, pair a number and a glyph, so that given a particular number, a particular glyph i.e. a symbol can be displayed.
1.1| The ASCII encoding system
ASCII is a 8-bit encoding system. As we all know that in computers all numbers at the lowest level are manipulated in the binary numeral system. Every digit is a called a 'bit'. Each bit can have a value of either '1' or '0'. And 8-bits make a byte. In ASCII, every byte represents a character. However, ASCII does not use the 8th bit. It uses only 7 bits i.e the range from 00000000b - 01111111b i.e 128 characters. The first bit was earlier used for carrying information about the data protocol over the stream (most commonly to carry the parity bit). This is the reason some that many UTF-8 text streams appear corrupted while using email [covered later in the chapter]. ASCII contains 95 printable characters. Though these characters are more than sufficient for communication in English, and bashing on IRC, the need for coverage of more characters was clearly felt because these characters weren't enough to transfer text in some languages that even used the Latin alphabet. Other languages including Chinese/Japanese/Korean, Devenagari etc. were a different issue altogether.
Other than printable characters, ASCII reserved the first 32 characters for 'control characters'. These characters are not used for transmitting readable text, but rather to control devices. For ex., the number '17' in ASCII is designated as 'Device Control 1' and was oftenly used to switch off devices, whereas '19', 'Device Control 3' was oftenly used to switch off devices. Due to the complexity involved in todays hardware, this method is completely deprecated.
1.2| The ISO-8859 standard
The ISO/IEC put forward a standard named ISO/IEC 8859 to overcome the shortcomings of ASCII. Most of the characters had the same code points as in ASCII to maintain backward compatibility. ISO-8859 increased the number of characters that can be represented by a single byte by making use of the 8th byte, giving in a total of 256 characters: twice that of ASCII. However, this still did not meet the requirements and many subsequent 'parts' were developed for encoding other characters. There were 16 such parts. For example, the 3rd part known as ISO-8859-3 encodes characters in the Turkish, Maltese, and Esperanto writing system. This still was not enough as a text stream could be encoded in only one encoding stream. So, it was not possible to have Turkish [ISO-8859-3] and Lativian [ISO-8859-4] characters in the same file. [Not completely true. The ISO 2022 standard had a solution. More more information, search for ISO 2022].
An encoding system that could cover all the writing systems was needed.
1.3| ISCII, JIS X 0201 and other encodings
The characters sets that were not implemented in the ISO-8859 standard were then implemented using alternative encodings. Devanagari was earlier a part of ISO-8859-12, but then ISCII took over. ISCII for Indian Standard Code for Information Interchange gave an encoding system for many Indian writing systems including Devanagari, Gurumukhi etc.
JIS X 0201 'Japanese Industrial Standard X 0201', was a character encoding for the Katakana writing system.
Such encoding systems were proving to be a problem as they could not be used in the same file. Moreover, due to more number of encodings, protability was also proving to be a problem.
1.4| Problems with typography
Many characters important for typography were not a part of these standards for example, curly quotation marks. It was because of the already existing shortage of code points that all standards focused on transmitting text rather than the design aspect of it. Including the design aspect, the multi-lingual aspect and the internet bringing people from all segments and locations on the computer, it was necessary for an encoding that supported a huge number of characters to become existent.
1.5| Understanding a text stream
A stream is just a sequence of numbers. This sequence of numbers can come in from a file on your hard disk or via a channel on your network. The application that receives this stream is to decide what to do with that stream. If it detects it as text, it's a text stream. If it detects is as some video file, it is a multimedia stream. From now on, you need to know that files are NOT categorized as text files and binary files. Only binary files exist. Files are nothing but a specific segement of the storage which stores some numbers in a sequence. If the categorization is not there, why have you been taught so in school? This is what we'll go into the later parts of the chapter. For now, we'll cover what Unicode is all about.
Last edited by a moderator: