How Do Different Factors Affect the Number of Bits Needed to Represent a Letter?

  • Thread starter Thread starter Raghav Gupta
  • Start date Start date
AI Thread Summary
A letter typically requires 8 bits for representation due to the binary encoding systems like ASCII, which assigns numerical values to characters. While ASCII uses 7 bits for standard characters, extended systems can accommodate additional symbols, including foreign characters through Unicode, which often uses 16 bits. The discussion also touches on how integers and characters are processed differently in computers, with ASCII converting letters to numerical values before binary representation. Various coding systems exist, including EBCDIC and extended ASCII, each with its own character limits. The conversation highlights the complexity of character encoding and the potential for data compression, suggesting that the number of bits per letter can be reduced significantly through advanced techniques.
Raghav Gupta
Messages
1,010
Reaction score
76
Why is that a letter requires 8 bits which is combination of 0's and 1's?
In one byte we can have 28 combinations. Apart from letters and numbers what more?
Why is 210 famous?
How the 0's and 1's combination which is "on" and "off" printing letters?
 
Technology news on Phys.org
You can look at the table here and see all 256 symbols. Don't forget upper and lower case, punctuation marks, etc.
 
phyzguy said:
You can look at the table here and see all 256 symbols. Don't forget upper and lower case, punctuation marks, etc.
But there can be so many other characters as well like japanese letters etc.
How computer can understand numbers? I thought it only understands 1 and 0.
If it understands the number also how the printing of letter takes place?
 
Raghav Gupta said:
Why is that a letter requires 8 bits which is combination of 0's and 1's?
In one byte we can have 28 combinations. Apart from letters and numbers what more?
Why is 210 famous?
How the 0's and 1's combination which is "on" and "off" printing letters?

In the Stone Age some machines had 7-bit characters to save money. There are upper-case and lower-case letters, numerals, and punctuation, which total more than 64.

2^10 is famous because it is approximately equal to one thousand.

The "on" and "off" printing characters are for teletypes, now obsolete.
 
  • Like
Likes Raghav Gupta
Foreign languages typically use an encoding called Unicode, which takes two bytes (and sometimes four bytes). Do you understand that a byte is 8 bits, so that once you have done this translation, it is translated into 1's and 0's? Also, the number "16" as text is coded differently in the computer than the number 16 as an integer number, which is coded differently than the number 16.0 as a floating point number. Explaining all of this in detail is the topic of a whole book. Why don't you find a good textbook on how computers work and start there. If you have specific questions after reading that, come back and ask.
 
Unicode (UTF-8) even includes emoji:

https://en.wikipedia.org/wiki/Emoji#Blocks

How many of them display properly depends on the font that your browser uses. When new characters are added to the standard set, fonts have to be updated to include them.
 
  • Like
Likes Raghav Gupta
phyzguy said:
Why don't you find a good textbook on how computers work and start there.
Can you give a reference for a this type of book?
 
Raghav Gupta said:
But there can be so many other characters as well like japanese letters etc.
How computer can understand numbers? I thought it only understands 1 and 0.
If it understands the number also how the printing of letter takes place?
The Japanese writing systems (There are at least 4 in common use) don't use letters. One system does use the Roman alphabet ( romaji ) to transliterate the sounds of Japanese, but the other three use either a syllabary (hiragana and katakana) or full-on ideograms [ kanji ] (of which there are thousands of different signs).

https://en.wikipedia.org/wiki/Japanese_writing_system

The kanji are derived from Chinese ideograms, but are not equivalent for the most part. Typewriters and keyboards which can handle kanji or Chinese ideograms are cumbersome devices with many keys, which take a long time to master. A typical Japanese student does not become fully fluent in speaking and writing his native language (all 4 forms of writing) until he reaches his teen years. Calligraphers practice the art of drawing Japanese ideograms often for a lifetime.

The computer works with binary equivalents of numbers. Integers are converted into their binary equivalents. Floating point numbers are converted to a specially-coded binary format, which is manipulated by the computer, and the results are decoded back to a decimal number for display or printing.

When printing, the computer sends a stream of data to the printer. The printer decodes the data stream and prints the proper character. Likewise, when data is displayed by the computer on screen, the internal data is decoded into human readable characters.
 
  • Like
Likes Silicon Waffle
SteamKing said:
Integers are converted into their binary equivalents.
But if integers can be converted into their binary equivalent why cannot letters directly be converted to their binary equivalent. Why ASCII is needed to convert letters first into numbers. Suppose ASCII assigns value 65 to A, then what about if we have to print 65 only ?
 
  • #10
Raghav Gupta said:
But if integers can be converted into their binary equivalent why cannot letters directly be converted to their binary equivalent. Why ASCII is needed to convert letters first into numbers. Suppose ASCII assigns value 65 to A, then what about if we have to print 65 only ?
What's the binary equivalent of 'A' or 'q' or '&'?

The ASCII code for 'A' is 65 decimal, but the computer uses the binary equivalent 100 0001, which is also 41 hex. If you want to print the numeral '65', you must print each decimal digit, '6' and '5', in the proper order for a human to understand it.

ASCII is a coded representation of letters, numerals, and other characters commonly found in American English writing. ASCII is not the only such system, but it is the one around which most computers operate. There are also extended ASCII, Unicode, and several other code systems in use:

https://en.wikipedia.org/wiki/ASCII

An older coding system, developed by IBM for their mainframes, was known as EBCDIC:

https://en.wikipedia.org/wiki/EBCDIC
 
  • Like
Likes Raghav Gupta
  • #11
Raghav Gupta said:
Can you give a reference for a this type of book?

I've heard good things about "The Elements of Computing Systems: Building a Modern Computer from First Principles" by Noam Nisan.
 
  • Like
Likes Raghav Gupta
  • #12
Raghav Gupta said:
Why is that a letter requires 8 bits
Well it does - and does not. Back in the infancy of computers a character (which is the general term for "letter") used a different number of bits:
  • Telex code = 5 bits
  • Some computers used 6 bits/character
  • Communication to early alphanumeric terminals: 7 bits + parity (ASCII or EBCDIC code)
The "modern" age of computers:
 
  • #13
So how many bits does a letter really need/carry?

27 letters would directly require 5 bits.
Using a base 27 numeral system, we would need lg(27) ~ 4.75 bits/letter instead.
Huffman would need ~ 4.12 bits/letter instead.
A better order 0 entropy coder ~ 4.08 bits/letter.
Order 1 (Markov: previous letter is the context) ~ 3.3 bits/letter
Order 2: ~ 3.1 bits/letter
using probability distribution among words: ~ 2.1 bits/letter.
...
the best compressor "cmix v8" http://mattmahoney.net/dc/text.html
compress 10^9 to 123,930,173 bytes - it is less than 1bit/letter.
...
Hilberg conjecture suggests that, due to long range correlations, entropy of text grows in a sublinear way
H(n) ~ n^beta where beta < 0.9
http://www.ipipan.waw.pl/~ldebowsk/docs/seminaria/hilberg.pdf
In other words, compressing two texts concatenated, we need less than sum of compressed sizes for separate files.
So this conjecture suggests that the number of bits per letter approaches zero ?
 
Back
Top