# Information Problem

How many books could you write onto a 40GB hard drive given the following assumptions?

Assumptions-

• Must be able to uniquely encode the numerals 0-9
• Must be able to uniquely encode the lower and upper case letters of the alphabet
• Must be able to uniquely encode spaces and full stops
• A book is assumed to have approximately 400 pages in it

Basically the problem is asking these questions:

• How many bits does it take to uniquely encode a character as described in the above assumptions?
• How many characters can be encoded onto a 40GB hard drive?
• How many books does that correspond to?

I want to see how people go about finding a solution to this problem. I got bored on the train to university this morning and wasted the time thinking about this problem. When a few people have had a go at the problem I'll post my working out.

chroot
Staff Emeritus
Gold Member
Take the total number of symbols in the alphabet you've specified to be N.

If you want to see how many books you can fit naively, just assume that you need ceil(log2(N)) bits to encode each symbol. The number of books you can store using this naive encoding is extremely easy to calculate.

If you want to be more sophisticated and see how many books you can actually fit into 40GB, perform a frequency analysis over a good-sized sample, and create a Huffman coding for your alphabet. Encode your books using this coding. You may then want to apply some kind of sliding-window or dictionary-based compression scheme. The number of books you can store using these techniques is likely to be something on the order of seven times as many as you can store in the naive encoding, largely because English text has only (IIRC) 1.8 bits per symbol of irreducible information.

- Warren

HallsofIvy
Homework Helper
You have required that there be 63 distinct symbols. n bits can encode 2^n symbols so we would need 6 bits (2^6= 63). This is basically the "ceil(log2(N))" chroot gave.

There are, however, a few pieces of information you did not give. You say "A book is assumed to have approximately 400 pages in it" but you DON'T say how many symbols to the page! Also, while you talk about individual characters, your "40 GB" is 40 million BYTES (well actually 40*(1,024,000)= 40,960,000 bytess). You don't say how we are to assume bits are fit into bytes. Assuming that we ONLY need the 63 symbols given and are trying to fit as many symbols as possible onto a drive, so that we abut one 6 bit symbol to another, 40 GB= 327680000 or 327680000/6 = 54,613,333 symbols on the drive (with 2 bits left over).

That would be a lot of work and it would be much simpler to use 8 bits, the entire byte, to represent a symbol as is actually done.
In that case we could fit 327680000/8= 40,960,000 symbols on a drive,
allowing for 2^8= 256 different symbols.

We still can't answer "How many books?" because you haven't told us how many symbols to a page (or how many symbols to a book).

If a drive can hold 40GB

Hurkyl
Staff Emeritus
Gold Member
And there are also some fancy space saving techniques like saving multiple occurances of a single string in multiple books as one entry on the hard drive. In an extreme case, you could fit an infinite number of books in your 40 GB hard drive if they were all the same by just saving one copy and symbolically linking all other copies to that one book.

Hurkyl

1GB=1024MB=1024*1024KB ~ 1 billion bytes=8 billion bits...
Or am I wrong ?

Good points Hurkyl.

I used microsoft word to work out that on an A4 page, you can fit around 2500-3000 symbols on a page (using 12 sized font and times new roman), and since an A4 page is roughly twice the size of a typical paperback page, you could fit roughly 1250-1500 symbols to a paperback page. Assuming 400 pages to a book then you could typically fit 500,000-600,000 symbols in a book.

Using your 54,613,333 symbols on a 40GB hard drive then you could fit 91-109 books on a 40GB hard drive.

damgo
^^^ Whoa, that's way to low. Uncompressed:

There are 64 characters you require, so that 6 bits. Let's round up to ASCII, 8 bits/character, to give us some breathing space: 1char/byte.

So 40 billion symbols in a 40gig drive = 80,000 books

Plain Huffman compression should at least double or triple this, and a book-optimized encoding scheme even more -- with chroot's 7x figure, ~half a million books.

Originally posted by damgo

So 40 billion symbols in a 40gig drive = 80,000 books

I somehow lost an entire order of magnitude in my calculations. You are entirely correct, it is between 66,666 and 80,000 books. That seems much more reasonable than approximately 100 books.

I have ignored compression throughout this problem, as I wanted to see the numbers that popped out assuming only very basic encoding. Those numbers above are still very impressive indeed.

Last edited:
chroot
Staff Emeritus