Information Problem

  • Thread starter Pauly Man
  • Start date
126
0
How many books could you write onto a 40GB hard drive given the following assumptions?

Assumptions-

  • Must be able to uniquely encode the numerals 0-9
  • Must be able to uniquely encode the lower and upper case letters of the alphabet
  • Must be able to uniquely encode spaces and full stops
  • A book is assumed to have approximately 400 pages in it

Basically the problem is asking these questions:

  • How many bits does it take to uniquely encode a character as described in the above assumptions?
  • How many characters can be encoded onto a 40GB hard drive?
  • How many books does that correspond to?

I want to see how people go about finding a solution to this problem. I got bored on the train to university this morning and wasted the time thinking about this problem. When a few people have had a go at the problem I'll post my working out.
 

chroot

Staff Emeritus
Science Advisor
Gold Member
10,166
34
Take the total number of symbols in the alphabet you've specified to be N.

If you want to see how many books you can fit naively, just assume that you need ceil(log2(N)) bits to encode each symbol. The number of books you can store using this naive encoding is extremely easy to calculate.

If you want to be more sophisticated and see how many books you can actually fit into 40GB, perform a frequency analysis over a good-sized sample, and create a Huffman coding for your alphabet. Encode your books using this coding. You may then want to apply some kind of sliding-window or dictionary-based compression scheme. The number of books you can store using these techniques is likely to be something on the order of seven times as many as you can store in the naive encoding, largely because English text has only (IIRC) 1.8 bits per symbol of irreducible information.

- Warren
 

HallsofIvy

Science Advisor
Homework Helper
41,712
876
You have required that there be 63 distinct symbols. n bits can encode 2^n symbols so we would need 6 bits (2^6= 63). This is basically the "ceil(log2(N))" chroot gave.

There are, however, a few pieces of information you did not give. You say "A book is assumed to have approximately 400 pages in it" but you DON'T say how many symbols to the page! Also, while you talk about individual characters, your "40 GB" is 40 million BYTES (well actually 40*(1,024,000)= 40,960,000 bytess). You don't say how we are to assume bits are fit into bytes. Assuming that we ONLY need the 63 symbols given and are trying to fit as many symbols as possible onto a drive, so that we abut one 6 bit symbol to another, 40 GB= 327680000 or 327680000/6 = 54,613,333 symbols on the drive (with 2 bits left over).

That would be a lot of work and it would be much simpler to use 8 bits, the entire byte, to represent a symbol as is actually done.
In that case we could fit 327680000/8= 40,960,000 symbols on a drive,
allowing for 2^8= 256 different symbols.

We still can't answer "How many books?" because you haven't told us how many symbols to a page (or how many symbols to a book).



If a drive can hold 40GB
 

Hurkyl

Staff Emeritus
Science Advisor
Gold Member
14,845
17
And there are also some fancy space saving techniques like saving multiple occurances of a single string in multiple books as one entry on the hard drive. In an extreme case, you could fit an infinite number of books in your 40 GB hard drive if they were all the same by just saving one copy and symbolically linking all other copies to that one book.

Hurkyl
 
191
0
1GB=1024MB=1024*1024KB ~ 1 billion bytes=8 billion bits...
Or am I wrong ?
 
126
0
Good points Hurkyl.

I used microsoft word to work out that on an A4 page, you can fit around 2500-3000 symbols on a page (using 12 sized font and times new roman), and since an A4 page is roughly twice the size of a typical paperback page, you could fit roughly 1250-1500 symbols to a paperback page. Assuming 400 pages to a book then you could typically fit 500,000-600,000 symbols in a book.

Using your 54,613,333 symbols on a 40GB hard drive then you could fit 91-109 books on a 40GB hard drive.
 

damgo

^^^ Whoa, that's way to low. Uncompressed:

There are 64 characters you require, so that 6 bits. Let's round up to ASCII, 8 bits/character, to give us some breathing space: 1char/byte.

So 40 billion symbols in a 40gig drive = 80,000 books

Plain Huffman compression should at least double or triple this, and a book-optimized encoding scheme even more -- with chroot's 7x figure, ~half a million books.
 
126
0
Originally posted by damgo

So 40 billion symbols in a 40gig drive = 80,000 books

I somehow lost an entire order of magnitude in my calculations. You are entirely correct, it is between 66,666 and 80,000 books. That seems much more reasonable than approximately 100 books.

I have ignored compression throughout this problem, as I wanted to see the numbers that popped out assuming only very basic encoding. Those numbers above are still very impressive indeed.
 
Last edited:

chroot

Staff Emeritus
Science Advisor
Gold Member
10,166
34
1 GB is defined (by hard-drive manufacturers anyway) to be 10^9 (1 billion) bytes.

- Warren
 

Related Threads for: Information Problem

  • Posted
Replies
2
Views
1K
  • Posted
Replies
5
Views
4K
  • Posted
Replies
1
Views
4K
  • Posted
Replies
3
Views
795
  • Posted
2
Replies
42
Views
4K
  • Posted
Replies
13
Views
8K
  • Posted
2
Replies
32
Views
4K
  • Posted
Replies
4
Views
2K

Physics Forums Values

We Value Quality
• Topics based on mainstream science
• Proper English grammar and spelling
We Value Civility
• Positive and compassionate attitudes
• Patience while debating
We Value Productivity
• Disciplined to remain on-topic
• Recognition of own weaknesses
• Solo and co-op problem solving

Hot Threads

Top