Calculating Books on 40GB Hard Drive: Assumptions & Solutions

  • Context: Undergrad 
  • Thread starter Thread starter Pauly Man
  • Start date Start date
  • Tags Tags
    Information
Click For Summary
SUMMARY

This discussion focuses on calculating the number of books that can be stored on a 40GB hard drive based on specific encoding assumptions. Participants established that 63 distinct symbols require 6 bits for encoding, allowing for approximately 54,613,333 symbols on the drive. Using an average of 500,000-600,000 symbols per book, the naive calculation suggests that 91-109 books can be stored, while advanced compression techniques like Huffman coding could increase this capacity to around 80,000 books or more, potentially reaching half a million with optimized encoding.

PREREQUISITES
  • Understanding of binary encoding and bit calculations
  • Familiarity with Huffman coding and compression techniques
  • Knowledge of character encoding standards (e.g., ASCII)
  • Basic grasp of file storage capacities (GB to bytes conversion)
NEXT STEPS
  • Research Huffman coding implementation for text compression
  • Explore advanced encoding schemes for optimizing storage
  • Learn about character encoding standards and their applications
  • Investigate file system structures and their impact on storage efficiency
USEFUL FOR

This discussion is beneficial for software developers, data scientists, and anyone involved in data storage optimization, particularly those interested in text encoding and compression methodologies.

Pauly Man
Messages
129
Reaction score
0
How many books could you write onto a 40GB hard drive given the following assumptions?

Assumptions-

  • Must be able to uniquely encode the numerals 0-9
  • Must be able to uniquely encode the lower and upper case letters of the alphabet
  • Must be able to uniquely encode spaces and full stops
  • A book is assumed to have approximately 400 pages in it

Basically the problem is asking these questions:

  • How many bits does it take to uniquely encode a character as described in the above assumptions?
  • How many characters can be encoded onto a 40GB hard drive?
  • How many books does that correspond to?

I want to see how people go about finding a solution to this problem. I got bored on the train to university this morning and wasted the time thinking about this problem. When a few people have had a go at the problem I'll post my working out.
 
Physics news on Phys.org
Take the total number of symbols in the alphabet you've specified to be N.

If you want to see how many books you can fit naively, just assume that you need ceil(log2(N)) bits to encode each symbol. The number of books you can store using this naive encoding is extremely easy to calculate.

If you want to be more sophisticated and see how many books you can actually fit into 40GB, perform a frequency analysis over a good-sized sample, and create a Huffman coding for your alphabet. Encode your books using this coding. You may then want to apply some kind of sliding-window or dictionary-based compression scheme. The number of books you can store using these techniques is likely to be something on the order of seven times as many as you can store in the naive encoding, largely because English text has only (IIRC) 1.8 bits per symbol of irreducible information.

- Warren
 
You have required that there be 63 distinct symbols. n bits can encode 2^n symbols so we would need 6 bits (2^6= 63). This is basically the "ceil(log2(N))" chroot gave.

There are, however, a few pieces of information you did not give. You say "A book is assumed to have approximately 400 pages in it" but you DON'T say how many symbols to the page! Also, while you talk about individual characters, your "40 GB" is 40 million BYTES (well actually 40*(1,024,000)= 40,960,000 bytess). You don't say how we are to assume bits are fit into bytes. Assuming that we ONLY need the 63 symbols given and are trying to fit as many symbols as possible onto a drive, so that we abut one 6 bit symbol to another, 40 GB= 327680000 or 327680000/6 = 54,613,333 symbols on the drive (with 2 bits left over).

That would be a lot of work and it would be much simpler to use 8 bits, the entire byte, to represent a symbol as is actually done.
In that case we could fit 327680000/8= 40,960,000 symbols on a drive,
allowing for 2^8= 256 different symbols.

We still can't answer "How many books?" because you haven't told us how many symbols to a page (or how many symbols to a book).



If a drive can hold 40GB
 
And there are also some fancy space saving techniques like saving multiple occurances of a single string in multiple books as one entry on the hard drive. In an extreme case, you could fit an infinite number of books in your 40 GB hard drive if they were all the same by just saving one copy and symbolically linking all other copies to that one book.

Hurkyl
 
1GB=1024MB=1024*1024KB ~ 1 billion bytes=8 billion bits...
Or am I wrong ?
 
Good points Hurkyl.

I used microsoft word to work out that on an A4 page, you can fit around 2500-3000 symbols on a page (using 12 sized font and times new roman), and since an A4 page is roughly twice the size of a typical paperback page, you could fit roughly 1250-1500 symbols to a paperback page. Assuming 400 pages to a book then you could typically fit 500,000-600,000 symbols in a book.

Using your 54,613,333 symbols on a 40GB hard drive then you could fit 91-109 books on a 40GB hard drive.
 
^^^ Whoa, that's way to low. Uncompressed:

There are 64 characters you require, so that 6 bits. Let's round up to ASCII, 8 bits/character, to give us some breathing space: 1char/byte.

So 40 billion symbols in a 40gig drive = 80,000 books

Plain Huffman compression should at least double or triple this, and a book-optimized encoding scheme even more -- with chroot's 7x figure, ~half a million books.
 
Originally posted by damgo

So 40 billion symbols in a 40gig drive = 80,000 books


I somehow lost an entire order of magnitude in my calculations. You are entirely correct, it is between 66,666 and 80,000 books. That seems much more reasonable than approximately 100 books.

I have ignored compression throughout this problem, as I wanted to see the numbers that popped out assuming only very basic encoding. Those numbers above are still very impressive indeed.
 
Last edited:
1 GB is defined (by hard-drive manufacturers anyway) to be 10^9 (1 billion) bytes.

- Warren
 

Similar threads

  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 28 ·
Replies
28
Views
3K
Replies
4
Views
2K
Replies
5
Views
3K
  • · Replies 33 ·
2
Replies
33
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K