Calculating Books on 40GB Hard Drive: Assumptions & Solutions

  • Context: Undergrad 
  • Thread starter Thread starter Pauly Man
  • Start date Start date
  • Tags Tags
    Information
Click For Summary

Discussion Overview

The discussion revolves around calculating how many books can be stored on a 40GB hard drive based on specific encoding assumptions for characters. Participants explore various methods of encoding, the implications of compression, and the number of symbols per page in books.

Discussion Character

  • Exploratory
  • Technical explanation
  • Mathematical reasoning
  • Debate/contested

Main Points Raised

  • One participant proposes calculating the number of bits required to encode characters based on the specified symbols and assumptions about book length.
  • Another suggests using frequency analysis and Huffman coding to potentially increase the number of books that can be stored compared to naive encoding.
  • A participant calculates that 63 distinct symbols would require 6 bits for encoding, but notes the lack of information about symbols per page and how bits fit into bytes.
  • One participant introduces the idea of saving space by linking multiple occurrences of the same string, theoretically allowing for an infinite number of books if they are identical.
  • Another participant provides estimates of symbols per page based on font size and page dimensions, leading to a calculation of approximately 91-109 books fitting on the drive.
  • One participant revises their earlier calculation, suggesting that with proper encoding and compression, the number of books could be as high as 80,000, and potentially much more with optimized encoding schemes.
  • Another participant clarifies the definition of 1GB as 10^9 bytes, which influences calculations regarding storage capacity.

Areas of Agreement / Disagreement

Participants express differing views on the number of books that can be stored, with some calculations suggesting figures ranging from 91-109 to as high as 80,000 or more, indicating a lack of consensus on the final number. The discussion remains unresolved regarding the exact impact of compression and encoding methods.

Contextual Notes

Participants note missing assumptions, such as the number of symbols per page and how bits are allocated within bytes, which complicates the calculations. The discussion also highlights the variability in definitions of storage capacity, particularly regarding the interpretation of gigabytes.

Pauly Man
Messages
129
Reaction score
0
How many books could you write onto a 40GB hard drive given the following assumptions?

Assumptions-

  • Must be able to uniquely encode the numerals 0-9
  • Must be able to uniquely encode the lower and upper case letters of the alphabet
  • Must be able to uniquely encode spaces and full stops
  • A book is assumed to have approximately 400 pages in it

Basically the problem is asking these questions:

  • How many bits does it take to uniquely encode a character as described in the above assumptions?
  • How many characters can be encoded onto a 40GB hard drive?
  • How many books does that correspond to?

I want to see how people go about finding a solution to this problem. I got bored on the train to university this morning and wasted the time thinking about this problem. When a few people have had a go at the problem I'll post my working out.
 
Physics news on Phys.org
Take the total number of symbols in the alphabet you've specified to be N.

If you want to see how many books you can fit naively, just assume that you need ceil(log2(N)) bits to encode each symbol. The number of books you can store using this naive encoding is extremely easy to calculate.

If you want to be more sophisticated and see how many books you can actually fit into 40GB, perform a frequency analysis over a good-sized sample, and create a Huffman coding for your alphabet. Encode your books using this coding. You may then want to apply some kind of sliding-window or dictionary-based compression scheme. The number of books you can store using these techniques is likely to be something on the order of seven times as many as you can store in the naive encoding, largely because English text has only (IIRC) 1.8 bits per symbol of irreducible information.

- Warren
 
You have required that there be 63 distinct symbols. n bits can encode 2^n symbols so we would need 6 bits (2^6= 63). This is basically the "ceil(log2(N))" chroot gave.

There are, however, a few pieces of information you did not give. You say "A book is assumed to have approximately 400 pages in it" but you DON'T say how many symbols to the page! Also, while you talk about individual characters, your "40 GB" is 40 million BYTES (well actually 40*(1,024,000)= 40,960,000 bytess). You don't say how we are to assume bits are fit into bytes. Assuming that we ONLY need the 63 symbols given and are trying to fit as many symbols as possible onto a drive, so that we abut one 6 bit symbol to another, 40 GB= 327680000 or 327680000/6 = 54,613,333 symbols on the drive (with 2 bits left over).

That would be a lot of work and it would be much simpler to use 8 bits, the entire byte, to represent a symbol as is actually done.
In that case we could fit 327680000/8= 40,960,000 symbols on a drive,
allowing for 2^8= 256 different symbols.

We still can't answer "How many books?" because you haven't told us how many symbols to a page (or how many symbols to a book).



If a drive can hold 40GB
 
And there are also some fancy space saving techniques like saving multiple occurances of a single string in multiple books as one entry on the hard drive. In an extreme case, you could fit an infinite number of books in your 40 GB hard drive if they were all the same by just saving one copy and symbolically linking all other copies to that one book.

Hurkyl
 
1GB=1024MB=1024*1024KB ~ 1 billion bytes=8 billion bits...
Or am I wrong ?
 
Good points Hurkyl.

I used microsoft word to work out that on an A4 page, you can fit around 2500-3000 symbols on a page (using 12 sized font and times new roman), and since an A4 page is roughly twice the size of a typical paperback page, you could fit roughly 1250-1500 symbols to a paperback page. Assuming 400 pages to a book then you could typically fit 500,000-600,000 symbols in a book.

Using your 54,613,333 symbols on a 40GB hard drive then you could fit 91-109 books on a 40GB hard drive.
 
^^^ Whoa, that's way to low. Uncompressed:

There are 64 characters you require, so that 6 bits. Let's round up to ASCII, 8 bits/character, to give us some breathing space: 1char/byte.

So 40 billion symbols in a 40gig drive = 80,000 books

Plain Huffman compression should at least double or triple this, and a book-optimized encoding scheme even more -- with chroot's 7x figure, ~half a million books.
 
Originally posted by damgo

So 40 billion symbols in a 40gig drive = 80,000 books


I somehow lost an entire order of magnitude in my calculations. You are entirely correct, it is between 66,666 and 80,000 books. That seems much more reasonable than approximately 100 books.

I have ignored compression throughout this problem, as I wanted to see the numbers that popped out assuming only very basic encoding. Those numbers above are still very impressive indeed.
 
Last edited:
1 GB is defined (by hard-drive manufacturers anyway) to be 10^9 (1 billion) bytes.

- Warren
 

Similar threads

  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 28 ·
Replies
28
Views
4K
Replies
4
Views
3K
Replies
9
Views
5K
  • · Replies 33 ·
2
Replies
33
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K