Answer: Dictionary Compression: Solving the Mystery of "01

  • Context: Comp Sci 
  • Thread starter Thread starter Villiers
  • Start date Start date
  • Tags Tags
    Compression
Click For Summary

Discussion Overview

The discussion revolves around the concept of dictionary compression in data storage, particularly focusing on the encoding of words in a document to reduce the amount of data required. Participants explore the implications of assigning binary codes to words and the calculations involved in determining the efficiency of this method.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant questions the meaning of a reduction of 38 bits for each occurrence of the word 'because' when assigned the binary code 01.
  • Another participant suggests that assigning a number to words could require 19 bits for Webster's Dictionary, but notes a discrepancy in the bit count used in the original explanation.
  • A correction is made regarding the lookup list, clarifying that it pertains to only the words present in the document rather than the entire English language dictionary.
  • One participant calculates that each occurrence of 'because' requires 40 bits, arguing that the ASCII calculation is not relevant in this context.
  • Another participant agrees with the correction about the lookup list and reiterates that the word 'because' is indexed as '01' in the text.

Areas of Agreement / Disagreement

Participants express differing interpretations of the original explanation and calculations related to dictionary compression. There is no clear consensus on the implications of the bit reductions or the assumptions made in the discussion.

Contextual Notes

There are unresolved assumptions regarding the definitions of bit requirements and the scope of the dictionary being referenced. The calculations presented may depend on specific interpretations of the data compression method discussed.

Villiers
Messages
7
Reaction score
0
Homework Statement
In a text document each letter could be stored as an ASCII code of 8 bits
In this document the word ‘because’ requires 56 bits of data (7 letters x 8 bits)
Instead, the word could be added to a dictionary and assigned the binary code 01 which is a reduction of 38 bits for each occurrence
A saving for 50 occurrences of the word:
Relevant Equations
50 x 54 bits saving = 2,700 bits saved, or 338 bytes
Hi

I have the answer to the dictionary compression question- but can't understand the following in the notes:

Instead, the word could be added to a dictionary and assigned the binary code 01 which is a reduction of 38 bits for each occurrence - what does this mean?

This is the extract in full:
In compressing larger volumes, a document of 100 pages could contain the word ‘because’ 50 times, resulting in 2000 bits of data being required. Instead the word could be added to a dictionary and assigned the code 01 which is a reduction of 38 bits for each occurrence. There would still be a slight overhead in terms of the storage of the dictionary but this would only be a one-off entry per word.
 
Physics news on Phys.org
Words could be assigned a number, which is stored instead of the entire word. Webster's Dictionary contains 470,000 words. That would require 19 bits to give each word a unique number. The explanation you quoted seems to use 18 bits instead of 19 (because 56-38=18). That would allow you to assign unique numbers to ##2^{18} = 262,144## words.

CORRECTION: It looks like it is talking about making a lookup list of only the words in the document, not an entire English language dictionary. So the word ‘because’ is the first word in the lookup list with index 01 and any occurrence of that word in the text is replaced by '01'.
 
Last edited:
thank you for your feedback
 
FactChecker said:
Words could be assigned a number, which is stored instead of the entire word. Webster's Dictionary contains 470,000 words. That would require 19 bits to give each word a unique number. The explanation you quoted seems to use 18 bits instead of 19 (because 56-38=18). That would allow you to assign unique numbers to ##2^{18} = 262,144## words.
I don't think this is what the question means, you are adding in assumptions which I can't see are justified.

Let's have another look:
Villiers said:
In compressing larger volumes, a document of 100 pages could contain the word ‘because’ 50 times, resulting in 2000 bits of data being required.
So each occurrence of 'because' requires 2000 ÷ 50 = 40 bits. This is clearly not ASCII so the calculation 7 x 8 = 56 bits is not relevant.

Villiers said:
Instead the word could be added to a dictionary and assigned the code 01 which is a reduction of 38 bits for each occurrence.
01 is two bits, 40 - 2 = 38.
 
  • Like
Likes   Reactions: FactChecker
pbuk said:
I don't think this is what the question means, you are adding in assumptions which I can't see are justified.
I stand corrected. It looks like it is talking about making a lookup list of only the words in the document, not an entire English language dictionary. So the word ‘because’ is the first word in the lookup list with index 01 and any occurrence of that word in the text is replaced by '01'.
 
  • Like
Likes   Reactions: pbuk

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 12 ·
Replies
12
Views
4K
  • · Replies 3 ·
Replies
3
Views
3K
Replies
8
Views
5K
  • · Replies 19 ·
Replies
19
Views
9K
  • · Replies 25 ·
Replies
25
Views
6K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 15 ·
Replies
15
Views
6K