Comp Sci Answer: Dictionary Compression: Solving the Mystery of "01

Villiers · Sep 25, 2022

Hi

I have the answer to the dictionary compression question- but can't understand the following in the notes:

Instead, the word could be added to a dictionary and assigned the binary code 01 which is a reduction of 38 bits for each occurrence - what does this mean?

This is the extract in full:
In compressing larger volumes, a document of 100 pages could contain the word ‘because’ 50 times, resulting in 2000 bits of data being required. Instead the word could be added to a dictionary and assigned the code 01 which is a reduction of 38 bits for each occurrence. There would still be a slight overhead in terms of the storage of the dictionary but this would only be a one-off entry per word.

FactChecker · Sep 25, 2022

Words could be assigned a number, which is stored instead of the entire word. Webster's Dictionary contains 470,000 words. That would require 19 bits to give each word a unique number. The explanation you quoted seems to use 18 bits instead of 19 (because 56-38=18). That would allow you to assign unique numbers to ##2^{18} = 262,144## words.

CORRECTION: It looks like it is talking about making a lookup list of only the words in the document, not an entire English language dictionary. So the word ‘because’ is the first word in the lookup list with index 01 and any occurrence of that word in the text is replaced by '01'.

Villiers · Sep 25, 2022

thank you for your feedback

pbuk · Sep 26, 2022

FactChecker said:

Words could be assigned a number, which is stored instead of the entire word. Webster's Dictionary contains 470,000 words. That would require 19 bits to give each word a unique number. The explanation you quoted seems to use 18 bits instead of 19 (because 56-38=18). That would allow you to assign unique numbers to ##2^{18} = 262,144## words.

I don't think this is what the question means, you are adding in assumptions which I can't see are justified.

Let's have another look:

Villiers said:

In compressing larger volumes, a document of 100 pages could contain the word ‘because’ 50 times, resulting in 2000 bits of data being required.

So each occurrence of 'because' requires 2000 ÷ 50 = 40 bits. This is clearly not ASCII so the calculation 7 x 8 = 56 bits is not relevant.

Villiers said:

Instead the word could be added to a dictionary and assigned the code 01 which is a reduction of 38 bits for each occurrence.

01 is two bits, 40 - 2 = 38.

FactChecker · Sep 26, 2022

pbuk said:

I don't think this is what the question means, you are adding in assumptions which I can't see are justified.

I stand corrected. It looks like it is talking about making a lookup list of only the words in the document, not an entire English language dictionary. So the word ‘because’ is the first word in the lookup list with index 01 and any occurrence of that word in the text is replaced by '01'.

Comp Sci Answer: Dictionary Compression: Solving the Mystery of "01

Thread 'Why wasn’t gravity included in the potential energy for this problem?'

Similar threads

Engineering Diff gain of a push pull degenerated differential pair

Engineering AGMA pitting resistance factor of safety (SH)

How Do I Draw This Shear and Moment Diagram?

PLL - How to find all the gains of a PI corrector and fix Ki ? MATLAB

Engineering Full bridge circuit with inductor and resistor

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers