Probability of compressing n random bits by k bits

karamand · Oct 14, 2012

Suppose that I have n bits of random data.
What is the probability that I can compress it by k bits?
A colleague assures me that the largest possible fraction of random inputs that would compress by e.g. 32 bits would be 2^(-32).
I am struggling to cope with this.
I can see that if I have n bits, then there are 2^n possible arrangements of inputs.
If it compresses by 32 bits then there are 2^(n-32) possible arrangements of outputs.
Hence the fraction that can compress by 32 bits is 2^(n-32)/ 2^n = 2^(-32), but I fail to see what randomness has to do with this - it is true of any n bits.
Does anybody have any idea where I can start to calculate the probability that random data can be compressed by k bits?

chiro · Oct 15, 2012

Hey karamand and welcome to the forums.

For a distribution of random bits, depending on your codes you won't get any real benefit of compression of random data if you don't make any assumption about either the structure or how the data is used.

One of the most important issues in compression is not what the data is, but what the data means.

General purpose compression algorithms exist and do not care what the interpretation or the internal structure (if one exists) of the data is and these kinds of algorithms are things like Huffman coding, Arithmetic Coding, BWT (A transform) and so on.

But be aware that a lot of compression algorithms are not like the above: they are based on knowing the structure and the use of the data and you'll find these algorithms in both lossy and lossless form for things like audio, video, 3D mesh and animation data, images and other kinds of data.

The best example is MP3 and MPEG codec schemes for audio and video respectively: these use known properties of not only the data, it's type, and how its used, but also the interpretation context of the data which includes the human eyes and ears.

The models for these are based on psycho-acoustic models of how the ear hears things as well as models of light and colour reception in the cones of the eye where you only store combinations of quantized values that make a difference detection wise.

This attribute is really important because it allows people to pick better representations and still retain the content as best as possible without having to think about a general purpose mechanism that is a lot harder.

If you want to investigate your question, look at information theory and look at the entropy of 1 bit as well as the entropy of n-bits.

Also take a look at the counting principle.

You might also want to look at how Huffman and Arithmetic Coding actually work statistically and probabilistically, since these methods are based on statistical techniques and understanding these in depth will answer your question indirectly.

As a final thought, consider the difference between the raw data and how its used what it means in the context of an application and you'll get two completely different approaches to compression.

karamand · Oct 15, 2012

Thanks for your reply, chiro.
To put the problem in some context, I have a file fragment (4K bytes) that may be either encrypted or compressed. If it is compressed then it will be lossless compression (jpg, mp3 etc. are specifically excluded).
Most files are compressed, as you say, using a non-optimal compression algorithm.
By compressing the fragment using an efficient method, I want to distinguish between compressed or encrypted fragments.
A compressed fragment should compress further, an encrypted one shouldn't (much).
I want to put a figure on the probability that a file is encrypted given that it compresses by k bits.

chiro · Oct 15, 2012

Compressed fragments done with the standard general lossless algorithms won't compress any further: you can show that the various Huffman codes are optimal and won't get any better for those kinds of schemes.

haruspex · Oct 15, 2012

The original post makes sense as an abstract question: given that it is a random string, what is the probability of happening to achieve a given degree of compression (with some specified algorithm). But as a way of spotting encrypted files I doubt it is of much use. As has already been discussed, a compressed file may or may not be prone to further compression, depending on the efficiency of the algorithm. (Btw, I suspect that no matter how good the compression algorithm, you can find a file that would gain from running the compression twice over.)
Similarly, encryption is not guaranteed (to tend) to make a file incompressible. In principle, it could add various fields with high redundancy without compromising the encryption.

Probability of compressing n random bits by k bits

1. What is the "probability of compressing n random bits by k bits"?

2. How is the probability of compression calculated?

3. What factors affect the probability of compression?

4. Can the probability of compression be greater than 1?

5. How does the probability of compression relate to data compression algorithms?

Similar threads

Hot Threads

Recent Insights