Probability of compressing n random bits by k bits

AI Thread Summary
The probability of compressing n bits of random data by k bits is largely dependent on the structure and meaning of the data rather than its randomness alone. General-purpose compression algorithms like Huffman coding and Arithmetic Coding do not account for the data's context, which is crucial for effective compression. Compressed files may not compress further if they have already been processed by efficient algorithms, while encrypted files may still exhibit redundancy that allows for some level of compression. The discussion emphasizes the importance of understanding information theory and entropy when evaluating compression probabilities. Ultimately, distinguishing between compressed and encrypted data requires careful analysis of the compression method and the data's characteristics.
karamand
Messages
6
Reaction score
0
Suppose that I have n bits of random data.
What is the probability that I can compress it by k bits?
A colleague assures me that the largest possible fraction of random inputs that would compress by e.g. 32 bits would be 2^(-32).
I am struggling to cope with this.
I can see that if I have n bits, then there are 2^n possible arrangements of inputs.
If it compresses by 32 bits then there are 2^(n-32) possible arrangements of outputs.
Hence the fraction that can compress by 32 bits is 2^(n-32)/ 2^n = 2^(-32), but I fail to see what randomness has to do with this - it is true of any n bits.
Does anybody have any idea where I can start to calculate the probability that random data can be compressed by k bits?
 
Physics news on Phys.org
Hey karamand and welcome to the forums.

For a distribution of random bits, depending on your codes you won't get any real benefit of compression of random data if you don't make any assumption about either the structure or how the data is used.

One of the most important issues in compression is not what the data is, but what the data means.

General purpose compression algorithms exist and do not care what the interpretation or the internal structure (if one exists) of the data is and these kinds of algorithms are things like Huffman coding, Arithmetic Coding, BWT (A transform) and so on.

But be aware that a lot of compression algorithms are not like the above: they are based on knowing the structure and the use of the data and you'll find these algorithms in both lossy and lossless form for things like audio, video, 3D mesh and animation data, images and other kinds of data.

The best example is MP3 and MPEG codec schemes for audio and video respectively: these use known properties of not only the data, it's type, and how its used, but also the interpretation context of the data which includes the human eyes and ears.

The models for these are based on psycho-acoustic models of how the ear hears things as well as models of light and colour reception in the cones of the eye where you only store combinations of quantized values that make a difference detection wise.

This attribute is really important because it allows people to pick better representations and still retain the content as best as possible without having to think about a general purpose mechanism that is a lot harder.

If you want to investigate your question, look at information theory and look at the entropy of 1 bit as well as the entropy of n-bits.

Also take a look at the counting principle.

You might also want to look at how Huffman and Arithmetic Coding actually work statistically and probabilistically, since these methods are based on statistical techniques and understanding these in depth will answer your question indirectly.

As a final thought, consider the difference between the raw data and how its used what it means in the context of an application and you'll get two completely different approaches to compression.
 
Thanks for your reply, chiro.
To put the problem in some context, I have a file fragment (4K bytes) that may be either encrypted or compressed. If it is compressed then it will be lossless compression (jpg, mp3 etc. are specifically excluded).
Most files are compressed, as you say, using a non-optimal compression algorithm.
By compressing the fragment using an efficient method, I want to distinguish between compressed or encrypted fragments.
A compressed fragment should compress further, an encrypted one shouldn't (much).
I want to put a figure on the probability that a file is encrypted given that it compresses by k bits.
 
Compressed fragments done with the standard general lossless algorithms won't compress any further: you can show that the various Huffman codes are optimal and won't get any better for those kinds of schemes.
 
The original post makes sense as an abstract question: given that it is a random string, what is the probability of happening to achieve a given degree of compression (with some specified algorithm). But as a way of spotting encrypted files I doubt it is of much use. As has already been discussed, a compressed file may or may not be prone to further compression, depending on the efficiency of the algorithm. (Btw, I suspect that no matter how good the compression algorithm, you can find a file that would gain from running the compression twice over.)
Similarly, encryption is not guaranteed (to tend) to make a file incompressible. In principle, it could add various fields with high redundancy without compromising the encryption.
 
I was reading documentation about the soundness and completeness of logic formal systems. Consider the following $$\vdash_S \phi$$ where ##S## is the proof-system making part the formal system and ##\phi## is a wff (well formed formula) of the formal language. Note the blank on left of the turnstile symbol ##\vdash_S##, as far as I can tell it actually represents the empty set. So what does it mean ? I guess it actually means ##\phi## is a theorem of the formal system, i.e. there is a...
Back
Top