Probability of compressing n random bits by k bits

karamand
Messages
6
Reaction score
0
Suppose that I have n bits of random data.
What is the probability that I can compress it by k bits?
A colleague assures me that the largest possible fraction of random inputs that would compress by e.g. 32 bits would be 2^(-32).
I am struggling to cope with this.
I can see that if I have n bits, then there are 2^n possible arrangements of inputs.
If it compresses by 32 bits then there are 2^(n-32) possible arrangements of outputs.
Hence the fraction that can compress by 32 bits is 2^(n-32)/ 2^n = 2^(-32), but I fail to see what randomness has to do with this - it is true of any n bits.
Does anybody have any idea where I can start to calculate the probability that random data can be compressed by k bits?
 
Physics news on Phys.org
Hey karamand and welcome to the forums.

For a distribution of random bits, depending on your codes you won't get any real benefit of compression of random data if you don't make any assumption about either the structure or how the data is used.

One of the most important issues in compression is not what the data is, but what the data means.

General purpose compression algorithms exist and do not care what the interpretation or the internal structure (if one exists) of the data is and these kinds of algorithms are things like Huffman coding, Arithmetic Coding, BWT (A transform) and so on.

But be aware that a lot of compression algorithms are not like the above: they are based on knowing the structure and the use of the data and you'll find these algorithms in both lossy and lossless form for things like audio, video, 3D mesh and animation data, images and other kinds of data.

The best example is MP3 and MPEG codec schemes for audio and video respectively: these use known properties of not only the data, it's type, and how its used, but also the interpretation context of the data which includes the human eyes and ears.

The models for these are based on psycho-acoustic models of how the ear hears things as well as models of light and colour reception in the cones of the eye where you only store combinations of quantized values that make a difference detection wise.

This attribute is really important because it allows people to pick better representations and still retain the content as best as possible without having to think about a general purpose mechanism that is a lot harder.

If you want to investigate your question, look at information theory and look at the entropy of 1 bit as well as the entropy of n-bits.

Also take a look at the counting principle.

You might also want to look at how Huffman and Arithmetic Coding actually work statistically and probabilistically, since these methods are based on statistical techniques and understanding these in depth will answer your question indirectly.

As a final thought, consider the difference between the raw data and how its used what it means in the context of an application and you'll get two completely different approaches to compression.
 
Thanks for your reply, chiro.
To put the problem in some context, I have a file fragment (4K bytes) that may be either encrypted or compressed. If it is compressed then it will be lossless compression (jpg, mp3 etc. are specifically excluded).
Most files are compressed, as you say, using a non-optimal compression algorithm.
By compressing the fragment using an efficient method, I want to distinguish between compressed or encrypted fragments.
A compressed fragment should compress further, an encrypted one shouldn't (much).
I want to put a figure on the probability that a file is encrypted given that it compresses by k bits.
 
Compressed fragments done with the standard general lossless algorithms won't compress any further: you can show that the various Huffman codes are optimal and won't get any better for those kinds of schemes.
 
The original post makes sense as an abstract question: given that it is a random string, what is the probability of happening to achieve a given degree of compression (with some specified algorithm). But as a way of spotting encrypted files I doubt it is of much use. As has already been discussed, a compressed file may or may not be prone to further compression, depending on the efficiency of the algorithm. (Btw, I suspect that no matter how good the compression algorithm, you can find a file that would gain from running the compression twice over.)
Similarly, encryption is not guaranteed (to tend) to make a file incompressible. In principle, it could add various fields with high redundancy without compromising the encryption.
 
Namaste & G'day Postulate: A strongly-knit team wins on average over a less knit one Fundamentals: - Two teams face off with 4 players each - A polo team consists of players that each have assigned to them a measure of their ability (called a "Handicap" - 10 is highest, -2 lowest) I attempted to measure close-knitness of a team in terms of standard deviation (SD) of handicaps of the players. Failure: It turns out that, more often than, a team with a higher SD wins. In my language, that...
Hi all, I've been a roulette player for more than 10 years (although I took time off here and there) and it's only now that I'm trying to understand the physics of the game. Basically my strategy in roulette is to divide the wheel roughly into two halves (let's call them A and B). My theory is that in roulette there will invariably be variance. In other words, if A comes up 5 times in a row, B will be due to come up soon. However I have been proven wrong many times, and I have seen some...
Back
Top