Probability of compressing n random bits by k bits

In summary: In practice, there is probably a correlation in that random data is likely to be encrypted and is likely to be incompressible.As for the original question, it is a matter of combinatorics to work out the number of strings that compress by k bits. That number depends on how many bits you are compressing; the answer is quite different if you are compressing 32 bits compared to 64 bits. The "probability" is then that number divided by the total number of strings. This is an exercise in the binomial distribution. In summary, the probability of compressing a random data by k bits depends on the number of bits being compressed and the algorithm being used. However, it is not a reliable method
  • #1
karamand
6
0
Suppose that I have n bits of random data.
What is the probability that I can compress it by k bits?
A colleague assures me that the largest possible fraction of random inputs that would compress by e.g. 32 bits would be 2^(-32).
I am struggling to cope with this.
I can see that if I have n bits, then there are 2^n possible arrangements of inputs.
If it compresses by 32 bits then there are 2^(n-32) possible arrangements of outputs.
Hence the fraction that can compress by 32 bits is 2^(n-32)/ 2^n = 2^(-32), but I fail to see what randomness has to do with this - it is true of any n bits.
Does anybody have any idea where I can start to calculate the probability that random data can be compressed by k bits?
 
Physics news on Phys.org
  • #2
Hey karamand and welcome to the forums.

For a distribution of random bits, depending on your codes you won't get any real benefit of compression of random data if you don't make any assumption about either the structure or how the data is used.

One of the most important issues in compression is not what the data is, but what the data means.

General purpose compression algorithms exist and do not care what the interpretation or the internal structure (if one exists) of the data is and these kinds of algorithms are things like Huffman coding, Arithmetic Coding, BWT (A transform) and so on.

But be aware that a lot of compression algorithms are not like the above: they are based on knowing the structure and the use of the data and you'll find these algorithms in both lossy and lossless form for things like audio, video, 3D mesh and animation data, images and other kinds of data.

The best example is MP3 and MPEG codec schemes for audio and video respectively: these use known properties of not only the data, it's type, and how its used, but also the interpretation context of the data which includes the human eyes and ears.

The models for these are based on psycho-acoustic models of how the ear hears things as well as models of light and colour reception in the cones of the eye where you only store combinations of quantized values that make a difference detection wise.

This attribute is really important because it allows people to pick better representations and still retain the content as best as possible without having to think about a general purpose mechanism that is a lot harder.

If you want to investigate your question, look at information theory and look at the entropy of 1 bit as well as the entropy of n-bits.

Also take a look at the counting principle.

You might also want to look at how Huffman and Arithmetic Coding actually work statistically and probabilistically, since these methods are based on statistical techniques and understanding these in depth will answer your question indirectly.

As a final thought, consider the difference between the raw data and how its used what it means in the context of an application and you'll get two completely different approaches to compression.
 
  • #3
Thanks for your reply, chiro.
To put the problem in some context, I have a file fragment (4K bytes) that may be either encrypted or compressed. If it is compressed then it will be lossless compression (jpg, mp3 etc. are specifically excluded).
Most files are compressed, as you say, using a non-optimal compression algorithm.
By compressing the fragment using an efficient method, I want to distinguish between compressed or encrypted fragments.
A compressed fragment should compress further, an encrypted one shouldn't (much).
I want to put a figure on the probability that a file is encrypted given that it compresses by k bits.
 
  • #4
Compressed fragments done with the standard general lossless algorithms won't compress any further: you can show that the various Huffman codes are optimal and won't get any better for those kinds of schemes.
 
  • #5
The original post makes sense as an abstract question: given that it is a random string, what is the probability of happening to achieve a given degree of compression (with some specified algorithm). But as a way of spotting encrypted files I doubt it is of much use. As has already been discussed, a compressed file may or may not be prone to further compression, depending on the efficiency of the algorithm. (Btw, I suspect that no matter how good the compression algorithm, you can find a file that would gain from running the compression twice over.)
Similarly, encryption is not guaranteed (to tend) to make a file incompressible. In principle, it could add various fields with high redundancy without compromising the encryption.
 

1. What is the "probability of compressing n random bits by k bits"?

The probability of compressing n random bits by k bits refers to the likelihood that a string of n random bits can be compressed to a smaller size of k bits without losing any information.

2. How is the probability of compression calculated?

The probability of compression is calculated by dividing the number of possible compressed strings of k bits by the total number of possible combinations of n random bits. This can be represented as P(k/n) = (2^k)/(2^n).

3. What factors affect the probability of compression?

The probability of compression is affected by the length of the original string of n bits, the desired compressed size of k bits, and the randomness of the original string. The more random the string, the higher the probability of successful compression.

4. Can the probability of compression be greater than 1?

No, the probability of compression cannot be greater than 1. This would mean that there are more possible compressed strings than total combinations of the original n bits, which is impossible.

5. How does the probability of compression relate to data compression algorithms?

The probability of compression is an important factor in the development and evaluation of data compression algorithms. A higher probability of compression means that the algorithm is more successful in reducing the size of the data without losing any information.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
469
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
11
Views
490
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
805
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
Back
Top