How to practically measure entropy of a file?

In summary: The issue is, Shannon entropy is defined in terms of probabilities of events (aka random variables) but leaves what those random variables are up to you. You could choose the events as which bit is next in {1, 0} or you could define them as which 2 bits are next in {00, 01, 10, 11}, in which case the 010101... string above has zero entropy, as it should, where in the first choice {0,1} it had max entropy. You can further expand this notion to a probability space of computer programs, so the digits of a number like pi or e actually have very little information/entropy beyond
  • #1
Paul Uszak
84
7
I'm trying to measure now much non redundant (actual) information my file contains. Some call this the amount of entropy.

Of course there is the standard p(x) log{p(x)}, but I think that Shannon was only considering it from the point of view of transmitting though a channel. Hence the formula requires a block size (say in bits, 8 typically). For a large file, this calculation is fairly useless, ignoring short to long distance correlations between symbols.

There are binary tree and Ziv-Lempel methods, but these seem highly academic in nature.

Compressibility is also regarded as a measure of entropy, but there seems to be no lower limit as to the degree of compression. For my file hiss.wav,

- original hiss.wav = 5.2 MB
- entropy via the Shannon formula = 4.6 MB
- hiss.zip = 4.6 MB
- hiss.7z = 4.2 MB
- hiss.wav.fp8 = 3.3 MB

Is there some reasonably practicable method of measuring how much entropy exists within hiss.wav?
 
Technology news on Phys.org
  • #2
Well, its good to look at Shannon's original formulation of entropy. One thing you'll notice is its simplicity, and that it really provides a sort of upper bound on things, but that's it. The most basic implementation can't tell the difference between 01010101... and some totally random string of zeros and ones (with equal probability of each). Even though the latter clearly contains MUCH more information than the other.

"Compressibility is also regarded as a measure of entropy, but there seems to be no lower limit as to the degree of compression."

True - or well, more specifically, there is a smallest compression (Kolmogorov complexity), but we provably can't ever find it with certainty. (its not computable) So the with that interpretation of Shannon information, we can only establish a sort of upperbound, the best compression we have found so far. For you, its 3.3meg
 
  • #3
I think you're right, but that both depresses and surprised me. Compression is ubiquitous. ADSL service, hard disks, the icons on this page, e.t.c. I would have thought that with the scale of research dollars globally expended on maximising data transmission and storage, there would be a more developed way of estimating (at least) how much of the darned stuff you're actually dealing with. I wouldn't have thought it beyond the realms of possibility that there would be a file utility that you pass over some data that outputs the theoretical entropy estimate. Just what are the telcos and disk manufacturers playing at :confused: ?
 
  • #4
Paul Uszak said:
I think you're right, but that both depresses and surprised me. Compression is ubiquitous. ADSL service, hard disks, the icons on this page, e.t.c. I would have thought that with the scale of research dollars globally expended on maximising data transmission and storage, there would be a more developed way of estimating (at least) how much of the darned stuff you're actually dealing with. I wouldn't have thought it beyond the realms of possibility that there would be a file utility that you pass over some data that outputs the theoretical entropy estimate. Just what are the telcos and disk manufacturers playing at :confused: ?

Well... One way to think of it is that you CAN make some entropy estimate pretty easily, the question is - how meaningful it is? For instance, its really easy to just measure the Shannon entropy of some file, in terms of "odds the next bit is a one, or zero". But how meaningful is it? Like I said, the file that is just a string of 01010101... Will show max entropy, but its not a meaningful measure of the information in the file.

The issue is, Shannon entropy is defined in terms of probabilities of events (aka random variables) but leaves what those random variables are up to you. You could choose the events as which bit is next in {1, 0} or you could define them as which 2 bits are next in {00, 01, 10, 11}, in which case the 010101... string above has zero entropy, as it should, where in the first choice {0,1} it had max entropy. You can further expand this notion to a probability space of computer programs, so the digits of a number like pi or e actually have very little information/entropy beyond the algorithm which produces them (as they should) rather than the infinite information reported by a more naive/limited choice of events in the probability space.

Shannon entropy, as an upperbound, is useful though. The most basic choice of events {0, 1}will tell you that a file that is just 100 megabytes of zeros has basically no information, as will the next {00, 01, 10, 11}, as will the one that uses computer programs, etc. As you choice of events become more complex, you become ever more accurate in your assessments of entropy, one just never quite gets all the way there. Its unsatisfying in that regard, but its these spaces where creative ideas blossom, I have found. :)
 
  • #5
I wonder what you get when you use the shannon formula for entropy on your compressed files?
 

Related to How to practically measure entropy of a file?

1. What is entropy and why is it important to measure it in a file?

Entropy is a measure of the randomness or disorder in a system. In the context of a file, it refers to the unpredictability or complexity of its data. It is important to measure entropy in a file because it can provide insights into its level of compression, encryption, or even the presence of hidden data.

2. How is entropy calculated for a file?

Entropy is calculated using the Shannon entropy formula, which takes into account the frequency of occurrence of each symbol (e.g. byte, character) in the file. This formula results in a value between 0 and 8, with higher values indicating a higher level of entropy.

3. What tools can be used to measure the entropy of a file?

There are several tools available for measuring entropy of a file, such as Entropy, Dieharder, and NIST Statistical Test Suite. These tools use different algorithms and statistical methods to calculate entropy and provide a numerical value as output.

4. Can entropy be measured for any type of file?

Yes, entropy can be measured for any type of file, including text files, images, audio files, and even executable files. However, the accuracy of the measurement may vary depending on the type of file and the specific tool or method used.

5. How can the entropy of a file be used in practical applications?

The entropy of a file can be used in various practical applications, such as data compression, cryptography, and data forensics. It can also be used to detect the presence of malware or other hidden information in a file. Additionally, measuring the entropy of a file can help in analyzing and improving its efficiency and security.

Similar threads

  • Other Physics Topics
Replies
3
Views
3K
  • STEM Educators and Teaching
Replies
11
Views
31K
  • Science and Math Textbooks
Replies
19
Views
17K
Back
Top