Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

How to practically measure entropy of a file?

  1. Aug 21, 2015 #1
    I'm trying to measure now much non redundant (actual) information my file contains. Some call this the amount of entropy.

    Of course there is the standard p(x) log{p(x)}, but I think that Shannon was only considering it from the point of view of transmitting though a channel. Hence the formula requires a block size (say in bits, 8 typically). For a large file, this calculation is fairly useless, ignoring short to long distance correlations between symbols.

    There are binary tree and Ziv-Lempel methods, but these seem highly academic in nature.

    Compressibility is also regarded as a measure of entropy, but there seems to be no lower limit as to the degree of compression. For my file hiss.wav,

    - original hiss.wav = 5.2 MB
    - entropy via the Shannon formula = 4.6 MB
    - hiss.zip = 4.6 MB
    - hiss.7z = 4.2 MB
    - hiss.wav.fp8 = 3.3 MB

    Is there some reasonably practicable method of measuring how much entropy exists within hiss.wav?
  2. jcsd
  3. Aug 21, 2015 #2
    Well, its good to look at Shannon's original formulation of entropy. One thing you'll notice is its simplicity, and that it really provides a sort of upper bound on things, but that's it. The most basic implementation can't tell the difference between 01010101... and some totally random string of zeros and ones (with equal probability of each). Even though the latter clearly contains MUCH more information than the other.

    "Compressibility is also regarded as a measure of entropy, but there seems to be no lower limit as to the degree of compression."

    True - or well, more specifically, there is a smallest compression (Kolmogorov complexity), but we provably can't ever find it with certainty. (its not computable) So the with that interpretation of Shannon information, we can only establish a sort of upperbound, the best compression we have found so far. For you, its 3.3meg
  4. Aug 22, 2015 #3
    I think you're right, but that both depresses and surprised me. Compression is ubiquitous. ADSL service, hard disks, the icons on this page, e.t.c. I would have thought that with the scale of research dollars globally expended on maximising data transmission and storage, there would be a more developed way of estimating (at least) how much of the darned stuff you're actually dealing with. I wouldn't have thought it beyond the realms of possibility that there would be a file utility that you pass over some data that outputs the theoretical entropy estimate. Just what are the telcos and disk manufacturers playing at :confused: ?
  5. Aug 22, 2015 #4
    Well... One way to think of it is that you CAN make some entropy estimate pretty easily, the question is - how meaningful it is? For instance, its really easy to just measure the Shannon entropy of some file, in terms of "odds the next bit is a one, or zero". But how meaningful is it? Like I said, the file that is just a string of 01010101... Will show max entropy, but its not a meaningful measure of the information in the file.

    The issue is, Shannon entropy is defined in terms of probabilities of events (aka random variables) but leaves what those random variables are up to you. You could choose the events as which bit is next in {1, 0} or you could define them as which 2 bits are next in {00, 01, 10, 11}, in which case the 010101... string above has zero entropy, as it should, where in the first choice {0,1} it had max entropy. You can further expand this notion to a probability space of computer programs, so the digits of a number like pi or e actually have very little information/entropy beyond the algorithm which produces them (as they should) rather than the infinite information reported by a more naive/limited choice of events in the probability space.

    Shannon entropy, as an upperbound, is useful though. The most basic choice of events {0, 1}will tell you that a file that is just 100 megabytes of zeros has basically no information, as will the next {00, 01, 10, 11}, as will the one that uses computer programs, etc. As you choice of events become more complex, you become ever more accurate in your assessments of entropy, one just never quite gets all the way there. Its unsatisfying in that regard, but its these spaces where creative ideas blossom, I have found. :)
  6. Aug 29, 2015 #5


    User Avatar
    Gold Member

    I wonder what you get when you use the shannon formula for entropy on your compressed files?
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook