What is the Shannon Entropy of the String 'QXZ'?

  • Context: Graduate 
  • Thread starter Thread starter Karl Coryat
  • Start date Start date
  • Tags Tags
    Entropy Shannon entropy
Click For Summary
SUMMARY

The Shannon entropy of the string "QXZ" cannot be accurately calculated using the standard entropy formula H(P) = –Ʃ pilog2pi when considering individual letter frequencies. The discussion clarifies that Shannon entropy applies to a probability distribution of all possible three-letter combinations, rather than a single instance. The confusion arises from the assumption that high surprisal letters directly correlate with increased entropy, which is incorrect. For precise analysis of specific strings, Kolmogorov complexity is recommended as a more suitable approach.

PREREQUISITES
  • Understanding of Shannon entropy and its formula H(P) = –Ʃ pilog2pi
  • Familiarity with probability distributions and their properties
  • Basic knowledge of information theory concepts
  • Awareness of Kolmogorov complexity as an alternative measure
NEXT STEPS
  • Study the application of Shannon entropy in probability distributions
  • Explore Kolmogorov complexity and its implications for string analysis
  • Read "Information Theory, Inference and Learning Algorithms" by MacKay for deeper insights
  • Investigate the frequency distribution of letters in English text
USEFUL FOR

Students and professionals in information theory, data scientists analyzing string data, and anyone interested in the mathematical foundations of entropy and complexity.

Karl Coryat
Messages
104
Reaction score
3
Shannon entropy of "QXZ"

Hello everyone. I am trying to determine the Shannon entropy of the string of letters QXZ, taking into consideration those letters' frequency in English. I am using the formula:

H(P) = –Ʃ pilog2pi

What's puzzling me is that I am expecting to calculate a high entropy, since QXZ represents an unexpected string in the context of English letter frequencies -- but the first pi term in the formula, which takes very small values (e.g., .0008606 for Q), is greatly diminishing my calculations. I am obviously making a wrong assumption here or applying something incorrectly, because as I understand it, letters with high surprisals should increase the entropy of the string, not reduce it.

Thank you in advance for your generous help.
 
Physics news on Phys.org
Shannon entropy is defined for a probability distribution. You are apparently making some sort of assumptions about the probability of a string of letters and trying to apply the formula for Shannon entropy to the probability of that string happening. Shannon entropy can be computed for the probability distribution for all 3 letter strings. (i.e. it applies to a set of probabilities that sum to 1.0) It doesn't apply to one realization of a 3 letter strings taken from that distribution.

Perhaps you should try Kolmogorov complexity if you want to deal with definite strings of letters.
 
Karl,

I don't know much about information theory, but I think the Shannon information content of "Q" in English text is simply [tex]- \log_2(P(Q))[/tex] The formula you quote for H(P) is for the entropy of an "ensemble" (or distribution), e.g. the entropy of a randomly selected letter in English text.

Reference: "Information Theory, Inference and Learning Algorithms" by MacKay (which is available for free download) http://www.inference.phy.cam.ac.uk/itila/
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 13 ·
Replies
13
Views
10K
Replies
7
Views
4K
  • · Replies 6 ·
Replies
6
Views
6K
  • · Replies 0 ·
Replies
0
Views
4K
Replies
23
Views
7K
  • · Replies 4 ·
Replies
4
Views
5K
  • · Replies 4 ·
Replies
4
Views
4K