Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Logarithm of a discrete random variable

  1. Jul 10, 2009 #1
    I am trying to explore a number of things regarding the entropy of random strings and am wondering how a character set of random size would affect the entropy of strings made from that set.

    Using the following formula, I need to take the log of a discrete random variable
    [tex]H = L\log_2 N[/tex]

    H is the entropy of the string in bits,
    L is the length of the string in characters
    N is the discrete random variable representing the number of possible characters to choose from

    How do you take the logarithm of a discrete random variable? Is there a general method that takes into account any maximum or minimum size of this variable?

    Thanks in advance
  2. jcsd
  3. Jul 10, 2009 #2
  4. Jul 10, 2009 #3
    Sorry, let me clarify

    In the event that you are using the 26 characters from the alphabet, then 26 would be the maximum value, 1 would be the minimum value for this discrete random variable. The number of characters from this 26 character set that you use does not have to be 26, it is random. Your string could be all 'A' 's, all vowels, etc. All you know about N is that it is between 1 and 26 (if you are using the alphabetic characters)

    I don't know how useful this specific idea really is as far as application is concerned, but I would like to know what the log of a random (discrete or not) variable is and how you would calculate it.

    Thanks for the speedy reply, hope this helps
  5. Jul 10, 2009 #4
    I think you are suffering from a few misconceptions. Foremost among them is the fact that strings do not have entropy (except perhaps kolmogorov entropy, but that's a different story). Random variables are the things that have entropy.
  6. Jul 10, 2009 #5
    If you have a string built entirely from iid random characters, wouldn't the string be a random variable and therefore have entropy?

    Shouldn't the entropy of the random string be the product of the number of iid characters in the string and the logarithm of the number of possible characters to choose from for each character of the string?

    Thanks again, I really appreciate the help.
  7. Jul 10, 2009 #6
    It seems like the string would have an entropy associated with it, sure. Strings can have more or less order... a simple example:

    E = {a, b, c}

    W3 = {aaa, aab, aac, aba, abb, abc, aca, acb, acc, baa, bab, bac, bba, bbb, bbc, bca, bcb, bcc, caa, cab, cac, cba, cbb, cbc, cca, ccb, ccc}


    3 a's, 0 b's, 0 c's: 1/27
    3 b's, 0 a's, 0 c's: 1/27
    3 c's, 0 a's, 0 c's: 1/27
    2 a's, 1 b, 0 c's: 3/27
    2 a's, 0 b's, 1 c: 3/27
    2 b's, 1 a, 0 c's: 3/27
    2 b's, 0 a's, 1 c: 3/27
    2 c's, 1 a, 0 b's: 3/27
    1 a, 1 b, 1 c: 6/27

    So the 1/1/1 state would be the highest-entropy state.

    Sorry if this is completely missing the point. Physics major talking.
  8. Jul 10, 2009 #7
    If the "string" was built from iid random variables each taking values over an alphabet, it wouldn't really be a string. It would be a sequence of random variables. It would have an entropy, but if you _sampled_ it to get an actual specific string, that string wouldn't have entropy. That distinction is important.

    The entropy of the sequence of random variables would be the sum of entropies of the individual random variables. If each individual random variable had the uniform distribution over 1,2,...,N, then the entropy of the sequence of random variables would be L log_2 N, where N and L are fixed constants, not random variables.

    AUMathTutor, there is a way to assign "entropy" to fixed strings, namely kolmogorov complexity, but that is not the same kind of entropy referred to in the source coding theorem (and in general it's impossible to determine exactly).
  9. Jul 10, 2009 #8
    So if you were to roll three six-sided fair die to determine a value for L, this would not impact the entropy of the "string"? In this case, would you just assume the maximum possible value for L which would be 18? Likewise for N?

    That would make sense, but what if you don't know the maximum. For example, a character set of user defined glyphs that could for all we know have an N of infinite (although this is almost surely not the case). I've never had formal education on any of this but I would guess that infinite entropy would violate some physical laws?

    Thanks again guys
  10. Jul 10, 2009 #9
    If you were to use some random process to determine L before producing the string, then yes, that would affect the entropy of the resulting random variable. The entropy of the R.V. would still be a constant, however; it would not depend on L in that case.
  11. Jul 11, 2009 #10
    Alright, I think I understand now. If [tex]H = L \log_2 N[/tex] where either L or N are random variables, then H will be a random variable and completely esoteric.

    Example: You are a literate English speaking person and are randomly introduced to a random culture where you have no idea about their language other than the fact that they have written communication involving strings of characters. You know nothing about the size of their alphabet, case-sensitivity, etc. Relative to you and in a defined context, the entropy of the string in which a literate local would understand to mean what you understand of the string "hello" from your culture (given the same context) would be random. Relative to the literate local, the entropy of this string would be non-random and not esoteric.

    Maybe I have been confusing random with chaotic in the context of the logarithm. Would the logarithm of a random variable be mathematically different from the random variable itself only if the random variable was deterministic? Example: If you were to take the natural log of the random variable X, where X is the roll of two 6-sided fair die, would ln(X) be the natural log of X before or after the dice are rolled.

    Thanks again.
  12. Jul 11, 2009 #11
    Perhaps I'm misunderstanding your question or oversimplifying, but the (Shannon) entropy (H) of any randomly generated string of length L from a fixed set of n available unique characters is: H= -log(1/n)^L where (1/n) is the probability under a uniform probability distribution of choosing a particular character from a set of n unique characters.

    So if the character set consists of 26 unique characters, than the probability of choosing any particular character is just 1/26 for each trial. If we have string length L and choose base 2 than H=-logb2(1/26)^L. It's easy to see how the values of L and n affect H.
    Last edited: Jul 11, 2009
  13. Jul 11, 2009 #12
    For clarity's sake, it's best to maintain the distinction between a randomly generated string, and a sequence of random variables. A specific string does not have a Shannon entropy because it is not a random variable.
  14. Jul 11, 2009 #13
    I don't follow that. You can very well ask what the probability of a random generator generating a given string is, based on the size of the character set. You can ask what the probability of getting HHHH in four fair coin tosses is. You can also ask what the probability of getting THHT or any other sequence specified a priori is. The probability is 1/16 regardless of what the specification is. The entropy of a randomly generated string is dependent on two variables: the size of the character set (n), and the length of the string (L).
    Last edited: Jul 11, 2009
  15. Jul 11, 2009 #14
    Precisely. The entropy of any four toss sequence would be four bits:

    [tex]H = \sum^{I = 4} _{i = 1} log_2 P(h_i)[/tex]

    Where [tex]H[/tex] is entropy in bits, [tex]i[/tex] is the toss in the sequence, [tex]I[/tex] is the sequence length, [tex]P(h_i)[/tex] is the probability of getting heads.


    [tex]H = 4 log_2{1/2} = 4 * 1 = 4[/tex]

    So again I ask:
    If [tex]P(h_i)[/tex] is itself a random process, is the entropy of a sequence [tex]I[/tex] characters in length random and esoteric?

    Thanks for your patience
  16. Jul 11, 2009 #15
    If the entropy of a string is fully determined by two variables, and one or both of these variables are random variables then the entropy is also a random variable. That is, the entropy of the string you actually get is the product of a random process. However, you cannot know the parameters of the distribution of possible strings without knowing the parameters of p(n) and/or p(L). Does that answer your question?

    Frankly, this is pretty esoteric.

    PS: I used to use hierarchies of probabilities to simulate baseball games. I never could do it for football or basketball, let alone soccer however. However, you have to set the parameters for the distribution you're using. I used real data for the first level (ie batting average), but made up values for the second level.
    Last edited: Jul 11, 2009
  17. Jul 11, 2009 #16
    Yes, the probability of getting THHT is 1/16 (_assuming_ you are generating only strings of length 4, and assuming a uniform probability distribution). However, events do not have Shannon entropies, only random variables do. This is a clear question of definition which can be resolved by you looking up the definition of Shannon entropy.

    You COULD speak of the surprisal of the event. If X is a random variable with the discrete uniform distribution over those strings of the alphabet {T,H} which have length exactly 4, then the surprisal of THHT would be -ln_2 (1/16) = 4 bits. But this fact is easily misleading, because it highly depends on X having that specific distribution. Also important to note is that with this distribution, strings of length other than 4, such as THH, have no surprisal defined because X does not take those values.

    Suppose instead that X is a random variable distributed over all nonempty strings of the alphabet {T,H}, such that P(X=x) = 4^{- |x|} where |x| is the length of the string x. You can verify that this is a valid distribution--start by showing P(|X|=y) = 2^{-y}. Now, with this distribution, the surprisal of THHT is -ln_2 (1/256) = 8 bits.
  18. Jul 11, 2009 #17
    Thank you sir/madam! This is precisely what I needed to know in order to move on to the next phase (aside from the logarithm thing which was covered earlier). Any future questions (of mine) regarding this information will be topics for other threads.

    Thanks again :approve:
  19. Jul 11, 2009 #18
    The practical lesson to take from this is that strings themselves do not carry information; they only have information (surprisal) with respect to a particular person's assumed probability distribution. The string "Paris is in France" carries less information to someone who already knows that fact than to someone who does not.

    Also, if the surprisal depends on the parameter L and you make L a random variable, what you would normally do is change the distribution of strings. The surprisal you get is then calculated from that new distribution, and as before it is a constant, not a random variable.
    Last edited: Jul 11, 2009
  20. Jul 11, 2009 #19
    Yes, I assume that each possible trial outcome is equally likely for randomly generated discrete string elements (digit strings, etc) unless otherwise specified. The Shannon equation reduces to the one I used when making this assumption. I prefer to keep things simple unless otherwise required. I made clear that I was talking about a uniform distribution (post 11). I believe this is what the OP had in mind.
    Last edited: Jul 11, 2009
  21. Jul 11, 2009 #20
    Yes again. You could argue that a randomly generated string, once it is generated, has no information in that it's a completed known fact. This could also be true of an interpretable string which resolves some uncertainty. Strictly speaking, the IT definition of information is counter-intuitive. There is no information once uncertainty is resolved but there is a high surprisal when you learn you won a lottery against all odds. I suppose the "information" is knowing the probabilities when you buy a lottery ticket.

    Nevertheless, entropy can be defined in terms of the number of discrete states in which a system can exist. The more states, the higher the entropy. A randomly generated character string can be thought of as one of a countable number of states in which a string could exist. In this sense, such a string has measurable entropy.
    Last edited: Jul 12, 2009
  22. Jul 12, 2009 #21
    The probability of THH is (1/8). Shannon entropy is a function of probability and nothing else. Where am I wrong? In terms of surprisal: H=-logb2(1/2)^3=3. Why can't surprisal be defined for this? The binary representation of three is (11). Can surprisal only be evaluated for binary values like 10, 100, 1000, etc? If so, it seems to be a rather restricted measure.
    Last edited: Jul 12, 2009
  23. Jul 12, 2009 #22
    As I defined X earlier, it is uniformly distributed over those strings of length 4 over the alphabet {T,H}. The probability of THH with that distribution is 0, so the surprisal of THH is not defined.
  24. Jul 12, 2009 #23
    ?????Certainly there is a probability of THH. I don't care if some distribution of X is restricted to binary values: 10, 100, 1000, 10000, etc. The probability of THH or any pre-specified sequence of three coin tosses is 0.125. If you have relevant probabilities and nothing else, you can calculate Shannon entropy. We've already discussed how Shannon entropy might be interpreted (a priori or a posteriori). It's clear I'm taking the former view, you're taking the latter. In the latter view, it seems information doesn't exist. Once an event occurs, it has a probability of one.
    Last edited: Jul 12, 2009
  25. Jul 12, 2009 #24
    Maybe you have some confusion over terminology. The strings over the alphabet {T,H} are the empty string, T, H, TT, TH, HT, HH, TTT, TTH, THT, THH, HTT, and so forth to infinity. The strings over the alphabet {T,H} of length 4 are the strings TTTT, TTTH, TTHT, TTHH, THTT, THTH, THHT, THHH, HTTT, HTTH, HTHT, HTHH, HHTT, HHTH, HHHT, HHHH. There are only sixteen of those. X is distributed uniformly over those sixteen strings, giving each string of length 4 a probability of 1/16. It should now be clear to you that P(X=THH) = 0, because THH is not a string of length 4. Also, if P(X=THH) were greater than 0, then X would not have a valid distribution because the sum of the probabilities of each event of X would be greater than 1.
    Last edited: Jul 12, 2009
  26. Jul 12, 2009 #25
    There's more than confusion over terminology here. You gave the alphabet of {T,H} as including THH. Now you're saying that THH has probability zero because its not a string of length 4. Tell me why this isn't patent nonsense. The probability of any pre-specified sequence of three coin tosses is (0.5)(0.5)(0.5)= 0.125. What does this have to do with summing probabilities????? Are we on the same planet?
    Last edited: Jul 12, 2009
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook