It is often said, that entropy is the average number of yes-no
questions you have to ask to obtain the information "which event"
occurred.
In Honerkamp "Statistical Physics An Advanced Approach with
Applications" (2.4.1) I found an explanation of the above dictum, but
I don't understand it. I first cite Honerkamp, tell you what I don't
understand and pose some further questions below.
Perhaps someone can help me!
Honerkamp writes:
"Let {A_1,...,A_N} be a complete, disjoint set of events, i.e A_1
u ....u A_N = \Omega.
Furthermore, let P be a probability defined for those events. We then
define the entropy as S = -k \sum_{i=1}^{N} P(A_i) ln(P(A_i)). Here k
represents a factor which we set 1 for the moment. In the framework of
statistical mechanics k will be Boltzmann's constant k_B."
...
"If an event has occurred, then, as we will show in a moment, -log_2
P(A_j) is a good measure of the number of questions to be asked in
order to find out that it is just A_j which is realized."
...
and now the section I don't understand:
"To show that -log_2 P(A_j) is just equal to the number of required
yes-or-no questions, we first divide \Omega into two disjoint domains
\Omega_1 and \Omega_2 such that \sum_{A_i \in \Omega_1} P(A_i) =
\sum_{A_i \in \Omega_2} P(A_i) = 1/2."
I think this is in general not possible. Consider just the trivial
case P(A_1) = 1 and P(A_i) = 0 for i != 1.
Honerkamp proceeds:
"The first question is now: Is A_j in \Omega_1? Having the answer to
this question we next consider the set containing A_j and multiply the
probabilities for this set by a factor of 2. The sum of probabilities
for this set is now again equal to 1, and we are in the same position
as before with the set \Omega: We divide it again and ask the
corresponding yes-or-no question.This procedure ends after k steps,
where k is the smallest integer such that 2^kP(A_j) becomes equal to
or larger than 1. Consequently, -log_2P(A_j) is a good measure of the
number of yes-or-no questions needed."
Is there a (simple) way to modify this algorithm to get a satisfactory
interpretation of the concept of entropy? Let me make this question
more concrete:
Assume you know the probabilities P_1,...,P_N, you want to find out
what event (A_1 or ... or A_N) occurred and you have an algorithm aa
to do so. Then, if you apply your algorithm and suppose the result
will be A_j, you have to ask, say f_aa(A_j) questions.
Since I have statistical mechanics in mind, let N be a natural number
or infinity (Honerkamp doesn't say something about this as I can see)
and let us assume, that the following series converge (in the case of
N = infinity) (do you agree, that this can be assumed??).
In that case, the average number of questions you have to ask is
C_aa((P_k)_k\in{1...N}) := \sum_j P(A_j)f_aa(A_j).
Then the question is: Is there a modification mm of the algorithm
which Honerkamp described (and which doesn't work in my opinion), that
leads to f_mm = -log_2? If not, does an algorithm ll exist at all,
which leads to f_ll = -log_2? If not, how would you motivate the
definition of entropy? If yes, is it true that there is no other
algorithm aa such that C_aa((P_k)_{k\in\{1...N\}}) < C_ll((P_k)_{k\in
\{1...N\}}) for some (P_k)_{k \in \{1...N\}} (P_k \in [0,1]...). If
yes, can you give me a mathematical rigorous proof. If not, can you
give me an algorithm cc which serves as counterexample? In the latter
case: why is then entropy not defined as sum_j P(A_j) f_cc(A_j) ?
Are there other ways to motivate, that -log_2P(A_j) is a good measure
for the "increment of knowledge" by receiving the information that A_j
is realized.
Surely -log_2P(A_j) >= 0 and -log_2P(A_j) = 0 iff P(A_j) = 1, which
are reasonable properties of such a measure, but that's not enough to
motivate the formula -log_2P(A_j)...
References to books or papers where (parts of) my questions were
answered would be nice.
Thanks,
Chris M
Ian Parker
Jun26-08, 06:00 AM
On 23 Jun, 23:09, chris.meyer...@googlemail.com wrote:
> It is often said, that entropy is the average number of yes-no
> questions you have to ask to obtain the information "which event"
> occurred.
>
> In Honerkamp "Statistical Physics An Advanced Approach with
> Applications" (2.4.1) I found an explanation of the above dictum, but
> I don't understand it. I first cite Honerkamp, tell you what I don't
> understand and pose some further questions below.
>
> Perhaps someone can help me!
>
> Honerkamp writes:
>
> "Let {A_1,...,A_N} be a complete, disjoint set of events, i.e A_1
> u ....u A_N = \Omega.
>
> Furthermore, let P be a probability defined for those events. We then
> define the entropy as S = -k \sum_{i=1}^{N} P(A_i) ln(P(A_i)). Here k
> represents a factor which we set 1 for the moment. In the framework of
> statistical mechanics k will be Boltzmann's constant k_B."
> ..
> "If an event has occurred, then, as we will show in a moment, -log_2
> P(A_j) is a good measure of the number of questions to be asked in
> order to find out that it is just A_j which is realized."
> ..
>
> and now the section I don't understand:
>
> "To show that -log_2 P(A_j) is just equal to the number of required
> yes-or-no questions, we first divide \Omega into two disjoint domains
> \Omega_1 and \Omega_2 such that \sum_{A_i \in \Omega_1} P(A_i) =
> \sum_{A_i \in \Omega_2} P(A_i) = 1/2."
>
> I think this is in general not possible. Consider just the trivial
> case P(A_1) = 1 and P(A_i) = 0 for i != 1.
>
> Honerkamp proceeds:
>
> "The first question is now: Is A_j in \Omega_1? Having the answer to
> this question we next consider the set containing A_j and multiply the
> probabilities for this set by a factor of 2. The sum of probabilities
> for this set is now again equal to 1, and we are in the same position
> as before with the set \Omega: We divide it again and ask the
> corresponding yes-or-no question.This procedure ends after k steps,
> where k is the smallest integer such that 2^kP(A_j) becomes equal to
> or larger than 1. Consequently, -log_2P(A_j) is a good measure of the
> number of yes-or-no questions needed."
>
> Is there a (simple) way to modify this algorithm to get a satisfactory
> interpretation of the concept of entropy? Let me make this question
> more concrete:
>
> Assume you know the probabilities P_1,...,P_N, you want to find out
> what event (A_1 or ... or A_N) occurred and you have an algorithm aa
> to do so. Then, if you apply your algorithm and suppose the result
> will be A_j, you have to ask, say f_aa(A_j) questions.
>
> Since I have statistical mechanics in mind, let N be a natural number
> or infinity (Honerkamp doesn't say something about this as I can see)
> and let us assume, that the following series converge (in the case of
> N = infinity) (do you agree, that this can be assumed??).
>
> In that case, the average number of questions you have to ask is
> C_aa((P_k)_k\in{1...N}) := \sum_j P(A_j)f_aa(A_j).
>
> Then the question is: Is there a modification mm of the algorithm
> which Honerkamp described (and which doesn't work in my opinion), that
> leads to f_mm = -log_2? If not, does an algorithm ll exist at all,
> which leads to f_ll = -log_2? If not, how would you motivate the
> definition of entropy? If yes, is it true that there is no other
> algorithm aa such that C_aa((P_k)_{k\in\{1...N\}}) < C_ll((P_k)_{k\in
> \{1...N\}}) for some (P_k)_{k \in \{1...N\}} (P_k \in [0,1]...). If
> yes, can you give me a mathematical rigorous proof. If not, can you
> give me an algorithm cc which serves as counterexample? In the latter
> case: why is then entropy not defined as sum_j P(A_j) f_cc(A_j) ?
>
> Are there other ways to motivate, that -log_2P(A_j) is a good measure
> for the "increment of knowledge" by receiving the information that A_j
> is realized.
>
> Surely -log_2P(A_j) >= 0 and -log_2P(A_j) = 0 iff P(A_j) = 1, which
> are reasonable properties of such a measure, but that's not enough to
> motivate the formula -log_2P(A_j)...
>
> References to books or papers where (parts of) my questions were
> answered would be nice.
>
I think your book explains things very badly. Let us look at a quantum
mechanical system. No matter what its complexity it can be expressed
in terms of a large but finite number of quantum states. A finite
number of quantum states can be expressed as a binary number. It says
a bit placed un an up/down true/false position.
Quantum theory gives us a clear numerical value, without quantum
theory we would have to express states in terms of Cantor's infinite
sets.
The second law of thermodynamics says that heat flows from a higher to
a lower temperature. We define temperature such that dE/dS = T
T = Temperature
E = Energy
S = Entropy
This is the normal definition of temperature. Entropy increases and
heat flows from a higher to a lower temperature. Entropy defined
statistically is
Log(number of states)
We may thus say that the temperature of a body is dE/d(Log N) where N
is the number of states. This is where we take our two partitions.
Physicists tend to think in terms of "e" whereas information theory
tends to use the base 2. Heat flows so as to maximize the number of
Quantum states. We thus find out how many states are created by a
nanojoule in each partition.
A statistical view point means that we can readily assign temperatures
to sub systems within a solid or a gas. We can define color
temperatures, we can say that the temperature of the electrons in a
florescent tube is 20,000K, similarly for carriers in an LED. What is
the meaning of these "temperatures"? It is quantum mechanical and
statistical.
A binary number does have a feel of Hutter. We can express the first
100 Megabytes of Wikipaedia in the form of a binary number. Hutter has
defined AI in entropic terms. What is entropy? It is the same thing as
compression.
- Ian Parker
Einar Andreas Rodland
Jun27-08, 06:00 AM
[Moderator's note: Lines reformatted to make them shorter. Although 80
is a maximum to enable nice display on almost all terminals, it's better
to keep your own, original, unquoted text in posts at 72 characters per
line or less, to enable it to be quoted a couple of times and still keep
things at less than 80 characters per line total. Also, don't use more
than 2 characters (including space) for a quote symbol. -P.H.]
<chris.meyer123@googlemail.com> wrote in message
news:a41d7e38-0715-401f-9ebd-430431dc3cc2@f63g2000hsf.googlegroups.com...
> It is often said, that entropy is the average number of yes-no
> questions you have to ask to obtain the information "which event"
> occurred.
This is an explanation of entropy which I suppose is more popular in
information sciences than in physics, hence Ian Parker's response.
However, let me address your issue from the point you make it.
> Honerkamp writes:
>
> "Let {A_1,...,A_N} be a complete, disjoint set of events, i.e A_1
> u ....u A_N = \Omega.
>
> Furthermore, let P be a probability defined for those events. We then
> define the entropy as S = -k \sum_{i=1}^{N} P(A_i) ln(P(A_i)). Here k
> represents a factor which we set 1 for the moment. In the framework of
> statistical mechanics k will be Boltzmann's constant k_B."
> ..
> "If an event has occurred, then, as we will show in a moment, -log_2
> P(A_j) is a good measure of the number of questions to be asked in
> order to find out that it is just A_j which is realized."
> ..
>
> and now the section I don't understand:
>
> "To show that -log_2 P(A_j) is just equal to the number of required
> yes-or-no questions, we first divide \Omega into two disjoint domains
> \Omega_1 and \Omega_2 such that \sum_{A_i \in \Omega_1} P(A_i) =
> \sum_{A_i \in \Omega_2} P(A_i) = 1/2."
>
> I think this is in general not possible. Consider just the trivial
> case P(A_1) = 1 and P(A_i) = 0 for i != 1.
Well, in your trivial counter-example, you don't have to ask any
questions at all: you already know the answer is A_1. However, writing
P_i for P(A_i), let's take (P_1,P_2)=(2/3,1/3), and it also breaks down
since you always have to ask one question, but the entropy is less than
one bit: the reason for this is you don't get the optimal 1/2--1/2 split
corresponding to 1 bit of information, but only the 2/3--1/3 split
corresponding to less than 1 bit of information.
Honerkamp's explanation thus works only to the extent you can divide the
remaining alternatives into two groups such that each group has
likelihood 1/2. Of course, this will generally not be the case, and the
average number of yes-no questions will therefore be at least the number
of bits in the entropy: i.e. at least -Sum[P_i*log2(P_i)].
However, let's instead say you have X_1, X_2,...,X_K drawn independently
from 1...N with probabilities (P_1,...,P_N). Again, if the entropy of
the distribution is S = s bits, then even given an optimal strategy the
expected number of questions would have to be at least K*s. However,
given any epsilon>0, for sufficiently large K you can do with less than
K*(s+epsilon).
E.g. with (P_1,P_2)=(2/3,1/3) and K=2, your first question could be "Is
X_1=X_2=1?" which could at least help you get the expected number of
questions below 2 (to 17/9 if I'm right).
> Honerkamp proceeds:
>
> "The first question is now: Is A_j in \Omega_1? Having the answer to
> this question we next consider the set containing A_j and multiply the
> probabilities for this set by a factor of 2. The sum of probabilities
> for this set is now again equal to 1, and we are in the same position
> as before with the set \Omega: We divide it again and ask the
> corresponding yes-or-no question.This procedure ends after k steps,
> where k is the smallest integer such that 2^kP(A_j) becomes equal to
> or larger than 1. Consequently, -log_2P(A_j) is a good measure of the
> number of yes-or-no questions needed."
>
>
> Is there a (simple) way to modify this algorithm to get a satisfactory
> interpretation of the concept of entropy? Let me make this question
> more concrete:
I seem to recall there is an algorithm where you encode
(X_1,X_2,...,X_K) as a number in the interval [0,1]---X_1 determines
which of the intervals [0,P_1),[P_1,P_1+P_2),..., X_2 determines
subintervals similarly, etc.---and you can split the remaining interval
in half by each yes-no question. I suspect you cannot take the
K=infinity limit directly with respect to the encoding in [0,1] since
you'd want 0.99999...<1 in this encoding, but if you use the equivalent
ordering of (X_1,...), you can always make perfect 1/2--1/2 splits, so
Honerkamp's algorithm would have worked (with some caution I suppose
since K=infinity).
I'm afraid it's too long since I was looking at this to remember any
references.
Einar
a student
Jun27-08, 06:00 AM
On Jun 24, 8:09am, chris.meyer...@googlemail.com wrote:
> It is often said, that entropy is the average number of yes-no
> questions you have to ask to obtain the information "which event"
> occurred.
>
> In Honerkamp "Statistical Physics An Advanced Approach with
> Applications" (2.4.1) I found an explanation of the above dictum, but
> I don't understand it. I first cite Honerkamp, tell you what I don't
> understand and pose some further questions below.
>
The proof you describe does sound rather handwaving. One good book,
which from memory discusses this interpretation (and others) of
entropy, is `Information Theory' by Robert Ash there is a recent
Dover edition.
Entropy is only an `average' concept, and one has to consider a number
of events to pull out its meaning. You can check Ash or other
textbooks for complete rigour, but here is the basic argument.
1. For a random event generator, that generates outcome k with
probability p_k, consider a sequence of N outcomes (k1,k2,=85,kN) from
the generator (eg, a dice is thrown N times). It is assumed that each
outcome is generated independently of the others. The probability of
generating a given sequence is then given by
P = (p_1)^(n_1) (p_2)^(n_2) =85. ,
where n_k is the number of times that outcome k appears in the
sequence.
2. Now, as N is increased, we expect that any particular outcome k
will, typically, appear in the sequence a total of approximately
n_k = Np_k
times (rounded to the nearest integer). Such a sequence is called a
`typical sequence'. It follows that the probability of generating
given typical sequence is given by, approximately,
P_typ = (p_1)^(Np_1) (p_2)^(Np_2) =85. = 2^{-NH] ,
where H is the Shannon entropy function
H = - sum_k p_k log_2 p_k .
3. It can be shown rigorously is that, for N sufficiently large, the
probability of obtaining a `typical' sequence of outcomes becomes as
close to 1 as one desires. Hence, any non-typical sequences can be
ignored FAPP. You will have to consult Ash or another textbook for
this 'law of large numbers'.
4. Since each typical sequence has the same probability, P_typ, of
occurring, and the sum of these probabilities is very close to unity
for large N, the number of typical sequences of length N must be
approximately given by
N_typ = 1 / P_typ = 2^{NH} .
5. We are now almost finished. To determine which actual typical
sequence has occurred in a given run of N outcomes, how many yes/no
questions must I ask? But this is same problem as guessing a number
from 1 to N_typ. I number the possible typical sequences from 1 to
N_typ, and ask `is the sequence of outcomes in the first half or in
the second half. And so on. This will take at most log_2 N_typ
questions, i.e., the number of questions required is
N_q = log_2 N_typ = NH.
6. Finally, the average number of questions required per outcome is
then
N_q / N = H.
Arnold Neumaier
Jun29-08, 06:00 AM
chris.meyer123@googlemail.com schrieb:
> It is often said, that entropy is the average number of yes-no
> questions you have to ask to obtain the information "which event"
> occurred.
>
> References to books or papers where (parts of) my questions were
> answered would be nice.
You may find the discussion in Appendix A of
http://lanl.arxiv.org/abs/0705.3790
illuminating.