# Need sum help (information theory)

## Main Question or Discussion Point

need sum help plz:) (information theory)

Hello.

I'm working on an information theory-related problem that involves doing a nasty sum. The problem is this: in a widget factory there is a conveyor belt with N widgets on it, and an unknown fraction $\xi = a/N$ of them are defective. You examine a sample of n widgets and find that a fraction $\eta = b/n$ is defective. What is mutual information $I(\eta : \xi)$ between the random variables $\eta$ and $\xi$ ? The idea, i think, is to see how large a sample n you need to take so that the sample defect rate gives you information about the actual defect rate. Let A_a be the event that there are a defective parts in the whole lot and B_b be the event that there are b defective parts in the sample. Then the formula for mutual information is:

$$I (\eta : \xi) = \sum_{a=1}^{N} \sum_{b=1}^{n} P(A_{a}) P(B_{b} | A_{a}) \log_{2} { \frac{P(B_{b} | A_{a}) }{P(B_{b})} }$$

which is always nonnegative. Here's what I've got so far: P(A_a) = 1/N by principle of insufficient reason (a could be anything from 1 to N with equal probability), and

$$P(B_{b} | A_{a}) = \frac{ \left( ^{a}_{b} \right) \left( ^{N-a}_{n-b} \right) }{ \left(^{N}_{n} \right) } = \frac{ \left( ^{n}_{b} \right) \left( ^{N-n}_{a-b} \right) }{ \left(^{N}_{a} \right) }$$
$$P(B_{b}) = \sum_{a=1}^{N} P(A_{a}) P(B_{b} | A_{a}) = \sum_{a=1}^{N} \frac{1}{N} \frac { \left( ^{n}_{b} \right) \left( ^{N-n}_{a-b} \right) }{ \left(^{N}_{a} \right) } \approx \int_{0}^{1} \left( ^{n}_{b} \right) x^{b} (1 - x)^{n-b} dx = \frac {\left( ^{n}_{b} \right)}{ \left( ^{n}_{b} \right) (n+1)} = \frac {1}{n+1}$$

if you pretend it is a Riemann sum and assume that $N \gg n$ and $a \gg b$, which I'm not sure is OK to do. I'm guessing the idea is to get some asymptotic formula for the mutual information as N becomes large, but how do you retain the dependence on N in the sum? For instance, if I apply the "large N" approximation for $P(B_{b} | A_{a})$, which is $\left( ^{n}_{b} \right) \left(\frac{a}{N}\right)^{b} \left(1 - \frac{a}{N} \right)^{n-b}$, and do the Riemann sum I get an expression that has no dependence on N and apparently diverges to negative infinity (weird because mutual information is nonnegative).

This is not a homework problem, just a "something to think about" problem I came across in an informal book on information theory.

thanks

Last edited:

Related Set Theory, Logic, Probability, Statistics News on Phys.org
sorry for the double post...it was an accident. maybe the admin/moderator can delete the other one...

Here's what I've got so far: P(A_a) = 1/N by principle of insufficient reason (a could be anything from 1 to N with equal probability),
Don't forget the case $a=0$ ;]

$$P(B_{b}) = \sum_{a=1}^{N} P(A_{a}) P(B_{b} | A_{a}) = \sum_{a=1}^{N} \frac{1}{N} \frac { \left( ^{n}_{b} \right) \left( ^{N-n}_{a-b} \right) }{ \left(^{N}_{a} \right) } \approx \int_{0}^{1} \left( ^{n}_{b} \right) x^{b} (1 - x)^{n-b} dx = \frac {\left( ^{n}_{b} \right)}{ \left( ^{n}_{b} \right) (n+1)} = \frac {1}{n+1}$$

if you pretend it is a Riemann sum and assume that $N \gg n$ and $a \gg b$, which I'm not sure is OK to do.
Don't you also have to assume that $N\to\infty$ for the Riemann sum? But the final answer seems reasonable regardless: uniform over the number of possible samples, just like the principle of insufficient reason got us for $P(A_a)$.

Plugging those three expressions for $P(A_a)$, $P(B_b)$ and $P(B_b|A_a)$ into the formula for mutual information gives a not-terribly tedious result:

$$I(A,B) = \log_2(n+1) + \frac{1}{n+1}\sum_{b=0}^n \log_2\left( ^{n}_{b} \right) + \frac{1}{N+1}\sum_{b=0}^n \sum_{a=0}^N \frac{\left( ^{n}_{b} \right)\left( ^{N-n}_{a-b} \right)}{\left( ^{N}_{a} \right)}\log_2\left( ^{N-n}_{a-b} \right) - \frac{1}{N+1}\sum_{a=0}^N \log_2\left( ^{N}_{a} \right)$$

I'm sure you can simplify that further, and maybe use a clever approximation or two, to get something that provides more intuition, though.

Last edited by a moderator:
Don't forget the case $a=0$ ;
Ah yes! I forgot both a=0 and b=0. Shame on me! I was also a little to quick to substitute the large N approximation for P(B_b|A_a). I get the same expression you do after simplification. However, since I am assuming N >> n and a >> b, I decided to use the other exact expression for P(B_b|A_a), namely:

$$\frac{ \left(^{a}_{b}\right)\left(^{N-a}_{n-b}\right) }{\left(^{N}_{n}\right)}$$

to get a similar expression for I(A,B)

$$I(A,B) = \log_2(n+1) -\log_2{\left(^{N}_{n}\right)} + \frac{1}{N+1}\sum_{b=0}^n \sum_{a=0}^N \frac{\left( ^{a}_{b} \right)\left( ^{N-a}_{n-b} \right)}{\left( ^{N}_{n} \right)}\log_2{\left(^{a}_{b}\right)\left( ^{N-a}_{n-b} \right) }$$

I did this so I could apply the Stirling approximation to the binomial coefficients where the top is much larger than the bottom:

(P choose Q) ~ (P/e)^Q / Q! when P>>Q

I applied this approximation in the argument of the log function in the double sum, and after much simplification I got

I(A,B) = log_2(n+1) + (1/(n+1))*Sum_b {log_2 (n choose b)}
+ (2n/(N+1))*Sum_a {(a/N) log_2 (a/N) }

At least this is nonnegative. I'm still trying to figure out what the significance of it is, other than that for very large N the mutual information apparently depends on n only.

thanks for the help!

Last edited:
Sounds good. Another approximation that might be useful here is:

$$\log\left( ^n _k \right) \leq k\left(1 + \log(\frac{n}{k})\right)$$

which, IIRC, is tight for $k<<n$

That's a good suggestion. With that approximation I get

$$I(A,B) \leq \log_2(n+1) - \log_2{ \left(^N_n\right) } + { n } + \frac {2n}{N(N+1)} \sum_{a=0}^{N} {a\log_2a} - \frac{2}{n+1}\sum_{b=0}^{n} {b\log_2b}$$

Not only should I(A,B) be nonnegative, but it should also be less than or equal to log_2 (n+1), the uncertainty H(B). Just eyeballing it, the above looks like it meets that requirement, but I haven't proved it yet. Still, I think this approximation is easier to work with than the first one I used. I'll think about it some more.

There's probably a good bound for those xlogx summations, but I can't recall one offhand.

However, I think there's a different approach to this problem that might be closer to what the author had in mind. Instead of using the principle of insufficient reason to assume a uniform distribution on $A$ (and so on $B$ as well), let's make some different assumption based on the idea that this is an assembly line in a factory. Specifically, let's model the widget assembly process as an i.i.d. sequence of Bernoulli trials, with some failure probability $\gamma$. Then, $A$ has a binomial distribution:

$$P(A_a) = \bold{B}(a;N,\gamma) = \left(^N _a\right) \gamma^a(1-\gamma)^{N-a}$$

where I'm using a bold $\bold{B}$ for the Binomial distribution, versus a regular $B$ for the random variable. Using the same expression for $P(B_b|A_a)$ as before, it turns out that $B$ is distributed as $\bold{B}(b;n,\gamma)$ and, more interestingly, $P(A_a|B_b) = \bold{B}(a-b;N-n,\gamma)$.

Recall this expression for mutual information:

$$\mathcal{I}\left(A;B\right) = \mathcal{H}(A) - \mathcal{H}(A|B)$$

where

$$\mathcal{H}(A|B) = -\sum_b P(B_b) \sum_a P(A_a|B_a) \log_2 P(A_a|B_b)$$.

That is, we're expressing the mutual informaton as the average number of bits required to describe the number of bad widgets in the entire lot before doing the sample, minus the number required after the sample has been completed. Since $b$ only appears in $P(A_a|B_b)$ as an offset, the inner summation in the expression for $\mathcal{H}(A|B)$ does not depend on $b$ (which is a consequence of the i.i.d. model of the widget assembly process), and so we get:

$$\mathcal{I}\left(A;B\right) = \mathcal{H}\left(\bold{B}(N,\gamma)\right) - \mathcal{H}\left(\bold{B}(N-n,\gamma)\right)$$

where $\mathcal{H}\left( \bold{B}(N,\gamma) \right)$ denotes the entropy of a Binomial-distributed random variable with parameters $N$ and $\gamma$. At this point, without even worrying about obtaining a particular expression for the binomial entropy, we can get some intuition about the mutual information. That is, the first term is the number of bits required to describe the failures in a lot of $N$ widgets, while the second is the is the number required to describe the failures in $N-n$ widgets, where both lots are Binomial with the same failure rate. So, in a sense, the mutual information consists of a reduction in the size of the untested lot of widgets.

While I'm not aware of an exact expression for the entropy of the binomial distribution, there is a very popular approximaton for large $N$ and fixed $\gamma$:

$$\bold{B}(a;N,\gamma) \approx \bold{N}(a;N\gamma,N\gamma(1-\gamma))$$

where $\bold{N}(a;\mu,\sigma^2)$ is the Normal distribution with mean $\mu$ and variance $\sigma^2$. The differential entropy, in nats, of the Normal distribution is given by $\frac{1}{2}\ln\left( 2\pi e \sigma^2 \right).$Plugging this approximation into the previous expression gives a very tidy result:

$$\mathcal{I}(A;B) \approx \frac{1}{2}\ln\frac{N}{N-n} [/itex] Note that this expression does not depend on $\gamma$, so we avoid the issue of having introduced a spurious parameter not mentioned in the original problem. Also, that answer is expressed in nats, not bits. There's probably a good bound for those xlogx summations, but I can't recall one offhand. However, I think there's a different approach to this problem that might be closer to what the author had in mind. Instead of using the principle of insufficient reason to assume a uniform distribution on $A$ (and so on $B$ as well), let's make some different assumption based on the idea that this is an assembly line in a factory. Specifically, let's model the widget assembly process as an i.i.d. sequence of Bernoulli trials, with some failure probability $\gamma$. Then, $A$ has a binomial distribution: That is a much more realistic model than the uniform distribution. It fits in with the usual communications idea of a symbol source and channel that has a fixed error rate. Using the same expression for $P(B_b|A_a)$ as before, it turns out that $B$ is distributed as $\bold{B}(b;n,\gamma)$ and, more interestingly, $P(A_a|B_b) = \bold{B}(a-b;N-n,\gamma)$. That last part was a total surprise to me. Did not expect everything to cancel out so neatly. Cool. Recall this expression for mutual information: [tex] \mathcal{I}\left(A;B\right) = \mathcal{H}(A) - \mathcal{H}(A|B)$$

where

$$\mathcal{H}(A|B) = -\sum_b P(B_b) \sum_a P(A_a|B_a) \log_2 P(A_a|B_b)$$.

That is, we're expressing the mutual informaton as the average number of bits required to describe the number of bad widgets in the entire lot before doing the sample, minus the number required after the sample has been completed. Since $b$ only appears in $P(A_a|B_b)$ as an offset, the inner summation in the expression for $\mathcal{H}(A|B)$ does not depend on $b$ (which is a consequence of the i.i.d. model of the widget assembly process)....
That one also came as a surprise. I see how it comes out that way, in that all occurences of 'b' are of the form 'a-b' and the sum goes from a-b = 0 to a-b = N-n, so that you can just replace a-b with a different index. But it still seems surprising that

H(A| B=b) does not depend on b.

and so we get:

$$\mathcal{I}\left(A;B\right) = \mathcal{H}\left(\bold{B}(N,\gamma)\right) - \mathcal{H}\left(\bold{B}(N-n,\gamma)\right)$$
That is a very nice intuitive result! I also like the fact that so far no assumptions have been made about the relative sizes of N, n, a, and b.

The differential entropy, in nats, of the Normal distribution is given by $\frac{1}{2}\ln\left( 2\pi e \sigma^2 \right).$Plugging this approximation into the previous expression gives a very tidy result:

[tex]
\mathcal{I}(A;B) \approx \frac{1}{2}\ln\frac{N}{N-n}
[/itex]
Yes, this must be what the author had in mind. I haven't learned about differential entropy yet, but I'm assuming it extends discrete entropy to the continuous by replacing the sum with an integral. To convert to bits in this expression, you would just replace the 'ln' with 'log_2', right? It makes sense in the N >> n case: you would only gain a tiny fraction of a bit, proportional to n/N, by taking the sample. On the other hand, if you sampled half of the N widgets, you would gain a half bit of information. Somehow that doesn't seem like a whole lot, considering the uncertainty of A is of order log(N). Does this mean widget makers should quality test almost all their widgets?

That is a much more realistic model than the uniform distribution. It fits in with the usual communications idea of a symbol source and channel that has a fixed error rate.
Yeah, in engineering-related fields like information theory, the phrase "a factory has an assembly line producing widgets" is basically a codeword for "Bernoulli process."

That last part was a total surprise to me. Did not expect everything to cancel out so neatly. Cool.
Yeah, it was partly from looking at the expression for $P(B_b|A_a)$ that motivated me to use binomial marginal distributions.

That one also came as a surprise. I see how it comes out that way, in that all occurences of 'b' are of the form 'a-b' and the sum goes from a-b = 0 to a-b = N-n, so that you can just replace a-b with a different index. But it still seems surprising that

H(A| B=b) does not depend on b.
Right, intuitively one expects the former to depend on the latter. Certainly, for more complex models of the widget assembly process, that will be the case (in fact, it is the case for the previous assumptions, with uniform marginals, right?). So this set of assumption is a special case: since we've assumed that each widget is assembled independently of the others, testing some subset of them doesn't let you infer much about the untested ones, only reduce their number. That everything works out so nicely for the binomial case seems, to me, to be a strong indicator that this is the "right" answer.

Yes, this must be what the author had in mind. I haven't learned about differential entropy yet, but I'm assuming it extends discrete entropy to the continuous by replacing the sum with an integral. To convert to bits in this expression, you would just replace the 'ln' with 'log_2', right?
Yeah, that's the idea. The interpretation of differential entropy is a little different, in that it would always take an infinite number of bits to describe the value of a continuous random variable exactly. So, the differential entropy tells you the number of bits required to specifiy the r.v. to unit resolution.

It makes sense in the N >> n case: you would only gain a tiny fraction of a bit, proportional to n/N, by taking the sample.
And specifically, for n=0, you get exactly 0 bits of information.

On the other hand, if you sampled half of the N widgets, you would gain a half bit of information. Somehow that doesn't seem like a whole lot, considering the uncertainty of A is of order log(N). Does this mean widget makers should quality test almost all their widgets?
One place that the differential entropy approximation shows it faults is for n=N, in which case it blows up. This is not such a big deal for an actual differential entropy, as it's possible to have infinite mutual information between two continuous r.v.'s, each of which have finite differential entropy (more generally, the differential mutual information is not required to be less than the differential entropy of each variable).

As for interpretations, to properly design a widget test process, you'd want to use more conventional statistical approach, where you specify some tolerance on the number of total failures, and then work out how many you need to test to fulfill that tolerance with, say, 95% confidence. It's not immediately clear what value X bits of mutual information between the test and total sets would be, unless you intend to build a Huffman code for the number of failures or something like that. I think the idea is more to have you work through an example that's common to other areas of prob/stat, and so check some of your intuitions about mutual information.

Awesome. You have been a huge help. Thanks again!