# Maximizing the entropy of a constrained distribution

1. Feb 18, 2014

### stlukits

It is well-known that with known marginal probabilities a_{i} and
b_{j} the joint probability distribution maximizing the entropy

$$H(P)=-\sum_{i=1}^{m}\sum_{j=1}^{n}p_{ij}\log{}p_{ij}$$

is $$p_{ij}=a_{i}b_{j}$$

For m=3 and n=3, a=(0.2,0.3,0.5), b=(0.1,0.6,0.3), for example,

$$\label{eq:r1} P=\left( \begin{array}{rrr} 0.02 & 0.12 & 0.06 \\ 0.03 & 0.18 & 0.09 \\ 0.05 & 0.30 & 0.15 \end{array} \right)$$

Now, I have a problem where the joint probability distribution is
constrained (much like a random walk where the path from one node to
another is blocked). For example, let m, n, a, and b be as above with
the constraint that

$$\label{eq:r2} P=\left( \begin{array}{rrr} p_{11} & 0 & p_{13} \\ p_{21} & p_{22} & p_{23} \\ p_{31} & p_{32} & p_{33} \\ \end{array} \right)$$

Because a and b are known, it suffices to find out any 2x2 matrix
contained in the 3x3 matrix, for example (x,y,z the variables by which
w_{1}, w_{2}, v_{1}, v_{2}, and sigma are expressible, given a and b)

$$\label{eq:r3} P=\left( \begin{array}{rrr} x & 0 & w_{1} \\ y & z & w_{2} \\ v_{1} & v_{2} & \sigma \\ \end{array} \right)$$

I use this to write out the entropy and differentiate with respect to
x, y, and z to find out that the maximum will be where

$$\label{eq:r4} \det\left( \begin{array}{ll} x & w_{1} \\ v_{1} & \sigma \end{array} \right)=0$$

$$\label{eq:r5} \det\left( \begin{array}{ll} y & w_{2} \\ v_{1} & \sigma \end{array} \right)=0$$

$$\label{eq:r6} \det\left( \begin{array}{ll} z & w_{2} \\ v_{2} & \sigma \end{array} \right)=0$$

This is a system of 3 non-linear equations which are awkward to solve
algebraically. In the end, I am interested to know which property a
and b need to have to ensure that P is a proper probability
distribution (i.e. no negative elements). For now, however, I'd be
thrilled if anybody could give me a hint how I could find the
solutions for x, y, and z algebraically without these non-linear
equations.

The numeric solution for this is

$$\label{eq:r7} P=\left( \begin{array}{rrr} 0.093 & 0.000 & 0.107 \\ 0.001 & 0.113 & 0.186 \\ 0.006 & 0.487 & 0.007 \end{array} \right)$$

But I have definitely seen systems where all solutions were complex
and some of the probabilities ended up <0, especially when m and n are
larger than 3 and there are more than one zero constraint in the joint
probability matrix.

Last edited: Feb 18, 2014
2. Feb 19, 2014

### Stephen Tashi

I'm curious how the matrix in the numerical example can be interpreted as describing a random walk between nodes. I think of such a walk being described by the transition matrix for a markov chain rather than a matrix that is a joint probability density.

3. Feb 19, 2014

### stlukits

Yes, that's correct. I mentioned the transition matrix for a Markov
chain because in Cover and Thomas (exercise 4.16) there is a really
interesting observation about the eigenvalues of a matrix derived from
the transition matrix and the maximum entropy of the transition
matrix. What I am trying to do has nothing to do with random walks,
but Cover and Thomas's observation and the fact that random walks also
have constraints such as a broken path from a node, made it worth
mentioning.

Here is what I need to solve. I have a prior joint probability
distribution

$$\label{eq:r1} \left( \begin{array}{rrrr} p_{11} & p_{12} & 0 & 0 \\ p_{21} & p_{22} & 0 & 0 \\ 0 & p_{32} & 0 & p_{34} \\ p_{41} & p_{42} & p_{43} & p_{44} \\ p_{51} & p_{52} & p_{53} & p_{54} \end{array} \right)$$

The marginals are
$(\alpha_{1},\alpha_{2},\alpha_{3},\alpha_{4},1-\sum\alpha_{i})^{T}$
and $(0.2,0.3,0.4,0.1)$; and a posterior joint probability
matrix

$$\label{eq:r2} \left( \begin{array}{rrrr} q_{11} & q_{12} & 0 & 0 \\ q_{21} & q_{22} & 0 & 0 \\ 0 & q_{32} & 0 & q_{34} \\ q_{41} & q_{42} & q_{43} & q_{44} \\ 0 & 0 & 0 & 0 \end{array} \right)$$

where the marginals are $(.40,.30,.20,.10,0)^{T}$ and
$(\beta_{1},\beta_{2},\beta_{3},1-\sum\beta_{i})$. I want
to populate these joint probability distributions using the principle
of maximum entropy and find out
$(\beta_{1},\beta_{2},\beta_{3})$.

4. Feb 21, 2014

### Stephen Tashi

This is just a random thought:

A naive computer programmer who was given some marginal densities a[], b[] and some constraints like p[1][2] = 0 might implement sampling from a distribution from those properties by the straightforward algorithm:

Pick i from the distribution defined by a[], pick j from the distribution defined by b[j]. If p[j] is is not constrained to be zero, return the event (i,j). If p[j] is zero, repeat the process.

The problem with that implementation is that the marginals of the realized distribution are not a[] and b[]. However, it makes me wonder if there are "doctored" marginal distributions a'[] and b'[] that the programmer could use in order to attain a[] and b[] with the above algorithm.

For example, if the single constraint is p[r] = 0 then I think the doctored distributions would have to satisfy

a[r] = a'[r](1 - b')/ (1 - a'[r]b')
a = a'/ ( 1 - a'[r]b['s]) , for i not equal to r

and similar equations for b'[].

5. Feb 21, 2014

### stlukits

What do you mean by a'[r]? In my original posting (before the replies), I was looking for m=3 and n=3 (3x3 probability matrix), a=(0.2,0.3,0.5), b=(0.1,0.6,0.3), and

$$P=\left( \begin{array}{rrr} p_{11} & 0 & p_{13} \\ p_{21} & p_{22} & p_{23} \\ p_{31} & p_{32} & p_{33} \\ \end{array} \right)$$

What would your doctored distributions yield (and how?)?

6. Feb 21, 2014

### Stephen Tashi

The constraint in the example is at p[1][2]. So r = 1 , s = 2.

Suppose we use the marginal distributions a'[] ,b'[] to randomly select i according to a'[] and randomly select j according to b'[] until an (i,j) is selected other than (1,2).

The probability of selecting something in the first row is the sum of the probabilities from selecting it on the 1st, 2nd, 3rd,.... try.

The probability of selecting something from row 1 on the 1st try is a'[1]( 1 - b'[2])

The probability of selecting something from row 1 on the 2nd try = the probability of failing to select anything on the 1st try and selecting something from row 1 on the second try
= a'[1] b'[2] a'[1](1 - b'[2])

Continuing in this manner, the probability of selecting something from the first row is given by a geometric series in powers of a'[1]b'[2]. It sums to a'[1](1 - b'[2])( 1 / ( 1 - a'[1]b'[2]).

If we want the selection process to effectively give a probability of a[1] for picking something from the first row we must have a[1] = a'[1](1 - b'[2])( 1 / ( 1 - a'[1]b'[2]).

I'm suggesting setting up similar equations and trying to solve for the doctored distributions a'[] and b'[] from the given distributions a[] and b[].

I don't know if the doctored distributions would yield the joint maximum entropy distribution but intuitively, they yield a joint distribution with the desired marginal distributions that results from a process where (i,j) is selected by selecting i and j independently..

7. Feb 22, 2014

### stlukits

I see what you mean. You would end up with a joint probability matrix which contains 0s in the right places and p[j]'s that are suggestive of the desired distribution, but the marginal probabilities would be off. Unfortunately, I think, you'd need (again) some kind of maximum entropy procedure to "doctor" things so that the marginals fall into place, but that introduces the above equations ... I will continue working on this. Thanks for your help.

PS.: There is very little literature on this. There is an obscure article by a Czech physicist (Majernik, 2000) about finding the marginal distributions when you are in possession of the conditional probabilities. Here we have the opposite problem: we are trying to find the conditional probabilities (which immediately follow from the joint probabilities and the marginals) given the marginal probabilities and some, but not all, conditional probabilities, using MAXENT. An American mathematician, Carl Wagner, wrote a piece on this in 1992 ("Generalized Probability Kinematics"), but I am almost certain that he made a mistake, and I am trying to fix it (unsuccessfully, so far).

Last edited: Feb 22, 2014
8. Feb 22, 2014

### Stephen Tashi

In the example, the doctored marginals are determined by equations with only two unknowns, a'[1] and b'[2]. Once these are determined, the other values of the doctored marginals are determined. The values of the p[j] are functions of the doctored marginals.

9. Feb 22, 2014

### stlukits

thx -- will update when I have something ...