Maximizing the entropy of a constrained distribution

In summary, the principle of maximum entropy can be used to find the joint probability distribution that maximizes entropy given known marginal probabilities a_{i} and b_{j}. This distribution is p_{ij}=a_{i}b_{j}. However, if there are constraints on the joint probability matrix, such as a blocked path in a random walk, then additional steps are needed to find the maximum entropy distribution. These steps involve solving a system of non-linear equations and can be difficult to do algebraically. The process can also be applied to other scenarios, such as finding the posterior joint probability distribution using the principle of maximum entropy.
  • #1
noowutah
57
3
It is well-known that with known marginal probabilities a_{i} and
b_{j} the joint probability distribution maximizing the entropy

[tex]H(P)=-\sum_{i=1}^{m}\sum_{j=1}^{n}p_{ij}\log{}p_{ij}[/tex]

is [tex]p_{ij}=a_{i}b_{j}[/tex]

For m=3 and n=3, a=(0.2,0.3,0.5), b=(0.1,0.6,0.3), for example,

[tex]\begin{equation}
\label{eq:r1}
P=\left(
\begin{array}{rrr}
0.02 & 0.12 & 0.06 \\
0.03 & 0.18 & 0.09 \\
0.05 & 0.30 & 0.15
\end{array}
\right)
\end{equation}[/tex]

Now, I have a problem where the joint probability distribution is
constrained (much like a random walk where the path from one node to
another is blocked). For example, let m, n, a, and b be as above with
the constraint that

[tex]\begin{equation}
\label{eq:r2}
P=\left(
\begin{array}{rrr}
p_{11} & 0 & p_{13} \\
p_{21} & p_{22} & p_{23} \\
p_{31} & p_{32} & p_{33} \\
\end{array}
\right)
\end{equation}[/tex]

Because a and b are known, it suffices to find out any 2x2 matrix
contained in the 3x3 matrix, for example (x,y,z the variables by which
w_{1}, w_{2}, v_{1}, v_{2}, and sigma are expressible, given a and b)

[tex]\begin{equation}
\label{eq:r3}
P=\left(
\begin{array}{rrr}
x & 0 & w_{1} \\
y & z & w_{2} \\
v_{1} & v_{2} & \sigma \\
\end{array}
\right)
\end{equation}[/tex]

I use this to write out the entropy and differentiate with respect to
x, y, and z to find out that the maximum will be where

[tex]\begin{equation}
\label{eq:r4}
\det\left(
\begin{array}{ll}
x & w_{1} \\
v_{1} & \sigma
\end{array}
\right)=0
\end{equation}[/tex]

[tex]\begin{equation}
\label{eq:r5}
\det\left(
\begin{array}{ll}
y & w_{2} \\
v_{1} & \sigma
\end{array}
\right)=0
\end{equation}[/tex]

[tex]\begin{equation}
\label{eq:r6}
\det\left(
\begin{array}{ll}
z & w_{2} \\
v_{2} & \sigma
\end{array}
\right)=0
\end{equation}[/tex]

This is a system of 3 non-linear equations which are awkward to solve
algebraically. In the end, I am interested to know which property a
and b need to have to ensure that P is a proper probability
distribution (i.e. no negative elements). For now, however, I'd be
thrilled if anybody could give me a hint how I could find the
solutions for x, y, and z algebraically without these non-linear
equations.

The numeric solution for this is

[tex]\begin{equation}
\label{eq:r7}
P=\left(
\begin{array}{rrr}
0.093 & 0.000 & 0.107 \\
0.001 & 0.113 & 0.186 \\
0.006 & 0.487 & 0.007
\end{array}
\right)
\end{equation}[/tex]

But I have definitely seen systems where all solutions were complex
and some of the probabilities ended up <0, especially when m and n are
larger than 3 and there are more than one zero constraint in the joint
probability matrix.
 
Last edited by a moderator:
Physics news on Phys.org
  • #2
stlukits said:
Now, I have a problem where the joint probability distribution is
constrained (much like a random walk where the path from one node to
another is blocked). For example, let m, n, a, and b be as above with
the constraint that

[tex]\begin{equation}
\label{eq:r2}
P=\left(
\begin{array}{rrr}
p_{11} & 0 & p_{13} \\
p_{21} & p_{22} & p_{23} \\
p_{31} & p_{32} & p_{33} \\
\end{array}
\right)
\end{equation}[/tex]

I'm curious how the matrix in the numerical example can be interpreted as describing a random walk between nodes. I think of such a walk being described by the transition matrix for a markov chain rather than a matrix that is a joint probability density.
 
  • #3
Yes, that's correct. I mentioned the transition matrix for a Markov
chain because in Cover and Thomas (exercise 4.16) there is a really
interesting observation about the eigenvalues of a matrix derived from
the transition matrix and the maximum entropy of the transition
matrix. What I am trying to do has nothing to do with random walks,
but Cover and Thomas's observation and the fact that random walks also
have constraints such as a broken path from a node, made it worth
mentioning.

Here is what I need to solve. I have a prior joint probability
distribution

[tex]
\begin{equation}
\label{eq:r1}
\left(
\begin{array}{rrrr}
p_{11} & p_{12} & 0 & 0 \\
p_{21} & p_{22} & 0 & 0 \\
0 & p_{32} & 0 & p_{34} \\
p_{41} & p_{42} & p_{43} & p_{44} \\
p_{51} & p_{52} & p_{53} & p_{54}
\end{array}
\right)
\end{equation}
[/tex]

The marginals are
[itex](\alpha_{1},\alpha_{2},\alpha_{3},\alpha_{4},1-\sum\alpha_{i})^{T}[/itex]
and [itex](0.2,0.3,0.4,0.1)[/itex]; and a posterior joint probability
matrix

[tex]
\begin{equation}
\label{eq:r2}
\left(
\begin{array}{rrrr}
q_{11} & q_{12} & 0 & 0 \\
q_{21} & q_{22} & 0 & 0 \\
0 & q_{32} & 0 & q_{34} \\
q_{41} & q_{42} & q_{43} & q_{44} \\
0 & 0 & 0 & 0
\end{array}
\right)
\end{equation}
[/tex]

where the marginals are [itex](.40,.30,.20,.10,0)^{T}[/itex] and
[itex](\beta_{1},\beta_{2},\beta_{3},1-\sum\beta_{i})[/itex]. I want
to populate these joint probability distributions using the principle
of maximum entropy and find out
[itex](\beta_{1},\beta_{2},\beta_{3})[/itex].
 
  • #4
This is just a random thought:

A naive computer programmer who was given some marginal densities a[], b[] and some constraints like p[1][2] = 0 might implement sampling from a distribution from those properties by the straightforward algorithm:

Pick i from the distribution defined by a[], pick j from the distribution defined by b[j]. If p[j] is is not constrained to be zero, return the event (i,j). If p[j] is zero, repeat the process.

The problem with that implementation is that the marginals of the realized distribution are not a[] and b[]. However, it makes me wonder if there are "doctored" marginal distributions a'[] and b'[] that the programmer could use in order to attain a[] and b[] with the above algorithm.

For example, if the single constraint is p[r] = 0 then I think the doctored distributions would have to satisfy

a[r] = a'[r](1 - b')/ (1 - a'[r]b')
a = a'/ ( 1 - a'[r]b['s]) , for i not equal to r

and similar equations for b'[].
 
  • Like
Likes 1 person
  • #5
What do you mean by a'[r]? In my original posting (before the replies), I was looking for m=3 and n=3 (3x3 probability matrix), a=(0.2,0.3,0.5), b=(0.1,0.6,0.3), and

[tex]\begin{equation}
P=\left(
\begin{array}{rrr}
p_{11} & 0 & p_{13} \\
p_{21} & p_{22} & p_{23} \\
p_{31} & p_{32} & p_{33} \\
\end{array}
\right)
\end{equation}
[/tex]

What would your doctored distributions yield (and how?)?
 
  • #6
stlukits said:
What do you mean by a'[r]? In my original posting (before the replies), I was looking for m=3 and n=3 (3x3 probability matrix), a=(0.2,0.3,0.5), b=(0.1,0.6,0.3), and

[tex]\begin{equation}
P=\left(
\begin{array}{rrr}
p_{11} & 0 & p_{13} \\
p_{21} & p_{22} & p_{23} \\
p_{31} & p_{32} & p_{33} \\
\end{array}
\right)
\end{equation}
[/tex]

The constraint in the example is at p[1][2]. So r = 1 , s = 2.

Suppose we use the marginal distributions a'[] ,b'[] to randomly select i according to a'[] and randomly select j according to b'[] until an (i,j) is selected other than (1,2).

The probability of selecting something in the first row is the sum of the probabilities from selecting it on the 1st, 2nd, 3rd,... try.

The probability of selecting something from row 1 on the 1st try is a'[1]( 1 - b'[2])

The probability of selecting something from row 1 on the 2nd try = the probability of failing to select anything on the 1st try and selecting something from row 1 on the second try
= a'[1] b'[2] a'[1](1 - b'[2])

Continuing in this manner, the probability of selecting something from the first row is given by a geometric series in powers of a'[1]b'[2]. It sums to a'[1](1 - b'[2])( 1 / ( 1 - a'[1]b'[2]).

If we want the selection process to effectively give a probability of a[1] for picking something from the first row we must have a[1] = a'[1](1 - b'[2])( 1 / ( 1 - a'[1]b'[2]).

I'm suggesting setting up similar equations and trying to solve for the doctored distributions a'[] and b'[] from the given distributions a[] and b[].

What would your doctored distributions yield (and how?)?

I don't know if the doctored distributions would yield the joint maximum entropy distribution but intuitively, they yield a joint distribution with the desired marginal distributions that results from a process where (i,j) is selected by selecting i and j independently..
 
  • #7
I see what you mean. You would end up with a joint probability matrix which contains 0s in the right places and p[j]'s that are suggestive of the desired distribution, but the marginal probabilities would be off. Unfortunately, I think, you'd need (again) some kind of maximum entropy procedure to "doctor" things so that the marginals fall into place, but that introduces the above equations ... I will continue working on this. Thanks for your help.

PS.: There is very little literature on this. There is an obscure article by a Czech physicist (Majernik, 2000) about finding the marginal distributions when you are in possession of the conditional probabilities. Here we have the opposite problem: we are trying to find the conditional probabilities (which immediately follow from the joint probabilities and the marginals) given the marginal probabilities and some, but not all, conditional probabilities, using MAXENT. An American mathematician, Carl Wagner, wrote a piece on this in 1992 ("Generalized Probability Kinematics"), but I am almost certain that he made a mistake, and I am trying to fix it (unsuccessfully, so far).
 
Last edited by a moderator:
  • #8
stlukits said:
but that introduces the above equations

In the example, the doctored marginals are determined by equations with only two unknowns, a'[1] and b'[2]. Once these are determined, the other values of the doctored marginals are determined. The values of the p[j] are functions of the doctored marginals.
 
  • #9
thx -- will update when I have something ...
 

FAQ: Maximizing the entropy of a constrained distribution

What is entropy?

Entropy is a measure of the randomness or disorder in a system. In the context of a distribution, entropy represents the uncertainty or unpredictability of the values that the distribution can take.

Why is maximizing entropy important?

Maximizing entropy is important because it helps us create a more unbiased and diverse distribution. This can be useful in various fields such as data compression, information theory, and machine learning.

What does it mean to constrain a distribution?

Constraining a distribution means putting limitations or restrictions on the possible values that the distribution can take. This can be done by setting boundaries or fixing some parameters of the distribution.

How is entropy maximized in a constrained distribution?

To maximize entropy in a constrained distribution, we use mathematical techniques such as Lagrange multipliers to find the distribution that has the highest entropy while satisfying the given constraints.

What are some real-world examples of maximizing entropy in a constrained distribution?

One example is in the field of image processing, where image compression algorithms use entropy maximization to reduce the file size while preserving the visual quality. Another example is in finance, where portfolio optimization techniques use entropy maximization to create a diverse and risk-averse investment portfolio.

Similar threads

Replies
2
Views
1K
Replies
28
Views
2K
Replies
5
Views
3K
Replies
20
Views
4K
Replies
4
Views
1K
Back
Top