# Solving for P_x(X) given P_y(Y) and P_(x+y)(X+Y)

I am considering two discrete random variables, X and Y, with values in {0,1,...,9}.
From my data, I can obtain an estimate for the distribution of Y (the function $P_{Y}(y) = P(Y=y)$ for y in {0,1,...,9}).
I can also obtain experimentally the distribution of the sum $P_{X+Y}(s)$ (that is, P(X+Y = s) for s in {0, 1, ..., 9} (the sum is modulo 10).

From that, I want to solve for distribution function of X ($P_{X}(x) =$ probablility that X = x for x in {0,1,...,9}).

I furthermore assume that X and Y are independent variables, so that their joint probability is the product of their individual probabilities.

From probability theory, I know I can write $P_{X+Y}(s) = \sum_{x = 0}^{9}P_{Y}((s-x) \mod 10) P_{X}(x)$
Since I want to solve for $P_{X}(x)$, I need to solve a 10x10 system of equations with coefficient matrix given by the values $P_{Y}(s-x)$ and the vector of all values of $P_{X+Y}(s)$ for s in {0,...,9}.

Unfortunately, perhaps due to the estimates not being perfectly accurate, the solution of this system of equations (done numerically) gives me negative values for some of the probabilities in $P_X(x)$.

So is there another way of obtaining an estimate for the values of $P_X(x)$?

Thank you.

Stephen Tashi
One of my professors said "All problems can be viewed as optimization problems".

You can view the problem to be something like curve fitting - find the values of the unknowns that make predictions that closely fit the observed data.

Treat both $P_X(i)$ and $P_Y(i)$ as unknowns, for all $i$. Define a function of those variables that predicts some aspects of the observed data. Define a function that measures the "error" between the predicted aspects and the observed aspects. Treat the the problem as finding values of the unknowns that minimize "error" of the prediction subject to contraints on the variables that make them legitimate probabilities (e.g. $0 \le P_X(i) \le 1$ , $\sum_{i=0}^9 P_X(i) = 1$ , $0 \le P_Y(i) \le 1$ , $\sum_{i=0}^9 P_Y(i) = 1$ ).

For example, look at the definition of the $\chi^2$ statistic. http://en.wikipedia.org/wiki/Pearson's_chi-squared_test Forget about $\chi^2$ being used in a statistical test and just consider it as a function that measures the error between predicted values (the expected values) and the observed values. In your data you have two types of "bins", Y-bins and X+Y bins.

In the "Y-bins" you have counts of how many times particular Y-values were observed. For example if you ran 1000 experiments and observed Y = 3 happened 88 times, the expected value $E_3 = (1000)P_Y(3)$ and the observed value $O_3 = 88$.

In the "X+Y" bins, you may have run a different number of experiments, let's say it was 2000. Then, for example, the expected number in the bin for X+Y=3 is $E_3 = 2000( P_X(0)P_Y(3) + P_X(1)P_Y(2) + P_X(2)P_Y(1) + ....P_X(9)P_Y(4) + ..)$ (corresponding to 2000 mutiplied by expression that appears on one of your ten equations.).

The formula for $\chi^2$ defines a function in the unknowns $P_X(i), P_Y(i)$. Values of the unknowns that make this function small produce a better fit to the observed data. Solve for the values of the unknowns that minimize $\chi^2$, subject to the given constraints.

---
There are different methods for doing statistical estimation and they may not give the same answers. Which method is best? That is a complicated question. Defining "best" involves technicalities and gets into the fact that estimation procedures that are best in one respect may not be best in another respect.

Hello,
I see, this seems like a good solution. So if I use the chi-squared function as my error measurement, I would compute $\sum_{y=0}^{9}\frac{(E_Y(y) - O_Y(y))^2}{E_Y(y)} + \sum_{s=0}^{9}\frac{(E_{X+Y}(s) - O_{X+Y}(s))^2}{E_{X+Y}(s)}$ where $E_{X+Y}(s)$ is replaced by the convolution sum in terms of $E_X$ and $E_Y$.
And my optimization would run on the space of values for $E_X(x), E_Y(y)$ for x, y in {0, 1, ..., 9}, constrained by $\frac{1}{|X|}\sum_{x=0}^{9}E_X(x) = \frac{1}{|Y|}\sum_{y=0}^{9}E_Y(y) = 1$ and values between 0 and 1.

Any suggestions on how to perform the optimization? First thing that came to mind was using the zero-gradient test, and Lagrange multipliers. But I don't like these unknowns occuring at the denominators....

Stephen Tashi
If the form of the probability model for the data is perfect then when the expected value associated with a given bin is zero, no observations should appear in that bin. (Since the random variable involved is a non-negative count, it can't have a zero mean except by being zero with probability 1.) So you could define the function $\frac{ (O(s) - E(s))^2}{E(s)}$ to be zero when $E(s) = 0$.

If we acknowledge the probability model is only an imperfect approximation to how the data is generated, then we'll have to think about a different error function to minimize. (Perhaps, a better term than "error function" would be "penalty function".)

If I had to write a computer program this morning to do the problem, I wouldn't do anything that depended on finding the symbolic expression for the gradient of the penalty function or lagrange multipliers. Someone who was good at using computer symbolic algebra software might be able to do that. I would rely on approximating the gradient numerically. To overcome local minima, I'd pick a large number of initial points and use approximate gradient following to see where they lead.

I don't know how much effort you want to spend on the problem. If it is an important problem then I suggest the following:

1) Define one or more utility functions that measure the utility (or dis-utility, if you prefer) of a wrong estimate. If the actual value of the unknowns is A and your estimate of the unknowns is B then find a function F(A,B) that measures the utility of the situation. ( F(A,B) may or may not be related to the function you minimize to make the estimate B. )

2) Do a Monte-Carlo simulation where you generate some random "true" values of A and apply your estimation method to obtain a B. Look at the distribution of F(A,B) as a random variable. That indicates how well your estimation method works "in general".

Thank you for the advice. For the optimization, I found a fast way would be to use MATLAB. I found it has a function called "fmincon" which can solve nonlinear optimization problems with linear constraints (in this case, the requirement that probabilities add up to 1 and that they are between 0 and 1). As for the penalty/error function, I am not sure if I should be minimizing the sum of the chi-squared values of $P_Y$ and $P_{X+Y}$, or something else.

I am not sure what you mean by the probability model, but to give you a concrete example of the data I am considering, Y would be the difference between two consecutive plaintext letters using some alphabet, and X would be the difference between two consecutive key values (this is in the context of Vigenere / Running Key cryptograms). I am basically studying some properties of my plaintext and ciphertext distributions, to try to infer something about the key distribution. (My variables X, Y would actually take values in a set as large as the alphabet used). Note, the plaintext distribution does not come from the plaintext being enciphered, but from typical english language data. I may not know the content of the encrypted data, except that it is in english, and of reasonable length (thousands of letters).

Last edited:
Stephen Tashi
I mean the assumption that the realizations of $X$ and $Y$ are actually independent realizations from the probability distributions $P_X(i),\ P_Y(i)$. If that isn't exactly true then it might happen that the "best" fit of such a model to the data would have (for example) $P_{X+Y}( 4) = 0$ even though the actual data might have (say) one occurrence of $X + Y = 4$ in several thousand realizations.