Solving for P_x(X) given P_y(Y) and P_(x+y)(X+Y)

Boorglar · Feb 19, 2015

I am considering two discrete random variables, X and Y, with values in {0,1,...,9}.
From my data, I can obtain an estimate for the distribution of Y (the function P_{Y}(y) = P(Y=y) for y in {0,1,...,9}).
I can also obtain experimentally the distribution of the sum P_{X+Y}(s) (that is, P(X+Y = s) for s in {0, 1, ..., 9} (the sum is modulo 10).

From that, I want to solve for distribution function of X (P_{X}(x) = probablility that X = x for x in {0,1,...,9}).

I furthermore assume that X and Y are independent variables, so that their joint probability is the product of their individual probabilities.

From probability theory, I know I can write P_{X+Y}(s) = \sum_{x = 0}^{9}P_{Y}((s-x) \mod 10) P_{X}(x)
Since I want to solve for P_{X}(x), I need to solve a 10x10 system of equations with coefficient matrix given by the values P_{Y}(s-x) and the vector of all values of P_{X+Y}(s) for s in {0,...,9}.

Unfortunately, perhaps due to the estimates not being perfectly accurate, the solution of this system of equations (done numerically) gives me negative values for some of the probabilities in P_X(x).

So is there another way of obtaining an estimate for the values of P_X(x)?

Thank you.

Stephen Tashi · Feb 19, 2015

One of my professors said "All problems can be viewed as optimization problems".

You can view the problem to be something like curve fitting - find the values of the unknowns that make predictions that closely fit the observed data.

Treat both P_X(i) and P_Y(i) as unknowns, for all i. Define a function of those variables that predicts some aspects of the observed data. Define a function that measures the "error" between the predicted aspects and the observed aspects. Treat the the problem as finding values of the unknowns that minimize "error" of the prediction subject to contraints on the variables that make them legitimate probabilities (e.g. 0 \le P_X(i) \le 1 , \sum_{i=0}^9 P_X(i) = 1 , 0 \le P_Y(i) \le 1 , \sum_{i=0}^9 P_Y(i) = 1 ).

For example, look at the definition of the \chi^2 statistic. http://en.wikipedia.org/wiki/Pearson's_chi-squared_test Forget about \chi^2 being used in a statistical test and just consider it as a function that measures the error between predicted values (the expected values) and the observed values. In your data you have two types of "bins", Y-bins and X+Y bins.

In the "Y-bins" you have counts of how many times particular Y-values were observed. For example if you ran 1000 experiments and observed Y = 3 happened 88 times, the expected value E_3 = (1000)P_Y(3) and the observed value O_3 = 88.

In the "X+Y" bins, you may have run a different number of experiments, let's say it was 2000. Then, for example, the expected number in the bin for X+Y=3 is E_3 = 2000( P_X(0)P_Y(3) + P_X(1)P_Y(2) + P_X(2)P_Y(1) + ...P_X(9)P_Y(4) + ..) (corresponding to 2000 mutiplied by expression that appears on one of your ten equations.).

The formula for \chi^2 defines a function in the unknowns P_X(i), P_Y(i). Values of the unknowns that make this function small produce a better fit to the observed data. Solve for the values of the unknowns that minimize \chi^2, subject to the given constraints.

---
There are different methods for doing statistical estimation and they may not give the same answers. Which method is best? That is a complicated question. Defining "best" involves technicalities and gets into the fact that estimation procedures that are best in one respect may not be best in another respect.

Boorglar · Feb 19, 2015

Hello,
I see, this seems like a good solution. So if I use the chi-squared function as my error measurement, I would compute \sum_{y=0}^{9}\frac{(E_Y(y) - O_Y(y))^2}{E_Y(y)} + \sum_{s=0}^{9}\frac{(E_{X+Y}(s) - O_{X+Y}(s))^2}{E_{X+Y}(s)} where E_{X+Y}(s) is replaced by the convolution sum in terms of E_X and E_Y.
And my optimization would run on the space of values for E_X(x), E_Y(y) for x, y in {0, 1, ..., 9}, constrained by \frac{1}{|X|}\sum_{x=0}^{9}E_X(x) = \frac{1}{|Y|}\sum_{y=0}^{9}E_Y(y) = 1 and values between 0 and 1.

Any suggestions on how to perform the optimization? First thing that came to mind was using the zero-gradient test, and Lagrange multipliers. But I don't like these unknowns occurring at the denominators...

Stephen Tashi · Feb 20, 2015

If the form of the probability model for the data is perfect then when the expected value associated with a given bin is zero, no observations should appear in that bin. (Since the random variable involved is a non-negative count, it can't have a zero mean except by being zero with probability 1.) So you could define the function \frac{ (O(s) - E(s))^2}{E(s)} to be zero when E(s) = 0.

If we acknowledge the probability model is only an imperfect approximation to how the data is generated, then we'll have to think about a different error function to minimize. (Perhaps, a better term than "error function" would be "penalty function".)

If I had to write a computer program this morning to do the problem, I wouldn't do anything that depended on finding the symbolic expression for the gradient of the penalty function or lagrange multipliers. Someone who was good at using computer symbolic algebra software might be able to do that. I would rely on approximating the gradient numerically. To overcome local minima, I'd pick a large number of initial points and use approximate gradient following to see where they lead.

I don't know how much effort you want to spend on the problem. If it is an important problem then I suggest the following:

1) Define one or more utility functions that measure the utility (or dis-utility, if you prefer) of a wrong estimate. If the actual value of the unknowns is A and your estimate of the unknowns is B then find a function F(A,B) that measures the utility of the situation. ( F(A,B) may or may not be related to the function you minimize to make the estimate B. )

2) Do a Monte-Carlo simulation where you generate some random "true" values of A and apply your estimation method to obtain a B. Look at the distribution of F(A,B) as a random variable. That indicates how well your estimation method works "in general".

Boorglar · Feb 20, 2015

Thank you for the advice. For the optimization, I found a fast way would be to use MATLAB. I found it has a function called "fmincon" which can solve nonlinear optimization problems with linear constraints (in this case, the requirement that probabilities add up to 1 and that they are between 0 and 1). As for the penalty/error function, I am not sure if I should be minimizing the sum of the chi-squared values of P_Y and P_{X+Y}, or something else.

I am not sure what you mean by the probability model, but to give you a concrete example of the data I am considering, Y would be the difference between two consecutive plaintext letters using some alphabet, and X would be the difference between two consecutive key values (this is in the context of Vigenere / Running Key cryptograms). I am basically studying some properties of my plaintext and ciphertext distributions, to try to infer something about the key distribution. (My variables X, Y would actually take values in a set as large as the alphabet used). Note, the plaintext distribution does not come from the plaintext being enciphered, but from typical english language data. I may not know the content of the encrypted data, except that it is in english, and of reasonable length (thousands of letters).

Stephen Tashi · Feb 20, 2015

Boorglar said:

I am not sure what you mean by the probability model,

I mean the assumption that the realizations of X and Y are actually independent realizations from the probability distributions P_X(i),\ P_Y(i). If that isn't exactly true then it might happen that the "best" fit of such a model to the data would have (for example) P_{X+Y}( 4) = 0 even though the actual data might have (say) one occurrence of X + Y = 4 in several thousand realizations.

Solving for P_x(X) given P_y(Y) and P_(x+y)(X+Y)

Thread 'Trouble understanding an online solution to an exercise in Dummit & Foote'

Thread 'Questions about non existence of GCDs vs (coimages, cokernels)'

Thread 'Decomposition into irreps of compact Lie group'

Similar threads

Hot Threads

I How to show ##p(x)=g(x)x\pm 1\in\Bbb{Q}[x]## is irreducible in ##\Bbb{Q}_{\Bbb{Z}}[x]##?

A Question about ##FG## modules

I Showing ##k[x_1,\ldots,x_n]/\mathfrak{a}## is finite dimensional

A Near-Rings with Noncommutative Addition and Two-Sided Distributivity

I How do we distinguish two different notations for cokernel and coimage?

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective