Solving for P_x(X) given P_y(Y) and P_(x+y)(X+Y)

  • Thread starter Thread starter Boorglar
  • Start date Start date
Boorglar
Messages
210
Reaction score
10
I am considering two discrete random variables, X and Y, with values in {0,1,...,9}.
From my data, I can obtain an estimate for the distribution of Y (the function P_{Y}(y) = P(Y=y) for y in {0,1,...,9}).
I can also obtain experimentally the distribution of the sum P_{X+Y}(s) (that is, P(X+Y = s) for s in {0, 1, ..., 9} (the sum is modulo 10).

From that, I want to solve for distribution function of X (P_{X}(x) = probablility that X = x for x in {0,1,...,9}).

I furthermore assume that X and Y are independent variables, so that their joint probability is the product of their individual probabilities.

From probability theory, I know I can write P_{X+Y}(s) = \sum_{x = 0}^{9}P_{Y}((s-x) \mod 10) P_{X}(x)
Since I want to solve for P_{X}(x), I need to solve a 10x10 system of equations with coefficient matrix given by the values P_{Y}(s-x) and the vector of all values of P_{X+Y}(s) for s in {0,...,9}.

Unfortunately, perhaps due to the estimates not being perfectly accurate, the solution of this system of equations (done numerically) gives me negative values for some of the probabilities in P_X(x).

So is there another way of obtaining an estimate for the values of P_X(x)?

Thank you.
 
Physics news on Phys.org
One of my professors said "All problems can be viewed as optimization problems".

You can view the problem to be something like curve fitting - find the values of the unknowns that make predictions that closely fit the observed data.

Treat both P_X(i) and P_Y(i) as unknowns, for all i. Define a function of those variables that predicts some aspects of the observed data. Define a function that measures the "error" between the predicted aspects and the observed aspects. Treat the the problem as finding values of the unknowns that minimize "error" of the prediction subject to contraints on the variables that make them legitimate probabilities (e.g. 0 \le P_X(i) \le 1 , \sum_{i=0}^9 P_X(i) = 1 , 0 \le P_Y(i) \le 1 , \sum_{i=0}^9 P_Y(i) = 1 ).

For example, look at the definition of the \chi^2 statistic. http://en.wikipedia.org/wiki/Pearson's_chi-squared_test Forget about \chi^2 being used in a statistical test and just consider it as a function that measures the error between predicted values (the expected values) and the observed values. In your data you have two types of "bins", Y-bins and X+Y bins.

In the "Y-bins" you have counts of how many times particular Y-values were observed. For example if you ran 1000 experiments and observed Y = 3 happened 88 times, the expected value E_3 = (1000)P_Y(3) and the observed value O_3 = 88.

In the "X+Y" bins, you may have run a different number of experiments, let's say it was 2000. Then, for example, the expected number in the bin for X+Y=3 is E_3 = 2000( P_X(0)P_Y(3) + P_X(1)P_Y(2) + P_X(2)P_Y(1) + ...P_X(9)P_Y(4) + ..) (corresponding to 2000 mutiplied by expression that appears on one of your ten equations.).

The formula for \chi^2 defines a function in the unknowns P_X(i), P_Y(i). Values of the unknowns that make this function small produce a better fit to the observed data. Solve for the values of the unknowns that minimize \chi^2, subject to the given constraints.

---
There are different methods for doing statistical estimation and they may not give the same answers. Which method is best? That is a complicated question. Defining "best" involves technicalities and gets into the fact that estimation procedures that are best in one respect may not be best in another respect.
 
Hello,
I see, this seems like a good solution. So if I use the chi-squared function as my error measurement, I would compute \sum_{y=0}^{9}\frac{(E_Y(y) - O_Y(y))^2}{E_Y(y)} + \sum_{s=0}^{9}\frac{(E_{X+Y}(s) - O_{X+Y}(s))^2}{E_{X+Y}(s)} where E_{X+Y}(s) is replaced by the convolution sum in terms of E_X and E_Y.
And my optimization would run on the space of values for E_X(x), E_Y(y) for x, y in {0, 1, ..., 9}, constrained by \frac{1}{|X|}\sum_{x=0}^{9}E_X(x) = \frac{1}{|Y|}\sum_{y=0}^{9}E_Y(y) = 1 and values between 0 and 1.

Any suggestions on how to perform the optimization? First thing that came to mind was using the zero-gradient test, and Lagrange multipliers. But I don't like these unknowns occurring at the denominators...
 
If the form of the probability model for the data is perfect then when the expected value associated with a given bin is zero, no observations should appear in that bin. (Since the random variable involved is a non-negative count, it can't have a zero mean except by being zero with probability 1.) So you could define the function \frac{ (O(s) - E(s))^2}{E(s)} to be zero when E(s) = 0.

If we acknowledge the probability model is only an imperfect approximation to how the data is generated, then we'll have to think about a different error function to minimize. (Perhaps, a better term than "error function" would be "penalty function".)

If I had to write a computer program this morning to do the problem, I wouldn't do anything that depended on finding the symbolic expression for the gradient of the penalty function or lagrange multipliers. Someone who was good at using computer symbolic algebra software might be able to do that. I would rely on approximating the gradient numerically. To overcome local minima, I'd pick a large number of initial points and use approximate gradient following to see where they lead.

I don't know how much effort you want to spend on the problem. If it is an important problem then I suggest the following:

1) Define one or more utility functions that measure the utility (or dis-utility, if you prefer) of a wrong estimate. If the actual value of the unknowns is A and your estimate of the unknowns is B then find a function F(A,B) that measures the utility of the situation. ( F(A,B) may or may not be related to the function you minimize to make the estimate B. )

2) Do a Monte-Carlo simulation where you generate some random "true" values of A and apply your estimation method to obtain a B. Look at the distribution of F(A,B) as a random variable. That indicates how well your estimation method works "in general".
 
Thank you for the advice. For the optimization, I found a fast way would be to use MATLAB. I found it has a function called "fmincon" which can solve nonlinear optimization problems with linear constraints (in this case, the requirement that probabilities add up to 1 and that they are between 0 and 1). As for the penalty/error function, I am not sure if I should be minimizing the sum of the chi-squared values of P_Y and P_{X+Y}, or something else.

I am not sure what you mean by the probability model, but to give you a concrete example of the data I am considering, Y would be the difference between two consecutive plaintext letters using some alphabet, and X would be the difference between two consecutive key values (this is in the context of Vigenere / Running Key cryptograms). I am basically studying some properties of my plaintext and ciphertext distributions, to try to infer something about the key distribution. (My variables X, Y would actually take values in a set as large as the alphabet used). Note, the plaintext distribution does not come from the plaintext being enciphered, but from typical english language data. I may not know the content of the encrypted data, except that it is in english, and of reasonable length (thousands of letters).
 
Last edited:
Boorglar said:
I am not sure what you mean by the probability model,

I mean the assumption that the realizations of X and Y are actually independent realizations from the probability distributions P_X(i),\ P_Y(i). If that isn't exactly true then it might happen that the "best" fit of such a model to the data would have (for example) P_{X+Y}( 4) = 0 even though the actual data might have (say) one occurrence of X + Y = 4 in several thousand realizations.
 
##\textbf{Exercise 10}:## I came across the following solution online: Questions: 1. When the author states in "that ring (not sure if he is referring to ##R## or ##R/\mathfrak{p}##, but I am guessing the later) ##x_n x_{n+1}=0## for all odd $n$ and ##x_{n+1}## is invertible, so that ##x_n=0##" 2. How does ##x_nx_{n+1}=0## implies that ##x_{n+1}## is invertible and ##x_n=0##. I mean if the quotient ring ##R/\mathfrak{p}## is an integral domain, and ##x_{n+1}## is invertible then...
The following are taken from the two sources, 1) from this online page and the book An Introduction to Module Theory by: Ibrahim Assem, Flavio U. Coelho. In the Abelian Categories chapter in the module theory text on page 157, right after presenting IV.2.21 Definition, the authors states "Image and coimage may or may not exist, but if they do, then they are unique up to isomorphism (because so are kernels and cokernels). Also in the reference url page above, the authors present two...
When decomposing a representation ##\rho## of a finite group ##G## into irreducible representations, we can find the number of times the representation contains a particular irrep ##\rho_0## through the character inner product $$ \langle \chi, \chi_0\rangle = \frac{1}{|G|} \sum_{g\in G} \chi(g) \chi_0(g)^*$$ where ##\chi## and ##\chi_0## are the characters of ##\rho## and ##\rho_0##, respectively. Since all group elements in the same conjugacy class have the same characters, this may be...
Back
Top