Subtracting out a random variable

Click For Summary
SUMMARY

This discussion focuses on the mathematical concept of "subtracting out" a random variable, specifically exploring the independence of functions derived from a discrete random variable X. The examples provided illustrate how to construct functions g and W such that the uncertainty H(Z) is maximized while ensuring independence from another function f(X). The discussion highlights scenarios where this is feasible and where it fails, emphasizing the need for joint distributions and mutual information minimization. The use of Iverson brackets and the concept of level sets are also introduced as tools for visualizing these relationships.

PREREQUISITES
  • Understanding of discrete random variables and their distributions
  • Familiarity with concepts of independence in probability theory
  • Knowledge of mutual information and entropy, specifically H(X|W)
  • Proficiency in using Iverson brackets for mathematical notation
NEXT STEPS
  • Study the properties of mutual information and its applications in information theory
  • Explore the concept of entropy in depth, particularly conditional entropy H(X|W)
  • Learn about the use of joint distributions in probability and statistics
  • Investigate advanced techniques for independence testing in random variables
USEFUL FOR

Mathematicians, statisticians, data scientists, and anyone interested in advanced probability theory and the manipulation of random variables for analytical purposes.

mXSCNT
Messages
310
Reaction score
1
"Subtracting out" a random variable

let X be a discrete R.V. and let Y = f(X) for some function f. I wish to find a function g, such that Y and Z = g(X) are independent, and also such that the uncertainty H(Z) is maximized. For example, suppose X is uniformly distributed over {0,1,2,3,4,5,6,7} and f(x) = 0 if x < 4, f(x) = 1 otherwise. Then if we let g(x) = x mod 4, g satisfies the requirements in this example. One could interpret Z as the distribution X from which the distribution Y has been "subtracted out." Encoding the values of X in binary as 000,001,010,011,100,101,110,111, we see that f extracts the left bit of X, and g extracts the remaining two bits.

However, this is not always possible; for example suppose X is uniformly distributed over {0,1,2} and f(x) = 1 if x == 2, f(x) = 0 otherwise. Then the only functions g such that g(X) is independent of f(X), are functions that map all of X to a single value, which does not capture the idea of "subtracting out" f(X). For one thing, one would like to be able to deduce the value of X by observing the values of f(X) and g(X), and that is not possible here.

As a compromise one could instead seek a function g such that if W is the joint distribution of f(X) and g(X), then H(X|W) = 0, and the mutual information I(f(X);g(X)) is minimized. But in general then, f(X) and g(X) would not be independent.

Any help would be appreciated--especially a pointer to other material that deals with "subtracting out" a random variable in a similar manner to this!
 
Last edited:
Physics news on Phys.org


In an extension of the problem, allow g(X,W) instead of simply g(X), for some discrete random variable W, independent from X. Then, seek g and W together, such that g(X,W) and f(X) are independent, and such that H(X|g(X,W), f(X)) = 0. If X is nonzero on only a finite number of values, this can always be done.

In the second example from above, we can solve it by letting W be uniformly distributed over {0,1}, and letting

[tex]g(x,w) = [x = 0] * 0 + [x = 1] * 1 + [x = 2] * w[/tex]

where [] is the Iverson bracket: [P] = 1 if the proposition P is true, [P] = 0 if P is false.

From here on denote the level set [tex]\{x : f(x) = y\}[/tex] by h(y). Note that the condition g(X,W) and f(X) are independent is equivalent to saying that for every [tex]y_1,y_2[/tex] in the image of f, and for every z in the image of g(h(Y),W), [tex]P(g(X,W) = z | X \in h(y_1)) = P(g(X,W) = z | X \in h(y_2))[/tex]. Also, the condition H(X|g(X,W),f(X)) = 0 is equivalent to saying for each y in the image of f, for a fixed z, there is exactly one value of X such that [tex]X \in h(y)[/tex] and [tex]P(g(X,W) = z) \neq 0[/tex].

This can be visualized as follows. Imagine that each value x of X corresponds to a rectangle of width 1 and height proportional to P(X=x). Partition the rectangles into groups h(y) for each y. Then we seek to "color" the rectangles in each group such that each color appears within exactly one rectangle for each group (it does not need to cover the entire rectangle) and such that the area of each color within a group, divided by the total area of its group, is constant over all the groups. Each color represents a possible value of g(X,W). Here is what such a diagram would look like for the preceding example:
attachment.php?attachmentid=17342&stc=1&d=1233184615.png

Let's take a slightly more complicated example. Suppose that X is uniformly distributed over {0,1,2,3,4} and f(X) = [X > 2]. (that is, f(X) = 1 if X > 2, f(X) = 0 otherwise). One can solve the problem in this case by letting W be uniformly distributed over {0,1,2,3,4,5} and letting g(x,w) = [x <= 2] * (2*x + w mod 2) + [x > 2] * (3*(x-3) + w mod 3). This is illustrated below.
attachment.php?attachmentid=17343&stc=1&d=1233184615.png

You may be able to see a pattern between these two examples. Now, somewhat more generally, suppose X is uniformly distributed over {1,2,...,n}, and the range of f(X) is {1,2,...,m}. Then let L = [tex]lcm_{y \in {1...m}} |h(y)|[/tex]. Let W be uniformly distributed over {0,1,...,L-1}. Let [tex]r(x) = i[/tex] iff x is the i'th smallest element of h(f(x)) (beginning at i=0). Now, define g(x, w) = r(x) * L / |h(f(x))| + (w mod |h(f(x))|); this solves the problem for the case when X is uniformly distributed.

However, it doesn't work for the following case: let X be distributed over {0,1,2,3} such that [tex]P(X=x) = \frac{1}{\sqrt{2} + 3}([x = 0] * \sqrt{2} + [x > 0])[/tex], and f(x) = [x > 1]. This case can be solved by letting W be distributed over {0,1,2,3}, P(W=w) = [tex]\frac{1}{2(\sqrt{2} + 1)} ([w <= 1] * \sqrt{2} + [w > 1])[/tex], and letting g(x, w) = [x <= 1] * (2x + [w > 1])) + [x > 1] * (x - 2 + 2 * (w mod 2)), as illustrated below.
attachment.php?attachmentid=17344&stc=1&d=1233184615.png


Using the intuition from the preceding example, let's again suppose that X is only nonzero over {1, 2, ..., n}, and that the range of f(X) is {1,2,...m}, but relax the restriction that X must be uniformly distributed. Let Wy be independently distributed over h(y) such that [tex]P(W_y = x) = P(X=x | X \in h(y))[/tex]. Let [tex]W \in \mathbb{Z}^m, W = W_1 \times W_2 \times ... \times W_m[/tex]. Let [tex]W(i,x) \in \mathbb{Z}^m[/tex] be W with its i'th coordinate replaced by x. Then, [tex]g(x, w) = W(f(x),x)[/tex].Another question is, can this be extended to the case when X is nonzero on arbitrarily many values? I suspect that in this case W might have to be continuous.

The technique of using the extra random variable W to satisfy the conditions H(X|g(X,W), f(X)) = 0, I(g(X,W);f(X)) = 0, comes at a cost: the uncertainty in g(X,W) may become relatively large. It would be ideal if H(g(X,W)) + H(f(X)) = H(X), but much of the time this is not nearly the case.

Any commentary? I am especially looking for a place where similar material (about the subtraction of the effect of one random variable from another) has been treated elsewhere.
 

Attachments

  • fig1.png
    fig1.png
    812 bytes · Views: 403
  • fig2.png
    fig2.png
    1,008 bytes · Views: 511
  • fig3.png
    fig3.png
    777 bytes · Views: 535
Last edited:

Similar threads

  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 6 ·
Replies
6
Views
5K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K