Are There Any Theorems Relating Joint Distributions to Marginals?

WWCY · Feb 1, 2019

Hi all,

I was wondering if there exist any theorems that allow one to relate any joint distribution to its marginals in the form of an inequality, whether or not ##X,Y## are independent. For example, is it possible to make a general statement like this?
$$f_{XY}(x,y) \geq f_X (x) f_Y(y)$$

Also, I came across the following inequality on stackexchange (link below):
$$F_X(x) + F_Y(y) - 1 \leq F_{X,Y}(x,y) \leq \sqrt{F_X(x) F_Y(y)}$$

Is this true for all ##F##? And if so, does this inequality have a name?

Many thanks in advance!

https://stats.stackexchange.com/que...g-joint-cumulative-and-marginal-distributions

StoneTemplePython · Feb 1, 2019

WWCY said:

Hi all,

I was wondering if there exist any theorems that allow one to relate any joint distribution to its marginals in the form of an inequality, whether or not ##X,Y## are independent. For example, is it possible to make a general statement like this?
$$f_{XY}(x,y) \geq f_X (x) f_Y(y)$$

given what you've said, I don't see how this could possibly be true.

Your right hand side is a product of two marginal distributions and says nothing about the the joint distribution. Via coupling or other techniques one can come up with many different joint distributions for the LHS. Have you tried to come up with a simple counter example?

WWCY said:

Also, I came across the following inequality on stackexchange (link below):
$$F_X(x) + F_Y(y) - 1 \leq F_{X,Y}(x,y) \leq \sqrt{F_X(x) F_Y(y)}$$

Is this true for all ##F##? And if so, does this inequality have a name?

Many thanks in advance!

https://stats.stackexchange.com/que...g-joint-cumulative-and-marginal-distributions

I'm not sure what is going on with that link since they don't use the correct standard form for CDFs which is ## X \leq x## not ##X \lt x##, though the results hold.

In many of my posts including responses to your questions, I've emphasized the role of events. Can you write out ##F_{X,Y}(x,y)## as an event (and specifically as the union or intersection of 2 events)?

Look at this event and its complement and the associated probabilities of course sum to one. Play around with the modelling here and apply Boole's Inequality / Union bound and re-arrange terms to get the lower bound. Please do this and show work. You should be able to interpret and work with events at this stage. For what its worth, I could eyeball the lower bound and see it was Boole's Inequality but I had to slow down at the end of deriving it to get the the symbol manipulation of sets correct.

- - - -
edit:
here is one way of proving the upper bound:

for the upper bound, consider the following:

##F_{X,Y}(x,y) = F_X(x)\cdot F_{Y|X\leq x}(y\vert X \leq x) \leq \sqrt{F_X(x)F_Y(y)}##

If ##F_X(x) =0## then we have equality, which satisfies the above. (Also note that if ##F_Y(y) = 0## then the upper bound is zero, but so must be the lower bound, why?, so for proving the inequality we can assume that both ##F_X(x)## and ##F_Y(y)## are positive)

Otherwise it is positive and we can divide by it to get the equivalent claim

##F_{Y|X}(y\vert x) \leq \sqrt{\frac{F_Y(y)}{F_X(x)}}##

switching to the language of events i.e.
##X(\omega) \leq x ## is event ##A## and
##Y(\omega) \leq y ## is event ##B##

the result follows by application of Bayes Rule / basic conditional probability.

so the claim to prove is

##P\big(B \big \vert A\big) \leq \sqrt{\frac{P(B)}{P(A)}}##

but by Bayes Rule we have

##P\big(B \big \vert A\big) = P\big(A\big \vert B\big) \cdot \frac{P\big(B\big)}{P\big(A\big)}\leq P\big(A\big \vert B\big)^\frac{1}{2} \cdot \Big(\frac{P\big(B\big)}{P\big(A\big)}\Big)^\frac{1}{2} \leq \Big(\frac{P\big(B\big)}{P\big(A\big)}\Big)^\frac{1}{2} = \sqrt{\frac{P\big(B\big)}{P\big(A\big)}}##

as desired.

where the first inequality follows by the fact that square roots increase numbers ##\in (0,1)## (and of course are neutral for 1), and the second follows by the fact that ##P\big(A\big \vert B\big)^\frac{1}{2} \in (0,1]##

I leave the lower bound open as an exercise for you. If you get stuck, post what you've done and I can help you step through it.

- - - -
additional edit:
A much more satisfying, and slick, way to prove the upper bound is to recognize it follows almost immediately from Cauchy-Schwarz.

If you are comfortable with the fact that expectations give an inner product and can see how to set this up with indicator random variables, then you are done.

If you aren't familiar/ comfortable with this, then you can streamline the argument via the use of a Gram Matrix. So consider the matrix below, with random variables ##X_1## and ##X_2## which have can be any real valued random variables that have a second moment

##\mathbf G := \begin{bmatrix}
E[X_1 X_1] & E[X_1 X_2] \\
E[X_2 X_1] & E[X_2 X_2]\\
\end{bmatrix}##

(note if these random variables had zero mean you'd see this referred to as a covariance matrix, but there is a general result here that I'm showing which holds for any arbitrary real valued random variables and it includes any arbitrary means-- something that we'll exploit later)

for any ##\mathbf v \in \mathbb R^2## you have

##\mathbf v^T \mathbf G \mathbf v ##
##= v_1^2 E[X_1 X_1] + v_1 v_2 E[X_1 X_2] + v_2 v_1 E[X_2 X_1] + v_2^2 E[X_2 X_2] ##
##= v_1^2 E[X_1^2] + 2 \cdot v_1 v_2 E[X_1 X_2] + v_2^2 E[X_2^2] ##
##= E\Big[ v_1^2 X_1^2 + 2 \cdot v_1 v_2 X_1 X_2 + v_2^2 X_2^2\Big] ##
##= E\Big[\big( v_1 X_1 + v_2 X_2\big)^2\Big]##
##\geq 0##

where we made use of Linearity of Expectations and the fact that over reals, a sum of squares is non-negative (and so is the expectation).

Hence ##\mathbf G## is real symmetric positive semi-definite.

Thus by Cauchy Schwarz we know
## \big(E[X_1 X_2]\big)^2 \leq E[X_1^2] \cdot E[X_2^2] ##

Or if you prefer we can say both eigenvalues of ##\mathbf G## are real non-negative so
##\det\big(\mathbf G\big) = E[X_1^2] \cdot \ E[X_2^2] - \big(E[X_1 X_2]\big)^2 = \lambda_1 \lambda_2 \geq 0##
which is the same thing. Side note: this gives us the equality conditions: i.e. the inequality is strict unless
##\begin{bmatrix}
E[X_1 X_1] \\
E[X_2 X_1] \\
\end{bmatrix}\propto
\begin{bmatrix}
E[X_1 X_2] \\
E[X_2 X_2]\\
\end{bmatrix}##

now select ##X_1 :=\mathbb I_A## and ##X_2 :=\mathbb I_B##, where as in the above we have
##X(\omega) \leq x ## is event ##A## and
##Y(\omega) \leq y ## is event ##B##

The result is

##\Big(F_{X,Y}(x,y)\Big)^2 = \Big(P\big(A \cap B\big)\Big)^2 = \Big(E\big[\mathbb I_A \mathbb I_B\big]\Big)^2 \leq E\big[\mathbb I_A^2\big] E\big[\mathbb I_B^2\big] = E\big[\mathbb I_A\big] E\big[\mathbb I_B\big] = P\big(A\big)P\big(B\big) = F_X(x)F_Y(y) ##

where we make use of idempotence of Indicator Random Variables (Bernoulis) e.g. so ##\mathbb I_A^2 = \mathbb I_A##

taking square roots gives the result
##F_{X,Y}(x,y) \leq \sqrt{F_X(x)F_Y(y)}##
as desired

WWCY said:

Is this true for all ##F##? And if so, does this inequality have a name?

Yes. The Lower Bound is called Union Bound and the Upper Bound is called Cauchy-Schwarz. These are two of the most fundamental inequalities out there.

mathman · Feb 1, 2019

##\int\int f_X(x)f_Y(y)dxdy=\int\int f_{X,Y}(x,y)dxdy=1## makes it impossible for one to dominate the other. If ##X## and ##Y## are independent, the joint density is the product of the individual densities.

Ray Vickson · Feb 3, 2019

WWCY said:

Hi all,

I was wondering if there exist any theorems that allow one to relate any joint distribution to its marginals in the form of an inequality, whether or not ##X,Y## are independent. For example, is it possible to make a general statement like this?
$$f_{XY}(x,y) \geq f_X (x) f_Y(y)$$

Also, I came across the following inequality on stackexchange (link below):
$$F_X(x) + F_Y(y) - 1 \leq F_{X,Y}(x,y) \leq \sqrt{F_X(x) F_Y(y)}$$

Is this true for all ##F##? And if so, does this inequality have a name?

Many thanks in advance!

https://stats.stackexchange.com/que...g-joint-cumulative-and-marginal-distributions

The statement ##f_{Xy}(x,y) \geq f_X(x) \, f_Y(y)## sometimes fails. Here is a counterexample to the statement. To keep it simple, start with a discrete bivariate probability mass function ##p_{XY}(1,2) = p_{XY}(2,1) = 1/2.## The marginals are
$$p_X(u) = p_Y(u) = \begin{cases} 1/2, & \text{if} \; u = 1\\
1/2, &\text{if} \; u=2 \\
0 & \text{otherwise}
\end{cases} $$
We have ##p_{XY}(1,1) = 0## but ##p_X(1) p_Y(1) = 1/4.## You can get a continuously-distributed example (with density functions instead of mass functions) just by spreading out the point masses a bit.

WWCY · Feb 3, 2019

Hi everyone, thanks for the responses. I do believe I understand how these inequalities are proven and how they work/don't work now.

For the union bound, ##A \equiv X\leq x, B \equiv Y\leq y##

$$P(A \cup B) = P(A) + P(B) - P(A\cap B)$$
$$P(A\cap B) = P(A) + P(B) - P(A \cup B) \geq P(A) + P(B) - 1 $$

Is this right?

StoneTemplePython · Feb 3, 2019

WWCY said:

Hi everyone, thanks for the responses. I do believe I understand how these inequalities are proven and how they work/don't work now.

For the union bound, ##A \equiv X\leq x, B \equiv Y\leq y##

$$P(A \cup B) = P(A) + P(B) - P(A\cap B)$$
$$P(A\cap B) = P(A) + P(B) - P(A \cup B) \geq P(A) + P(B) - 1 $$

Is this right?

Nicely done. What I was thinking of was:

##P(A \cap B) ##
##= 1 - P(A^C \cup B^C) ##
##\geq 1 - \Big(P(A^C) + P(B^C) \Big)##
##= 1 - \Big(\big(1-P(A)\big) + \big(1-P(B)\big)\Big) ##
##= P(A) + P(B) -1##
by the union bound

- - - -
it isn't needed but one advantage of this setup is that it immediately gives a proof of
##P(A_1\cap A_2 \cap ... \cap A_n) \geq P(A_1) + P(A_2) + ... + P(A_n) - (n-1)##

via an identical argument, courtesy of the union bound.

The exact calculation approach would be a lot more difficult (though you could use bunching and induction to avoid most of the unpleasantness).

StoneTemplePython · Feb 11, 2019

By the way, I'm not sure whether this is "obvious" or not, but I will also point out that n-variable upper bound holds, as well via Cauchy-Schwarz

##P(A_1\cap A_2 \cap ... \cap A_n)= P(X_1 \leq x_1, X_2 \leq x_2, ..., X_n \leq x_n) \leq \Big(P(X_1 \leq x_1) P(X_2 \leq x_2) ... P(X_n \leq x_n)\Big)^{\frac{1}{n}}##
The underlying idea seems like higher dimensional Hölder inequality (or super additivity of the geometric mean as shown in the edit), but Cauchy Schwarz plus induction works just fine. So we already have the result for the n=2 case, which is our base case.

So we want to prove it for natural number ##n\geq 3## --i.e. we have the result for ##n-1## but need to show that implies the result for ##n##. So for the below, we have ##n## events in total,

Now define:
##\mathbb I_{A_k}## to be the indicator r.v. that takes value 1 when the event ##A_k## occurs, and
##\mathbb I_{B_k} := \prod_{i\neq k}\mathbb I_{A_i}## i.e. the event ##\bigcap_{i\neq k}A_i##

so by the earlier argument, we have

## 0\leq P(X_1 \leq x_1, X_2 \leq x_2, ..., X_n \leq x_n)^2 = P(A_1\cap A_2 \cap ... \cap A_n)^2 = \big(E\big[\mathbb I_{A_k}\mathbb I_{B_k}\big]\big)^2 \leq E\big[\mathbb I_{A_k}^2\big] E\big[\mathbb I_{B_k}^2\big] = E\big[\mathbb I_{A_k}\big] E\big[\mathbb I_{B_k}\big] ##

i.e. we have
##0\leq P(X_1 \leq x_1, X_2 \leq x_2, ..., X_n \leq x_n) \leq E\big[\mathbb I_{A_k}\big]^\frac{1}{2}E\big[\mathbb I_{B_k}\big]^\frac{1}{2} = P\big(A_k\big)^\frac{1}{2}E\big[\mathbb I_{B_k}\big]^\frac{1}{2} \leq P\big(A_k\big)^\frac{1}{2}\Big(\prod_{i\neq k}P\big(A_i\big)^\frac{1}{n-1}\Big)^\frac{1}{2}##

where the right hand side inequality comes from applying the induction hypothesis.
Taking advantage of non-negativity, we can apply this for ##k=1, 2, ..., n## and multiply these inequalities to get

##0\leq P(X_1 \leq x_1, X_2 \leq x_2, ..., X_n \leq x_n)^n \leq \prod_{k=1}^{n} P\big(A_k\big)^\frac{1}{2}P\big(A_k\big)^\frac{n-1}{2(n-1)} = \prod_{k=1}^{n} P\big(A_k\big)##

taking nth roots gives
##0\leq P(X_1 \leq x_1, X_2 \leq x_2, ..., X_n \leq x_n) \leq \Big(\prod_{k=1}^{n} P\big(A_k\big)\Big)^\frac{1}{n}=\Big(P(X_1 \leq x_1) P(X_2 \leq x_2) ... P(X_n \leq x_n)\Big)^{\frac{1}{n}}##
which is the n dimensional form of the inequality

edit:
the result also follows by the super-addiviity of the geometric mean.

consider the function
##f: \mathbb{R}^n_{\geq 0}\rightarrow \mathbb{R}_{\geq 0}##
where
##f(\mathbf v\big) = \big(\prod_{i=1}^n v_i\big)^\frac{1}{n}##
Equivalently this is just the geometric mean of ##\mathbf v##.

The function is negative convex (aka concave), because, for any ##p \in (0,1)## we have
##p\cdot f\Big(\mathbf v_1 \Big) + (1-p)\cdot f\Big(\mathbf v_2 \Big) = f\Big(p\mathbf v_1 \Big) + f\Big((1-p)\mathbf v_2 \Big)\leq f\Big(p\mathbf v_1+ (1-p)\mathbf v_2 \Big)##
by super additivity of the geometric mean

for convenience consider
##\mathbf V := \begin{bmatrix}
\mathbb I_{A_1} \\
\mathbb I_{A_2}\\
\vdots \\
\mathbb I_{A_n}
\end{bmatrix}##
i.e. a vector of (indicator) random variables

##P(X_1 \leq x_1, X_2 \leq x_2, ..., X_n \leq x_n) ##
##= E\big[\mathbb I_{A_1}\mathbb I_{A_2} ... \mathbb I_{A_n}\big]##
##= E\big[(\mathbb I_{A_1}\mathbb I_{A_2} ... \mathbb I_{A_n})^\frac{1}{n}\big]##
## = E\Big[f\big(\mathbf V\big)\Big]##
## \leq f\Big(E\big[\mathbf V\big]\Big)##
##= \Big(E\big[\mathbb I_{A_1}\big] E\big[\mathbb I_{A_2}\big] ... E\big[\mathbb I_{A_n}\big]\Big)^\frac{1}{n}##
##=\Big(P(X_1 \leq x_1) P(X_2 \leq x_2) ... P(X_n \leq x_n)\Big)^{\frac{1}{n}}##
by Jensen's Inequality.

Alternatively, since the Indicator Random variables are simple, and indeed binary, we could enumerate the outcomes and actually write the lower bound as a convex combination of ##2^n##geometric means (where at most one value is non zero) and the upper bound as a geometric mean of this convex combination -- the result then follows directly by super additivity of the geometric mean.

a later edit:
it seems a simpler and better solution slipped my mind at first.
a better approach to the upper bound is

##P(A_1\cap A_2 \cap ... \cap A_n)##
##= P(X_1 \leq x_1, X_2 \leq x_2, ..., X_n \leq x_n) ##
##\leq \min\Big\{ P(X_1 \leq x_1), P(X_2 \leq x_2), ... ,P(X_n \leq x_n)\Big\}##
##\leq \Big(P(X_1 \leq x_1) P(X_2 \leq x_2) ... P(X_n \leq x_n)\Big)^{\frac{1}{n}} ##

1.) There are finitely many terms, so there is some minimum probability in the below set
##\Big\{ P(X_1 \leq x_1), P(X_2 \leq x_2), ... ,P(X_n \leq x_n)\Big\}##

instead of introducing notation, we can assume WLOG that ##P(X_1 \leq x_1)## has the minimum, i.e. that
##P(X_1 \leq x_1) \leq p \in \Big\{ P(X_1 \leq x_1),P(X_2 \leq x_2), ... ,P(X_n \leq x_n)\Big\}##

2.) Via the original work with Cauchy Schwarz in the 2 variable case (or some other argument) we know that if ##P(X_1 \leq x_1) = 0##, then
##P(A_1\cap A_2 \cap ... \cap A_n)= P(X_1 \leq x_1, X_2 \leq x_2, ..., X_n \leq x_n) =0##, which gives an equality case of what we are trying to prove for an upper bound

3.) So to finish the argument we assume ## 0 \lt P(X_1 \leq x_1) ##, and since this is non-zero we may condition on it
##P\big(A_1\cap A_2 \cap ... \cap A_n\big) ##
##=P\big(A_1\big) \cdot P\Big( \{A_2\big \vert A_1\} \cap \{A_3\big \vert A_1\} \cap ... \cap \{A_n\big \vert A_1\} \Big)##
## \leq P\big(A_1\big)##
##=P(X_1 \leq x_1)##
##\leq \Big(P(X_1 \leq x_1) P(X_2 \leq x_2) ... P(X_n \leq x_n)\Big)^{\frac{1}{n}} ##

where the inequalities follow because probabilities are bounded in zero and one, and a geometric mean is at least as large as its minimum component. I.e. taking nth roots of the inequalities in step 2 gives:

##0 \lt P(X_1 \leq x_1)^{\frac{1}{n}} \leq P(X_j \leq x_j)^{\frac{1}{n}}##
for ##1 \leq j \leq n##

and taking advantage of positivity and multiplying over this bound
## P(X_1 \leq x_1) = \prod_{j=1}^n P(X_1 \leq x_1)^{\frac{1}{n}} \leq \big(\prod_{j=1}^n P(X_j \leq x_j)\big)^{\frac{1}{n}}##

WWCY · Feb 15, 2019

Hi @StoneTemplePython , it wasn't obvious in the least to me, so thank you for pointing out this general inequality!

Thanks for taking the time, your comments on my probability theory posts have been absolutely invaluable to me.

Are There Any Theorems Relating Joint Distributions to Marginals?

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad The countability paradox of computable numbers

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Are There Any Theorems Relating Joint Distributions to Marginals?

Similar threads