Question on random variables and histograms

1. Feb 14, 2014

mnb96

Hello,

I have two random variables $X$ and $Y$ that can take values of the kind $(a,b)$ where $a,b\in \{ 0,1,2,3 \}$. Thus, the sample space has only 16 elements.
I have, say, N observations for both X and Y, and I would like to know if there is some correlation between X and Y.

- How is this typically done? Perhaps by calculating the mutual information between X and Y?

- Should I treat X and Y as 2-dimensional random variables? Or is it better to convert them to 1-dimensional random variables by using the mapping $(a,b)\longmapsto 4a+b$ ?
In this way the sample space would simply become $\{ 0,1,2,\ldots,15\}$.

2. Feb 14, 2014

pamparana

3. Feb 14, 2014

Stephen Tashi

Are you asking how estimate "correlation" in the technical sense of that word (e.g. "correlation coefficient", "Pearson's product moment coefficient"?) Or are you asking how to quantify the "dependence" between the random variables in a more general way?

(In either case, using the data alone, you won't "know" with certainty any facts about the correlation between X and Y just from a set of sample realizations of the random variables. You won't know with certainty the yes-or-no answer to whether they are correlated and you won't know with certainty the probability they are correlated or the probability they are not correlated. To get definite mathematical answers to such questions, you must make some specific assumptions about the random variables.

A common procedure in statistics is to assume the random variables are independent, quantify the probability that some aspect of the data has the value that was observed based on that assumption, and "accept" or "reject" the idea that the variables are independent. This is simply a procedure and does not answer any of the above questions. )

4. Feb 14, 2014

mnb96

Yes. The reason is that in this case the numbers are merely "labels". In the scenario I have in mind, an observation of the random variable X, say (a,b)=(2,3), could mean, for instance, that person-2 bought item-3.

The random variable Y has the same meaning, but the sample space of Y contains different persons than the ones of X.

We formulate the hypothesis that the two groups of people are independent, i.e. the purchases of the people of the first group do not have any influence on the purchases of the people in the second group.

Assuming we have collected a large amount of observations, how can we test, and eventually reject, the above hypothesis on independence?

5. Feb 14, 2014

mathman

Random variables are correlated if E(XY)≠E(X)E(Y). There are procedures in sampling theory which can test whether or not it is true with confindence limits.

6. Feb 14, 2014

Stephen Tashi

If there is no natural order to the values of the variables (like level of education: elementary,high school,college ), you could write the data as a 4x4 "contingency table" and use a chi-square test.

7. Feb 18, 2014

mnb96

Let's say the sample space for A is given by all the pairs $(name,\; item)$ where:

$name \; \in\{ Alice, Abbey, Andrew, Adam \}$, and
$item \; \in\{ 1,2,3,4 \}$

The sample space for B is given instead by the pairs $(name,\; item)$ where:

$name \; \in\{ Bob, Brian, Barbara, Beth \}$, and
$item \; \in\{ 1,2,3,4 \}$

I have many observations, say 10000 observations for A, and 10000 observations for B. Some examples of observations of A are: (Adam,1), (Alice,3), (Alice,4), (Andrew,2), and so forth...

- In order to test the independency between the people of the two groups, should I just make a 2x4 contingency table where:
the rows are: Group A, Group B
the columns are: item 1, item 2, item 3, item 4
and then apply a Chi-square test?

- As an alternative method, can I instead populate a two-dimensional 16x16 histogram where in one axis I have all the possible events of A: (Alice,1) (Alice,2) (Alice,3) (Alice,4) (Abbey,1) (Abbey,2) (Abbey,3) (Abbey,4), and so forth...and in the other axis I would have all the possible events of B.
The histogram would give me a reasonable estimate of the joint distribution of A and B, and from that I could calculate the mutual information, which is essentially the KL-divergence between $p_{AB}(x,y)$ and $p_A(x)p_B(y)$

Which one of the two alternatives is "better"?

8. Feb 18, 2014

Stephen Tashi

I'm glad you put "better" in quotes. It shows you're prepared for the usual interrogation about what you are trying to accomplish.

First, we should mention the fact that your proposed use of the Kullback-Leibler Divergence makes it a function of the sample values (because you are estimating probabilities from observed frequencies). Thus you are proposing to use something that ought to be called a "Kullback-Leibler statistic ". ( Just as Chi-square is a "statistic".)

Hypothesis Testing is a crude way to approach a problem. It usually answers the question "Is there a difference?" and the answers are restricted to "yes" or "no". Estimation is a more refined approach. If you want to estimate the Kullback-Leibler divergence between the two probability distributions then you might care to estimate the difference accurately even if it is a small difference. Most people who have the goal of a yes-or-no answer aren' t disturbed if two things with small differences are "accepted" as being the same.

In searching the web for estimators of the K-L Divergence, I found this PDF (which I haven't studied). https://www.google.com/url?sa=t&rct...vIfkv1GWksFlYA&bvm=bv.61535280,d.aWM&cad=rja/ it's references might help us. I didn't find anything useful in a quick search for "Kullback Leibler statistic". I did find some things using "empirical Kullback Leibler divergence".

My standard advice is to investigate problems with computer simulations. All applications of probability theory and statistics depend on assuming some model or models for how the data is generated. Are you a competent programmer in some computer language?

9. Feb 18, 2014

Stephen Tashi

Strictly speaking, that hypothesis test doesn't test for independence of the two groups. It would test whether two populations have the same probability distribution. My interpretation of the two populations being sampled would be:

Population 1: Pick a person P from group A from the probability distribution f_1() over the persons in it and then pick a choice that person P made from a probability distribution f_P() over the possible choices, where f_P() depends on P.

Population 2: Pick a person Q from group B from the probability distribution f_2() over the persons in it and then pick a choice that person Q made from a probability distribution f_Q() over the possible choices where f_Q() depends on Q and may be different than any f_P() from Population 1.

My intuition is that those two populations might have the same distribution, even if f_1 is not the same as f_2 and even if there is no way to match up people between the two groups who have the same probability distribution over the preferences. (e.g. In group A, perhaps Alice likes 1's alot and Abbey doesn't like them much. In group B, perhaps Bob likes 1's somewhat and so does Brian. The bottom line for the two groups picking 1's might be the same.)

So the question is whether you want to "accept" two such populations as being the same?

10. Feb 18, 2014

Stephen Tashi

I don't see how you populate the cells of this histogram with data. A cell of the histogram would represent an event such as "Alice picked 3 and Brian picked 2". Do you have such data? You said your data consists of pairs such as {Alice, 3} and {Brian, 2} but you didn't say that you had a way to match up these samples. (i.e. you didn't say that you observed an outcome such as "Alice picked 3 when Brian picked 2").

11. Feb 18, 2014

mnb96

Ok. I think I got the difference between hypothesis testing and mutual information. The former gives a "yes/no" answer based on a specific procedure of decision, while the latter gives a sort of measure of "distance" between two probability distributions (i.e. the joint distribution and the product of the marginal distributions).

This is actually a good advice that I should follow, since I am very familiar with MATLAB.

I tried to follow your example, and after thinking about it, I started to suspect, that perhaps, in this particular example, we could avoid keeping track of the name of the person who bought an item.
For instance, let's suppose that my hypothesis is that the people whose names start with an 'A' (the group A) have different tastes than the people whose names start with a 'B' (the group B).
In this case it should be sufficient to make a 2x4 contingency table where the rows stand for group A/group B and the columns represent item 1/2/3/4. Am I right?

You are right. I did not give (and I do not have) information from my data to populate an histogram in that way. I got the idea of using mutual information from this short section of a Wikipedia article, but I don't know if mutual information can be used to describe any statistical difference between group A and group B.

Last edited: Feb 18, 2014