Question on random variables and histograms

In summary, the Kullback-Leibler Divergence is a measure of how much different the distributions of two variables are from one another. It can be used to determine how well the two distributions are matched, and is generally a better measure of correlation than the Pearson Product Moment Coefficient.
  • #1
mnb96
715
5
Hello,

I have two random variables [itex]X[/itex] and [itex]Y[/itex] that can take values of the kind [itex](a,b)[/itex] where [itex]a,b\in \{ 0,1,2,3 \}[/itex]. Thus, the sample space has only 16 elements.
I have, say, N observations for both X and Y, and I would like to know if there is some correlation between X and Y.

- How is this typically done? Perhaps by calculating the mutual information between X and Y?

- Should I treat X and Y as 2-dimensional random variables? Or is it better to convert them to 1-dimensional random variables by using the mapping [itex](a,b)\longmapsto 4a+b[/itex] ?
In this way the sample space would simply become [itex]\{ 0,1,2,\ldots,15\}[/itex].
 
Physics news on Phys.org
  • #3
mnb96 said:
I would like to know if there is some correlation between X and Y.

Are you asking how estimate "correlation" in the technical sense of that word (e.g. "correlation coefficient", "Pearson's product moment coefficient"?) Or are you asking how to quantify the "dependence" between the random variables in a more general way?

(In either case, using the data alone, you won't "know" with certainty any facts about the correlation between X and Y just from a set of sample realizations of the random variables. You won't know with certainty the yes-or-no answer to whether they are correlated and you won't know with certainty the probability they are correlated or the probability they are not correlated. To get definite mathematical answers to such questions, you must make some specific assumptions about the random variables.

A common procedure in statistics is to assume the random variables are independent, quantify the probability that some aspect of the data has the value that was observed based on that assumption, and "accept" or "reject" the idea that the variables are independent. This is simply a procedure and does not answer any of the above questions. )
 
  • #4
pamparana said:
Any reason why you don't want to use the usual correlation by estimating the mean and standard deviation of your two populations from the samples?

http://www.mathsisfun.com/data/correlation.html
Yes. The reason is that in this case the numbers are merely "labels". In the scenario I have in mind, an observation of the random variable X, say (a,b)=(2,3), could mean, for instance, that person-2 bought item-3.

The random variable Y has the same meaning, but the sample space of Y contains different persons than the ones of X.

We formulate the hypothesis that the two groups of people are independent, i.e. the purchases of the people of the first group do not have any influence on the purchases of the people in the second group.

Assuming we have collected a large amount of observations, how can we test, and eventually reject, the above hypothesis on independence?
 
  • #5
Random variables are correlated if E(XY)≠E(X)E(Y). There are procedures in sampling theory which can test whether or not it is true with confindence limits.
 
  • #6
mnb96 said:
Yes. The reason is that in this case the numbers are merely "labels".

If there is no natural order to the values of the variables (like level of education: elementary,high school,college ), you could write the data as a 4x4 "contingency table" and use a chi-square test.
 
  • #7
Let's say the sample space for A is given by all the pairs [itex](name,\; item)[/itex] where:

[itex]name \; \in\{ Alice, Abbey, Andrew, Adam \}[/itex], and
[itex]item \; \in\{ 1,2,3,4 \}[/itex]

The sample space for B is given instead by the pairs [itex](name,\; item)[/itex] where:

[itex]name \; \in\{ Bob, Brian, Barbara, Beth \}[/itex], and
[itex]item \; \in\{ 1,2,3,4 \}[/itex]

I have many observations, say 10000 observations for A, and 10000 observations for B. Some examples of observations of A are: (Adam,1), (Alice,3), (Alice,4), (Andrew,2), and so forth...

- In order to test the independency between the people of the two groups, should I just make a 2x4 contingency table where:
the rows are: Group A, Group B
the columns are: item 1, item 2, item 3, item 4
and then apply a Chi-square test?

- As an alternative method, can I instead populate a two-dimensional 16x16 histogram where in one axis I have all the possible events of A: (Alice,1) (Alice,2) (Alice,3) (Alice,4) (Abbey,1) (Abbey,2) (Abbey,3) (Abbey,4), and so forth...and in the other axis I would have all the possible events of B.
The histogram would give me a reasonable estimate of the joint distribution of A and B, and from that I could calculate the mutual information, which is essentially the KL-divergence between [itex]p_{AB}(x,y)[/itex] and [itex]p_A(x)p_B(y)[/itex]

Which one of the two alternatives is "better"?
 
  • #8
I'm glad you put "better" in quotes. It shows you're prepared for the usual interrogation about what you are trying to accomplish.

First, we should mention the fact that your proposed use of the Kullback-Leibler Divergence makes it a function of the sample values (because you are estimating probabilities from observed frequencies). Thus you are proposing to use something that ought to be called a "Kullback-Leibler statistic ". ( Just as Chi-square is a "statistic".)

Hypothesis Testing is a crude way to approach a problem. It usually answers the question "Is there a difference?" and the answers are restricted to "yes" or "no". Estimation is a more refined approach. If you want to estimate the Kullback-Leibler divergence between the two probability distributions then you might care to estimate the difference accurately even if it is a small difference. Most people who have the goal of a yes-or-no answer aren' t disturbed if two things with small differences are "accepted" as being the same.

In searching the web for estimators of the K-L Divergence, I found this PDF (which I haven't studied). https://www.google.com/url?sa=t&rct...vIfkv1GWksFlYA&bvm=bv.61535280,d.aWM&cad=rja/ it's references might help us. I didn't find anything useful in a quick search for "Kullback Leibler statistic". I did find some things using "empirical Kullback Leibler divergence".

My standard advice is to investigate problems with computer simulations. All applications of probability theory and statistics depend on assuming some model or models for how the data is generated. Are you a competent programmer in some computer language?
 
  • #9
mnb96 said:
- In order to test the independency between the people of the two groups, should I just make a 2x4 contingency table where:
the rows are: Group A, Group B
the columns are: item 1, item 2, item 3, item 4
and then apply a Chi-square test?

Strictly speaking, that hypothesis test doesn't test for independence of the two groups. It would test whether two populations have the same probability distribution. My interpretation of the two populations being sampled would be:

Population 1: Pick a person P from group A from the probability distribution f_1() over the persons in it and then pick a choice that person P made from a probability distribution f_P() over the possible choices, where f_P() depends on P.

Population 2: Pick a person Q from group B from the probability distribution f_2() over the persons in it and then pick a choice that person Q made from a probability distribution f_Q() over the possible choices where f_Q() depends on Q and may be different than any f_P() from Population 1.

My intuition is that those two populations might have the same distribution, even if f_1 is not the same as f_2 and even if there is no way to match up people between the two groups who have the same probability distribution over the preferences. (e.g. In group A, perhaps Alice likes 1's a lot and Abbey doesn't like them much. In group B, perhaps Bob likes 1's somewhat and so does Brian. The bottom line for the two groups picking 1's might be the same.)

So the question is whether you want to "accept" two such populations as being the same?
 
  • #10
mnb96 said:
The histogram would give me a reasonable estimate of the joint distribution of A and B

I don't see how you populate the cells of this histogram with data. A cell of the histogram would represent an event such as "Alice picked 3 and Brian picked 2". Do you have such data? You said your data consists of pairs such as {Alice, 3} and {Brian, 2} but you didn't say that you had a way to match up these samples. (i.e. you didn't say that you observed an outcome such as "Alice picked 3 when Brian picked 2").
 
  • Like
Likes 1 person
  • #11
Stephen Tashi said:
Hypothesis Testing is a crude way to approach a problem. It usually answers the question "Is there a difference?" and the answers are restricted to "yes" or "no". Estimation is a more refined approach.

Ok. I think I got the difference between hypothesis testing and mutual information. The former gives a "yes/no" answer based on a specific procedure of decision, while the latter gives a sort of measure of "distance" between two probability distributions (i.e. the joint distribution and the product of the marginal distributions).
Stephen Tashi said:
My standard advice is to investigate problems with computer simulations. All applications of probability theory and statistics depend on assuming some model or models for how the data is generated. Are you a competent programmer in some computer language?

This is actually a good advice that I should follow, since I am very familiar with MATLAB.
Stephen Tashi said:
Strictly speaking, that hypothesis test doesn't test for independence of the two groups. It would test whether two populations have the same probability distribution. My interpretation of the two populations being sampled would be:

[...]

So the question is whether you want to "accept" two such populations as being the same?

I tried to follow your example, and after thinking about it, I started to suspect, that perhaps, in this particular example, we could avoid keeping track of the name of the person who bought an item.
For instance, let's suppose that my hypothesis is that the people whose names start with an 'A' (the group A) have different tastes than the people whose names start with a 'B' (the group B).
In this case it should be sufficient to make a 2x4 contingency table where the rows stand for group A/group B and the columns represent item 1/2/3/4. Am I right?
Stephen Tashi said:
I don't see how you populate the cells of this histogram with data. A cell of the histogram would represent an event such as "Alice picked 3 and Brian picked 2". Do you have such data? You said your data consists of pairs such as {Alice, 3} and {Brian, 2} but you didn't say that you had a way to match up these samples. (i.e. you didn't say that you observed an outcome such as "Alice picked 3 when Brian picked 2").

You are right. I did not give (and I do not have) information from my data to populate an histogram in that way. I got the idea of using mutual information from this short section of a Wikipedia article, but I don't know if mutual information can be used to describe any statistical difference between group A and group B.
 
Last edited:

1. What is a random variable?

A random variable is a variable whose value is determined by chance or randomness. It is often denoted by the letter "X" and can take on different values based on the outcome of a particular experiment or situation.

2. How is a random variable different from a regular variable?

A regular variable has a specific and known value, while a random variable has a range of possible values determined by chance. Regular variables are used to represent known quantities, while random variables are used to represent uncertain quantities.

3. What is a histogram?

A histogram is a graphical representation of the distribution of a dataset. It consists of a series of bars, where the height of each bar represents the frequency or relative frequency of values falling within a certain range or bin.

4. How are random variables and histograms related?

Random variables are often used to represent the data in a dataset, and histograms are used to visualize the distribution of that data. The values of the random variable are plotted on the x-axis, while the frequency or relative frequency is plotted on the y-axis in a histogram.

5. What is the purpose of using a histogram to represent random variables?

Histograms provide a visual representation of the data, making it easier to understand the distribution and characteristics of the random variable. It allows for quick comparisons between different datasets and can reveal patterns or trends in the data.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
301
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
335
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
850
Back
Top