Comparing two multivariate distributions (two matrices)

In summary, the speaker is seeking help with comparing two datasets for their MS thesis. They have researched various methods and have found tests for normality of multivariate samples, but are more interested in comparing the datasets regardless of their distributions. They mention an old paper that suggests using eigenvalues of the covariance matrix as a measure of spread or scatterness of the data. They also mention wanting to have parameters of comparison even if the data sets are not normal and welcome suggestions from a mathematical perspective. The speaker then provides more information about the purpose of the exercise, which is to come up with regression models, and mentions that they are looking at differences in data structure and how it relates to the resulting models' performance. They mention comparing the marginals (individual
  • #1
rayms
2
0
I urgently need some help in my problem for my MS thesis. I have two datasets of same variable dimension but different number of observations, ie same # of columns but not same # rows. The variables are indentical for both sets. I want to compare the multivariate distributions of the two data sets. I have done some google research on the matter and all I could find are tests for normality of multivariate samples. Although that information is also useful, i am more interested in comparing my two datasets whatever their distrubutions maybe. In what way should i compare them? What are the parameters of comparison?. In a mutivariate normal distribution, I have read from an old paper (1983) by Hannu Oja that the eigenvalues of the covariane matrix is a measure of spread or scatterness of the data, or so i understood it that way. Please comfirm this if i am right or wrong. This is as far as my search for answers could go. I want to have paramenters of comparison even if my data sets are not normal. Since I am dealing with two matrices, I also welcome suggestions from the mathematical point of view as well. Thank you very much in advance for any and all suggestions.
Rayms
 
Physics news on Phys.org
  • #2
I'd stick with simple methods as far as possible - find out the purpose of the exercise; explore the data; decide which 2 or 3 variables are most "important"; then compare using some appropriate charts (e.g. scatter, histogram, box/whisker etc).
 
  • #3
bpet said:
I'd stick with simple methods as far as possible - find out the purpose of the exercise; explore the data; decide which 2 or 3 variables are most "important"; then compare using some appropriate charts (e.g. scatter, histogram, box/whisker etc).

Thanks for the reply bpet. I was already assuming nobody cares and nobody reads my thread. But I exagerrate. The problem is I cannot reduce the dimension of my data anymore, it is already reduced from the original. I guess I have to be more specific what I am using the data for. The data sets will be used to come up with regression models. One of my hypotheses is that the validity or predictive accuracy of the models must lie on the internal structure of the data sets used. This internal structure can be described by their distrubutions and other mathematical properties. In other words, I am trying to look at diffrences in data structure of the two sets and relate these differences in the resulting models´performance.
 
  • #4
Ok, probably the easiest way is first to compare the marginals and then the joint distribution.

The marginals (i.e. individual variables) can be compared using the usual univariate nonparametric two-sample tests (KS, AD, CvM, MW etc).

If no significant differences are found in the marginals, and if the marginals are continuous, you could descale the data by converting the rank order to percentiles. Then try some graphical tools (such as parallel coordinates, andrews plot, scatter matrices etc). There also exist several multivariate distribution-free two sample tests but I don't know a lot about that area.

Also the above assumes that your data sets consist of IID observations, e.g. for time series models other methods might be more suitable.

HTH
 
  • #5


I understand your urgency in finding a solution for your thesis problem. Comparing two multivariate distributions is a complex task and requires careful consideration of various factors.

Firstly, it is important to determine the purpose of your comparison. Are you looking to assess the similarity or difference between the two datasets? Are you interested in identifying potential outliers or patterns within the data? Once you have a clear objective, you can then choose an appropriate statistical method for comparison.

One approach could be to use a multivariate analysis of variance (MANOVA) test, which allows for the comparison of multiple variables between two groups. This test also does not assume normality of the data, as it uses rank-based methods. Another option could be to use a non-parametric test such as the Kolmogorov-Smirnov test, which compares the cumulative distribution functions of the two datasets.

In terms of parameters of comparison, you could consider measures such as mean, median, standard deviation, and range for each variable in the two datasets. Additionally, you could also look at measures of association, such as correlation coefficients, to assess the relationship between the variables in the two datasets.

Regarding your question about the eigenvalues of the covariance matrix, it is correct that they can be used as a measure of spread or variability in the data. However, this approach may be limited as it only considers the covariance between variables and does not take into account other aspects of the data.

In addition to statistical methods, you could also explore visualization techniques such as scatter plots, box plots, and parallel coordinate plots to compare the distributions of the two datasets.

Overall, it is important to carefully consider your research question and choose an appropriate method for comparison. I hope this information helps and wish you the best of luck with your thesis.
 

1. What is the purpose of comparing two multivariate distributions?

The purpose of comparing two multivariate distributions, also known as two matrices, is to understand the relationship between the two sets of data. This can help identify patterns, similarities, and differences between the variables in the two distributions.

2. How do you compare two multivariate distributions?

To compare two multivariate distributions, you can use statistical methods such as multivariate analysis of variance (MANOVA), principal component analysis (PCA), or canonical correlation analysis (CCA). These methods can help to identify significant differences between the two distributions and to visualize the data in a meaningful way.

3. What are some common measures used to compare two multivariate distributions?

Some common measures used to compare two multivariate distributions include the mean, standard deviation, correlation coefficient, and covariance. These measures provide information on the central tendency, spread, and relationship between the variables in the two distributions.

4. Can two multivariate distributions with different dimensions be compared?

Yes, two multivariate distributions with different dimensions can be compared. However, it is important to use appropriate statistical methods and to ensure that the data is properly transformed or standardized to account for the different dimensions. This can help to make the comparison more meaningful and accurate.

5. How can comparing two multivariate distributions be useful in scientific research?

Comparing two multivariate distributions can be useful in scientific research as it allows for a deeper understanding of the relationship between different variables. This can help to identify important factors, variables, or patterns that may be influencing the outcome of a study. It can also aid in hypothesis testing and in identifying areas for further research.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
814
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
25
Views
5K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
837
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
3K
Back
Top