Comparing two multivariate distributions (two matrices)

Click For Summary

Discussion Overview

The discussion revolves around comparing two multivariate datasets that share the same variable dimensions but differ in the number of observations. Participants explore methods for comparing the distributions of these datasets, regardless of their underlying distributions, and seek parameters for such comparisons.

Discussion Character

  • Exploratory
  • Technical explanation
  • Mathematical reasoning

Main Points Raised

  • One participant seeks guidance on comparing multivariate distributions and questions the relevance of eigenvalues of the covariance matrix as a measure of data spread.
  • Another participant suggests using simple methods, emphasizing the importance of understanding the purpose of the analysis and focusing on key variables before employing visual comparison techniques such as scatter plots and histograms.
  • A later reply proposes starting with comparisons of marginal distributions using nonparametric tests, and if no significant differences are found, suggests graphical tools for further analysis.
  • There is mention of the need to consider the internal structure of the datasets and its relation to the performance of regression models, indicating that the analysis aims to understand how differences in data structure affect model validity.
  • Concerns are raised about the assumption of IID observations, suggesting that alternative methods may be necessary for time series data.

Areas of Agreement / Disagreement

Participants express various methods for comparison, with no consensus on a single approach. There are differing opinions on the importance of dimensionality reduction and the types of tests or visualizations to employ.

Contextual Notes

Some limitations include the assumption of IID observations for certain methods, and the discussion does not resolve the appropriateness of various statistical tests for the datasets in question.

Who May Find This Useful

Researchers and students working with multivariate data analysis, particularly in the context of regression modeling and distribution comparison.

rayms
Messages
2
Reaction score
0
I urgently need some help in my problem for my MS thesis. I have two datasets of same variable dimension but different number of observations, ie same # of columns but not same # rows. The variables are indentical for both sets. I want to compare the multivariate distributions of the two data sets. I have done some google research on the matter and all I could find are tests for normality of multivariate samples. Although that information is also useful, i am more interested in comparing my two datasets whatever their distrubutions maybe. In what way should i compare them? What are the parameters of comparison?. In a mutivariate normal distribution, I have read from an old paper (1983) by Hannu Oja that the eigenvalues of the covariane matrix is a measure of spread or scatterness of the data, or so i understood it that way. Please comfirm this if i am right or wrong. This is as far as my search for answers could go. I want to have paramenters of comparison even if my data sets are not normal. Since I am dealing with two matrices, I also welcome suggestions from the mathematical point of view as well. Thank you very much in advance for any and all suggestions.
Rayms
 
Physics news on Phys.org
I'd stick with simple methods as far as possible - find out the purpose of the exercise; explore the data; decide which 2 or 3 variables are most "important"; then compare using some appropriate charts (e.g. scatter, histogram, box/whisker etc).
 
bpet said:
I'd stick with simple methods as far as possible - find out the purpose of the exercise; explore the data; decide which 2 or 3 variables are most "important"; then compare using some appropriate charts (e.g. scatter, histogram, box/whisker etc).

Thanks for the reply bpet. I was already assuming nobody cares and nobody reads my thread. But I exagerrate. The problem is I cannot reduce the dimension of my data anymore, it is already reduced from the original. I guess I have to be more specific what I am using the data for. The data sets will be used to come up with regression models. One of my hypotheses is that the validity or predictive accuracy of the models must lie on the internal structure of the data sets used. This internal structure can be described by their distrubutions and other mathematical properties. In other words, I am trying to look at diffrences in data structure of the two sets and relate these differences in the resulting models´performance.
 
Ok, probably the easiest way is first to compare the marginals and then the joint distribution.

The marginals (i.e. individual variables) can be compared using the usual univariate nonparametric two-sample tests (KS, AD, CvM, MW etc).

If no significant differences are found in the marginals, and if the marginals are continuous, you could descale the data by converting the rank order to percentiles. Then try some graphical tools (such as parallel coordinates, andrews plot, scatter matrices etc). There also exist several multivariate distribution-free two sample tests but I don't know a lot about that area.

Also the above assumes that your data sets consist of IID observations, e.g. for time series models other methods might be more suitable.

HTH
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 6 ·
Replies
6
Views
1K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 11 ·
Replies
11
Views
14K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K