Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Comparing two multivariate distributions (two matrices)

  1. Nov 11, 2011 #1
    I urgently need some help in my problem for my MS thesis. I have two datasets of same variable dimension but different number of observations, ie same # of columns but not same # rows. The variables are indentical for both sets. I want to compare the multivariate distributions of the two data sets. I have done some google research on the matter and all I could find are tests for normality of multivariate samples. Although that information is also useful, i am more interested in comparing my two datasets whatever their distrubutions maybe. In what way should i compare them? What are the parameters of comparison?. In a mutivariate normal distribution, I have read from an old paper (1983) by Hannu Oja that the eigenvalues of the covariane matrix is a measure of spread or scatterness of the data, or so i understood it that way. Please comfirm this if i am right or wrong. This is as far as my search for answers could go. I want to have paramenters of comparison even if my data sets are not normal. Since I am dealing with two matrices, I also welcome suggestions from the mathematical point of view as well. Thank you very much in advance for any and all suggestions.
  2. jcsd
  3. Nov 11, 2011 #2
    I'd stick with simple methods as far as possible - find out the purpose of the exercise; explore the data; decide which 2 or 3 variables are most "important"; then compare using some appropriate charts (e.g. scatter, histogram, box/whisker etc).
  4. Nov 12, 2011 #3
    Thanks for the reply bpet. I was already assuming nobody cares and nobody reads my thread. But I exagerrate. The problem is I cannot reduce the dimension of my data anymore, it is already reduced from the original. I guess I have to be more specific what im using the data for. The data sets will be used to come up with regression models. One of my hypotheses is that the validity or predictive accuracy of the models must lie on the internal structure of the data sets used. This internal structure can be described by their distrubutions and other mathematical properties. In other words, im trying to look at diffrences in data structure of the two sets and relate these differences in the resulting models´performance.
  5. Nov 12, 2011 #4
    Ok, probably the easiest way is first to compare the marginals and then the joint distribution.

    The marginals (i.e. individual variables) can be compared using the usual univariate nonparametric two-sample tests (KS, AD, CvM, MW etc).

    If no significant differences are found in the marginals, and if the marginals are continuous, you could descale the data by converting the rank order to percentiles. Then try some graphical tools (such as parallel coordinates, andrews plot, scatter matrices etc). There also exist several multivariate distribution-free two sample tests but I don't know a lot about that area.

    Also the above assumes that your data sets consist of IID observations, e.g. for time series models other methods might be more suitable.

Share this great discussion with others via Reddit, Google+, Twitter, or Facebook