Outlier identification of multivariate data

serbring · Mar 20, 2015

Dear all,

I have more than 1500 matrices that are the occurency frequency of a bivariate dataset, something like the following: [0.1, 0.3,0.1;0.4,0.05,0.05]. I need to extract the few of them that are really different from the others. Instead of visually analyze each of them and trying to hardly identify something, is there any way to programmatically carry it out maybe with an index similar to like the zscore for outlier identification in univariate data?

Thanks

Best regards

Serbring

wabbit · Mar 20, 2015

I do not quite understand what your matrices are : what is on each line exactly? And more importantly, are all the matrices of the same form (2x3 in your example)?

If the latter is true, a generic approach could be to treat all matrices as points in 6D space, and look for points far away from the others.

A basic implementation of this idea would be to compute the average of the 1500 matrices, ##\bar M={1\over N}\sum_{i}M_i##, and then for each matrix, its distance from this average ## d_i=d(M_i,\bar M)=\sqrt{\sum_{j,k}(M_{i;jk}-\bar M_{jk})^2}##.
The largest ##d_i##'s then give you (an interpretation of) the outliers.

You can improve on that by using more information about the matrices and what it means for two of them to be near each other, but perhaps the above might serve as a starting point.

serbring · Mar 22, 2015

Hi Wabbit,

Thanks for your reply. Each matrix is the joint probability between two signals (torque and speed of a shaft) and all the matrices have the same size.

Is that distance the Euclidian distance, right? Which information may add?

Thanks

wabbit · Mar 22, 2015

Still unclear to me what the matrices are, I don't see an obvious way in which "joint probability between torque and speed" should take the form of a 2x3 matrix. Also I assume you mean frequency not probability.

Other than that, yes, this is the Euclidean distance but that was just a choice by default - any distance will do. This is one of the areas of improvement: for instancr if torque is more important than speed than you may use a different distance, etc.

Have you tried this basic version to see if it gives you something usable?

serbring · Mar 22, 2015

Thanks, Yes, I meant frequency. Well the two signals are firsly binned and then I compute the discrete joint frequency distribution by using the Matlab function "hist3".

I've tried the method and it is rather effective, what I'm trying to understand is how to set properly a "threshold" for outlier identification.

How may I weight more torque than speed? I might be useful to me.

wabbit · Mar 22, 2015

OK your description is now clear. I didn't think it was that because your rows do not sum to 1.

First the threshold, you could just set the percentage of outliers (that's what a zscore does under a fancy name), or you could look at a chart of ##d_i##as a function of rank i.e. sort them first and see if there are some points or group standing out (like a break in the curve...). Actually, you re looking at a univariate distribution now (the ##d_i##'s) so anything you usually do in that case is applicable, that was the reason for introducing them.

A weighted distance would be
##d'_i=d'(M_i,\bar M)=\sqrt{\sum_{j,k}w_{jk}(M_{i;jk}-\bar M_{jk})^2}## where ##w_{jk}\geq 0, \sum_{j,k}w_{jk}=1## is a set of weights you like.

You can pick any weights that make sense for the problem - say the second bin matters more than the others: increase the ##w_{j2}##. Torque distribution matters more than speed: increase ##w_{1k}##, etc.

There are other ways of getting a better selection, but these would require you to think more about what properties of the matrices are important to you and what criterion for two matrices being "close" makes the most sense - then translating those thoughts into mathematical form.

But as long as this one works, why bother :)

serbring · Mar 22, 2015

Oh great that I may use any univariate outlier identification, so I'll use the zscore.

Really thanks for your help, on Friday I was going crazy to visually analyze all those matrices. I'll keep you updated ;)

Outlier identification of multivariate data

1. What is an outlier in multivariate data?

2. How do you identify outliers in multivariate data?

3. Why is it important to identify outliers in multivariate data?

4. How can outliers be handled in multivariate data?

5. Can outliers be useful in multivariate data analysis?

Similar threads

Hot Threads

Recent Insights