Outlier identification of multivariate data

AI Thread Summary
The discussion focuses on identifying outliers in a set of over 1500 matrices representing joint frequency distributions of torque and speed. A suggested method involves treating the matrices as points in a 6D space and calculating their Euclidean distance from the average matrix to identify outliers. The user seeks to establish a threshold for outlier identification and is interested in weighting the importance of torque over speed in the distance calculation. It is recommended to use univariate outlier detection techniques, such as the z-score, to facilitate this process. The conversation highlights the importance of defining appropriate weights and criteria for determining matrix similarity.
serbring
Messages
267
Reaction score
2
Dear all,

I have more than 1500 matrices that are the occurency frequency of a bivariate dataset, something like the following: [0.1, 0.3,0.1;0.4,0.05,0.05]. I need to extract the few of them that are really different from the others. Instead of visually analyze each of them and trying to hardly identify something, is there any way to programmatically carry it out maybe with an index similar to like the zscore for outlier identification in univariate data?

Thanks

Best regards

Serbring
 
Physics news on Phys.org
I do not quite understand what your matrices are : what is on each line exactly? And more importantly, are all the matrices of the same form (2x3 in your example)?

If the latter is true, a generic approach could be to treat all matrices as points in 6D space, and look for points far away from the others.

A basic implementation of this idea would be to compute the average of the 1500 matrices, ##\bar M={1\over N}\sum_{i}M_i##, and then for each matrix, its distance from this average ## d_i=d(M_i,\bar M)=\sqrt{\sum_{j,k}(M_{i;jk}-\bar M_{jk})^2}##.
The largest ##d_i##'s then give you (an interpretation of) the outliers.

You can improve on that by using more information about the matrices and what it means for two of them to be near each other, but perhaps the above might serve as a starting point.
 
Last edited:
Hi Wabbit,

Thanks for your reply. Each matrix is the joint probability between two signals (torque and speed of a shaft) and all the matrices have the same size.

Is that distance the Euclidian distance, right? Which information may add?

Thanks
 
Still unclear to me what the matrices are, I don't see an obvious way in which "joint probability between torque and speed" should take the form of a 2x3 matrix. Also I assume you mean frequency not probability.

Other than that, yes, this is the Euclidean distance but that was just a choice by default - any distance will do. This is one of the areas of improvement: for instancr if torque is more important than speed than you may use a different distance, etc.

Have you tried this basic version to see if it gives you something usable?
 
Thanks, Yes, I meant frequency. Well the two signals are firsly binned and then I compute the discrete joint frequency distribution by using the Matlab function "hist3".

I've tried the method and it is rather effective, what I'm trying to understand is how to set properly a "threshold" for outlier identification.

How may I weight more torque than speed? I might be useful to me.
 
OK your description is now clear. I didn't think it was that because your rows do not sum to 1.

First the threshold, you could just set the percentage of outliers (that's what a zscore does under a fancy name), or you could look at a chart of ##d_i##as a function of rank i.e. sort them first and see if there are some points or group standing out (like a break in the curve...). Actually, you re looking at a univariate distribution now (the ##d_i##'s) so anything you usually do in that case is applicable, that was the reason for introducing them.

A weighted distance would be
##d'_i=d'(M_i,\bar M)=\sqrt{\sum_{j,k}w_{jk}(M_{i;jk}-\bar M_{jk})^2}## where ##w_{jk}\geq 0, \sum_{j,k}w_{jk}=1## is a set of weights you like.

You can pick any weights that make sense for the problem - say the second bin matters more than the others: increase the ##w_{j2}##. Torque distribution matters more than speed: increase ##w_{1k}##, etc.

There are other ways of getting a better selection, but these would require you to think more about what properties of the matrices are important to you and what criterion for two matrices being "close" makes the most sense - then translating those thoughts into mathematical form.

But as long as this one works, why bother :)
 
Oh great that I may use any univariate outlier identification, so I'll use the zscore.

Really thanks for your help, on Friday I was going crazy to visually analyze all those matrices. I'll keep you updated ;)
 

Similar threads

Replies
1
Views
3K
Replies
25
Views
6K
Replies
8
Views
5K
Back
Top