Outlier identification of multivariate data

Click For Summary

Discussion Overview

The discussion revolves around the identification of outliers in a set of multivariate data represented as matrices, specifically focusing on a bivariate dataset that reflects joint frequency distributions of two signals: torque and speed. Participants explore methods for programmatically detecting outliers without visual inspection.

Discussion Character

  • Exploratory
  • Technical explanation
  • Mathematical reasoning

Main Points Raised

  • One participant seeks a method to identify outliers among 1500 matrices representing joint frequency distributions, suggesting the use of an index similar to the z-score.
  • Another participant proposes treating the matrices as points in a 6D space and calculating the Euclidean distance from the average matrix to identify outliers.
  • A clarification is made regarding the nature of the matrices, emphasizing that they represent joint frequency distributions rather than probabilities.
  • Participants discuss the choice of distance metric, with one suggesting that different metrics could be used depending on the importance of the variables involved.
  • One participant expresses the need to establish a threshold for outlier identification and inquires about weighting the distance to prioritize torque over speed.
  • A weighted distance formula is introduced, allowing for the adjustment of weights based on the significance of different components of the matrices.
  • Participants agree that using univariate outlier identification methods, such as the z-score, could be applicable to the derived distances.

Areas of Agreement / Disagreement

There is no consensus on a single method for outlier identification, as participants propose various approaches and metrics. The discussion remains open with multiple competing views on how to best define and identify outliers in the context of the matrices.

Contextual Notes

Participants note that the matrices do not sum to 1, which raises questions about their interpretation as joint probabilities. The discussion also highlights the need for careful consideration of the properties of the matrices and the criteria for determining closeness between them.

Who May Find This Useful

This discussion may be useful for researchers or practitioners working with multivariate datasets, particularly in fields related to signal processing, data analysis, or statistical modeling, who are interested in outlier detection techniques.

serbring
Messages
267
Reaction score
2
Dear all,

I have more than 1500 matrices that are the occurency frequency of a bivariate dataset, something like the following: [0.1, 0.3,0.1;0.4,0.05,0.05]. I need to extract the few of them that are really different from the others. Instead of visually analyze each of them and trying to hardly identify something, is there any way to programmatically carry it out maybe with an index similar to like the zscore for outlier identification in univariate data?

Thanks

Best regards

Serbring
 
Physics news on Phys.org
I do not quite understand what your matrices are : what is on each line exactly? And more importantly, are all the matrices of the same form (2x3 in your example)?

If the latter is true, a generic approach could be to treat all matrices as points in 6D space, and look for points far away from the others.

A basic implementation of this idea would be to compute the average of the 1500 matrices, ##\bar M={1\over N}\sum_{i}M_i##, and then for each matrix, its distance from this average ## d_i=d(M_i,\bar M)=\sqrt{\sum_{j,k}(M_{i;jk}-\bar M_{jk})^2}##.
The largest ##d_i##'s then give you (an interpretation of) the outliers.

You can improve on that by using more information about the matrices and what it means for two of them to be near each other, but perhaps the above might serve as a starting point.
 
Last edited:
Hi Wabbit,

Thanks for your reply. Each matrix is the joint probability between two signals (torque and speed of a shaft) and all the matrices have the same size.

Is that distance the Euclidian distance, right? Which information may add?

Thanks
 
Still unclear to me what the matrices are, I don't see an obvious way in which "joint probability between torque and speed" should take the form of a 2x3 matrix. Also I assume you mean frequency not probability.

Other than that, yes, this is the Euclidean distance but that was just a choice by default - any distance will do. This is one of the areas of improvement: for instancr if torque is more important than speed than you may use a different distance, etc.

Have you tried this basic version to see if it gives you something usable?
 
Thanks, Yes, I meant frequency. Well the two signals are firsly binned and then I compute the discrete joint frequency distribution by using the Matlab function "hist3".

I've tried the method and it is rather effective, what I'm trying to understand is how to set properly a "threshold" for outlier identification.

How may I weight more torque than speed? I might be useful to me.
 
OK your description is now clear. I didn't think it was that because your rows do not sum to 1.

First the threshold, you could just set the percentage of outliers (that's what a zscore does under a fancy name), or you could look at a chart of ##d_i##as a function of rank i.e. sort them first and see if there are some points or group standing out (like a break in the curve...). Actually, you re looking at a univariate distribution now (the ##d_i##'s) so anything you usually do in that case is applicable, that was the reason for introducing them.

A weighted distance would be
##d'_i=d'(M_i,\bar M)=\sqrt{\sum_{j,k}w_{jk}(M_{i;jk}-\bar M_{jk})^2}## where ##w_{jk}\geq 0, \sum_{j,k}w_{jk}=1## is a set of weights you like.

You can pick any weights that make sense for the problem - say the second bin matters more than the others: increase the ##w_{j2}##. Torque distribution matters more than speed: increase ##w_{1k}##, etc.

There are other ways of getting a better selection, but these would require you to think more about what properties of the matrices are important to you and what criterion for two matrices being "close" makes the most sense - then translating those thoughts into mathematical form.

But as long as this one works, why bother :)
 
Oh great that I may use any univariate outlier identification, so I'll use the zscore.

Really thanks for your help, on Friday I was going crazy to visually analyze all those matrices. I'll keep you updated ;)
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 2 ·
Replies
2
Views
631
  • · Replies 25 ·
Replies
25
Views
6K