Calculating match between two data sets

Click For Summary

Discussion Overview

The discussion revolves around methods for quantifying the "match" between two data sets in the context of a Monte Carlo experiment. Participants explore various approaches to calculate a percentage match, aiming for a clear numerical representation of similarity or difference between the data sets.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant suggests using the average of the absolute value of the differences between the two data sets at each index, but finds the result uninformative due to scale issues.
  • Another approach discussed is the average of the absolute value of the percentage differences, which also presents challenges as percentages can exceed 100%.
  • A correlation coefficient is mentioned as a potential method, but it is noted that it only indicates linear relationships, which may not reflect the actual similarity of the data sets.
  • One participant recommends defining a mathematically precise measure of similarity and suggests normalizing the data sets to address scaling issues.
  • Another participant emphasizes the importance of understanding the boundaries of the data to classify similarity and difference effectively.
  • Statistical techniques for comparing distributions, such as the Chi-square goodness of fit test, are proposed as a method for assessing similarity.
  • Markovian analysis is suggested as a way to characterize the evolution of processes represented by the data sets.
  • Entropy methods are mentioned as a means to quantify information density and assess similarity based on entropy measures.
  • A participant proposes a sum square difference approach, indicating that scaling differences per index may be necessary for accurate comparison.

Areas of Agreement / Disagreement

Participants express various methods and considerations for quantifying similarity, but there is no consensus on a single approach. Multiple competing views and uncertainties regarding definitions and methodologies remain evident throughout the discussion.

Contextual Notes

Participants highlight limitations related to the definitions of similarity and difference, the scaling of data sets, and the need for precise mathematical formulations. The discussion includes unresolved mathematical steps and varying assumptions about the nature of the data.

jordanstreet
Messages
7
Reaction score
0
Calculating "match" between two data sets

Hey guys, I'm developing a program for comparing the effects of various terms in a Monte Carlo experiment. Right now I have it so you can visually see the effect of "switching" terms on and off and need a way of quantifying how much two lines "match".

-----

What I need is to be able to compare two data sets and get a number which represents how much the two data sets "match". Here are some methods I have tried and their effectiveness.

1) Average(Absolute Value(Difference between the two sets at each index)) - this gives me a number but the number doesn't really mean anything to me depending on the scale of the two data sets. This lead me to my next attempt

2) Average(Absolute Value(Percentage difference between the two sets at each index)) - this was better but again the percentages could range well over 100%

- also with the above two strategies I would be getting the average difference which I would need to somehow manipulate into a percentage match

3) Correlation coefficient - this looked promising but I then realized this only tells me "how linear" a relationship the two sets had. They could be totally different sets of data but as long as they shared the same linear relationship they would be a 100% correlation

-------

Basically the goal is to calculate a percentage match where 100% would be the two sets are identical and 0% would mean they are infinitely different. Any help would be greatly appreciated. Thanks!
 
Physics news on Phys.org


jordanstreet said:
Hey guys, I'm developing a program for comparing the effects of various terms in a Monte Carlo experiment. Right now I have it so you can visually see the effect of "switching" terms on and off and need a way of quantifying how much two lines "match".

-----

What I need is to be able to compare two data sets and get a number which represents how much the two data sets "match". Here are some methods I have tried and their effectiveness.

1) Average(Absolute Value(Difference between the two sets at each index)) - this gives me a number but the number doesn't really mean anything to me depending on the scale of the two data sets. This lead me to my next attempt

2) Average(Absolute Value(Percentage difference between the two sets at each index)) - this was better but again the percentages could range well over 100%

- also with the above two strategies I would be getting the average difference which I would need to somehow manipulate into a percentage match

3) Correlation coefficient - this looked promising but I then realized this only tells me "how linear" a relationship the two sets had. They could be totally different sets of data but as long as they shared the same linear relationship they would be a 100% correlation

-------

Basically the goal is to calculate a percentage match where 100% would be the two sets are identical and 0% would mean they are. Any help would be greatly appreciated. Thanks!

Hi Jordan,

It seems you are in a brainstorming process here and you're still not quite sure what you want since the definition of 0% as infinitely different is not very precise. So the first thing I would recommend you is to work out a mathematically precise definition of what you want to measure that is relevant to your problem.

But anyway, adding to the brainstorming process; you could try to normalize X1 and X2 variances to avoid the scaling problem; you simply do X/σ for both sets. Once this problem is gone you can work with the distribution of distances per point; you could do this again in many different ways, for instance, you could set scorings like "% of points at a certain distance of less" or any other that fit your needs.
 
Last edited:


Hey jordanstreet and welcome to the forums.

Just a few comments for your particular problem:

1) Defining 'different' and 'similar' with respect to 1 and 0.

When considering this one needs to think about the boundaries of possibilities for data to consider how to relate two data sets together and how to class them as x% 'similar' or 'different'.

If you know the range of your data, then you can use to classify difference or similarity.

For example if you know that for example everything is in-between 0 and 1 as a real number, then you know that a distance of 1 is completely different. This is known as understanding the global constraints of your data whether forced or inferred.

The above will help you define a metric (or maybe a norm under certain conditions) so that you can quantify similarity and difference of two data points. Metrics with smaller values are similar and larger values are different. You can transform this so that you get an appropriate value corresponding to 0 for same and 1 for completely different.

2) Does order count or does only the probabilistic properties/frequency information count?

If the order of data does not count and if you do need to necessarily correlate data in a pair-wise or similar fashion, then you can convert your data to a distribution.

The benefit of doing this is that there are statistical techniques to compare two distributions to see if they are statistically significantly similar: in other words, you do a hypothesis test under frequentist statistics to say whether two distributions: an expected and an observed can not be failed to be rejected under the hypothesis that they come from the same distribution.

As a start you would want to look at a Goodness of Fit test (Chi-square goodness of fit) which is implemented in most major statistical software packages.

The above is good when you just want to compare things in terms of a standard one off process as opposed to something where the position and relationship to other local data points are relevant.

3) Are markovian attributes desirable?

If you want to consider at the very least the simplest characterization of a process with regard to how it evolves, then a markovian analysis is a good place to start.

With this you can calculate the transition matrix given your data and then compare the corresponding operators for each data set (i.e. the transition matrix). You can use norm techniques for operators as well as adaptation of the goodness of fit test.

For more complicated methods, you will probably need to read around.

4) Entropy methods

Entropy methods are a way to quantify the information density of a particular representation of information. Once you define the alphabet and quantify the probabilities, you can get the entropy of said representation with respect to its alphabet.

In terms of similarity, you can then apply a metric to the entropy measure to look at similarity of information content. Different measures will look at different things and you will have to research this if you want more complicated and general schemes, but the idea is to use the entropy measure you are interested in and then use the metric to assess similarity.

These should give you a few ideas for your application.
 


From your item (3), it sounds like adjusting the datasets to the same scale won't do it. If y1 = x and y2 = 2x then they're to be viewed as quite different datasets. I'm guessing your reference to scale in (1) is in regard of the common scale, not separate scales per dataset. I.e. if {y1i} and {y2i} have a difference measure of m then {2*y1i} and {2*y2i} should have the same difference measure.
I also get the impression that they share a common index, so order matters.
A possibility would be sum square difference / sum squares: Ʃ(y1i-y2i)2/Ʃ(y1i2+y2i2)
I think the max possible value of that is 2.
However, you may have meant that the difference is to be scaled per index. So that would be: Ʃ{(y1i-y2i)2/(y1i2+y2i2)}/n
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 6 ·
Replies
6
Views
1K
  • · Replies 18 ·
Replies
18
Views
4K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 12 ·
Replies
12
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K