Checking Triangle Inequality for List Similarity Metric

Click For Summary

Discussion Overview

The discussion revolves around the validity of a distance metric based on similarity scores for comparing two lists, specifically examining whether this metric satisfies the triangle inequality. The context includes theoretical exploration of list similarity metrics and their mathematical properties.

Discussion Character

  • Exploratory
  • Mathematical reasoning

Main Points Raised

  • One participant proposes a similarity score defined as the intersection of two lists divided by their union, with a corresponding distance metric calculated as one minus the similarity score.
  • Another participant suggests that the triangle inequality can be proven for equal list sizes and provides a mathematical expression for distances based on pairwise overlaps.
  • A different participant notes that the scoring metric may be skewed when lists are large with small overlaps, indicating a need for potential corrections to the metric based on overlap size.
  • Concerns are raised about the accuracy of the distance metric in scenarios where one list is significantly larger than the other.

Areas of Agreement / Disagreement

Participants express uncertainty regarding the general validity of the distance metric for arbitrary list sizes, and no consensus is reached on whether the triangle inequality holds in all cases.

Contextual Notes

Participants acknowledge limitations related to list size and overlap, suggesting that these factors may affect the accuracy of the distance metric without providing a definitive resolution.

hoffmann
Messages
65
Reaction score
0
Say I have two lists, List1 and List2 containing elements such as words. Some words are common two both List1 and List2. I want to create a distance metric that tells me how far apart the two lists are based on a similarity "score". The similarity score and distance metric are as follows:

Similarity score: Intersection(List1,List2) / Union(List1,List2)
Distance = 1 - Similarity Score

In other words, the similarity score is just the percentage overlap between the two lists and the distance is 0 when the two lists are the same. Say I generalize this to n lists and I calculate the distances between lists (a symmetric matrix of distances). My question is, is this distance formula valid? In other words, does it satisfy the triangle inequality? How do I check this?
 
Physics news on Phys.org
It's not very hard to prove that it satisfies the triangle inequality if all list sizes are equal. Let N = size of the list, x, y & z = pairwise overlaps between three lists. You must have that x >= y+z-N. Distance between lists that give you x is 1-x/(2N-x). With some algebra you can conclude that the triangle inequality is satisfied for all valid x, y & z.

It seems to be true for arbitrary list sizes too, but I don't see an easy way to prove that right now ...
 
yes, that is also not coming to me. if it helps, i know that this scoring metric is skewed when the sizes of the lists are large and their overlaps are small (only in comparison to the sizes of the lists, but still pretty large in comparison to smaller list/overlap sizes). it seems that i need to correct the metric for the size of the overlap to get a more accurate value. also the same sort of situation when the sizes of the lists are small and their overlaps are ~0. finally, when one of the lists is big and the other small (or vice versa), i could also run into a skewed distance value.
 
http://pnylab.com/pny/papers/nmet/nmet/index.html
 
beautiful! thank you!
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 19 ·
Replies
19
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 5 ·
Replies
5
Views
2K