Checking Triangle Inequality for List Similarity Metric

Click For Summary
SUMMARY

The discussion centers on the validity of a distance metric for measuring similarity between two lists, defined as Distance = 1 - (Intersection(List1, List2) / Union(List1, List2)). This metric is shown to satisfy the triangle inequality under certain conditions, specifically when list sizes are equal. The author notes that while the triangle inequality appears to hold for arbitrary list sizes, proving this remains challenging. Additionally, the metric may become skewed when list sizes differ significantly or when overlaps are small relative to list sizes, necessitating adjustments for accurate distance calculations.

PREREQUISITES
  • Understanding of set operations: Intersection and Union
  • Familiarity with distance metrics in mathematics
  • Knowledge of triangle inequality in metric spaces
  • Basic algebra for manipulating distance formulas
NEXT STEPS
  • Research the properties of distance metrics in metric spaces
  • Explore methods for correcting skewed distance metrics
  • Learn about advanced set theory concepts relevant to list comparisons
  • Investigate algorithms for calculating pairwise distances in large datasets
USEFUL FOR

Data scientists, mathematicians, and software developers interested in list similarity metrics and their applications in data analysis and machine learning.

hoffmann
Messages
65
Reaction score
0
Say I have two lists, List1 and List2 containing elements such as words. Some words are common two both List1 and List2. I want to create a distance metric that tells me how far apart the two lists are based on a similarity "score". The similarity score and distance metric are as follows:

Similarity score: Intersection(List1,List2) / Union(List1,List2)
Distance = 1 - Similarity Score

In other words, the similarity score is just the percentage overlap between the two lists and the distance is 0 when the two lists are the same. Say I generalize this to n lists and I calculate the distances between lists (a symmetric matrix of distances). My question is, is this distance formula valid? In other words, does it satisfy the triangle inequality? How do I check this?
 
Physics news on Phys.org
It's not very hard to prove that it satisfies the triangle inequality if all list sizes are equal. Let N = size of the list, x, y & z = pairwise overlaps between three lists. You must have that x >= y+z-N. Distance between lists that give you x is 1-x/(2N-x). With some algebra you can conclude that the triangle inequality is satisfied for all valid x, y & z.

It seems to be true for arbitrary list sizes too, but I don't see an easy way to prove that right now ...
 
yes, that is also not coming to me. if it helps, i know that this scoring metric is skewed when the sizes of the lists are large and their overlaps are small (only in comparison to the sizes of the lists, but still pretty large in comparison to smaller list/overlap sizes). it seems that i need to correct the metric for the size of the overlap to get a more accurate value. also the same sort of situation when the sizes of the lists are small and their overlaps are ~0. finally, when one of the lists is big and the other small (or vice versa), i could also run into a skewed distance value.
 
http://pnylab.com/pny/papers/nmet/nmet/index.html
 
beautiful! thank you!
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 19 ·
Replies
19
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 5 ·
Replies
5
Views
2K