Distance Metric

1. Aug 17, 2010

hoffmann

Say I have two lists, List1 and List2 containing elements such as words. Some words are common two both List1 and List2. I want to create a distance metric that tells me how far apart the two lists are based on a similarity "score". The similarity score and distance metric are as follows:

Similarity score: Intersection(List1,List2) / Union(List1,List2)
Distance = 1 - Similarity Score

In other words, the similarity score is just the percentage overlap between the two lists and the distance is 0 when the two lists are the same. Say I generalize this to n lists and I calculate the distances between lists (a symmetric matrix of distances). My question is, is this distance formula valid? In other words, does it satisfy the triangle inequality? How do I check this?

2. Aug 17, 2010

hamster143

It's not very hard to prove that it satisfies the triangle inequality if all list sizes are equal. Let N = size of the list, x, y & z = pairwise overlaps between three lists. You must have that x >= y+z-N. Distance between lists that give you x is 1-x/(2N-x). With some algebra you can conclude that the triangle inequality is satisfied for all valid x, y & z.

It seems to be true for arbitrary list sizes too, but I don't see an easy way to prove that right now ...

3. Aug 17, 2010

hoffmann

yes, that is also not coming to me. if it helps, i know that this scoring metric is skewed when the sizes of the lists are large and their overlaps are small (only in comparison to the sizes of the lists, but still pretty large in comparison to smaller list/overlap sizes). it seems that i need to correct the metric for the size of the overlap to get a more accurate value. also the same sort of situation when the sizes of the lists are small and their overlaps are ~0. finally, when one of the lists is big and the other small (or vice versa), i could also run into a skewed distance value.

4. Aug 17, 2010

hamster143

5. Aug 17, 2010

hoffmann

beautiful! thank you!!