Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Distance Metric

  1. Aug 17, 2010 #1
    Say I have two lists, List1 and List2 containing elements such as words. Some words are common two both List1 and List2. I want to create a distance metric that tells me how far apart the two lists are based on a similarity "score". The similarity score and distance metric are as follows:

    Similarity score: Intersection(List1,List2) / Union(List1,List2)
    Distance = 1 - Similarity Score

    In other words, the similarity score is just the percentage overlap between the two lists and the distance is 0 when the two lists are the same. Say I generalize this to n lists and I calculate the distances between lists (a symmetric matrix of distances). My question is, is this distance formula valid? In other words, does it satisfy the triangle inequality? How do I check this?
  2. jcsd
  3. Aug 17, 2010 #2
    It's not very hard to prove that it satisfies the triangle inequality if all list sizes are equal. Let N = size of the list, x, y & z = pairwise overlaps between three lists. You must have that x >= y+z-N. Distance between lists that give you x is 1-x/(2N-x). With some algebra you can conclude that the triangle inequality is satisfied for all valid x, y & z.

    It seems to be true for arbitrary list sizes too, but I don't see an easy way to prove that right now ...
  4. Aug 17, 2010 #3
    yes, that is also not coming to me. if it helps, i know that this scoring metric is skewed when the sizes of the lists are large and their overlaps are small (only in comparison to the sizes of the lists, but still pretty large in comparison to smaller list/overlap sizes). it seems that i need to correct the metric for the size of the overlap to get a more accurate value. also the same sort of situation when the sizes of the lists are small and their overlaps are ~0. finally, when one of the lists is big and the other small (or vice versa), i could also run into a skewed distance value.
  5. Aug 17, 2010 #4
  6. Aug 17, 2010 #5
    beautiful! thank you!!
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook