Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Evaluating similarity between two subsets

  1. Apr 29, 2009 #1
    i have the following problem that i cant figure out.
    i have a set ID's which i pseudo-randomly split into 2 subsets A and B. lets say for the sake of simplicity i did it twice only, so i have subsets A&B and A`&B`. the sizes in the different splittings are the same - not sizes of A and B, but sizes of A and A`, and of B and B`.
    what i need to know is if there is a standard way of telling how different those distributions are. i just want to make sure that subset A is not different from A` or B` by just a couple of ID's (sizes of A and B differ by 1 only).
  2. jcsd
  3. Apr 30, 2009 #2
    Well, if you don't have a distance measure between pairs of ID's--if two ID's are either equal or unequal, no shades of gray--then your problem is simple. Define the similarity between two subsets to be the size of their intersection. If you want to compare subsets of different sizes, one possibility is to define the similarity of A and B to be 2 |A n B| / (|A| + |B|).
  4. Apr 30, 2009 #3
    thank you for your reply!
    i dont think i was clear about my problem, but what you suggested is something what i tried to do.
    ID's are all different from each other (we do clinical research and ID's are identifying numbers of our volunteers).
    what i did was assign number 1 to an ID if it is in the subgroup A and -1 if it is in B. then i did the same for the second selection - assign 1 if the ID in the subgroup A1 and -1 if it is in B1.
    so i get vectors for 2 different selections, lets name them C and C1, consisting of 1's and -1's.
    then i just take a scalar product and divide by the total number of ID's.
    in the extreme cases, if A is same as A1 and B is same as B1, we get 1; and if A is same as B1 and B is same as A1, we get -1. the closer this value to 0 is, the more they differ.
    now is there some widely accepted name for this kind of calculation? like a similarity or correlation coefficient? or something else?
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook