# Evaluating similarity between two subsets

i have the following problem that i cant figure out.
i have a set ID's which i pseudo-randomly split into 2 subsets A and B. lets say for the sake of simplicity i did it twice only, so i have subsets A&B and A&B. the sizes in the different splittings are the same - not sizes of A and B, but sizes of A and A, and of B and B.
what i need to know is if there is a standard way of telling how different those distributions are. i just want to make sure that subset A is not different from A or B by just a couple of ID's (sizes of A and B differ by 1 only).

## Answers and Replies

Well, if you don't have a distance measure between pairs of ID's--if two ID's are either equal or unequal, no shades of gray--then your problem is simple. Define the similarity between two subsets to be the size of their intersection. If you want to compare subsets of different sizes, one possibility is to define the similarity of A and B to be 2 |A n B| / (|A| + |B|).

thank you for your reply!
i dont think i was clear about my problem, but what you suggested is something what i tried to do.
ID's are all different from each other (we do clinical research and ID's are identifying numbers of our volunteers).
what i did was assign number 1 to an ID if it is in the subgroup A and -1 if it is in B. then i did the same for the second selection - assign 1 if the ID in the subgroup A1 and -1 if it is in B1.
so i get vectors for 2 different selections, lets name them C and C1, consisting of 1's and -1's.
then i just take a scalar product and divide by the total number of ID's.
in the extreme cases, if A is same as A1 and B is same as B1, we get 1; and if A is same as B1 and B is same as A1, we get -1. the closer this value to 0 is, the more they differ.
now is there some widely accepted name for this kind of calculation? like a similarity or correlation coefficient? or something else?