# Evaluating similarity between two subsets

i have the following problem that i cant figure out.
i have a set ID's which i pseudo-randomly split into 2 subsets A and B. lets say for the sake of simplicity i did it twice only, so i have subsets A&B and A&B. the sizes in the different splittings are the same - not sizes of A and B, but sizes of A and A, and of B and B.
what i need to know is if there is a standard way of telling how different those distributions are. i just want to make sure that subset A is not different from A or B by just a couple of ID's (sizes of A and B differ by 1 only).

## Answers and Replies

Well, if you don't have a distance measure between pairs of ID's--if two ID's are either equal or unequal, no shades of gray--then your problem is simple. Define the similarity between two subsets to be the size of their intersection. If you want to compare subsets of different sizes, one possibility is to define the similarity of A and B to be 2 |A n B| / (|A| + |B|).