Evaluating similarity between two subsets

bagatur · Apr 29, 2009

i have the following problem that i can't figure out.
i have a set ID's which i pseudo-randomly split into 2 subsets A and B. let's say for the sake of simplicity i did it twice only, so i have subsets A&B and A`&B`. the sizes in the different splittings are the same - not sizes of A and B, but sizes of A and A`, and of B and B`.
what i need to know is if there is a standard way of telling how different those distributions are. i just want to make sure that subset A is not different from A` or B` by just a couple of ID's (sizes of A and B differ by 1 only).

mXSCNT · Apr 30, 2009

Well, if you don't have a distance measure between pairs of ID's--if two ID's are either equal or unequal, no shades of gray--then your problem is simple. Define the similarity between two subsets to be the size of their intersection. If you want to compare subsets of different sizes, one possibility is to define the similarity of A and B to be 2 |A n B| / (|A| + |B|).

bagatur · Apr 30, 2009

thank you for your reply!
i don't think i was clear about my problem, but what you suggested is something what i tried to do.
ID's are all different from each other (we do clinical research and ID's are identifying numbers of our volunteers).
what i did was assign number 1 to an ID if it is in the subgroup A and -1 if it is in B. then i did the same for the second selection - assign 1 if the ID in the subgroup A1 and -1 if it is in B1.
so i get vectors for 2 different selections, let's name them C and C1, consisting of 1's and -1's.
then i just take a scalar product and divide by the total number of ID's.
in the extreme cases, if A is same as A1 and B is same as B1, we get 1; and if A is same as B1 and B is same as A1, we get -1. the closer this value to 0 is, the more they differ.
now is there some widely accepted name for this kind of calculation? like a similarity or correlation coefficient? or something else?

Evaluating similarity between two subsets

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad How do E[X] and E[|X|] relate?

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight