Evaluating similarity between two subsets

  • Context: Undergrad 
  • Thread starter Thread starter bagatur
  • Start date Start date
  • Tags Tags
    Subsets
Click For Summary
SUMMARY

The discussion focuses on evaluating the similarity between two subsets of unique identifiers (IDs) derived from clinical research volunteers. The user proposes a method of assigning values to IDs based on their membership in subsets A, B, A', and B', creating vectors C and C1. The similarity is calculated using the scalar product of these vectors divided by the total number of IDs, yielding values between -1 and 1 to indicate similarity. The user seeks a widely accepted term for this calculation, which resembles a correlation coefficient.

PREREQUISITES
  • Understanding of set theory and subset operations
  • Familiarity with vector mathematics and scalar products
  • Knowledge of clinical research data management
  • Basic concepts of similarity measures in statistics
NEXT STEPS
  • Research "Jaccard similarity coefficient" for comparing set similarities
  • Learn about "Cosine similarity" and its applications in vector analysis
  • Explore "Pearson correlation coefficient" for understanding correlation measures
  • Investigate "Statistical methods for clinical research data analysis"
USEFUL FOR

Researchers in clinical studies, data analysts, and statisticians interested in evaluating the similarity of subsets within datasets.

bagatur
Messages
2
Reaction score
0
i have the following problem that i can't figure out.
i have a set ID's which i pseudo-randomly split into 2 subsets A and B. let's say for the sake of simplicity i did it twice only, so i have subsets A&B and A`&B`. the sizes in the different splittings are the same - not sizes of A and B, but sizes of A and A`, and of B and B`.
what i need to know is if there is a standard way of telling how different those distributions are. i just want to make sure that subset A is not different from A` or B` by just a couple of ID's (sizes of A and B differ by 1 only).
 
Physics news on Phys.org
Well, if you don't have a distance measure between pairs of ID's--if two ID's are either equal or unequal, no shades of gray--then your problem is simple. Define the similarity between two subsets to be the size of their intersection. If you want to compare subsets of different sizes, one possibility is to define the similarity of A and B to be 2 |A n B| / (|A| + |B|).
 
thank you for your reply!
i don't think i was clear about my problem, but what you suggested is something what i tried to do.
ID's are all different from each other (we do clinical research and ID's are identifying numbers of our volunteers).
what i did was assign number 1 to an ID if it is in the subgroup A and -1 if it is in B. then i did the same for the second selection - assign 1 if the ID in the subgroup A1 and -1 if it is in B1.
so i get vectors for 2 different selections, let's name them C and C1, consisting of 1's and -1's.
then i just take a scalar product and divide by the total number of ID's.
in the extreme cases, if A is same as A1 and B is same as B1, we get 1; and if A is same as B1 and B is same as A1, we get -1. the closer this value to 0 is, the more they differ.
now is there some widely accepted name for this kind of calculation? like a similarity or correlation coefficient? or something else?
 

Similar threads

  • · Replies 14 ·
Replies
14
Views
1K
  • · Replies 33 ·
2
Replies
33
Views
5K
  • · Replies 20 ·
Replies
20
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 16 ·
Replies
16
Views
2K
  • · Replies 2 ·
Replies
2
Views
5K