How to compare two data sets with multiple samples

Click For Summary
SUMMARY

This discussion focuses on comparing two data sets, A and B, each containing five samples of integers taken under different conditions. The recommended approach is to use Analysis of Variance (ANOVA) to determine if the means of the subsets are statistically different. While combining all samples into one list for mean and standard deviation calculations is a quick method, it is advised to consider the uneven sample sizes to avoid biased results. The discussion emphasizes the importance of understanding the experimental design and the specific questions being addressed before selecting the appropriate statistical methods.

PREREQUISITES
  • Understanding of Analysis of Variance (ANOVA)
  • Basic knowledge of statistical concepts such as mean and standard deviation
  • Familiarity with normal distribution and its properties
  • Concept of degrees of freedom in statistical testing
NEXT STEPS
  • Research the implementation of ANOVA in statistical software like R or Python's SciPy library
  • Learn about matched-pairs t-tests and their application in comparing sample means
  • Explore regression models with fixed effects for analyzing data with multiple samples
  • Investigate the implications of sample size on statistical analysis and variance estimation
USEFUL FOR

Statisticians, data analysts, researchers comparing multiple data sets, and anyone interested in understanding the nuances of statistical testing and data interpretation.

wahaj
Messages
154
Reaction score
2
I have two data sets A and B which correspond to two different settings of the system. Both sets contain 5 separate lists of integers; I took 5 samples, at different points in time and location, for each set to reduce the random error in the data. How would I go about comparing sets A and B? If I only had one sample in sets A and B, I could simply calculate the mean and standard deviation and compare the data. This is pretty much the extent of statistics I've learned. But I have multiple samples in each set, so should I combine all the samples into one big list and calculate the mean and SD? The number of integers in each sample may not be the same.

I'm not looking for something very accurate here. Something quick and dirty will do.
 
Physics news on Phys.org
You can combine all the distributions into one about the property you want to compare.
ie. each of the 5 sets will have a mean ... from that you can find the mean of the means, and a standard deviation of the means ... which will let you compare the means.
 
This is probably the simplest way to go about doing this. Thanks for the help
 
Simple is good. Unless you are are planning to do something subsequent with your system - in that case good might be a better choice.
Since we do not have much to go on, let's start with ANOVA - analysis of variance, and an assumption that the data fit a normal distribution.

Analysis of Variance (ANOVA) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data.

It will tell you if subset A and subset B are from the same population. Or not. It is hard to tell what you did because you said 'settings' which is plural. Your experimental design and th answers you need impact how you analyze data. Big time.

If you are just playing, then @Simon Bridge gave you the best answer.
 
The answer partially depends on degrees of freedom your samples have. If each sample is large you can perform a separate matched-pairs t test on each sample. If their results point to the same conclusion then you're home. But suppose 3 point in one direction and 2 in the other. Then you may conclude "time" and "location" matter. If you don't have sufficient DF then my advice is to estimate a regression model with "sample fixed effects."
 
Last edited:
When I calculated the means of each sample, they were pretty close to each other. Close enough for me anyways; Standard deviation of the mean of means was about 10% of mean for set A and 6.5% for set B. I suspect that I if I took more samples in each set I could get even better standard deviations. The mean of means thing worked well enough for me since I don't want to spend several hours learning and then programming a complicated statistics test for my simple comparison.
 
  • Like
Likes   Reactions: EnumaElish
I know the problem is solved, but I just want to add that taking the mean of means is a bad idea since the data don't have the same observations. Intuitively, if you have one group with one observation and the other ##4## groups have ##1000## observations, then this ##1## observation will receive more weight, which is not what you want.

Mathematically, if ##X_i## has a normal distribution ##N(\mu,\sigma^2)##. Then the mean ##\overline{X}## has distribution ##N(\mu, \sigma^2/n)##.
So if we take the means of ##5## means, then this will certainly be an unbiased estimator, no problem there. But the variance of the mean of ##5## means is
\frac{\sigma^2}{25}\sum_{i=1}^{5} \frac{1}{n_i}
where ##n_i## are the sample sizes of the ##5## groups.

Compare this with the variance if we just put all the groups together:
\frac{\sigma^2}{\sum_{i=1}^{5} n_i}
It is easy to see that this variance is always smaller with equality if ##n_1=...=n_5##.

So in order to get more precise estimates, it is better to just put all the observations together in one big set (and perhaps introduce a blocking factor) than taking a mean of means.
 
  • Like
Likes   Reactions: EnumaElish
So basically the problem lies in the uneven sample sizes, which is what I was worried about too when I made my original post. But as I found when I started working on the data; the weights are not an issue. All the samples I collected contained approximately 60 numbers each and there were only small differences in the size of each samples, if they occurred at all.
 
  • Like
Likes   Reactions: EnumaElish
Hey wahaj.

It might help to understand what property you are trying to assess.

Means, medians and things like variances have a very good interpretation but it may not necessarily be what you are looking for.

To get this answer you have to understand what questions you are trying to answer that are completely independent of statistics and then formulate the appropriate test statistics, collection techniques, and processing of information to facilitate this answer.
 
  • Like
Likes   Reactions: EnumaElish

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 6 ·
Replies
6
Views
1K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 13 ·
Replies
13
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 18 ·
Replies
18
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
4K