How to compare two data sets with multiple samples

Click For Summary

Discussion Overview

The discussion centers around comparing two data sets, A and B, each containing multiple samples of integers taken under different conditions. Participants explore various statistical methods for comparison, including means, standard deviations, and ANOVA, while addressing the implications of sample sizes and distributions.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant suggests combining all samples into a single list to calculate the mean and standard deviation for comparison.
  • Another proposes calculating the mean of the means from each sample set, which could provide a quick comparison.
  • A different viewpoint emphasizes the importance of using ANOVA to determine if the samples come from the same population, highlighting the need for proper experimental design.
  • Concerns are raised about the degrees of freedom and the appropriateness of using matched-pairs t-tests depending on sample sizes.
  • One participant mentions that the mean of means approach might be flawed due to uneven sample sizes, arguing that it could lead to biased results.
  • Another participant notes that despite the potential issues with sample sizes, their own calculations showed close means and acceptable standard deviations.
  • A later reply stresses the necessity of understanding the specific property being assessed before choosing statistical methods.

Areas of Agreement / Disagreement

Participants express differing opinions on the best approach to compare the data sets, particularly regarding the implications of sample sizes and the validity of the mean of means method. There is no consensus on a single method, and the discussion remains unresolved.

Contextual Notes

Participants highlight limitations related to sample sizes and the assumptions underlying statistical methods, such as normal distribution and the impact of uneven sample sizes on variance calculations.

wahaj
Messages
154
Reaction score
2
I have two data sets A and B which correspond to two different settings of the system. Both sets contain 5 separate lists of integers; I took 5 samples, at different points in time and location, for each set to reduce the random error in the data. How would I go about comparing sets A and B? If I only had one sample in sets A and B, I could simply calculate the mean and standard deviation and compare the data. This is pretty much the extent of statistics I've learned. But I have multiple samples in each set, so should I combine all the samples into one big list and calculate the mean and SD? The number of integers in each sample may not be the same.

I'm not looking for something very accurate here. Something quick and dirty will do.
 
Physics news on Phys.org
You can combine all the distributions into one about the property you want to compare.
ie. each of the 5 sets will have a mean ... from that you can find the mean of the means, and a standard deviation of the means ... which will let you compare the means.
 
This is probably the simplest way to go about doing this. Thanks for the help
 
Simple is good. Unless you are are planning to do something subsequent with your system - in that case good might be a better choice.
Since we do not have much to go on, let's start with ANOVA - analysis of variance, and an assumption that the data fit a normal distribution.

Analysis of Variance (ANOVA) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data.

It will tell you if subset A and subset B are from the same population. Or not. It is hard to tell what you did because you said 'settings' which is plural. Your experimental design and th answers you need impact how you analyze data. Big time.

If you are just playing, then @Simon Bridge gave you the best answer.
 
The answer partially depends on degrees of freedom your samples have. If each sample is large you can perform a separate matched-pairs t test on each sample. If their results point to the same conclusion then you're home. But suppose 3 point in one direction and 2 in the other. Then you may conclude "time" and "location" matter. If you don't have sufficient DF then my advice is to estimate a regression model with "sample fixed effects."
 
Last edited:
When I calculated the means of each sample, they were pretty close to each other. Close enough for me anyways; Standard deviation of the mean of means was about 10% of mean for set A and 6.5% for set B. I suspect that I if I took more samples in each set I could get even better standard deviations. The mean of means thing worked well enough for me since I don't want to spend several hours learning and then programming a complicated statistics test for my simple comparison.
 
  • Like
Likes   Reactions: EnumaElish
I know the problem is solved, but I just want to add that taking the mean of means is a bad idea since the data don't have the same observations. Intuitively, if you have one group with one observation and the other ##4## groups have ##1000## observations, then this ##1## observation will receive more weight, which is not what you want.

Mathematically, if ##X_i## has a normal distribution ##N(\mu,\sigma^2)##. Then the mean ##\overline{X}## has distribution ##N(\mu, \sigma^2/n)##.
So if we take the means of ##5## means, then this will certainly be an unbiased estimator, no problem there. But the variance of the mean of ##5## means is
\frac{\sigma^2}{25}\sum_{i=1}^{5} \frac{1}{n_i}
where ##n_i## are the sample sizes of the ##5## groups.

Compare this with the variance if we just put all the groups together:
\frac{\sigma^2}{\sum_{i=1}^{5} n_i}
It is easy to see that this variance is always smaller with equality if ##n_1=...=n_5##.

So in order to get more precise estimates, it is better to just put all the observations together in one big set (and perhaps introduce a blocking factor) than taking a mean of means.
 
  • Like
Likes   Reactions: EnumaElish
So basically the problem lies in the uneven sample sizes, which is what I was worried about too when I made my original post. But as I found when I started working on the data; the weights are not an issue. All the samples I collected contained approximately 60 numbers each and there were only small differences in the size of each samples, if they occurred at all.
 
  • Like
Likes   Reactions: EnumaElish
Hey wahaj.

It might help to understand what property you are trying to assess.

Means, medians and things like variances have a very good interpretation but it may not necessarily be what you are looking for.

To get this answer you have to understand what questions you are trying to answer that are completely independent of statistics and then formulate the appropriate test statistics, collection techniques, and processing of information to facilitate this answer.
 
  • Like
Likes   Reactions: EnumaElish

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 6 ·
Replies
6
Views
1K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 13 ·
Replies
13
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 18 ·
Replies
18
Views
4K
  • · Replies 30 ·
2
Replies
30
Views
5K