How to compare two data sets with multiple samples

In summary, the data from sets A and B are not statistically different. The mean of means is a bad idea because it will weight the 1 observation more than the others.
  • #1
wahaj
156
2
I have two data sets A and B which correspond to two different settings of the system. Both sets contain 5 separate lists of integers; I took 5 samples, at different points in time and location, for each set to reduce the random error in the data. How would I go about comparing sets A and B? If I only had one sample in sets A and B, I could simply calculate the mean and standard deviation and compare the data. This is pretty much the extent of statistics I've learned. But I have multiple samples in each set, so should I combine all the samples into one big list and calculate the mean and SD? The number of integers in each sample may not be the same.

I'm not looking for something very accurate here. Something quick and dirty will do.
 
Physics news on Phys.org
  • #2
You can combine all the distributions into one about the property you want to compare.
ie. each of the 5 sets will have a mean ... from that you can find the mean of the means, and a standard deviation of the means ... which will let you compare the means.
 
  • #3
This is probably the simplest way to go about doing this. Thanks for the help
 
  • #4
Simple is good. Unless you are are planning to do something subsequent with your system - in that case good might be a better choice.
Since we do not have much to go on, let's start with ANOVA - analysis of variance, and an assumption that the data fit a normal distribution.

Analysis of Variance (ANOVA) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data.

It will tell you if subset A and subset B are from the same population. Or not. It is hard to tell what you did because you said 'settings' which is plural. Your experimental design and th answers you need impact how you analyze data. Big time.

If you are just playing, then @Simon Bridge gave you the best answer.
 
  • #5
The answer partially depends on degrees of freedom your samples have. If each sample is large you can perform a separate matched-pairs t test on each sample. If their results point to the same conclusion then you're home. But suppose 3 point in one direction and 2 in the other. Then you may conclude "time" and "location" matter. If you don't have sufficient DF then my advice is to estimate a regression model with "sample fixed effects."
 
Last edited:
  • #6
When I calculated the means of each sample, they were pretty close to each other. Close enough for me anyways; Standard deviation of the mean of means was about 10% of mean for set A and 6.5% for set B. I suspect that I if I took more samples in each set I could get even better standard deviations. The mean of means thing worked well enough for me since I don't want to spend several hours learning and then programming a complicated statistics test for my simple comparison.
 
  • Like
Likes EnumaElish
  • #7
I know the problem is solved, but I just want to add that taking the mean of means is a bad idea since the data don't have the same observations. Intuitively, if you have one group with one observation and the other ##4## groups have ##1000## observations, then this ##1## observation will receive more weight, which is not what you want.

Mathematically, if ##X_i## has a normal distribution ##N(\mu,\sigma^2)##. Then the mean ##\overline{X}## has distribution ##N(\mu, \sigma^2/n)##.
So if we take the means of ##5## means, then this will certainly be an unbiased estimator, no problem there. But the variance of the mean of ##5## means is
[tex]\frac{\sigma^2}{25}\sum_{i=1}^{5} \frac{1}{n_i}[/tex]
where ##n_i## are the sample sizes of the ##5## groups.

Compare this with the variance if we just put all the groups together:
[tex]\frac{\sigma^2}{\sum_{i=1}^{5} n_i}[/tex]
It is easy to see that this variance is always smaller with equality if ##n_1=...=n_5##.

So in order to get more precise estimates, it is better to just put all the observations together in one big set (and perhaps introduce a blocking factor) than taking a mean of means.
 
  • Like
Likes EnumaElish
  • #8
So basically the problem lies in the uneven sample sizes, which is what I was worried about too when I made my original post. But as I found when I started working on the data; the weights are not an issue. All the samples I collected contained approximately 60 numbers each and there were only small differences in the size of each samples, if they occurred at all.
 
  • Like
Likes EnumaElish
  • #9
Hey wahaj.

It might help to understand what property you are trying to assess.

Means, medians and things like variances have a very good interpretation but it may not necessarily be what you are looking for.

To get this answer you have to understand what questions you are trying to answer that are completely independent of statistics and then formulate the appropriate test statistics, collection techniques, and processing of information to facilitate this answer.
 
  • Like
Likes EnumaElish

1. What statistical test should I use to compare two data sets with multiple samples?

The appropriate statistical test to use depends on the type of data and the research question. Some commonly used tests for comparing two data sets with multiple samples include t-tests, ANOVA, and Mann-Whitney U test. It is important to consult with a statistician or refer to statistical guidelines to determine the most suitable test for your data.

2. How do I interpret the results of a statistical test for comparing two data sets with multiple samples?

The interpretation of the results will depend on the type of statistical test used. Generally, the test will provide a p-value, which indicates the probability of obtaining the observed results if the null hypothesis (no significant difference between the two data sets) is true. If the p-value is less than the chosen significance level (usually 0.05), it is considered statistically significant, and the null hypothesis can be rejected. Other measures, such as effect size and confidence intervals, can also be used to interpret the results.

3. Can I compare two data sets with different sample sizes?

Yes, it is possible to compare two data sets with different sample sizes. However, it is important to note that larger sample sizes typically provide more accurate estimates and increase the power of statistical tests to detect differences between groups.

4. How can I visualize the differences between two data sets with multiple samples?

There are various ways to visualize the differences between two data sets with multiple samples, including bar charts, box plots, and scatter plots. These graphs can help identify any patterns or trends in the data and provide a clear visual representation of the differences between the two data sets.

5. Is there a preferred method for comparing two data sets with multiple samples?

There is no one preferred method for comparing two data sets with multiple samples, as different statistical tests may be appropriate depending on the type of data and research question. It is important to carefully select the appropriate method and consult with a statistician to ensure accurate and meaningful results.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
920
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
823
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
483
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
893
  • Set Theory, Logic, Probability, Statistics
Replies
14
Views
290
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
Back
Top