How to compare two data sets with multiple samples

wahaj · Jun 16, 2016

I have two data sets A and B which correspond to two different settings of the system. Both sets contain 5 separate lists of integers; I took 5 samples, at different points in time and location, for each set to reduce the random error in the data. How would I go about comparing sets A and B? If I only had one sample in sets A and B, I could simply calculate the mean and standard deviation and compare the data. This is pretty much the extent of statistics I've learned. But I have multiple samples in each set, so should I combine all the samples into one big list and calculate the mean and SD? The number of integers in each sample may not be the same.

I'm not looking for something very accurate here. Something quick and dirty will do.

Simon Bridge · Jun 16, 2016

You can combine all the distributions into one about the property you want to compare.
ie. each of the 5 sets will have a mean ... from that you can find the mean of the means, and a standard deviation of the means ... which will let you compare the means.

wahaj · Jun 16, 2016

This is probably the simplest way to go about doing this. Thanks for the help

jim mcnamara · Jun 17, 2016

Simple is good. Unless you are are planning to do something subsequent with your system - in that case good might be a better choice.
Since we do not have much to go on, let's start with ANOVA - analysis of variance, and an assumption that the data fit a normal distribution.

Analysis of Variance (ANOVA) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data.

It will tell you if subset A and subset B are from the same population. Or not. It is hard to tell what you did because you said 'settings' which is plural. Your experimental design and th answers you need impact how you analyze data. Big time.

If you are just playing, then @Simon Bridge gave you the best answer.

EnumaElish · Jun 18, 2016

The answer partially depends on degrees of freedom your samples have. If each sample is large you can perform a separate matched-pairs t test on each sample. If their results point to the same conclusion then you're home. But suppose 3 point in one direction and 2 in the other. Then you may conclude "time" and "location" matter. If you don't have sufficient DF then my advice is to estimate a regression model with "sample fixed effects."

wahaj · Jun 18, 2016

When I calculated the means of each sample, they were pretty close to each other. Close enough for me anyways; Standard deviation of the mean of means was about 10% of mean for set A and 6.5% for set B. I suspect that I if I took more samples in each set I could get even better standard deviations. The mean of means thing worked well enough for me since I don't want to spend several hours learning and then programming a complicated statistics test for my simple comparison.

micromass · Jun 18, 2016

I know the problem is solved, but I just want to add that taking the mean of means is a bad idea since the data don't have the same observations. Intuitively, if you have one group with one observation and the other ##4## groups have ##1000## observations, then this ##1## observation will receive more weight, which is not what you want.

Mathematically, if ##X_i## has a normal distribution ##N(\mu,\sigma^2)##. Then the mean ##\overline{X}## has distribution ##N(\mu, \sigma^2/n)##.
So if we take the means of ##5## means, then this will certainly be an unbiased estimator, no problem there. But the variance of the mean of ##5## means is
\frac{\sigma^2}{25}\sum_{i=1}^{5} \frac{1}{n_i}
where ##n_i## are the sample sizes of the ##5## groups.

Compare this with the variance if we just put all the groups together:
\frac{\sigma^2}{\sum_{i=1}^{5} n_i}
It is easy to see that this variance is always smaller with equality if ##n_1=...=n_5##.

So in order to get more precise estimates, it is better to just put all the observations together in one big set (and perhaps introduce a blocking factor) than taking a mean of means.

wahaj · Jun 18, 2016

So basically the problem lies in the uneven sample sizes, which is what I was worried about too when I made my original post. But as I found when I started working on the data; the weights are not an issue. All the samples I collected contained approximately 60 numbers each and there were only small differences in the size of each samples, if they occurred at all.

chiro · Jun 18, 2016

Hey wahaj.

It might help to understand what property you are trying to assess.

Means, medians and things like variances have a very good interpretation but it may not necessarily be what you are looking for.

To get this answer you have to understand what questions you are trying to answer that are completely independent of statistics and then formulate the appropriate test statistics, collection techniques, and processing of information to facilitate this answer.

How to compare two data sets with multiple samples

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad A variant of the Monty Hall problem

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad How do E[X] and E[|X|] relate?

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight