# Standard deviation for a biological assay

1. Nov 19, 2015

### lavoisier

Hi everyone, I have a basic question on statistics.

Suppose you have a biological assay to test a given property of some molecules.
At some point in time you have tested N different molecules M1, M2, ..., Mi, ..., MN, and for each you have repeated the test a number of times Ri. The results are Pi.
Example. You have tested N=4 molecules doing Ri={3,2,1,1} repeats. The results are Pi={{5.5, 6.0, 5.7}, {7.9, 8.1}, {6.3}, {8.5}}.
Now, in practice the way results are reported (at least where I work) is: for each Mi, mean and standard error of the mean based only on the actual repeats for each molecule.
Which means, for molecule 3 above the standard error of the mean is quite small, for molecule 1 it's larger and for the last 2 molecules it isn't calculated.
And even more worryingly, when by chance two repeats give exactly the same result, you find s.e.m.=0 !

I can't help thinking that this is wrong. But my knowledge of statistics is limited, that's why I'm asking for help here.

Doesn't a 'method', and therefore an assay, have an inherent standard deviation SD that is based on the total observed variance? Shouldn't we express all results based on this single, general SD?

So in practice, wouldn't it be more correct to express the results as:

P(Mi) = mean(Pi) +/- SD/sqrt(Ri)

This way, molecules that are tested only once would still get a standard error, and those for which repeats give close results wouldn't get an unreasonably small s.e.m.

Or am I wrong? What do you think?

Thanks
L

2. Nov 19, 2015

### RUber

Of course you want to find a better way to estimate the standard deviation of your method.
However, if you don't have a benchmark on your method, and the four molecules are different, it would be inappropriate to lump all the measurements together to estimate the error.
If the results you have above are all you have, and the assumption is that all (or most) observed difference between measurements is due to measurement error, then you would estimate the total SD of measurement as:
$Var = \frac{\sum_{i=1}^N \sum_{j=1}^{R_i} (x_{ij} - \overline x_i ) ^2 }{\sum_{i=1}^N R_i - 1}$
$SD = sqrt(Var)$
If you could assume that the molecules are from the same population, and may have the same mean...then building a SD estimate using all the values would work, but might obscure the measurement error.

3. Nov 19, 2015

### lavoisier

Thank you RUber! That's exactly what I needed to understand,
Indeed, I wasn't thinking of lumping all the molecules together into one mean. I probably misrepresented that with my notation.
What I didn't know was that I could calculate a variance by summing squared differences from different means. You clarified that.
As for the main source of error being the measurement itself rather than the molecule, that's a very good question. I would say it is in most cases. Only occasionally there will be molecules that interfere with the fundamental mechanism of the assay generating systematic errors or larger variability.
FYI, in the past another PF user sent me a link to a very interesting article that dealt with the calculation of the error of a specific type of assay, where several data points are collected and some parameters are fitted to a logistic-like equation. In that case the assay result we get is the regression parameter(s), which should come with an error (in the same way as the slope and intercept of a least squares fit can be given with their own error). No such luck - those who run the assay aren't happy to provide that piece of information.
So maybe the next best thing we can do as end users is adopt the above approach.
In any case, if I see that the distribution of the squared differences is not normal, I guess that may be an indication that the error is not random.
Thanks again.
L

4. Nov 19, 2015

### RUber

One other thought, if you think that the measurement error might be proportional to the size, you could also normalize the error term by dividing by the mean measurement.
You would notice this if you plotted the (measured - mean) errors vs. the means. If the plot looks like a cone, you might assume that the error is not independent of the molecule size.

5. Nov 20, 2015

### lavoisier

I tried the method today. The differences (x-xmean) were normally distributed. I didn't try plotting them vs the means, I read this new post just now.
I adapted the formula a little; in particular, I subtracted from the denominator of Var the number of molecules that were tested only once.
If I left them in, the SD was way too low, because most molecules were indeed tested only once, so each of them contributed 0 to the numerator but 1 to the denominator.
Unless your formula was intended as:

$Var = \frac{\sum_{i=1}^N \sum_{j=1}^{R_i} (x_{ij} - \overline x_i ) ^2 }{\sum_{i=1}^N (R_i - 1)} = \frac{\sum_{i=1}^N \sum_{j=1}^{R_i} (x_{ij} - \overline x_i ) ^2 }{-N+\sum_{i=1}^N R_i }$

I got SD = 0.22, which was not far from the mean of the SD's of the individual repeats.
Great! Tx

6. Nov 20, 2015

### RUber

Yes, sorry...I forgot the parentheses in the denominator. Your sharp analysis kept you from being led astray. Good work.