# I Sampling and analysis of variance

1. Oct 30, 2016

### imsolost

Dear forumers,
In advance, thank you for reading this post and helping me solving this.

The problem is the following :

I have a population (say, a bag) from which i take 3 samples.

Each sample gets analysed once by a laboratory in order to know their concentration of a product.

The result from the laboratory for each sample "i" is in the form : the real value of the concentration is somewhere (i.e. normally distributed) around µ_i with the variance sigma²_i.

sigma²_i depends on the time of measurement t_i, : the longer the measurement, the more accurate the result is, which means that µ_i gets closer to the real value of concentration of the samplie "i". The relation between sigma_i and t_i is known.

So far, so good.

Now the question is simply :

• What do i know about the concentration of my whole bag ? With which uncertainty (expressed as a variance sigma) ?
• If i take a 4th sample, what can I expect from it ? If I measure it during a time t_4, what will be the associated uncertainty (using all the information I have) ?
• What do i know about the heterogeneity of my bag ? (i.e. how can i split sigma between a relation with all the sigma_i and another "heterogeneity" term representing the variance of the 3 means µ1, µ2 and µ3).

To me, the problem seems a bit similar to an ANOVA analysis, except that, for ANOVA, u have multiple values for each sample while here, u have only one value with a known uncertainty. I don't know, I'm lost...

Would be amazing if u guys can give me some help on this !and

2. Oct 30, 2016

### FactChecker

This is really a question about sampling techniques more than ANOVA. The formula for combining results with different variances can be found in https://en.wikipedia.org/wiki/Stratified_sampling It's not completely clear to me that this directly applies to your problem but I think it is getting close. Since you are talking about such a tiny sample, you will not have a good sample estimate of the variance. Unless you have some way of estimating the variance as a function of time, I don't think you can do anything.

3. Oct 30, 2016

### imsolost

Hi FactChecker.

I don't really know if stratified samplings does apply here...

I mean, I guess u consider each of my 3 samples as a "stratum", but then quantity as the size of a stratum is not really defined, so I can't apply this theory here..

I am a bit confused.

Btw, the number of samples, 3 in this case, is just for the example.

Thank you for trying to help me though !

4. Oct 30, 2016

### FactChecker

Sorry. I had assumed that the reference equation for the sample mean would have weightings inversely proportional to the variance. I see that it doesn't and it is not immediately clear to me how it should be done. I still think that the subject of sampling techniques is the right subject to address your question.

5. Oct 30, 2016

### Stephen Tashi

What is that relation? Is it deterministic?

6. Oct 30, 2016

### imsolost

For the measurement of samplei during a time ti giving a result µi, then the variance σi² = (µi / ti)

7. Oct 30, 2016

### Stephen Tashi

It isn't clear what the equation $\sigma_i = \mu_i/ t$ signifies.

For a population parameter, we have to define the associated population. If you are trying to estimate the population mean of a set of samples and we denote the mean of this population as $\mu$ then what population is associated with the mean $\mu_i$ and how it related to $\mu$ ?

Likewise we need to know the population associated with $\sigma_i$. For example, what population is associated with $\mu_2$ and $\sigma_2$ ?

8. Oct 30, 2016

### imsolost

Please note that µi/t = σi2 (the square disappeared from your expression).

Now, I'll try to answer to your remark. The real value of the concentration in the sample i, let's call it ρi, is unknown but considering we measured µi, we assume that ρi is somewhere around µi and µi is the best estimate. We assume here that f(ρi I µi), i.e. the probability distribution of ρi considering µi was measured, follows a gaussian with a standard deviation of √(µi/t) and a mean µi.

And the whole problem is about :
1. testing the following hypothesis : H0 : For all i,j : ρij
2. Expressing the total variance s² as a split between one term that represent the heterogeneity (probably something in the form Σii-µ)² where µ is the mean between all the µi ; and another term that implie all the σi. In a sense, it's very similar to the ANOVA approach.
I hope this clarifies these things a bit :-/

And thank you for putting your time into helping me on this !

9. Oct 30, 2016

### FactChecker

Is this a time series of measurements of a changing population at times ti or are they measurements of the same population that take different elapsed times ti?

If they are a time series of a changing population, then you should apply time series analysis. I am guessing that you don't mean that.

If they are measurements of the same population that take different elapsed times ti, then there is only one true mean that you want to estimate. The usual way to combine the mean estimators of the same thing from several samples of different sizes is to weight each estimate by its sample size. Since the relationship between the sample size and the estimator variance is sample_mean_sigma2 = true_sigma2 / sample_size, I would suggest weighting each estimator by 1/sample_mean_sigma2. In your case, that would be a time-weighted estimate of the mean. A sample that took twice as long would have twice as much weight in the average estimator.

10. Oct 31, 2016

### Stephen Tashi

There is zero probability that two samples from a gaussian distribution are exactly equal.

You used the notation "$p_i$" to denote both a random variable and a parameter. To do a hypothesis test, you should test a statement about the distribution of a random variable . For example, you might test a hypothesis about a population parameter of its distribution.

If we do Bayesian statistics then we can assume a population parameter is a random variable. But unless $p_i$ and $p_j$ are discrete random variables it doesn't make sense to test the hypothesis that they are equal.

So you are using $\mu_i$ to denote the result of a single measurement?

I think you need to straighten out your notation. If $\sigma_i$ is a standard deviation of some population then it can't be a given by a function that gives a different answer depending on the value of a sample that is drawn from that population - (i.e. by $\sigma_i^2 = \mu_i/t$) If you want $\sigma_i$ to be an estimator of a population parameter then its value can depend on the value(s) in a sample.

"Somewhere around" is too wishy-washy to translate into a mathematical statement. "Best estimate" is also subject to interpretation because there are several different criteria for what makes an estimate "good" or "best". (e.g. unbiased, minimum variance, maximum liklihood, least squares).

Perhaps you intend to assume that $\mu_i$ is an unbiased estimator of $p_i$.

How do you define the concentration of the whole bag? Is it the average concentration of the 3 samples ?

-----

A difficulty with your problem is that $\sigma_i^2 = \mu_i/t$ must be interpreted as an estimate of the standard deviation (of the population of all imaginable tests of a given concentration c that last for the time t). So there is "uncertainty" in the estimator $\sigma_i$. Hence if you attempt to estimate the "uncertainty" (i.e. standard deviation) of $(p_1 + p_2 + p_3)/3$ you must examine how the uncertainty in the $\sigma_i$ contributes to the uncertainty in your estimate.

There are familiar statistical scenarios where the population standard deviation is assumed to be correctly estimated by the sample standard deviation. The justification for this assumption is that the uncertainty of the sample standard deviation is small when it is computed from a large sample by the usual formula. But in your case, you are not computing $\sigma_i$ by the usual formula.

11. Oct 31, 2016

### imsolost

Each sample doesn't change as a function of time : ρi is a constant over time.
As stated above, it's just that the longer the measurement, the more accurate it gets. The estimate µi of the real value ρi gets closer to ρi. (see my post above with the distribution function f(ρi I µi).

Now, is it the same population ? Well, all 3 samples were taken from the same lot indeed. The unknown concentration of the whole lot is ρ. I'd like to have an estimate of ρ and an evaluation of the uncertainty. But keep in mind that my 3 samples are'nt necessary the same. The lot can be heterogeneous : each sample "i" has its own concentration ρi wich is normally distributed around ρ. That"s why I thought I should first test the following hypothesis: H0 : For all i,j : ρi=ρj. I don't know how to test that.

I don't know what time series are, but indeed the population isn't changing over time, so i guess you guessed well

The concept of sample "size", in this problem, isn't very clear to me. But I was also expecting a weight that is a function of time, although I can't see any rigorous explanation for it. In your expression, can you define sample_mean_sigma and true_sigma ?

Again, I'd like to thank you guys for helping me on this.

edit : just saw the post of Stephen Tashi. Gonna read it and answer soon.

12. Oct 31, 2016

### imsolost

Okay so I just read Tashi's very interesting post and I think we definitely are getting close to the core of my problem and the mistakes I'm probably making here.

Honestly I think I was doing both at the same time (conventional stats and bayesian ones) and that's probably not appropriate. So let's say ρi is a population parameter and not a random variable.

That said, now that ρi for all i is a parameter, I dont see why I couldn't test an hypothesis like Ho : for all i,j : ρij.

Small side-conversation : {
Let's compare to the ANOVA analysis. Isn't it what they do ? Testing if sub-populations differs according to some explanatory variable ; which here translates into testing if my 3 samples differ by more than their "within-sample" deviation. The only difference with a classical ANOVA being that here, i have only one single measurement for each level which gives me an estimator of the mean and of the dispersion, while ANOVA has multiple results datas for each level from which it calculates... its mean and its dispersion !
But I can't use ANOVA theory since there is no such thing as sample size or freedom degree in my problem.
}

Yes. µi is the result of one single measurement on the sample i. But you are right, maybe I should have used another letter for it, it confuses people thinking it's a mean. That said, if you guys are okay with this, I propose keeping this notation in this post for consistency.

You are totally right. Big mistake of my part, so let's rephrase this : √(µi/t) is an estimator of the standard deviation.

Yes ! Thank you for correcting me.

Well that's one of my questions.

I guess the best use of the information I have would be to say that the concentration of the whole bag is best estimated (unbiaised estimator? ) by the mean of the 3 results (on the 3 samples). So something like (µ1+µ2+µ3)/3 ? I don't know . Someone suggested using a weight average that implies the elapsed time of measurement ? What would be the justification for this ? My intuition makes me think that, indeed, I should give more credit to a more accurate result and give it more weight. I'm confused :/

Yep ! But how can i do that ?

I didn't really understand that part.

Last edited: Oct 31, 2016
13. Oct 31, 2016

### Stephen Tashi

Assuming $p_i$ is a population parameter there is no conceptual objection to a hypothesis test. But there may be a practical obstacle to doing a hypothesis test because you only have 1 sample from each population.

The statement of your problem is ambiguous because the equation $\sigma_i^2 = \mu_i/t$ is merely the definition of an estimator of a variance. We don't have any probability model that explains why that formula ought to work (in any sense) as a estimator of the variance.

If that estimator is suggested by some data, it would help to know the format of that data. If the estimator is a theoretical result then what is the probability model that the theory uses?

If you can write computer programs, you can investigate the behavior of the estimator by the Monte-Carlo method . Pick an arbitrary "true" concentration value $p_i$ and a time t. Set the population standard deviation $\sigma$ equal to $\sqrt{ p_i/ t}$ . Generate random samples from a normal distribution with mean $p_i$ and standard deviation $\sigma$. For each sample value $s_k$ , compute the value of the estimator $\sigma_k = \sqrt{s_k/t}$ and look at the distribution of the $\sigma_k$.

14. Oct 31, 2016

### Stephen Tashi

If you knew that $p_1 = p_2 = p_3 = p$ then it would make intuitive sense to give more weight to estimators of $p$ that had less variance. But if the 3 estimators are estimating different things, then it isn't clear that an giving an estimator of $p_1$ more weight helps you estimate the sum $(p_1 + p_2 + p_3)/3$ better.

To answer questions about a good estimator for $(p_1 + p_2 + p_3)/3$ , you need a probability model for how $p_1,p_2,p_3$ are selected from some population of concentrations.

The various estimators of $(p_1 + p_2 + p_3)/3$ is another topic that can be investigated by Monte-Carlo simulations.

Yep ! But how can i do that ?

I didn't really understand that part.[/QUOTE]

15. Oct 31, 2016

### imsolost

I'll take more time into reading your 2 last posts with attention but I can already answer your first question. The measurand is not really a concentration. It's a count rate which relates to a radioactive activity per unit mass (see here for a quick summary). Basically for a given number of counts Ni, the uncertainty (variance) is (√Ni)². So for a counting during a time ti, the measured count rate is Ni/ti = Ri with a variance Ni/ (ti²) which is equal to Ri/ti. This is the expression written in the above posts.

Now, gonna study the rest of your post !

Last edited: Oct 31, 2016
16. Oct 31, 2016

### FactChecker

Here is the point I was trying to make. The normal method of combining estimates from sample groups of different sizes is by weighting each estimate by its sample size. That is because there is a direct inverse relationship between the sample size and the variance of that sample estimate. In your case, you have a direct inverse relationship between measurement elapsed time and variance. So measurement elapsed time takes the place of the sample size. Weigh your estimates by the measurement elapsed time.

17. Nov 2, 2016

### Stephen Tashi

I agree with that idea if there is some statistical relation among the concentrations (or counts) in the 3 samples.

However, consider a situation at the other extreme. Suppose we are trying to estimate the mean of 3 random variables that are "unrelated". For example let X1 be the age of a randomly selected resident of California, let X2 be the closing price of the stock of the Monsanto company on a randomly selected day in 2015, and let X3 be the mileage on a randomly selected automobile that is registered in the state of North Carolina. If we have different different sized samples of each of these random variable, is it wise to estimate the mean value of the random variable Y = X1+X2+X3 by using an unequally weighted sum of the sample means ?

From the original statement of the problem:
This hints that the 3 samples might be samples "of the same thing". So the estimate of the mean concentration of the 3 samples, is also an estimate of the mean concentration of the population of things in the bag. In this case, I agree with using a weighted sum of the sample means.

But the question is also posed:
So perhaps the things in the bag are not samples of the same thing.

18. Nov 2, 2016

### FactChecker

That's a good point. But how is that really different from any other source of random variation? I admit that if the amount of time of a measurement correlates with different types being selected from the bag, then there is a problem. But otherwise, I would just consider the differences due to the selection of a type from the bag just another source of random variation.

This entire problem seems so much like the problem of averaging poll results given only the published average result and margin of error. But I am not sure that anyone does that without also knowing the sample sizes, which they might just use directly for weighting the individual results. I searched for details of their methods, but didn't find any.

I also think that this is closely related to importance sampling and stratified sampling methods, where the "bag" is not homogeneous. But I could not find a reference that directly weighted the individual results using 1/strata_variancei, rather than the sample size, Ni. The only thing I can think of is to weight the individual results by the time spent on the measurement, since that is given to be proportional to 1/strata_variancei. I admit that I am kind of "winging it" here more than I should.