Sampling and analysis of variance

imsolost · Oct 30, 2016

Dear forumers,
In advance, thank you for reading this post and helping me solving this.

The problem is the following :

I have a population (say, a bag) from which i take 3 samples.

Each sample gets analysed once by a laboratory in order to know their concentration of a product.

The result from the laboratory for each sample "i" is in the form : the real value of the concentration is somewhere (i.e. normally distributed) around µ_i with the variance sigma²_i.

sigma²_i depends on the time of measurement t_i, : the longer the measurement, the more accurate the result is, which means that µ_i gets closer to the real value of concentration of the samplie "i". The relation between sigma_i and t_i is known.

So far, so good.

Now the question is simply :

What do i know about the concentration of my whole bag ? With which uncertainty (expressed as a variance sigma) ?
If i take a 4th sample, what can I expect from it ? If I measure it during a time t_4, what will be the associated uncertainty (using all the information I have) ?
What do i know about the heterogeneity of my bag ? (i.e. how can i split sigma between a relation with all the sigma_i and another "heterogeneity" term representing the variance of the 3 means µ1, µ2 and µ3).

To me, the problem seems a bit similar to an ANOVA analysis, except that, for ANOVA, u have multiple values for each sample while here, u have only one value with a known uncertainty. I don't know, I'm lost...

Would be amazing if u guys can give me some help on this !and

FactChecker · Oct 30, 2016

This is really a question about sampling techniques more than ANOVA. The formula for combining results with different variances can be found in https://en.wikipedia.org/wiki/Stratified_sampling It's not completely clear to me that this directly applies to your problem but I think it is getting close. Since you are talking about such a tiny sample, you will not have a good sample estimate of the variance. Unless you have some way of estimating the variance as a function of time, I don't think you can do anything.

imsolost · Oct 30, 2016

Hi FactChecker.

I don't really know if stratified samplings does apply here...

I mean, I guess u consider each of my 3 samples as a "stratum", but then quantity as the size of a stratum is not really defined, so I can't apply this theory here..

I am a bit confused.

Btw, the number of samples, 3 in this case, is just for the example.

Thank you for trying to help me though !

FactChecker · Oct 30, 2016

Sorry. I had assumed that the reference equation for the sample mean would have weightings inversely proportional to the variance. I see that it doesn't and it is not immediately clear to me how it should be done. I still think that the subject of sampling techniques is the right subject to address your question.

Stephen Tashi · Oct 30, 2016

imsolost said:

The relation between sigma_i and t_i is known.

What is that relation? Is it deterministic?

imsolost · Oct 30, 2016

What is that relation? Is it deterministic?

For the measurement of sample_i during a time t_i giving a result µ_i, then the variance σ_i^² = (µ_i / t_i)

Stephen Tashi · Oct 30, 2016

imsolost said:

For the measurement of sample_i during a time t_i giving a result µ_i, then the variance σ_i^² = (µ_i / t_i)

It isn't clear what the equation ##\sigma_i = \mu_i/ t ## signifies.

For a population parameter, we have to define the associated population. If you are trying to estimate the population mean of a set of samples and we denote the mean of this population as ##\mu## then what population is associated with the mean ##\mu_i## and how it related to ##\mu## ?

Likewise we need to know the population associated with ##\sigma_i##. For example, what population is associated with ##\mu_2## and ##\sigma_2## ?

imsolost · Oct 30, 2016

Stephen Tashi said:

It isn't clear what the equation σi=μi/t\sigma_i = \mu_i/ t signifies.

For a population parameter, we have to define the associated population. If you are trying to estimate the population mean of a set of samples and we denote the mean of this population as μ\mu then what population is associated with the mean μi\mu_i and how it related to μ\mu ?

Likewise we need to know the population associated with σi\sigma_i. For example, what population is associated with μ2\mu_2 and σ2\sigma_2 ?

Please note that µ_i/t = σ_i² (the square disappeared from your expression).

Now, I'll try to answer to your remark. The real value of the concentration in the sample i, let's call it ρ_i, is unknown but considering we measured µ_i, we assume that ρ_i is somewhere around µ_i and µ_i is the best estimate. We assume here that f(ρ_i I µ_i), i.e. the probability distribution of ρ_i considering µ_i was measured, follows a gaussian with a standard deviation of √(µ_i/t) and a mean µ_i.

And the whole problem is about :

testing the following hypothesis : H₀ : For all i,j : ρ_i=ρ_j
Expressing the total variance s² as a split between one term that represent the heterogeneity (probably something in the form Σ_i(µ_i-µ)² where µ is the mean between all the µ_i ; and another term that implie all the σ_i. In a sense, it's very similar to the ANOVA approach.

I hope this clarifies these things a bit :-/

And thank you for putting your time into helping me on this !

FactChecker · Oct 30, 2016

"sigma²_i depends on the time of measurement t_i, : the longer the measurement, the more accurate the result is"

Is this a time series of measurements of a changing population at times t_i or are they measurements of the same population that take different elapsed times t_i?

If they are a time series of a changing population, then you should apply time series analysis. I am guessing that you don't mean that.

If they are measurements of the same population that take different elapsed times t_i, then there is only one true mean that you want to estimate. The usual way to combine the mean estimators of the same thing from several samples of different sizes is to weight each estimate by its sample size. Since the relationship between the sample size and the estimator variance is sample_mean_sigma² = true_sigma² / sample_size, I would suggest weighting each estimator by 1/sample_mean_sigma². In your case, that would be a time-weighted estimate of the mean. A sample that took twice as long would have twice as much weight in the average estimator.

Stephen Tashi · Oct 31, 2016

imsolost said:

i.e. the probability distribution of ρ_i considering µ_i was measured, follows a gaussian with a standard deviation of √(µ_i/t) and a mean µ_i.

testing the following hypothesis : H₀ : For all i,j : ρ_i=ρ_j

There is zero probability that two samples from a gaussian distribution are exactly equal.

You used the notation "##p_i##" to denote both a random variable and a parameter. To do a hypothesis test, you should test a statement about the distribution of a random variable . For example, you might test a hypothesis about a population parameter of its distribution.

If we do Bayesian statistics then we can assume a population parameter is a random variable. But unless ##p_i## and ##p_j## are discrete random variables it doesn't make sense to test the hypothesis that they are equal.

The real value of the concentration in the sample i, let's call it ρi, is unknown but considering we measured µi, we assume that ρi is somewhere around µi and µi is the best estimate.

So you are using ##\mu_i## to denote the result of a single measurement?

I think you need to straighten out your notation. If ##\sigma_i## is a standard deviation of some population then it can't be a given by a function that gives a different answer depending on the value of a sample that is drawn from that population - (i.e. by ## \sigma_i^2 = \mu_i/t##) If you want ##\sigma_i## to be an estimator of a population parameter then its value can depend on the value(s) in a sample.

we assume that ρi is somewhere around µi and µi is the best estimate.

"Somewhere around" is too wishy-washy to translate into a mathematical statement. "Best estimate" is also subject to interpretation because there are several different criteria for what makes an estimate "good" or "best". (e.g. unbiased, minimum variance, maximum liklihood, least squares).

Perhaps you intend to assume that ##\mu_i## is an unbiased estimator of ##p_i##.

What do i know about the concentration of my whole bag ?

How do you define the concentration of the whole bag? Is it the average concentration of the 3 samples ?

-----
A difficulty with your problem is that ##\sigma_i^2 = \mu_i/t## must be interpreted as an estimate of the standard deviation (of the population of all imaginable tests of a given concentration c that last for the time t). So there is "uncertainty" in the estimator ##\sigma_i##. Hence if you attempt to estimate the "uncertainty" (i.e. standard deviation) of ##(p_1 + p_2 + p_3)/3## you must examine how the uncertainty in the ##\sigma_i## contributes to the uncertainty in your estimate.

There are familiar statistical scenarios where the population standard deviation is assumed to be correctly estimated by the sample standard deviation. The justification for this assumption is that the uncertainty of the sample standard deviation is small when it is computed from a large sample by the usual formula. But in your case, you are not computing ##\sigma_i## by the usual formula.

imsolost · Oct 31, 2016

FactChecker said:

Is this a time series of measurements of a changing population at times ti or are they measurements of the same population that take different elapsed times ti?

Each sample doesn't change as a function of time : ρ_i is a constant over time.
As stated above, it's just that the longer the measurement, the more accurate it gets. The estimate µ_i of the real value ρ_i gets closer to ρ_i. (see my post above with the distribution function f(ρ_i I µ_i).

Now, is it the same population ? Well, all 3 samples were taken from the same lot indeed. The unknown concentration of the whole lot is ρ. I'd like to have an estimate of ρ and an evaluation of the uncertainty. But keep in mind that my 3 samples are'nt necessary the same. The lot can be heterogeneous : each sample "i" has its own concentration ρ_i which is normally distributed around ρ. That"s why I thought I should first test the following hypothesis: H0 : For all i,j : ρi=ρj. I don't know how to test that.

FactChecker said:

If they are a time series of a changing population, then you should apply time series analysis. I am guessing that you don't mean that.

I don't know what time series are, but indeed the population isn't changing over time, so i guess you guessed well

FactChecker said:

If they are measurements of the same population that take different elapsed times ti, then there is only one true mean that you want to estimate. The usual way to combine the mean estimators of the same thing from several samples of different sizes is to weight each estimate by its sample size. Since the relationship between the sample size and the estimator variance is sample_mean_sigma2 = true_sigma2 / sample_size, I would suggest weighting each estimator by 1/sample_mean_sigma2. In your case, that would be a time-weighted estimate of the mean. A sample that took twice as long would have twice as much weight in the average estimator.

The concept of sample "size", in this problem, isn't very clear to me. But I was also expecting a weight that is a function of time, although I can't see any rigorous explanation for it. In your expression, can you define sample_mean_sigma and true_sigma ?

Again, I'd like to thank you guys for helping me on this.

edit : just saw the post of Stephen Tashi. Gonna read it and answer soon.

imsolost · Oct 31, 2016

Okay so I just read Tashi's very interesting post and I think we definitely are getting close to the core of my problem and the mistakes I'm probably making here.

Stephen Tashi said:

There is zero probability that two samples from a gaussian distribution are exactly equal.

You used the notation "##p_i##" to denote both a random variable and a parameter. To do a hypothesis test, you should test a statement about the distribution of a random variable . For example, you might test a hypothesis about a population parameter of its distribution.

If we do Bayesian statistics then we can assume a population parameter is a random variable. But unless ##p_i## and ##p_j## are discrete random variables it doesn't make sense to test the hypothesis that they are equal.

Honestly I think I was doing both at the same time (conventional stats and bayesian ones) and that's probably not appropriate. So let's say ρ_i is a population parameter and not a random variable.

That said, now that ρ_i for all i is a parameter, I don't see why I couldn't test an hypothesis like Ho : for all i,j : ρ_i=ρ_j.

Small side-conversation : {
Let's compare to the ANOVA analysis. Isn't it what they do ? Testing if sub-populations differs according to some explanatory variable ; which here translates into testing if my 3 samples differ by more than their "within-sample" deviation. The only difference with a classical ANOVA being that here, i have only one single measurement for each level which gives me an estimator of the mean and of the dispersion, while ANOVA has multiple results datas for each level from which it calculates... its mean and its dispersion !
But I can't use ANOVA theory since there is no such thing as sample size or freedom degree in my problem.
}

Stephen Tashi said:

So you are using ##\mu_i## to denote the result of a single measurement?

Yes. µ_i is the result of one single measurement on the sample i. But you are right, maybe I should have used another letter for it, it confuses people thinking it's a mean. That said, if you guys are okay with this, I propose keeping this notation in this post for consistency.

Stephen Tashi said:

I think you need to straighten out your notation. If ##\sigma_i## is a standard deviation of some population then it can't be a given by a function that gives a different answer depending on the value of a sample that is drawn from that population - (i.e. by ## \sigma_i^2 = \mu_i/t##) If you want ##\sigma_i## to be an estimator of a population parameter then its value can depend on the value(s) in a sample.

You are totally right. Big mistake of my part, so let's rephrase this : √(µ_i/t) is an estimator of the standard deviation.

Stephen Tashi said:

"Somewhere around" is too wishy-washy to translate into a mathematical statement. "Best estimate" is also subject to interpretation because there are several different criteria for what makes an estimate "good" or "best". (e.g. unbiased, minimum variance, maximum liklihood, least squares).

Perhaps you intend to assume that ##\mu_i## is an unbiased estimator of ##p_i##.

Yes ! Thank you for correcting me.

Stephen Tashi said:

How do you define the concentration of the whole bag? Is it the average concentration of the 3 samples ?

Well that's one of my questions.

I guess the best use of the information I have would be to say that the concentration of the whole bag is best estimated (unbiaised estimator?

) by the mean of the 3 results (on the 3 samples). So something like (µ1+µ2+µ3)/3 ? I don't know

. Someone suggested using a weight average that implies the elapsed time of measurement ? What would be the justification for this ? My intuition makes me think that, indeed, I should give more credit to a more accurate result and give it more weight. I'm confused :/

Stephen Tashi said:

(...) if you attempt to estimate the "uncertainty" (i.e. standard deviation) of ##(p_1 + p_2 + p_3)/3## you must examine how the uncertainty in the ##\sigma_i## contributes to the uncertainty in your estimate.

Yep ! But how can i do that ?

Stephen Tashi said:

There are familiar statistical scenarios where the population standard deviation is assumed to be correctly estimated by the sample standard deviation. The justification for this assumption is that the uncertainty of the sample standard deviation is small when it is computed from a large sample by the usual formula. But in your case, you are not computing ##\sigma_i## by the usual formula.

I didn't really understand that part.

Stephen Tashi · Oct 31, 2016

imsolost said:

That said, now that ρ_i for all i is a parameter, I don't see why I couldn't test an hypothesis like Ho : for all i,j : ρ_i=ρ_j.

Assuming ##p_i## is a population parameter there is no conceptual objection to a hypothesis test. But there may be a practical obstacle to doing a hypothesis test because you only have 1 sample from each population.

The statement of your problem is ambiguous because the equation ##\sigma_i^2 = \mu_i/t## is merely the definition of an estimator of a variance. We don't have any probability model that explains why that formula ought to work (in any sense) as a estimator of the variance.

If that estimator is suggested by some data, it would help to know the format of that data. If the estimator is a theoretical result then what is the probability model that the theory uses?

If you can write computer programs, you can investigate the behavior of the estimator by the Monte-Carlo method . Pick an arbitrary "true" concentration value ##p_i## and a time t. Set the population standard deviation ##\sigma## equal to ##\sqrt{ p_i/ t}## . Generate random samples from a normal distribution with mean ##p_i## and standard deviation ##\sigma##. For each sample value ##s_k## , compute the value of the estimator ##\sigma_k = \sqrt{s_k/t}## and look at the distribution of the ##\sigma_k##.

Stephen Tashi · Oct 31, 2016

imsolost said:

My intuition makes me think that, indeed, I should give more credit to a more accurate result and give it more weight.

If you knew that ##p_1 = p_2 = p_3 = p ## then it would make intuitive sense to give more weight to estimators of ##p## that had less variance. But if the 3 estimators are estimating different things, then it isn't clear that an giving an estimator of ##p_1## more weight helps you estimate the sum ##(p_1 + p_2 + p_3)/3## better.

To answer questions about a good estimator for ##(p_1 + p_2 + p_3)/3## , you need a probability model for how ##p_1,p_2,p_3## are selected from some population of concentrations.

The various estimators of ##(p_1 + p_2 + p_3)/3## is another topic that can be investigated by Monte-Carlo simulations.Yep ! But how can i do that ?
I didn't really understand that part.[/QUOTE]

imsolost · Oct 31, 2016

I'll take more time into reading your 2 last posts with attention but I can already answer your first question. The measurand is not really a concentration. It's a count rate which relates to a radioactive activity per unit mass (see here for a quick summary). Basically for a given number of counts Ni, the uncertainty (variance) is (√Ni)². So for a counting during a time ti, the measured count rate is Ni/ti = Ri with a variance Ni/ (ti²) which is equal to Ri/ti. This is the expression written in the above posts.

Now, going to study the rest of your post !

FactChecker · Oct 31, 2016

imsolost said:

The concept of sample "size", in this problem, isn't very clear to me. But I was also expecting a weight that is a function of time, although I can't see any rigorous explanation for it. In your expression, can you define sample_mean_sigma and true_sigma ?

Here is the point I was trying to make. The normal method of combining estimates from sample groups of different sizes is by weighting each estimate by its sample size. That is because there is a direct inverse relationship between the sample size and the variance of that sample estimate. In your case, you have a direct inverse relationship between measurement elapsed time and variance. So measurement elapsed time takes the place of the sample size. Weigh your estimates by the measurement elapsed time.

Stephen Tashi · Nov 2, 2016

FactChecker said:

Weigh your estimates by the measurement elapsed time.

I agree with that idea if there is some statistical relation among the concentrations (or counts) in the 3 samples.

However, consider a situation at the other extreme. Suppose we are trying to estimate the mean of 3 random variables that are "unrelated". For example let X1 be the age of a randomly selected resident of California, let X2 be the closing price of the stock of the Monsanto company on a randomly selected day in 2015, and let X3 be the mileage on a randomly selected automobile that is registered in the state of North Carolina. If we have different different sized samples of each of these random variable, is it wise to estimate the mean value of the random variable Y = X1+X2+X3 by using an unequally weighted sum of the sample means ?

From the original statement of the problem:

I have a population (say, a bag) from which i take 3 samples.

Each sample gets analysed once by a laboratory in order to know their concentration of a product.

This hints that the 3 samples might be samples "of the same thing". So the estimate of the mean concentration of the 3 samples, is also an estimate of the mean concentration of the population of things in the bag. In this case, I agree with using a weighted sum of the sample means.

But the question is also posed:

What do i know about the heterogeneity of my bag ?

So perhaps the things in the bag are not samples of the same thing.

FactChecker · Nov 2, 2016

Stephen Tashi said:

But the question is also posed:
What do i know about the heterogeneity of my bag ?
So perhaps the things in the bag are not samples of the same thing.

That's a good point. But how is that really different from any other source of random variation? I admit that if the amount of time of a measurement correlates with different types being selected from the bag, then there is a problem. But otherwise, I would just consider the differences due to the selection of a type from the bag just another source of random variation.

This entire problem seems so much like the problem of averaging poll results given only the published average result and margin of error. But I am not sure that anyone does that without also knowing the sample sizes, which they might just use directly for weighting the individual results. I searched for details of their methods, but didn't find any.

I also think that this is closely related to importance sampling and stratified sampling methods, where the "bag" is not homogeneous. But I could not find a reference that directly weighted the individual results using 1/strata_variance_i, rather than the sample size, N_i. The only thing I can think of is to weight the individual results by the time spent on the measurement, since that is given to be proportional to 1/strata_variance_i. I admit that I am kind of "winging it" here more than I should.

Sampling and analysis of variance

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad The countability paradox of computable numbers

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect