Sampling and analysis of variance

In summary: There is no known uncertainty associated to the result from the laboratory.-The real value of the concentration is somewhere (i.e. normally distributed) around µi with the variance sigma²_i.-The relation between sigma_i and t_i is known.-For a population parameter, we have to define the associated population. If you are trying to estimate the population mean of a set of samples and we denote the mean of this population as μ\mu then what population is associated with the mean μi\mu_i and how it related to μ\mu?-Likewise we need to know the population associated
  • #1
imsolost
18
1
Dear forumers,
In advance, thank you for reading this post and helping me solving this.

The problem is the following :

I have a population (say, a bag) from which i take 3 samples.

Each sample gets analysed once by a laboratory in order to know their concentration of a product.

The result from the laboratory for each sample "i" is in the form : the real value of the concentration is somewhere (i.e. normally distributed) around µ_i with the variance sigma²_i.

sigma²_i depends on the time of measurement t_i, : the longer the measurement, the more accurate the result is, which means that µ_i gets closer to the real value of concentration of the samplie "i". The relation between sigma_i and t_i is known.

So far, so good.

Now the question is simply :

  • What do i know about the concentration of my whole bag ? With which uncertainty (expressed as a variance sigma) ?
  • If i take a 4th sample, what can I expect from it ? If I measure it during a time t_4, what will be the associated uncertainty (using all the information I have) ?
  • What do i know about the heterogeneity of my bag ? (i.e. how can i split sigma between a relation with all the sigma_i and another "heterogeneity" term representing the variance of the 3 means µ1, µ2 and µ3).
To me, the problem seems a bit similar to an ANOVA analysis, except that, for ANOVA, u have multiple values for each sample while here, u have only one value with a known uncertainty. I don't know, I'm lost...

Would be amazing if u guys can give me some help on this !and
 
Physics news on Phys.org
  • #2
This is really a question about sampling techniques more than ANOVA. The formula for combining results with different variances can be found in https://en.wikipedia.org/wiki/Stratified_sampling It's not completely clear to me that this directly applies to your problem but I think it is getting close. Since you are talking about such a tiny sample, you will not have a good sample estimate of the variance. Unless you have some way of estimating the variance as a function of time, I don't think you can do anything.
 
  • #3
Hi FactChecker.

I don't really know if stratified samplings does apply here...

I mean, I guess u consider each of my 3 samples as a "stratum", but then quantity as the size of a stratum is not really defined, so I can't apply this theory here..

I am a bit confused.

Btw, the number of samples, 3 in this case, is just for the example.

Thank you for trying to help me though !
 
  • #4
Sorry. I had assumed that the reference equation for the sample mean would have weightings inversely proportional to the variance. I see that it doesn't and it is not immediately clear to me how it should be done. I still think that the subject of sampling techniques is the right subject to address your question.
 
  • #5
imsolost said:
The relation between sigma_i and t_i is known.

What is that relation? Is it deterministic?
 
  • #6
What is that relation? Is it deterministic?

For the measurement of samplei during a time ti giving a result µi, then the variance σi² = (µi / ti)
 
  • #7
imsolost said:
For the measurement of samplei during a time ti giving a result µi, then the variance σi² = (µi / ti)

It isn't clear what the equation ##\sigma_i = \mu_i/ t ## signifies.

For a population parameter, we have to define the associated population. If you are trying to estimate the population mean of a set of samples and we denote the mean of this population as ##\mu## then what population is associated with the mean ##\mu_i## and how it related to ##\mu## ?

Likewise we need to know the population associated with ##\sigma_i##. For example, what population is associated with ##\mu_2## and ##\sigma_2## ?
 
  • #8
Stephen Tashi said:
It isn't clear what the equation σi=μi/t\sigma_i = \mu_i/ t signifies.

For a population parameter, we have to define the associated population. If you are trying to estimate the population mean of a set of samples and we denote the mean of this population as μ\mu then what population is associated with the mean μi\mu_i and how it related to μ\mu ?

Likewise we need to know the population associated with σi\sigma_i. For example, what population is associated with μ2\mu_2 and σ2\sigma_2 ?

Please note that µi/t = σi2 (the square disappeared from your expression).

Now, I'll try to answer to your remark. The real value of the concentration in the sample i, let's call it ρi, is unknown but considering we measured µi, we assume that ρi is somewhere around µi and µi is the best estimate. We assume here that f(ρi I µi), i.e. the probability distribution of ρi considering µi was measured, follows a gaussian with a standard deviation of √(µi/t) and a mean µi.

And the whole problem is about :
  1. testing the following hypothesis : H0 : For all i,j : ρij
  2. Expressing the total variance s² as a split between one term that represent the heterogeneity (probably something in the form Σii-µ)² where µ is the mean between all the µi ; and another term that implie all the σi. In a sense, it's very similar to the ANOVA approach.
I hope this clarifies these things a bit :-/

And thank you for putting your time into helping me on this !
 
  • #9
"sigma²_i depends on the time of measurement t_i, : the longer the measurement, the more accurate the result is"
Is this a time series of measurements of a changing population at times ti or are they measurements of the same population that take different elapsed times ti?

If they are a time series of a changing population, then you should apply time series analysis. I am guessing that you don't mean that.

If they are measurements of the same population that take different elapsed times ti, then there is only one true mean that you want to estimate. The usual way to combine the mean estimators of the same thing from several samples of different sizes is to weight each estimate by its sample size. Since the relationship between the sample size and the estimator variance is sample_mean_sigma2 = true_sigma2 / sample_size, I would suggest weighting each estimator by 1/sample_mean_sigma2. In your case, that would be a time-weighted estimate of the mean. A sample that took twice as long would have twice as much weight in the average estimator.
 
  • #10
imsolost said:
i.e. the probability distribution of ρi considering µi was measured, follows a gaussian with a standard deviation of √(µi/t) and a mean µi.

  1. testing the following hypothesis : H0 : For all i,j : ρij

There is zero probability that two samples from a gaussian distribution are exactly equal.

You used the notation "##p_i##" to denote both a random variable and a parameter. To do a hypothesis test, you should test a statement about the distribution of a random variable . For example, you might test a hypothesis about a population parameter of its distribution.

If we do Bayesian statistics then we can assume a population parameter is a random variable. But unless ##p_i## and ##p_j## are discrete random variables it doesn't make sense to test the hypothesis that they are equal.

The real value of the concentration in the sample i, let's call it ρi, is unknown but considering we measured µi, we assume that ρi is somewhere around µi and µi is the best estimate.

So you are using ##\mu_i## to denote the result of a single measurement?

I think you need to straighten out your notation. If ##\sigma_i## is a standard deviation of some population then it can't be a given by a function that gives a different answer depending on the value of a sample that is drawn from that population - (i.e. by ## \sigma_i^2 = \mu_i/t##) If you want ##\sigma_i## to be an estimator of a population parameter then its value can depend on the value(s) in a sample.

we assume that ρi is somewhere around µi and µi is the best estimate.
"Somewhere around" is too wishy-washy to translate into a mathematical statement. "Best estimate" is also subject to interpretation because there are several different criteria for what makes an estimate "good" or "best". (e.g. unbiased, minimum variance, maximum liklihood, least squares).

Perhaps you intend to assume that ##\mu_i## is an unbiased estimator of ##p_i##.

What do i know about the concentration of my whole bag ?

How do you define the concentration of the whole bag? Is it the average concentration of the 3 samples ?

-----

A difficulty with your problem is that ##\sigma_i^2 = \mu_i/t## must be interpreted as an estimate of the standard deviation (of the population of all imaginable tests of a given concentration c that last for the time t). So there is "uncertainty" in the estimator ##\sigma_i##. Hence if you attempt to estimate the "uncertainty" (i.e. standard deviation) of ##(p_1 + p_2 + p_3)/3## you must examine how the uncertainty in the ##\sigma_i## contributes to the uncertainty in your estimate.

There are familiar statistical scenarios where the population standard deviation is assumed to be correctly estimated by the sample standard deviation. The justification for this assumption is that the uncertainty of the sample standard deviation is small when it is computed from a large sample by the usual formula. But in your case, you are not computing ##\sigma_i## by the usual formula.
 
  • #11
FactChecker said:
Is this a time series of measurements of a changing population at times ti or are they measurements of the same population that take different elapsed times ti?
Each sample doesn't change as a function of time : ρi is a constant over time.
As stated above, it's just that the longer the measurement, the more accurate it gets. The estimate µi of the real value ρi gets closer to ρi. (see my post above with the distribution function f(ρi I µi).

Now, is it the same population ? Well, all 3 samples were taken from the same lot indeed. The unknown concentration of the whole lot is ρ. I'd like to have an estimate of ρ and an evaluation of the uncertainty. But keep in mind that my 3 samples are'nt necessary the same. The lot can be heterogeneous : each sample "i" has its own concentration ρi which is normally distributed around ρ. That"s why I thought I should first test the following hypothesis: H0 : For all i,j : ρi=ρj. I don't know how to test that.

FactChecker said:
If they are a time series of a changing population, then you should apply time series analysis. I am guessing that you don't mean that.

I don't know what time series are, but indeed the population isn't changing over time, so i guess you guessed well :smile:

FactChecker said:
If they are measurements of the same population that take different elapsed times ti, then there is only one true mean that you want to estimate. The usual way to combine the mean estimators of the same thing from several samples of different sizes is to weight each estimate by its sample size. Since the relationship between the sample size and the estimator variance is sample_mean_sigma2 = true_sigma2 / sample_size, I would suggest weighting each estimator by 1/sample_mean_sigma2. In your case, that would be a time-weighted estimate of the mean. A sample that took twice as long would have twice as much weight in the average estimator.

The concept of sample "size", in this problem, isn't very clear to me. But I was also expecting a weight that is a function of time, although I can't see any rigorous explanation for it. In your expression, can you define sample_mean_sigma and true_sigma ?

Again, I'd like to thank you guys for helping me on this.

edit : just saw the post of Stephen Tashi. Gonna read it and answer soon.
 
  • #12
Okay so I just read Tashi's very interesting post and I think we definitely are getting close to the core of my problem and the mistakes I'm probably making here.

Stephen Tashi said:
There is zero probability that two samples from a gaussian distribution are exactly equal.

You used the notation "##p_i##" to denote both a random variable and a parameter. To do a hypothesis test, you should test a statement about the distribution of a random variable . For example, you might test a hypothesis about a population parameter of its distribution.

If we do Bayesian statistics then we can assume a population parameter is a random variable. But unless ##p_i## and ##p_j## are discrete random variables it doesn't make sense to test the hypothesis that they are equal.

Honestly I think I was doing both at the same time (conventional stats and bayesian ones) and that's probably not appropriate. So let's say ρi is a population parameter and not a random variable.

That said, now that ρi for all i is a parameter, I don't see why I couldn't test an hypothesis like Ho : for all i,j : ρij.

Small side-conversation : {
Let's compare to the ANOVA analysis. Isn't it what they do ? Testing if sub-populations differs according to some explanatory variable ; which here translates into testing if my 3 samples differ by more than their "within-sample" deviation. The only difference with a classical ANOVA being that here, i have only one single measurement for each level which gives me an estimator of the mean and of the dispersion, while ANOVA has multiple results datas for each level from which it calculates... its mean and its dispersion !
But I can't use ANOVA theory since there is no such thing as sample size or freedom degree in my problem.
}
Stephen Tashi said:
So you are using ##\mu_i## to denote the result of a single measurement?

Yes. µi is the result of one single measurement on the sample i. But you are right, maybe I should have used another letter for it, it confuses people thinking it's a mean. That said, if you guys are okay with this, I propose keeping this notation in this post for consistency.

Stephen Tashi said:
I think you need to straighten out your notation. If ##\sigma_i## is a standard deviation of some population then it can't be a given by a function that gives a different answer depending on the value of a sample that is drawn from that population - (i.e. by ## \sigma_i^2 = \mu_i/t##) If you want ##\sigma_i## to be an estimator of a population parameter then its value can depend on the value(s) in a sample.

You are totally right. Big mistake of my part, so let's rephrase this : √(µi/t) is an estimator of the standard deviation.

Stephen Tashi said:
"Somewhere around" is too wishy-washy to translate into a mathematical statement. "Best estimate" is also subject to interpretation because there are several different criteria for what makes an estimate "good" or "best". (e.g. unbiased, minimum variance, maximum liklihood, least squares).

Perhaps you intend to assume that ##\mu_i## is an unbiased estimator of ##p_i##.

Yes ! Thank you for correcting me.

Stephen Tashi said:
How do you define the concentration of the whole bag? Is it the average concentration of the 3 samples ?

Well that's one of my questions. :biggrin:

I guess the best use of the information I have would be to say that the concentration of the whole bag is best estimated (unbiaised estimator? :confused:) by the mean of the 3 results (on the 3 samples). So something like (µ1+µ2+µ3)/3 ? I don't know o_O. Someone suggested using a weight average that implies the elapsed time of measurement ? What would be the justification for this ? My intuition makes me think that, indeed, I should give more credit to a more accurate result and give it more weight. I'm confused :/
Stephen Tashi said:
(...) if you attempt to estimate the "uncertainty" (i.e. standard deviation) of ##(p_1 + p_2 + p_3)/3## you must examine how the uncertainty in the ##\sigma_i## contributes to the uncertainty in your estimate.

Yep ! But how can i do that ?

Stephen Tashi said:
There are familiar statistical scenarios where the population standard deviation is assumed to be correctly estimated by the sample standard deviation. The justification for this assumption is that the uncertainty of the sample standard deviation is small when it is computed from a large sample by the usual formula. But in your case, you are not computing ##\sigma_i## by the usual formula.

I didn't really understand that part.
 
Last edited:
  • #13
imsolost said:
That said, now that ρi for all i is a parameter, I don't see why I couldn't test an hypothesis like Ho : for all i,j : ρij.

Assuming ##p_i## is a population parameter there is no conceptual objection to a hypothesis test. But there may be a practical obstacle to doing a hypothesis test because you only have 1 sample from each population.

The statement of your problem is ambiguous because the equation ##\sigma_i^2 = \mu_i/t## is merely the definition of an estimator of a variance. We don't have any probability model that explains why that formula ought to work (in any sense) as a estimator of the variance.

If that estimator is suggested by some data, it would help to know the format of that data. If the estimator is a theoretical result then what is the probability model that the theory uses?

If you can write computer programs, you can investigate the behavior of the estimator by the Monte-Carlo method . Pick an arbitrary "true" concentration value ##p_i## and a time t. Set the population standard deviation ##\sigma## equal to ##\sqrt{ p_i/ t}## . Generate random samples from a normal distribution with mean ##p_i## and standard deviation ##\sigma##. For each sample value ##s_k## , compute the value of the estimator ##\sigma_k = \sqrt{s_k/t}## and look at the distribution of the ##\sigma_k##.
 
  • #14
imsolost said:
My intuition makes me think that, indeed, I should give more credit to a more accurate result and give it more weight.

If you knew that ##p_1 = p_2 = p_3 = p ## then it would make intuitive sense to give more weight to estimators of ##p## that had less variance. But if the 3 estimators are estimating different things, then it isn't clear that an giving an estimator of ##p_1## more weight helps you estimate the sum ##(p_1 + p_2 + p_3)/3## better.

To answer questions about a good estimator for ##(p_1 + p_2 + p_3)/3## , you need a probability model for how ##p_1,p_2,p_3## are selected from some population of concentrations.

The various estimators of ##(p_1 + p_2 + p_3)/3## is another topic that can be investigated by Monte-Carlo simulations.Yep ! But how can i do that ?
I didn't really understand that part.[/QUOTE]
 
  • #15
I'll take more time into reading your 2 last posts with attention but I can already answer your first question. The measurand is not really a concentration. It's a count rate which relates to a radioactive activity per unit mass (see here for a quick summary). Basically for a given number of counts Ni, the uncertainty (variance) is (√Ni)². So for a counting during a time ti, the measured count rate is Ni/ti = Ri with a variance Ni/ (ti²) which is equal to Ri/ti. This is the expression written in the above posts.

Now, going to study the rest of your post !
 
Last edited:
  • #16
imsolost said:
The concept of sample "size", in this problem, isn't very clear to me. But I was also expecting a weight that is a function of time, although I can't see any rigorous explanation for it. In your expression, can you define sample_mean_sigma and true_sigma ?
Here is the point I was trying to make. The normal method of combining estimates from sample groups of different sizes is by weighting each estimate by its sample size. That is because there is a direct inverse relationship between the sample size and the variance of that sample estimate. In your case, you have a direct inverse relationship between measurement elapsed time and variance. So measurement elapsed time takes the place of the sample size. Weigh your estimates by the measurement elapsed time.
 
  • #17
FactChecker said:
Weigh your estimates by the measurement elapsed time.

I agree with that idea if there is some statistical relation among the concentrations (or counts) in the 3 samples.

However, consider a situation at the other extreme. Suppose we are trying to estimate the mean of 3 random variables that are "unrelated". For example let X1 be the age of a randomly selected resident of California, let X2 be the closing price of the stock of the Monsanto company on a randomly selected day in 2015, and let X3 be the mileage on a randomly selected automobile that is registered in the state of North Carolina. If we have different different sized samples of each of these random variable, is it wise to estimate the mean value of the random variable Y = X1+X2+X3 by using an unequally weighted sum of the sample means ?

From the original statement of the problem:
I have a population (say, a bag) from which i take 3 samples.

Each sample gets analysed once by a laboratory in order to know their concentration of a product.

This hints that the 3 samples might be samples "of the same thing". So the estimate of the mean concentration of the 3 samples, is also an estimate of the mean concentration of the population of things in the bag. In this case, I agree with using a weighted sum of the sample means.

But the question is also posed:
What do i know about the heterogeneity of my bag ?

So perhaps the things in the bag are not samples of the same thing.
 
  • Like
Likes FactChecker
  • #18
Stephen Tashi said:
But the question is also posed:
What do i know about the heterogeneity of my bag ?
So perhaps the things in the bag are not samples of the same thing.
That's a good point. But how is that really different from any other source of random variation? I admit that if the amount of time of a measurement correlates with different types being selected from the bag, then there is a problem. But otherwise, I would just consider the differences due to the selection of a type from the bag just another source of random variation.

This entire problem seems so much like the problem of averaging poll results given only the published average result and margin of error. But I am not sure that anyone does that without also knowing the sample sizes, which they might just use directly for weighting the individual results. I searched for details of their methods, but didn't find any.

I also think that this is closely related to importance sampling and stratified sampling methods, where the "bag" is not homogeneous. But I could not find a reference that directly weighted the individual results using 1/strata_variancei, rather than the sample size, Ni. The only thing I can think of is to weight the individual results by the time spent on the measurement, since that is given to be proportional to 1/strata_variancei. I admit that I am kind of "winging it" here more than I should.
 

1. What is sampling in research?

Sampling in research is the process of selecting a subset of individuals or items from a larger population to represent that population. This allows researchers to collect and analyze data from a smaller group, which can be more practical and cost-effective than studying the entire population.

2. What is the purpose of sampling in research?

The purpose of sampling in research is to make inferences about a population based on data collected from a smaller sample. This allows researchers to draw conclusions and make generalizations about a larger group, without having to study every individual or item in that group.

3. What is analysis of variance (ANOVA)?

Analysis of variance (ANOVA) is a statistical technique used to compare the means of three or more groups. It determines whether there is a significant difference between the means of the groups, and if so, which group or groups are responsible for the difference.

4. What is the difference between one-way and two-way ANOVA?

One-way ANOVA is used when there is only one independent variable, while two-way ANOVA is used when there are two independent variables. One-way ANOVA can only determine if there is a significant difference between the means of three or more groups, while two-way ANOVA can also determine if there is an interaction between the two independent variables.

5. How is ANOVA used in scientific research?

ANOVA is commonly used in scientific research to analyze data from experiments with multiple groups or factors. It allows researchers to determine if there is a significant difference between groups, which can help identify the factors that may be influencing the outcome of the study. ANOVA can also be used to compare the effectiveness of different treatments or interventions.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
335
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
280
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
668
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
807
  • Set Theory, Logic, Probability, Statistics
Replies
21
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
2K
Back
Top