Choosing the number of trials when the data stream is unlimited

  • B
  • Thread starter Roger Dodger
  • Start date
  • Tags
    Data Stream
In summary, the conversation discusses collecting data from a Geiger-Muller radiation detector and finding the average clicks per second. It is mentioned that the clicks follow a Poisson distribution and that a long trial can be performed instead of multiple trials. The questions revolve around whether to break the data stream into segments and how to assess the standard error in the measurement. It is also mentioned that the terms "uncertainty" and "precision" are not formal statistical terms but are related to statistical concepts.
  • #1
Roger Dodger
42
3
TL;DR Summary
I can collect data from a device for as long as I want. Do I stop and start data collection to generate multiple trials?
I am collecting data from a Geiger-Muller radiation detector, which generates clicks that correspond to particles entering the detector. These clicks come in purely at random, so the number of clicks in a given time interval are governed by the Poisson distribution. My job is to find the average clicks per second.

I can leave the detector on as long as I want and therefore can collect data for as long as I want. Unlike most experiments, which require the researcher to perform multiple trials, I can perform one long trial.

Should I break the long data stream into N number of segments, with each segment treated as a trial?

If it is better to use the one long trial, how do I assess the standard error in the measurement? Or, can I?

I have one more question: I performed a Google search to answer my questions, but found nothing. Are the answers to my questions common knowledge? I feel they are rather stupid, but I don't know the answer so I have no choice but to seek help.

Thanks in advance.
 
Physics news on Phys.org
  • #2
It won't make any difference. If total count is k, then the variance is k, so yor rate is k/t and rate std. dev. is ##\sqrt{k}/t##.
 
  • #3
mathman said:
If total count is k, then the variance is k,

To clarify, the total count ## k## is an estimator for the population variance.

so yor rate is k/t
##\hat{r} = k/t## is an estimator for the population rate

and rate std. dev. is ##\sqrt{k}/t##.
##\hat{\sigma}_{\hat{r}} = \sqrt{k}/t## is an estimator for the (population) standard deviation of ##\hat{r}##.

Roger Dodger said:
Are the answers to my questions common knowledge?

Your questions ("is it better" etc.) are not well defined mathematical questions. They could be re-phrased as topics is the theory of statistical estimation. It is not common to find people who understand this and are able to re-phrase them.

I'd say it's common practice to believe that the more samples we take, the better our estimates. (Hence, taking counts over a longer interval is better than a taking counts over a shorter one.) To supply a technical meaning for "better" and prove one way is better than another can be complicated. That task does not involve common knowledge.

Should I break the long data stream into N number of segments, with each segment treated as a trial?
Intuition says this won't make your estimate "better", but how do we wish to define "better"? - and how do we define the estimator?
 
Last edited:
  • #4
One usually tries to achieve a certain precision in a particular measurement. In this case that depends on the length of the counting time assuming this is the dominant source of uncertainty. Correct?

For Poisson distributed events as nuclear decays, it makes no difference if you count one long time or multiple short periods with which you calculated a mean.

If you want X% precision and your rate is R then you can relate the chosen precision to the rate with σR/R = X/100, where σR is the uncertainty of R = (R/t)1/2 = n1/2/t. This gives the simple relation N = (100/X)2 which (surprise) is independent of the time.
 
  • Like
Likes hutchphd
  • #5
gleem said:
where σR is the uncertainty of R = (R/t)1/2 = n1/2/t.

I don't know how to define the terms "uncertainty" and "precision" using the language of mathematical statistics, or whether these terms relate to confidence intervals. In case they do, the mathematical theory of a confidence interval for the mean of a poisson distribution (versus the sample mean) looks more complicated that estimating a standard deviation by taking a square root. For example, Wikipedia has https://en.wikipedia.org/wiki/Poisson_distribution. the section "Confidence interval".
 
  • #6
I used the term uncertainly a little loosely. n1/2/t is the variance of the rate which has a specific meaning for a Poisson distribution which quickly approaches a Gaussian distribution as the number of counts increases. Approximately 63% of measurements should lie within the interval of one variance from the calculated rate. If the interval is extended to two variances then about 95% of the measurements are expected to be in that interval. An interval of three variances should encompass about 99% of measurements.

The statistical precision X that I referenced as a goal for the measurement is the interval in which I would expect 63% of the determinations of the rate to lie. If the rate is 100 and X =1% I would expect that 63% of the measurements lie in the interval of 99 to 101.
 
  • #7
gleem said:
I used the term uncertainly a little loosely. n1/2/t is the variance of the rate which has a specific meaning for a Poisson distribution

What does "##n##" represent? Is it a parameter of the distribution? - or is it a value measured in a sample?

If ##n## is a parameter of the Poission distribution then, in the case at hand, it is unknown. If ##n## is a value measured in a sample then it has no deterministic relation to the probability of the sample mean being within certain limits. Perhaps there is some convention involved in the definitions of "uncertainty" and "precision" that says we will treat a value measured in a sample as being exactly equal to a parameter of the parent distribution?
 
  • #8
n is the data, in this case, the number of events detected.

FAIK, uncertainty, and precision are not formal statistical terms but as I indicated are related to statistical concepts. I am(was) an experimental physicist and my use and experience in applications of statistics and statistical inference is not related to the usual applications found in the social or medical sciences.

Stephen Tashi said:
Perhaps there is some convention involved in the definitions of "uncertainty" and "precision" that says we will treat a value measured in a sample as being exactly equal to a parameter of the parent distribution?

The Poisson distribution. If n is a number of events of some kind that have a low probability of occurrence in some defined domain as time or space then n can be assumed to be an estimate of the mean who's parent distribution has a variance of n1/2. So when you measure the number of events you have some information about the parent distribution. Smaller values of n have larger variances and therefore larger uncertainties relative to the mean.
 
  • #9
gleem said:
If n is a number of events of some kind that have a low probability of occurrence in some defined domain as time or space then n can be assumed to be an estimate of the mean who's parent distribution has a variance of n1/2. So when you measure the number of events you have some information about the parent distribution. Smaller values of n have larger variances and therefore larger uncertainties relative to the mean.

I understand that as set of conventions, but they don't translate (unambiguously) into statements about mathematical statistics. I think we agree on this.

For example "smaller values of ##n## have larger variances". If ##n## is a quantity measured in a sample, it isn't a particular value of ##n## that has a variance, rather it is the distribution of possible values of ##n## that has a variance. So I think what is meant is that "smaller values of ##n## suggest that the distribution of ##n## has a smaller variance". However that statement is not a specific mathematical claim since "suggest" is ambiguous.

It would not be correct to say "smaller values of ##n## make it more likely that the variance of ##n## is smaller" unless we have assumed a Bayesian model where speaking of a probable value for the variance makes sense - i.e. that there exists a prior probability distribution for the value of the variance.
 
  • #10
Stephen Tashi said:
For example "smaller values of n have larger variances". If n is a quantity measured in a sample, it isn't a particular value of n that has a variance, rather it is the distribution of possible values of n that has a variance.

I am not sure what this discussion is trying to resolve. We do not need to measure a sample mean or variance as is usual for much of statistical data. If you are measuring the height of people in some area measuring one person's height does not tell anything about the distribution. To get an estimate of the mean and variance you must take a sample from a representative group and compute a mean and variance

For nuclear decay events, n is taken as an estimate of the mean of the parent population. n1/2 is an estimate of the variance of the parent population. It is expected that about 2/3 of all measurements will lie in the interval n ± n1/2.

Because the relative uncertainty 1/n1/2 decreases as n increased one always tries to accumulate as many events to obtain a required accuracy. This is standard practice in measurements of nuclear decays.
The larger the n the closer it comes to the actual mean value of a similar measurement.

I believe the OP question has been answered.
 
  • #11
Roger Dodger said:
Should I break the long data stream into N number of segments, with each segment treated as a trial?
It is often better to do this because when the cleaner unplugs the experiment by mistake and plugs it back in again later hoping no-one notices, when you plot the results for that hour you will easily spot the outlier and be able to exclude it from the results.
 
  • Like
  • Haha
Likes atyy and hutchphd
  • #12
gleem said:
For nuclear decay events, n is taken as an estimate of the mean of the parent population. n1/2 is an estimate of the variance of the parent population.

I agree, in general. (Is the estimated variance ##n## or ##\sqrt{n}##? )

It is expected that about 2/3 of all measurements will lie in the interval n ± n1/2.

That may be an expectation in the intuitive sense of the word "expected". However, I question the literal mathematical interpretation of that statement. The mathematics of a 2/3 confidence interval for the sample mean vs the population mean of a Poission distribution looks more complicated that the above formula. As I mentioned before, see the "Confidence interval" section of
https://en.wikipedia/wiki/Poisson_distribution#Confidence_interval. Does the simple formula approximate that complicated formula? Maybe it does.
I believe the OP question has been answered.

You're probably correct. People who ask questions about statistics in this section of the forum are often satisfied with empirical answers - even before their questions are made mathematically well defined! That's fine with me. I only object to answers that justify themselves with incorrect mathematics.
 
  • #13
Stephen Tashi said:
Intuition says this won't make your estimate "better", but how do we wish to define "better"? - and how do we define the estimator?

This seems woefully pedantic.
Here is my definition of "better": closer to the result obtained if I could do the experiment an arbitrarilly large number of times.
And the estimator is whatever I choose it to be. This seems a question of experimental design.

I am truly interested in why this is not adequate...
 
  • Like
Likes gleem
  • #14
hutchphd said:
This seems woefully pedantic.
Correct mathematics is woefully pedantic.

Here is my definition of "better": closer to the result obtained if I could do the experiment an arbitrarilly large number of times.
And the estimator is whatever I choose it to be.

To put that in mathematical terms, explain it using the concept of limits. "Do the experiment an abitrailly large number of times" has no mathematical implications unless you are more specific.

This seems a question of experimental design.
Perhaps you can rephrase the original question as a well defined question in the topic of experimental design. Experimental Design is a field with its own jargon.
 
  • #15
Stephen Tashi said:
That may be an expectation in the intuitive sense of the word "expected"

Expectation is mathematically defined. For a Poisson distribution, the expectation is the number of events (per unit interval). When we measure the number of events we expect that the mean probably (63%) has a value in the interval defined by n ± n1/2.

Stephen Tashi said:
People who ask questions about statistics in this section of the forum are often satisfied with empirical answers - even before their questions are made mathematically well defined!

Do you think the OP question was properly stated? If not how would you state it?

Stephen Tashi said:
I only object to answers that justify themselves with incorrect mathematics.

As do I.
 
  • Like
Likes hutchphd
  • #16
gleem said:
Expectation is mathematically defined.
Provided we say what random variable's expectation this refers to.

For a Poisson distribution, the expectation is the number of events (per unit interval).
The number of events per unit interval varies. I think you intend to say that for a Possion random variable, its expectation gives the expected number of events per unit interval.

When we measure the number of events we expect that the mean probably (63%) has a value in the interval defined by n ± n1/2.

Again the distinction between ##n## as a parameter of the Poission distribution versus ##n## as a specific count taken in a sample is not made. What you say is approximately true for the ##n## that is the parameter of the Poission distribution. That's useful if you happen to know the exact value of the parameter.

If we say ##n## is a value measured in a sample, then we don't know that ##n## is the exact value of the parameter and we have the confidence interval given in the wikipedial article.

Furthermore, if we measure n = 25 counts, that information is not sufficient for computing the probability that the number of counts in similar experiments will fall in the interval [25-5, 25+5]. Claiming to compute this probability is a common misinterpretation of the concept of confidence intervals. For more on this, see the section "Misunderstandings" in the Wikipedia article https://en.wikipedia.org/wiki/Confidence_interval#Meaning_and_interpretationTo compute the probability that future counts will fall in the interval [25-5,25+5] requires knowing the exact value of the parameter of the Poisson distribution.
Do you think the OP question was properly stated?

I more troubled by statements in some of the answers than the original question!

Asking if it is "better" to run one long trial versus several shorter ones can be answered in a technical manner if "better" is precisely defined. One can also answer the question intuitively or based on one's empirical experience. What doesn't work is to claim that an intuitive or empirical answer is justified by some mathematical hocus pocus that does not define "better".

The part of the question question that asks "How can I assess the standard error of measurement" requires us to interpret "assess". I'd translated it into math as "How can I estimate the standard deviation of the measured number of counts?"
 
  • #17
Stephen Tashi said:
To compute the probability that future counts will fall in the interval [25-5,25+5] requires knowing the exact value of the parameter of the Poisson distribution.

Cannot we predict the probability that a value will occur or be greater or less than something without knowing the exact value of the distribution? We hardly ever know a mean of a distribution precisely.

Stephen Tashi said:
I'd translated it into math as "How can I estimate the standard deviation of the measured number of counts?"

Make several determinations of the mean counts and calculate the variance or for a Poisson distributed quantity take the square root of the number of counts from a single measurement.

I would not call this approach of estimating the mean and variance a heuristic or that it is an empirical observation because it is for lack of a better term implied by the Poisson and binomial distributions. The mean being the most probable value is the one most likely to appear in any measurement. Since we also get an estimate of the variance we now have a range in which the actual mean most probably will occur.
 
  • #18
gleem said:
Cannot we predict the probability that a value will occur or be greater or less than something without knowing the exact value of the distribution? We hardly ever know a mean of a distribution precisely.
By "predict the probability", do you mean "compute the probability"? No, we cannot compute the probability that the measured counts or rates of Poisson random variable will fall in a specific numerical interval like [20,30] without knowing the parameters of the distribution. The fact that we hardly ever know the exact parameters of a distribution doesn't change this. Math is heartless!

Make several determinations of the mean counts and calculate the variance or for a Poisson distributed quantity take the square root of the number of counts from a single measurement.
Yes, that describes a procedure of parameter estimation. Perhaps you intend that the square root of the sample counts estimates the standard deviation of the distribution, not the variance.

I would not call this approach of estimating the mean and variance a heuristic or that it is an empirical observation because it is for lack of a better term implied by the Poisson and binomial distributions.

You can learn the theory of parameter estimation from standard texts on statistics. They cover the type of questions you are asking.

My general advice is to overcome the notion that statistics is conceptually simple. It is not.

By way of anti-analogy, people who study trigonometry encounter various problems with ladders leaning against walls, poles casting shawdows, and so forth. A student can take the attitude that the conceptual complications are all in the trigonometry, and one doesn't need to study ladders and poles in exhaustive detail. The ladders and poles in trigonometry can be handled by an intelligent person with common sense. Students can be successful in trigonometry while having in this outlook.

By contrast, a student who has studied algebra and calculus may regard problems in probability and statistics as being conceptually complex only in regards to the algebra and calculus. They may think that problems in statistics can be solved by common sense and intelligent conversation. This outlook won't work. Just stating a problem in statistics unambiguously can be conceptually complex.
The mean being the most probable value is the one most likely to appear in any measurement.
The probability of getting a sample mean equal to the population mean may be zero. For example, consider the mean of a binomial distribution with parameters N = 10 and p = .33. When the mean is most probable value, it still may not be very probable.
 
  • #19
Stephen Tashi said:
No, we cannot compute the probability that the measured counts or rates of Poisson random variable will fall in a specific numerical interval like [20,30] without knowing the parameters of the distribution. The fact that we hardly ever know the exact parameters of a distribution doesn't change this.

How do we determine confidence levels if we do not know the parameters of the distribution? How do we determine whether two statistics are significantly different?

Stephen Tashi said:
Perhaps you intend that the square root of the sample counts estimates the standard deviation of the distribution, not the variance.

Yes actually I did because the standard deviation is determined from the variance.

Stephen Tashi said:
You can learn the theory of parameter estimation from standard texts on statistics.

Thank you for the suggestion.

Stephen Tashi said:
The probability of getting a sample mean equal to the population mean may be zero. For example, consider the mean of a binomial distribution with parameters N = 10 and p = .33. When the mean is most probable value, it still may not be very probable.

1594388083664.png


The mean is 3.3 and the variance is 2.2. The probability of observing the mean, in this case the closest value to the mean is .26, hardly close to zero. The probability of finding the value of the mean in the range of mean ± σ is .81 pretty high and where you would expect to find the mean.

But for a Poisson approximation to be valid for a binomial distribution p << 1 and the mean should be much smaller than the maximum number of events.
 
  • #20
gleem said:
How do we determine confidence levels if we do not know the parameters of the distribution?

Confidence intervals for the parameter of a distribution do not state the probability that the parameter is in an interval with definite numerical endpoints like [20,30]. A confidence interval gives information like "There is a .68 probability that a mean of a random sample is within plus or minus 5 of the population mean". It doesn't say what the actual population mean is. It doesn't say that if you have one particular sample mean such as 25 that there is a .68 probability that the population mean is within plus or minus 5 of that particular number. (Thinking that a confidence interval does give that information is a common misconception about confidence intervals.)

How do we determine whether two statistics are significantly different?
First, you have to know the technical definition of "significance". One you know that definition, you will see that levels of significance are set in a subjective manner.

The mean is 3.3 and the variance is 2.2. The probability of observing the mean, in this case the closest value to the mean is .26, hardly close to zero.

"The closest value to the mean" is not the population mean. The probability of observing exactly 3.3 is zero. That's my point. It's in reply to what you wrote:
The mean being the most probable value is the one most likely to appear in any measurement.

The probability of finding the value of the mean in the range of mean ± σ is .81 pretty high and where you would expect to find the mean.
If you're going to change your argument so it discusses the probability of values within 1 sigma of the mean, then it needs to be revised.

I think you are seeking a commonsense argument in support of an intuitive (and imprecise) idea that we can deduce or approximate the parameters of a distribution by taking random samples. To make that idea precise, you need a technical definition of "approximation" that applies when probability is involved. It is not simple to formulate such an idea in a logically consistent manner. For example, if we assume a random variable has a specific but unknown distribution then the mean of the distribution (if it exists) is a specific but unknown number. Hence it makes no sense to talk about the probability of that mean being in particular numerical interval such as [20, 30] because we are not describing an experiment where the mean is chosen at random in some way. The mean is not a random variable. It has no probability distribution.

If you want to formulate a statistical problem in way where speaking of the probability of the mean being in [20,30] is not nonsense then you have to formulate the problem so the mean is a random variable. Bayesian statistics does this.
 
Last edited:
  • #21
Before we continue could you tell me about your background in statistics?
 
  • #22
I took graduate level courses in statistics 40 years ago - master's degree type courses, nothing at the level of Phd research. The most advanced and most memorable course was based on the text Optimal Statistical Decisions by Morris DeGroot (1970 edition), so I have a Bayesian outlook. My career (as an operations research analyst) involved mainly contributing code to computer simulations, not doing statistics.
 
  • #23
pbuk said:
It is often better to do this because when the cleaner unplugs the experiment by mistake and plugs it back in again later hoping no-one notices, when you plot the results for that hour you will easily spot the outlier and be able to exclude it from the results.
I have had a few laughs for this comment - although phrased lightheartedly I meant it seriously. 20 short experiments where you can identify and isolate the effects of an earthqake/sunspots/phase of the moon/an unknown factor to be investigated (I'd start with the cleaner) are much better than 1 long one where you simply end up with a less accurate answer.
 
  • #24
pbuk said:
I have had a few laughs for this comment - although phrased lightheartedly I meant it seriously.

That highlights the difference between a statistical problem in the sense of a real life problem versus a problem in the sense of well defined mathematical problem. Solving a real life problem mathematically involves turning it into specific mathematics by making assumptions. The OP simplifies the problem to sampling a Poisson process with a constant but unknown parameter. Your point of view is valid in regards to considering the real life problem.
 
  • Like
Likes pbuk
  • #25
From a physics perspective: Collect data in several intervals, check if there are outliers or trends visible. If not, pool the measurements. If there are trends or outliers, find out what went wrong.
 
  • Like
Likes pbuk

1. How do I determine the optimal number of trials when the data stream is unlimited?

There is no one-size-fits-all answer to this question as it depends on various factors such as the type of data, the desired level of accuracy, and the resources available. However, some methods that can help determine the optimal number of trials include statistical power analysis, simulation studies, and sequential testing.

2. Can I just keep collecting data indefinitely to ensure accuracy?

While it may seem like collecting an unlimited amount of data would guarantee accuracy, this is not always the case. In fact, continuing to collect data beyond a certain point may not significantly improve the results and can be a waste of resources. It is important to balance the trade-off between accuracy and efficiency.

3. How do I deal with the possibility of changing data patterns in an unlimited data stream?

In an unlimited data stream, it is possible for the underlying data patterns to change over time. To address this, researchers can use adaptive methods that adjust the number of trials based on the changing patterns. This allows for more efficient use of resources and can improve the accuracy of the results.

4. Is there a minimum number of trials required for accurate results?

The minimum number of trials required for accurate results depends on the type of data and the desired level of accuracy. In general, a larger sample size will result in more precise estimates and better statistical power. However, the minimum number of trials may also be affected by other factors such as the variability of the data and the effect size.

5. How can I ensure the validity of my results when choosing the number of trials in an unlimited data stream?

To ensure the validity of results, it is important to carefully plan and design the study. This includes choosing an appropriate sample size, using randomization and control groups, and considering potential biases. Additionally, conducting multiple analyses and sensitivity tests can help validate the results and ensure their reliability.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
925
  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
479
  • Set Theory, Logic, Probability, Statistics
Replies
12
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
18
Views
3K
  • High Energy, Nuclear, Particle Physics
Replies
17
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
Back
Top