# Sampling for hypothesis testing

Hi guys. I'm not a statistician although I use it enough that I'm surprised something is bothering me. I'm doing hypothesis testing on a population >100,000. What I'm wondering is whether there is any difference whatsoever between performing multiple tests on several samples or just doing one test on a larger sample. For example, is one test on a sample of size 100 equivalent in all respects to 4 tests on samples of size 25 each. Is there any additional information to be gained by one method versus the other? If so, I can't find a reference (which leads me to believe there is no difference).

A bit of explanation might be helpful. This is an economics issue with essentially an infinite number of assets. There is some number of participants, each of whom may randomly choose some small fixed number of assets, say 20, from that infinite number available. So what I would like to say is, for example, each participant has a 95% chance of choosing a set which has a mean of property x in some interval as opposed to saying with 95% confidence that the population mean of property x lies in some interval. So it somehow seems to me that pulling multiple samples of size 20 and testing those would give me a better indication of the distribution of sample means than pulling one large sample. On the other hand that seems dumb and the two methods should be equivalent. Any guidance would be appreciated.

## Answers and Replies

I think I answered my own question but input is certainly still welcome.

Stephen Tashi
Science Advisor
For example, is one test on a sample of size 100 equivalent in all respects to 4 tests on samples of size 25 each.

What you mean by "equivalent". A hypothesis test is a procedure that "accepts" or "rejects" a statement. For example, if the single test "accepts" the statement, then what are you going to call an "equivalent" result from the 4 tests? -that all 4 "accept" the statement? - that 3 out of 4 accept it?

So what I would like to say is, for example, each participant has a 95% chance of choosing a set which has a mean of property x in some interval as opposed to saying with 95% confidence that the population mean of property x lies in some interval.

It's not clear how you select the interval.

So it somehow seems to me that pulling multiple samples of size 20 and testing those would give me a better indication of the distribution of sample means than pulling one large sample.

Using typical assumptions, we can estimate the population distribution parameters from the large sample and calculate the distribution of the sample mean from the estimated population distribution. Typical assumptions are that the measurements of individuals are realizations of idependent identically distributed random variables from some given distribution.

On the other hand that seems dumb and the two methods should be equivalent.

I don't know exactly what the two methods are yet. Your first question is about "hypothesis testing" and you next question seems to be about "estimation". Those are two different statistical tasks.

chiro
Science Advisor
Hey alan2.

Whenever it comes to uncertainty in a distribution, the most important thing that matters is the content of information in the sample. This is represented in a theorem in statistics that relates information to uncertainty (and variance) in a distribution.

The most optimal information content in a sample is when all observations are independent. The worst case is when every piece of information is completely correlated with one value. Typically we hope for the former and cringe at the latter but in practical situations we hope that it is somewhere in between the two (and far closer to the independent and identically distribution ideal).

With hypothesis testing and sampling distributions, the idea is that if you pick an interval some p proportion (p = 1 - a where a is significance) then the idea is that p*100 percent of the time, the true parameter will lie in that interval. That is basically the best we can do statistically unless we find a way to reduce the uncertainty altogether which in most cases is not really possible theoretically or even practically.

When you talk about your partitioning example you need to keep in mind the idea of information mentioned above. In statistics we call it Fisher information which has a matrix in multivariate distributions and in Information Theory, we have a different measure (which is similar to Fisher but not exactly the same) called Shannon entropy.

You need to assess whether one strategy has an advantage over another from an information point of view. If it doesn't then it's a lot better to use the one that has maximum information and information that is representative probabilistic-ally and also relevant to the parameter of interest being estimated.

As a reminder for information - correlation (for things like means) or some other relationship (for non-linear relationships) will reduce information content relative to the IID assumption and a lack of this property will do the opposite - moving information towards its maximum in an IID setting.

• h6ss