Sampling for hypothesis testing

alan2
Messages
324
Reaction score
56
Hi guys. I'm not a statistician although I use it enough that I'm surprised something is bothering me. I'm doing hypothesis testing on a population >100,000. What I'm wondering is whether there is any difference whatsoever between performing multiple tests on several samples or just doing one test on a larger sample. For example, is one test on a sample of size 100 equivalent in all respects to 4 tests on samples of size 25 each. Is there any additional information to be gained by one method versus the other? If so, I can't find a reference (which leads me to believe there is no difference).

A bit of explanation might be helpful. This is an economics issue with essentially an infinite number of assets. There is some number of participants, each of whom may randomly choose some small fixed number of assets, say 20, from that infinite number available. So what I would like to say is, for example, each participant has a 95% chance of choosing a set which has a mean of property x in some interval as opposed to saying with 95% confidence that the population mean of property x lies in some interval. So it somehow seems to me that pulling multiple samples of size 20 and testing those would give me a better indication of the distribution of sample means than pulling one large sample. On the other hand that seems dumb and the two methods should be equivalent. Any guidance would be appreciated.
 
Physics news on Phys.org
I think I answered my own question but input is certainly still welcome.
 
alan2 said:
For example, is one test on a sample of size 100 equivalent in all respects to 4 tests on samples of size 25 each.

What you mean by "equivalent". A hypothesis test is a procedure that "accepts" or "rejects" a statement. For example, if the single test "accepts" the statement, then what are you going to call an "equivalent" result from the 4 tests? -that all 4 "accept" the statement? - that 3 out of 4 accept it?

So what I would like to say is, for example, each participant has a 95% chance of choosing a set which has a mean of property x in some interval as opposed to saying with 95% confidence that the population mean of property x lies in some interval.

It's not clear how you select the interval.

So it somehow seems to me that pulling multiple samples of size 20 and testing those would give me a better indication of the distribution of sample means than pulling one large sample.

Using typical assumptions, we can estimate the population distribution parameters from the large sample and calculate the distribution of the sample mean from the estimated population distribution. Typical assumptions are that the measurements of individuals are realizations of idependent identically distributed random variables from some given distribution.

On the other hand that seems dumb and the two methods should be equivalent.

I don't know exactly what the two methods are yet. Your first question is about "hypothesis testing" and you next question seems to be about "estimation". Those are two different statistical tasks.
 
Hey alan2.

Whenever it comes to uncertainty in a distribution, the most important thing that matters is the content of information in the sample. This is represented in a theorem in statistics that relates information to uncertainty (and variance) in a distribution.

The most optimal information content in a sample is when all observations are independent. The worst case is when every piece of information is completely correlated with one value. Typically we hope for the former and cringe at the latter but in practical situations we hope that it is somewhere in between the two (and far closer to the independent and identically distribution ideal).

With hypothesis testing and sampling distributions, the idea is that if you pick an interval some p proportion (p = 1 - a where a is significance) then the idea is that p*100 percent of the time, the true parameter will lie in that interval. That is basically the best we can do statistically unless we find a way to reduce the uncertainty altogether which in most cases is not really possible theoretically or even practically.

When you talk about your partitioning example you need to keep in mind the idea of information mentioned above. In statistics we call it Fisher information which has a matrix in multivariate distributions and in Information Theory, we have a different measure (which is similar to Fisher but not exactly the same) called Shannon entropy.

You need to assess whether one strategy has an advantage over another from an information point of view. If it doesn't then it's a lot better to use the one that has maximum information and information that is representative probabilistic-ally and also relevant to the parameter of interest being estimated.

As a reminder for information - correlation (for things like means) or some other relationship (for non-linear relationships) will reduce information content relative to the IID assumption and a lack of this property will do the opposite - moving information towards its maximum in an IID setting.
 
  • Like
Likes h6ss
Hi all, I've been a roulette player for more than 10 years (although I took time off here and there) and it's only now that I'm trying to understand the physics of the game. Basically my strategy in roulette is to divide the wheel roughly into two halves (let's call them A and B). My theory is that in roulette there will invariably be variance. In other words, if A comes up 5 times in a row, B will be due to come up soon. However I have been proven wrong many times, and I have seen some...
Thread 'Detail of Diagonalization Lemma'
The following is more or less taken from page 6 of C. Smorynski's "Self-Reference and Modal Logic". (Springer, 1985) (I couldn't get raised brackets to indicate codification (Gödel numbering), so I use a box. The overline is assigning a name. The detail I would like clarification on is in the second step in the last line, where we have an m-overlined, and we substitute the expression for m. Are we saying that the name of a coded term is the same as the coded term? Thanks in advance.
Back
Top