# Sampling for hypothesis testing

• alan2
In summary, the conversation discusses the use of hypothesis testing and sampling distributions in economics and how the choice of sample size and method can impact the information content and accuracy of the results. The importance of independent and identically distributed data is highlighted, as well as the concept of information and its impact on uncertainty in a distribution. The conversation also touches on the comparison between different statistical tasks, such as hypothesis testing and estimation, and the importance of choosing the method that maximizes information and is relevant to the parameter of interest.
alan2
Hi guys. I'm not a statistician although I use it enough that I'm surprised something is bothering me. I'm doing hypothesis testing on a population >100,000. What I'm wondering is whether there is any difference whatsoever between performing multiple tests on several samples or just doing one test on a larger sample. For example, is one test on a sample of size 100 equivalent in all respects to 4 tests on samples of size 25 each. Is there any additional information to be gained by one method versus the other? If so, I can't find a reference (which leads me to believe there is no difference).

A bit of explanation might be helpful. This is an economics issue with essentially an infinite number of assets. There is some number of participants, each of whom may randomly choose some small fixed number of assets, say 20, from that infinite number available. So what I would like to say is, for example, each participant has a 95% chance of choosing a set which has a mean of property x in some interval as opposed to saying with 95% confidence that the population mean of property x lies in some interval. So it somehow seems to me that pulling multiple samples of size 20 and testing those would give me a better indication of the distribution of sample means than pulling one large sample. On the other hand that seems dumb and the two methods should be equivalent. Any guidance would be appreciated.

I think I answered my own question but input is certainly still welcome.

alan2 said:
For example, is one test on a sample of size 100 equivalent in all respects to 4 tests on samples of size 25 each.

What you mean by "equivalent". A hypothesis test is a procedure that "accepts" or "rejects" a statement. For example, if the single test "accepts" the statement, then what are you going to call an "equivalent" result from the 4 tests? -that all 4 "accept" the statement? - that 3 out of 4 accept it?

So what I would like to say is, for example, each participant has a 95% chance of choosing a set which has a mean of property x in some interval as opposed to saying with 95% confidence that the population mean of property x lies in some interval.

It's not clear how you select the interval.

So it somehow seems to me that pulling multiple samples of size 20 and testing those would give me a better indication of the distribution of sample means than pulling one large sample.

Using typical assumptions, we can estimate the population distribution parameters from the large sample and calculate the distribution of the sample mean from the estimated population distribution. Typical assumptions are that the measurements of individuals are realizations of idependent identically distributed random variables from some given distribution.

On the other hand that seems dumb and the two methods should be equivalent.

I don't know exactly what the two methods are yet. Your first question is about "hypothesis testing" and you next question seems to be about "estimation". Those are two different statistical tasks.

Hey alan2.

Whenever it comes to uncertainty in a distribution, the most important thing that matters is the content of information in the sample. This is represented in a theorem in statistics that relates information to uncertainty (and variance) in a distribution.

The most optimal information content in a sample is when all observations are independent. The worst case is when every piece of information is completely correlated with one value. Typically we hope for the former and cringe at the latter but in practical situations we hope that it is somewhere in between the two (and far closer to the independent and identically distribution ideal).

With hypothesis testing and sampling distributions, the idea is that if you pick an interval some p proportion (p = 1 - a where a is significance) then the idea is that p*100 percent of the time, the true parameter will lie in that interval. That is basically the best we can do statistically unless we find a way to reduce the uncertainty altogether which in most cases is not really possible theoretically or even practically.

When you talk about your partitioning example you need to keep in mind the idea of information mentioned above. In statistics we call it Fisher information which has a matrix in multivariate distributions and in Information Theory, we have a different measure (which is similar to Fisher but not exactly the same) called Shannon entropy.

You need to assess whether one strategy has an advantage over another from an information point of view. If it doesn't then it's a lot better to use the one that has maximum information and information that is representative probabilistic-ally and also relevant to the parameter of interest being estimated.

As a reminder for information - correlation (for things like means) or some other relationship (for non-linear relationships) will reduce information content relative to the IID assumption and a lack of this property will do the opposite - moving information towards its maximum in an IID setting.

h6ss

## 1. What is sampling and why is it important for hypothesis testing?

Sampling is the process of selecting a subset of individuals or data points from a larger population. This smaller sample is used to make inferences about the larger population. Sampling is important for hypothesis testing because it allows us to make generalizations about a population without having to collect data from every single individual. It also helps to reduce bias and increase the efficiency of the study.

## 2. What is the difference between random sampling and non-random sampling?

Random sampling is when every individual or data point in a population has an equal chance of being selected for the sample. This helps to reduce bias and increase the representativeness of the sample. Non-random sampling, on the other hand, involves selecting individuals or data points based on specific criteria, such as convenience or purposive sampling. This may introduce bias and limit the generalizability of the results.

## 3. How do you determine the appropriate sample size for hypothesis testing?

The appropriate sample size for hypothesis testing depends on several factors, such as the desired level of precision, the expected effect size, and the variability within the population. Generally, a larger sample size will provide more precise estimates and increase the power of the study. Statistical power analysis can be used to determine the necessary sample size for a specific study.

## 4. Can you use data from a previous study as a sample for hypothesis testing?

Yes, data from a previous study can be used as a sample for hypothesis testing if it meets the criteria for a representative sample. This means that the data must be from a population that is similar to the population being studied, and the data must have been collected using appropriate sampling methods. Additionally, the data should be relevant to the research question being investigated.

## 5. What is the difference between a sample and a population in hypothesis testing?

A sample is a subset of individuals or data points from a larger population, whereas a population is the entire group of individuals or data points that the researcher is interested in studying. In hypothesis testing, the sample is used to make inferences about the population, which allows us to draw conclusions about the larger group without having to collect data from every individual.

• Set Theory, Logic, Probability, Statistics
Replies
7
Views
745
• Set Theory, Logic, Probability, Statistics
Replies
5
Views
725
• Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
1
Views
666
• Set Theory, Logic, Probability, Statistics
Replies
1
Views
882
• Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
22
Views
3K