# Population proportion from a sample estimate

1. Jul 26, 2015

### Adel Makram

I am interested to know which proper statistical test to use to know the population proportion from a sample taken from the population.
For example, a sample of 20 people with 7 persons prefer the red color and 13 prefer the blue color. Which of the following methods should be used to conclude about whether there is a real color preference in the population.
1) binomial distribution with a two tails test assuming a success rate of 0.5 and calculating the sum of probability of x=o to 7 then compare this with 0.05.
2) calculate the population proportion from the sample proportion (7/20) using z- test of single population proportion and check whether 0.5 is included within the range of calculated proportion.

2. Jul 26, 2015

### FactChecker

The Chi-squared goodness of fit test is easy to apply. In your example, if there was no preference you would expect on average 10 in each red and blue group. Putting this in the test like at http://vassarstats.net/csfit.html gives a probability 0.2636. This means that if there really was no preference, the odds of that result or one more biased occurring is about one in every 4 trials.

3. Jul 26, 2015

### Stephen Tashi

Those words suggest that you want to estimate the population proportion rather than hypothesis test whether the population proportion is different than 0.5.

4. Jul 26, 2015

### Adel Makram

I think that both statments have the same interpretation. Knowing the population propotion from a sample proportion can tell whether it statistically differes from 0.5.

5. Jul 26, 2015

### Adel Makram

So again what is the optimal test:
I have probably 4 tests now each with different result.
1) Binomial distribution with the trial x=7/20, success rate of 0.5 and the number of sample n=20.
2) z-test of population proportion from a sample proportion.
3) Comparing two sample proportions (7/20) and (13/2) using t-test of sample proportion.
4) Chi-Square for goodness of fit.

All of those tests gave non-statistically significant resulted which mean the null hypothesis of equal proportion is not rejected.

6. Jul 26, 2015

### Adel Makram

I used the link to calculate the test statistics and I found p=0.0099 which is statistically significant. However, when I used excel I got test statistics of 0.057 which not statistically significant. However, Excel didnot give me the critical value so I am not sure 0.057 is the test statistcs value or it is p-value.

7. Jul 26, 2015

### FactChecker

One of the well known statistics packages is R. It is open source and well documented. The Chi-squared goodness of fit test is described in tutorials like I don't know if it is the best you can get, but no one would criticize you for using R. The R command for your example is:
chisq.test(c(7,13), p=c(0.5,0.5))

which gives this result:
Chi-squared test for given probabilities
data: c(7, 13)
X-squared = 1.8, df = 1, p-value = 0.1797

This is significantly smaller than the answer from the web site link. That may be because there is not enough data for a valid Chi-squared test. Or one of them may do some corrections that the other does not.

8. Jul 27, 2015

### Stephen Tashi

No, they don't have the same interpretation in mathematical statistics.

I don't understand your language. You can't "know" the population proportion just from a sample proportion.

Perhaps you mean "Comparing a sample proportion to an assumed population proportion of 0.5 ....".

9. Jul 27, 2015

### Adel Makram

I am interest too in the reasoning aspect of the solution. For example, why Chi-Square is more usueful in solving this question than binomial or z-test? I known that many statistical problems may have many solutions but in my case what is the advantage of going for one test over the other?

10. Jul 27, 2015

### Adel Makram

No I mean, if the population proportion is unknown, we can still calculate it from a sample proportion using z-test.

11. Jul 27, 2015

### Adel Makram

If z=(p±π)/√p(1-p)/n then, π= p±z√p(1-p)/n where p is the sample proportion, π is the population proportion which is unknown, n is the sample size.

12. Jul 27, 2015

### Stephen Tashi

By that definition of z, you can't calculate it without already knowing $\pi$.

I think you are using the terminology "z test"when you mean "z statistic".

Perhaps you are thinking about "confidence intervals" for an estimate, not about hypothesis testing.

13. Jul 27, 2015

### Adel Makram

Yes I mean the confidence interval where the population proportion lies. In this case the denominator will be the standard error not the standard deviation. So still sample proportion can be used to derive the population proportion.

So the null hypothesis that the sample proportion is not different from a population proportion of 0.5 will use the concept of standard deviation while calculating an estimate about the population proportion from a sample one without knowing it would gives the confidence intervals for that estimate. So which one is more appropriate in my case? And can we face a situation where the null hypothesis is not rejected (no difference between the sample and population proportions) while an estimate of the population proportion has a confidence interval not including 0.5?

Last edited: Jul 27, 2015
14. Jul 27, 2015

### FactChecker

I think you could use either one. I don't know which would be more powerful of if there would be any difference in this case. In general, the Chi-squared allows you to test if a sample fits a distribution with several possible categories, not just two. The other tests would not apply in that case.

15. Jul 28, 2015

### Stephen Tashi

I think we need to clarify exactly what you want to do. For example, do you want to publish a paper in a scientific journal? Or use your results to invest in the stock market? Or design some machinery?

Statistics is a very technical topic and the concept of "confidence intervals" has one meaning for the-man-in-the-street and very different technical meaning in statistics. How far do you need to progress in the technical understanding of statistics to accomplish your goal? You might have to meet a higher standard to publish in a scientific journal than a personal standard you'd use to pick your own stock market investments.

16. Jul 28, 2015

### Adel Makram

.
I never though that it is that much complicated. My question was very simple one, Does 13/20 indicated that the preference of blue color is higher than the red color?

If you are facing this problem, how will you solve it?

17. Jul 28, 2015

### Stephen Tashi

Now you are posing the question as hpothesis test instead of problem involving confidence intervals.

First I would try to clarify the exact question that I want to ask. I'll attempt to state your question. You want do do a hypothesis test that the population proportion is different than 0.5. You have several alternative statistics that can be used. You want to know which statistic is "optimal".

One concept of "optimal" is the concept of the "most powerful". In your problem, the "power" of a statistical test at a given true value of the population proportion is the probability that the null hypothesis ( which is that the population proportion is 0.5) is rejected. For example, the power of the z-test at an assumed true population proportion of 0.633 would be the probability that the z-test rejects the null hypothesis when 0.633 is the actual proportion. To illustrate this conceptually, you can imagine doing a Monte-Carlo simulation to estimate the power. You would repeatedly simulate drawing 20 individuals from a population where the true proportion (favoring a color) is 0.633. You would apply to z-test to each of these batches of 20 individuals and see what fraction of times the z-test correctly rejects the null hypothesis.

We can probably look up material about the relative power of various tests of population proportions. Does the "power" of tests describe what you want to know about?

18. Jul 28, 2015

### Adel Makram

If I understand you correctly, this means that the population proportion of 0.633 in this example is known and we would like to know whether a sample proportion of 0.5 is drawn from that population by applying z-test. Probably it will be rejected at 0.05 confidence level and this means the z-test in powerful in this case because p-value will be less than 0.05.

My case is the opposite, I don`t know the population proportion, I only know a sample proportion which is 13/20 and I would like to know whether the population proportion is 0.5 as 13/20 of my sample probably comes by chance only.

Now if I follow you then I assume that the population proportion is 0.5 and I would like to know whether 13/20 of my sample is drawn from this population by applying z-test. Then 0.5 is an assumed value not a true value but I can still use it to calculate the standard deviation √π(1-π). In this case I will not include n ( the sample size) in my calculation. Am I right?

Last edited: Jul 28, 2015
19. Jul 29, 2015

### Adel Makram

I think to make one small correction by including the sample size, n, in the denominator to represent the standard error.
so Ho: population proportion is 0.5.
H1: population proportion ≠ 0.5.

20. Jul 29, 2015

### Stephen Tashi

Yes
No
We don't want to know the probability of drawing a sample where the proportion is exactly 0.5.

We want to know the probability that a specific statistical test correctly rejects the null hypothesis that the population proportion is 0.5.

For example, "What is the probability that a z-test with a significance level of 0.05 correctly rejects the null hypothesis that the population proportion is 0.5".

To completely describe the power of the z-test, we plot a curve of its power for the full range of possible true population proportions. 0.633 is just one x-coordinate on this curve.

To compare the power of two tests, we plot their power curves on the same graph and see which curve is higher. We hope that one curve is always higher than the other, in which case the statistic with the higher curve is always better at rejecting the null hypothesis correctly, no matter which true value of the population proportion we use.

It may be that the curves cross, in which case you face a subjective choice.

21. Jul 29, 2015

### Adel Makram

What I observed is that sample and the population proportions can be used interchangeably. The reason is because z-test in our case is a 2-tailed test, so interchanging π(population proportion) with p( sample proportion) will not affect the result. For example, having π=0.5, p=13/20 is equivalent to π=13/20 and p=0.5.

Having said that, then the probability of rejecting H0 of π=0.5 at 0.05 confidence interval is equivalent to the probability of having π lies out side the range of 2 SD (standard deviation) from the point estimate of p=13/20.

Last edited: Jul 29, 2015
22. Jul 29, 2015

### RUber

The two are not exactly interchangeable. Note that the standard deviation is based on your null hypothesis. If you are testing the null hypothesis that your population proportion is .5, then you apply the hypothesized variance to your test...i.e. .25.
If you are hypothesizing that your population proportion is 13/20, and seeing how likely an outcome of 10/20 would be, then you would apply the hypothesized variance of 13/20 * 7/20 = 91/400. Although these variances are close, they are noticeably different and that has everything to do with your null hypothesis.

Assuming the null hypothesis of even odds, then you have the larger standard deviation, and the range of plus/minus 2 (or 1.96) standard deviations would be the standard normal approximation.
Additionally, you might want to add in the discrete data considerations. If you only have 20 trials, then your possible options are discrete, e.g. 12/20, 13/20, 14/20.
So, if you want to exclude 13/20 entirely, then 12.5/20 should be outside of your tolerance, since on the continuous curve, anything in [12.5, 13.5) gets rounded to 13.
If you compare these results to the binomial method, you should have similar outcomes, since n = 20 is getting close to sufficiently large to make the normal approximation.
Without having additional information about special applications, I would say that the standard normal approximation to the binomial distribution should be acceptable.
One way I get a (2 tailed) p value of about .264, another I get about .263.

23. Jul 29, 2015

### RUber

I am not sure you used the calculator correctly. When I put your data into the same tool, I got the same results as FactChecker in post #2 - which was entirely in line with both the normal approximation and the discrete binomial p values in my last post.

So for this data set, I am not sure you could say that one test is better than another, since they are principally based upon the same assumptions.

24. Jul 29, 2015

### Stephen Tashi

I took a brief glance at statistical opinions on the web about this problem - the summary;

The binomial distribution is preferred to the normal distribution for sample sizes as small as 20, just because the normal distribution isn't an accurate approximation.

One dimensional chi-square tests have a low power.

Share this great discussion with others via Reddit, Google+, Twitter, or Facebook