# Test of hypothesis for independent samples

• chwala
In summary, the conversation discusses the basics of t-tests, including the types of t-tests and when to use them. It also touches on the importance of understanding degrees of freedom and alpha levels in relation to hypothesis testing. One issue raised is the common convention of using a 5% alpha level, and the potential drawbacks of this practice. The conversation also brings up the idea of reporting effect sizes instead of just focusing on significance.

#### chwala

Gold Member
Homework Statement
Kindly look at the link below ( the steps are pretty clear to me) i need some clarification though.
Relevant Equations
Stats
Reference;

https://www.statisticshowto.com/probability-and-statistics/t-test/

My question is, can we as well have 'subtract each ##x## score from each ##y## score?' thanks.

...t-tests after all are easy to comprehend; as long as one knows the types;
i.e
1. indepedent sample tests (compares means btwn groups)
2. Paired sample(mean from same group comaprison) &
3. One sample

then you good to go... then a matter of understanding Dof and alpha level as compared to calculated value to ascertain any of the given hypothesis questions.

#### Attachments

• stats1.png
17.5 KB · Views: 48
Last edited:
Yes, you can. See the note under step 8.

chwala
Why does the author indicate that if you do not have a specified alpha value then use ##5\%##, any specific/particular reason? Why not ##2\%## or ##10\%## in that matter i.e reference step##7##.

For the paired t-test, from:
https://www.statisticshowto.com/probability-and-statistics/t-test/ ( My bold)

#### "When to Choose a Paired T Test / Paired Samples T Test / Dependent Samples T Test​

Choose the paired t-test if you have two measurements on the same item, person or thing. But you should also choose this test if you have two items that are being measured with a unique condition. For example, you might be measuring car safety performance in vehicle research and testing and subject the cars to a series of crash tests. Although the manufacturers are different, you might be subjecting them to the same conditions.

With a “regular” two sample t test, you’re comparing the means for two different samples. For example, you might test two different groups of customer service associates on a business-related test or testing students from two universities on their English skills. But if you take a random sample each group separately and they have different conditions, your samples are independent and you should run an independent samples t test (also called between-samples and unpaired-samples).

The null hypothesis for the independent samples t-test is μ1 = μ2. So it assumes the means are equal. With the paired t test, the null hypothesis is that the pairwise difference between the two tests is equal (H0: µd = 0). "

An issue, point I think is interesting here is that this technique is used in classification schemes. Tw o objects are in the same class if the variability between them is within a limited range, and in different classes otherwise. As in: How/When do we declare two dogs are of the same breed?

chwala
chwala said:
Why does the author indicate that if you do not have a specified alpha value then use ##5\%##, any specific/particular reason? Why not ##2\%## or ##10\%## in that matter i.e reference step##7##.
It's become something of a standard, through no specific reason I'm aware of. This has given rise to criticism on the basis of the choice/number being arbitrary. There's been some discusion to include effect size in such tests in part for this reason: If, say, the difference in outcome of two medicines is significant at some level, but the effect size is of minor, then you might not care as much. Meaning that one medicine will reduce duration of a cold by 3 days ( if untreated), while the other one will reduce duration by 4 days, then significance itself , is not of much value.

Last edited:
chwala
chwala said:
Why does the author indicate that if you do not have a specified alpha value then use ##5\%##, any specific/particular reason? Why not ##2\%## or ##10\%## in that matter i.e reference step##7##.
It is a common convention in many fields. It's bad. The original idea sounds good: If you test one hypothesis in your study, then on average only 1 in 20 studies with no effect will falsely call something significant. In practice people rarely have just a single precisely defined hypothesis and don't correct their analyses properly for that. To make things worse many journals don't want to publish null results, giving scientists an even larger incentive to dig up something they can call significant, and also making it impossible to see how many studies are done in total. As a result, we get tons of "significant" results that are just random fluctuations. You can see it in distributions of p-values. The range *just* below 0.05 is more common than we should expect, nicely shown in this plot (z-values, so p=0.05 corresponds to z=1.96)

Significance isn't the interesting property anyway. If your sample size is large enough you'll always find a significant effect for essentially everything - that doesn't mean it's relevant. If option 1 reduces some risk by 2% +-0.5% (p<0.001 for having an effect) and option 2 reduces the risk by 40% +- 21% (p>0.05), which option do you prefer? The second one, of course - despite the smaller significance it's far more likely to help a lot, while the first option is a certain but minimal reduction.
More studies should report effect sizes instead of focusing on arbitrary "significance" points.

WWGD
mfb said:
It is a common convention in many fields. It's bad. The original idea sounds good: If you test one hypothesis in your study, then on average only 1 in 20 studies with no effect will falsely call something significant. In practice people rarely have just a single precisely defined hypothesis and don't correct their analyses properly for that. To make things worse many journals don't want to publish null results, giving scientists an even larger incentive to dig up something they can call significant, and also making it impossible to see how many studies are done in total. As a result, we get tons of "significant" results that are just random fluctuations. You can see it in distributions of p-values. The range *just* below 0.05 is more common than we should expect, nicely shown in this plot (z-values, so p=0.05 corresponds to z=1.96)

Significance isn't the interesting property anyway. If your sample size is large enough you'll always find a significant effect for essentially everything - that doesn't mean it's relevant. If option 1 reduces some risk by 2% +-0.5% (p<0.001 for having an effect) and option 2 reduces the risk by 40% +- 21% (p>0.05), which option do you prefer? The second one, of course - despite the smaller significance it's far more likely to help a lot, while the first option is a certain but minimal reduction.
More studies should report effect sizes instead of focusing on arbitrary "significance" points.
mfb,is this the main issue in terms of the problem of replicability/reproducibility of results in the social sciences? Is it a problem throughout all sciences, rather than just social sciences?

I think if people would focus more on effect sizes and confidence intervals and be fine with publishing null results we would reduce the problem and increase reproducibility a lot. Reproduction would be results consistent within the uncertainties.

That's another issue of a binary "significant"/"not significant" classification. If one study claims it's significant (odds ratio 1.35, p=0.02, 95% CI from 1.05 to 1.65)) and a similar study says it's not (odds ratio 1.25, p=0.10, 95% CI from 0.95 to 1.55), do they disagree? Of course not - they are within one standard deviation of each other. But they seem to say very different things.

There are fields that do it much better and particle physics is among them. Most studies are repetitions (typically but not always with better precision), most results of searches are null results (which do get published without issue), failing to reproduce previous measurements is very rare even for the measurements which measure a non-zero value.

chwala and WWGD