# Confidence Interval based on two sample tests

Mutaja

## Homework Statement

The following two samples are corrosion values for 20 untreated pipes (Xi) and 20 surface treated pipes (Yi)

∑Xi = 950, ∑Xi2 = 46344 and ∑Yi = 1092, ∑Yi2 = 62136

We want to examine if there is basis to claim that the surface treatment reduce corrosion.

a)

1: Find a 95% confidence interval for µX - µY based on the samples above. Be clear as to which assumptions you have done. Does the surface treatment reduce the corrosion?

2: Perform a hypothesis test with significance level 5%. Again be clear as to which assumptions you've done to use this test.

A new test is being done on 20 different pipes. This time the test is done by using the surface treatment on one side (Xi), and the other side of the same pipe is left untreated (Yi). Below is the result.

b)

Do a hypothesis test regarding the surface treatment reducing corrosion. Again use significance level 5%. Be clear as to which assumptions you have done to use this test.

## Homework Equations

Not sure. Let me know and I will edit this section.

## The Attempt at a Solution

[/B]
I will focus on A for now.

First I need to identify the average corrosion values:
x̅ = ∑Xi / 20 = 950/20= 47.5
Ȳ = ∑Xi / 20 = 1092/20= 54.6

Then I find the Z value or T value from the tables (not sure which) for 95% confidence interval.
Z0.025 = 1.96
T0.025 = 2.093

Finding the variance:

S2 = 1/19 * Σ(Xi - x̅)2 + Σ(Yi - Ȳ)2
I calculated this manually to be 196.41.

Which then gave me the confidence interval: 950 - 1092 +/- 2.093*√(196.41) * √(1/20 + 1/20)
=[-132.72, 151.28]

Where am I going wrong, and how do I use the numbers I get for my confidence interval to draw a conclusion?

PS: I remember for way back when using this forum that there was a much easier way of typing formulas properly. And the problem is translated from Lithuanian so please let me know if something is unclear.

#### Attachments

• NFnWq98.png
3.2 KB · Views: 627
• wCDrtgi.png
3.1 KB · Views: 266

Homework Helper
Gold Member
In your formula for ##S^2## you have assumed the samples have equal variance, which is not justified by the question.

The test to perform here is Welch's t-test for unpaired data, where equal variances is not assumed. See here.

To type clear mathematical formulae, you can use LaTeX. Instructions are here. Also, I think some of your formulae above are missing important parentheses, which makes it hard to follow them.

Mutaja
In your formula for ##S^2## you have assumed the samples have equal variance, which is not justified by the question.

The test to perform here is Welch's t-test for unpaired data, where equal variances is not assumed. See here.

To type clear mathematical formulae, you can use LaTeX. Instructions are here. Also, I think some of your formulae above are missing important parentheses, which makes it hard to follow them.

Thank you for your reply. In the link you sent, I do not understand how to use the information to get a formula I can use to determine different variance for the two samples.

The formula $$T = \frac {\bar Y_1 - \bar Y_2} {S_2 \sqrt{\frac 1 N_1 + \frac 1 N_2}},$$ I feel like I have T and ##S_p## unknown, or should I use the T I found before? Slightly confused, sorry.

Thanks for showing me how the "new" LaTex works, from memory I think it used to be an easier dropdown menu, not sure why that's changed but I'll work with this. Below will be the opening post formulas re-written (can't seem to edit my opening post?). This is since you commented that I might miss a parenthesis.

Numbers from the exercise:
∑Xi = 950, ∑Xi2 = 46344 and ∑Yi = 1092, ∑Yi2 = 62136

Average corrosion values:
x̅ = ∑Xi / 20 = 950/20= 47.5
Ȳ = ∑Xi / 20 = 1092/20= 54.6

Z value and T value from the tables (turns out I only need a T value, but not this one?) for 95% confidence interval.
Z0.025 = 1.96
T0.025 = 2.093

Finding the variance:

##S^2 = \frac 1 {19} \left[ \sum_{i=1}^{n=20} (X_i - \bar X)^2 + \sum_{i=1}^{n=20} (Y_i - \bar Y)^2 \right]##

Which results in ##S^2 = 196.41##

I then get this confidence interval: ##950 - 1092 \pm 2.093 * \sqrt{196.41} * \sqrt{\frac 1 {20} + \frac 1 {20}} = [-132.72, 151.28]##

Homework Helper
Gold Member
In the link you sent, I do not understand how to use the information to get a formula I can use to determine different variance for the two samples.

The formula $$T = \frac {\bar Y_1 - \bar Y_2} {S_2 \sqrt{\frac 1 N_1 + \frac 1 N_2}},$$ I feel like I have T and ##S_p## unknown, or should I use the T I found before?
That's the wrong formula. Immediately before that formula are the words "If equal variances are assumed, then the formula reduces to:", which are not applicable here.

Use the formula that doesn't assume equal variances, which is the one after the words "Test Statistic:". It uses the two sample variances ##s_1{}^2## and ##s_2{}^2##, which you calculate from the two samples separately with the usual formula for sample variance.

Mutaja
That's the wrong formula. Immediately before that formula are the words "If equal variances are assumed, then the formula reduces to:", which are not applicable here.

Use the formula that doesn't assume equal variances, which is the one after the words "Test Statistic:". It uses the two sample variances ##s_1{}^2## and ##s_2{}^2##, which you calculate from the two samples separately with the usual formula for sample variance.

I'm sorry, I must've typed the wrong formula when looking at the latex codes at the same time, I did indeed post the wrong formula.

But your reply here cleared up my confusion regardless, I will get back to you once I've completed this step if necessary.

Mutaja
That's the wrong formula. Immediately before that formula are the words "If equal variances are assumed, then the formula reduces to:", which are not applicable here.

Use the formula that doesn't assume equal variances, which is the one after the words "Test Statistic:". It uses the two sample variances ##s_1{}^2## and ##s_2{}^2##, which you calculate from the two samples separately with the usual formula for sample variance.

I used this formula: ##T = \frac {\bar X - \bar Y}{\sqrt{\frac {S_X^2}{N_X}} + \sqrt{\frac {S_Y^2}{N_Y}}} ##

Where
##\bar X = 47.5##
##\bar Y = 54.6##
##N_X = 20##
##N_Y = 20##

And ##S_X^2## is calculated using normal sample variance:

##S_X^2 = \frac {(39-47.5)^2 + ... + (62-47.5)^2} {19} = \frac {8.5^2+ ... + 14.5^2} {19} = \frac {1219} {19} = 64.158##

##S_Y^2## was calculated using the same formula (I will type it out if wanted) = 132.253

I then got: ##T = \frac {47.5 - 54.6}{\sqrt{\frac {64.158}{20}} + \sqrt{\frac {132.253}{20}}} = -1.627##

In regards to my opening post, I'm not sure where to use this number, as I had a different (wrong) approach there.

Is this my 95% confidence interval?

95% confidence interval: ##X_i - Y_i \pm T_{0.025} * \sqrt T * \sqrt {\frac {1} {20} + \frac {1} {20}}##

##950 - 1092 \pm 2.093 * \sqrt {1.627} * \sqrt {\frac {1} {20} + \frac {1} {20}}##

= [-141.156, -142,844] (changed the negative square root to positive for obvious reasons.

I still can't make much sense out of this.

Last edited:
Homework Helper
Dearly Missed
I used this formula: ##T = \frac {\bar X - \bar Y}{\sqrt{\frac {S_X^2}{N_X}} + \sqrt{\frac {S_Y^2}{N_Y}}} ##

Where
##\bar X = 47.5##
##\bar Y = 54.6##
##N_X = 20##
##N_Y = 20##

And ##S_X^2## is calculated using normal sample variance:

##S_X^2 = \frac {(39-47.5)^2 + ... + (62-47.5)^2} {19} = \frac {8.5^2+ ... + 14.5^2} {19} = \frac {1219} {19} = 64.158##

##S_Y^2## was calculated using the same formula (I will type it out if wanted) = 132.253

I then got: ##T = \frac {47.5 - 54.6}{\sqrt{\frac {64.158}{20}} + \sqrt{\frac {132.253}{20}}} = -1.627##

In regards to my opening post, I'm not sure where to use this number, as I had a different (wrong) approach there.

Is this my 95% confidence interval?

95% confidence interval: ##X_i - Y_i \pm T_{0.025} * \sqrt T * \sqrt {\frac {1} {20} + \frac {1} {20}}##

##950 - 1092 \pm 2.093 * \sqrt 196.41 * \sqrt {\frac {1} {20} + \frac {1} {20}}##

= [-141.156, -142,844] (changed the negative square root to positive for obvious reasons.

I still can't make much sense out of this.

Your formula for ##T## is incorrect. Look at
http://www.itl.nist.gov/div898/handbook/eda/section3/eda353.htm
for the correct form.

Mutaja
Your formula for ##T## is incorrect. Look at
http://www.itl.nist.gov/div898/handbook/eda/section3/eda353.htm
for the correct form.
As andrewkirk suggested above, I used the formula in the link you both gave me, the formula after "test statistic". I just renamed it using Sx2 and Sy2 for it to be identical with the text in my exercise, instead of S12 and S22 as is used in the formula.

Hope this clears it up. Is it still incorrect?

Edit: I see where it's wrong now, I used two square roots instead of one in the denominator.

##T = \frac {\bar X - \bar Y}{\sqrt{\frac {S_X^2}{N_X} +\frac {S_Y^2}{N_Y}}} ##

This gave T = -2.265 and plugging that into the formula I provided earlier, that I have in my noted to determine confidence interval, I get this:

##X_i - Y_i \pm T_{0.025} * \sqrt T * \sqrt {\frac {1} {20} + \frac {1} {20}}##

##950 - 1092 \pm 2.093 * \sqrt {2.265} * \sqrt {\frac {1} {20} + \frac {1} {20}}##

95% confidence interval: [-141, -143]

I rounded the decimals as they were .00 and .99.

To me, this is something I still can't get a clear answer from.

Homework Helper
Dearly Missed
As andrewkirk suggested above, I used the formula in the link you both gave me, the formula after "test statistic". I just renamed it using Sx2 and Sy2 for it to be identical with the text in my exercise, instead of S12 and S22 as is used in the formula.

Hope this clears it up. Is it still incorrect?

Edit: I see where it's wrong now, I used two square roots instead of one in the denominator.

##T = \frac {\bar X - \bar Y}{\sqrt{\frac {S_X^2}{N_X} +\frac {S_Y^2}{N_Y}}} ##

This gave T = -2.265 and plugging that into the formula I provided earlier, that I have in my noted to determine confidence interval, I get this:

##X_i - Y_i \pm T_{0.025} * \sqrt T * \sqrt {\frac {1} {20} + \frac {1} {20}}##

##950 - 1092 \pm 2.093 * \sqrt {2.265} * \sqrt {\frac {1} {20} + \frac {1} {20}}##

95% confidence interval: [-141, -143]

I rounded the decimals as they were .00 and .99.
To me, this is something I still can't get a clear answer from.

Your numbers must be wrong: your ##X,Y## data values each range from 39 to 70, so there is no way the difference in their means can be less than -141.

Mutaja
Your numbers must be wrong: your ##X,Y## data values each range from 39 to 70, so there is no way the difference in their means can be less than -141.

## \bar X ## and ## \bar Y ## are correct since they are simply ∑##X_i## and ∑##Y_i## which are gathered from the exercise text, divided by 20.

Is ##S_X^2 = 64.158## and ##S_Y^2 = 132.253## the wrong numbers? I've double and triple checked those numbers. Repeating my formula:

##S_X^2 = \frac {(39-47.5)^2 + ... + (62-47.5)^2} {19} = \frac {8.5^2+ ... + 14.5^2} {19} = \frac {1219} {19} = 64.158##

Same method for ##S_Y^2##.

##N_X## and ##N_Y## are 20, since I have 20 numbers in both samples, so that can't be wrong.

I will go over ##S_X^2## and ##S_Y^2## again.

Edit: They are both correct. I can't see any of the inputs in my formulas that's incorrect at this time.

Last edited:
Homework Helper
Dearly Missed
## \bar X ## and ## \bar Y ## are correct since they are simply ∑##X_i## and ∑##Y_i## which are gathered from the exercise text, divided by 20.

Is ##S_X^2 = 64.158## and ##S_Y^2 = 132.253## the wrong numbers? I've double and triple checked those numbers. Repeating my formula:

##S_X^2 = \frac {(39-47.5)^2 + ... + (62-47.5)^2} {19} = \frac {8.5^2+ ... + 14.5^2} {19} = \frac {1219} {19} = 64.158##

Same method for ##S_Y^2##.

##N_X## and ##N_Y## are 20, since I have 20 numbers in both samples, so that can't be wrong.

I will go over ##S_X^2## and ##S_Y^2## again.

Edit: They are both correct. I can't see any of the inputs in my formulas that's incorrect at this time.

No, no, no! The confidence interval for the difference in means must include the sample mean-difference within it. The sample mean difference is ##\bar{x} - \bar{y} = 47.5 - 54.6 = -7.1##, so the confidence-interval must contain the number ##-7.1##. Your interval of ##[-143, -141]## misses the value ##-7.1## by a huge amount.

There really MUST be something wrong.

Mutaja
No, no, no! The confidence interval for the difference in means must include the sample mean-difference within it. The sample mean difference is ##\bar{x} - \bar{y} = 47.5 - 54.6 = -7.1##, so the confidence-interval must contain the number ##-7.1##. Your interval of ##[-143, -141]## misses the value ##-7.1## by a huge amount.

There really MUST be something wrong.

There is, or there was. I took the answers I got from this thread and used with the formulas and notes I have and got this:

I have used different symbols. Want to find: ##Δµ = µ_X - µ_Y ## based on ##Δ_{\bar X \bar Y} = \bar X - \bar Y = 47.5 - 54.6 = -7.1##

Variance: ## \frac {(n_X -1)S_X^2 + (n_Y -1)S_Y^2} {n_x + n_y -2} = \frac {(19)47.5^2 + (19)54.6^2} {38} = 2618.705##

##2618.705(\frac {1} {20} + \frac {1} {20}) = 261.8705##

##\sqrt {261.8705} = 16.18##

Using T distribution, 95% 38 degrees of fredom (20+20-2) (40 because that's the nearest number my table gives me) gives the value 2.021.

## E = 2.021 * 16.18 = 32.7, [-7.1-32-7 , -7.1+32.7]##

95% confidence interval: [-39.8, 25.6]

How do I determine if ##Δ_µ = 0## is within this confidence interval?

Homework Helper
Dearly Missed
There is, or there was. I took the answers I got from this thread and used with the formulas and notes I have and got this:

I have used different symbols. Want to find: ##Δµ = µ_X - µ_Y ## based on ##Δ_{\bar X \bar Y} = \bar X - \bar Y = 47.5 - 54.6 = -7.1##

Variance: ## \frac {(n_X -1)S_X^2 + (n_Y -1)S_Y^2} {n_x + n_y -2} = \frac {(19)47.5^2 + (19)54.6^2} {38} = 2618.705##

##2618.705(\frac {1} {20} + \frac {1} {20}) = 261.8705##

##\sqrt {261.8705} = 16.18##

Using T distribution, 95% 38 degrees of fredom (20+20-2) (40 because that's the nearest number my table gives me) gives the value 2.021.

## E = 2.021 * 16.18 = 32.7, [-7.1-32-7 , -7.1+32.7]##

95% confidence interval: [-39.8, 25.6]

How do I determine if ##Δ_µ = 0## is within this confidence interval?

Are you saying that you do not know whether the number ##0## is in the interval ##[-39.8, 25.6]?##

Mutaja
Are you saying that you do not know whether the number ##0## is in the interval ##[-39.8, 25.6]?##
I was looking through my formulas and numbers looking for a value##µ_X - µ_Y = ?##

I also thought the value that should be within the confidence interval should be +7.1 or -7.1.

Would the conclusion " ##Δµ = 0## is within the confidence interval, we can therefore not conclude that the treated and untreated pipes have the same mean value despite there being a clear difference in the mean value in the two samples. We can not conclude that the surface treatment reduce corrosion. " be the answer?

Homework Helper
Dearly Missed
I was looking through my formulas and numbers looking for a value##µ_X - µ_Y = ?##

I also thought the value that should be within the confidence interval should be +7.1 or -7.1.

Would the conclusion " ##Δµ = 0## is within the confidence interval, we can therefore not conclude that the treated and untreated pipes have the same mean value despite there being a clear difference in the mean value in the two samples. We can not conclude that the surface treatment reduce corrosion. " be the answer?

No: you have it backwards: Since your interval includes 0, you cannot conclude that the two means are different. There is at least a 5% chance that the different sample means just occur by chance in samples with the same true mean.

However, I get a different confidence interval from yours. Both of my confidence limits are < 0,so for me, I can say (with 95% confidence) that ##\mu_X < \mu_Y##. (I just used the TwoSampleTTest module in Maple's Statistics package.)

So, for my analysis of the data, the uncoated pipes have a lower average corrosion than treated pipes, so treating the pipes makes them worse!

Mutaja
No: you have it backwards: Since your interval includes 0, you cannot conclude that the two means are different. There is at least a 5% chance that the different sample means just occur by chance in samples with the same true mean.

However, I get a different confidence interval from yours. Both of my confidence limits are < 0,so for me, I can say (with 95% confidence) that ##\mu_X < \mu_Y##. (I just used the TwoSampleTTest module in Maple's Statistics package.)

So, for my analysis of the data, the uncoated pipes have a lower average corrosion than treated pipes, so treating the pipes makes them worse!

Looking at the sum of the coated and uncoated pipes, or the average, it looks to be the most likely result yes.

My calculations should get a confidence interval where both sums are greater than 0, but I've looked over this several times now and everything seem to be correct according to my notes and formulas. Do you have any idea what I'm missing?

Last edited:
Homework Helper
Dearly Missed
There is, or there was. I took the answers I got from this thread and used with the formulas and notes I have and got this:

I have used different symbols. Want to find: ##Δµ = µ_X - µ_Y ## based on ##Δ_{\bar X \bar Y} = \bar X - \bar Y = 47.5 - 54.6 = -7.1##

Variance: ## \frac {(n_X -1)S_X^2 + (n_Y -1)S_Y^2} {n_x + n_y -2} = \frac {(19)47.5^2 + (19)54.6^2} {38} = 2618.705##

##2618.705(\frac {1} {20} + \frac {1} {20}) = 261.8705##

##\sqrt {261.8705} = 16.18##

Using T distribution, 95% 38 degrees of fredom (20+20-2) (40 because that's the nearest number my table gives me) gives the value 2.021.

## E = 2.021 * 16.18 = 32.7, [-7.1-32-7 , -7.1+32.7]##

95% confidence interval: [-39.8, 25.6]

How do I determine if ##Δ_µ = 0## is within this confidence interval?

You must not have read carefully the link I sent you. The fact is that for two independent samples the so-called 2-sample t-test is an approximation, but one that has been tested extensively through theoretical work and monte-carlo simulation, etc.

The relevant "degrees of freedom" for ##t## is NOT ##N_1 + N_2 - 2##, rather, it is given by the formula shown in the link:

##\text{degrees of freedom} = v = \displaystyle \frac{(s_1^2/N_1 + s_2^2/N_2)^2}{(s_1^2/N_1)^2/(N_1-1) + (s_2^2 /N_2)^2/(N_2-1)}##

For the data in your problem we get ##v = 32.0379##, so about 32 degrees of freedom. This is large enough to allow us to get a "reasonable" approximation by just using the normal distribution instead of the t distribution. (The t(32) gives a confidence interval about 5% larger than the normal.)

Mutaja
You must not have read carefully the link I sent you. The fact is that for two independent samples the so-called 2-sample t-test is an approximation, but one that has been tested extensively through theoretical work and monte-carlo simulation, etc.

The relevant "degrees of freedom" for ##t## is NOT ##N_1 + N_2 - 2##, rather, it is given by the formula shown in the link:

##\text{degrees of freedom} = v = \displaystyle \frac{(s_1^2/N_1 + s_2^2/N_2)^2}{(s_1^2/N_1)^2/(N_1-1) + (s_2^2 /N_2)^2/(N_2-1)}##

For the data in your problem we get ##v = 32.0379##, so about 32 degrees of freedom. This is large enough to allow us to get a "reasonable" approximation by just using the normal distribution instead of the t distribution. (The t(32) gives a confidence interval about 5% larger than the normal.)

Using that formula I got 37.82 using my numbers.

Which are ##S_1^2 = 47.5 , S_2^2 = 54.6## N is 20 for both.

Homework Helper
Dearly Missed
Using that formula I got 37.82 using my numbers.

Which are ##S_1^2 = 47.5 , S_2^2 = 54.6## N is 20 for both.

NO: back in message #10 you calculated ##s_1^2 =S_X^2 = 64.158## and ##s_2^2 = S_Y^2 = 132.253##. However, I---or rather, Maple ---gets a slightly different value of ##S_X^2## but the same ##S_Y^2## that you got back in message #10.

Mutaja
NO: back in message #10 you calculated ##s_1^2 =S_X^2 = 64.158## and ##s_2^2 = S_Y^2 = 132.253##. However, I---or rather, Maple ---gets a slightly different value of ##S_X^2## but the same ##S_Y^2## that you got back in message #10.

Ok, I got the numbers confused, my bad. Looks like I've been mixing up two formulas with the ones you've provided me and the ones I have in my lecture notes.

In my post #10 I have put the numbers into a spreadsheet to calculate it so there shouldn't be any errors, not even decimals.

Question 1: Are you sure that the numbers you used to calculate the answer are correct?

Question 2: Can you indicate how slightly my ##S_X^2## value is off? Going over my calculations again to check my error would be a lot more convenient if I know what the ##S_X^2## value is supposed to be, and what the confidence interval is supposed to be. Having the right answer without the right calculations won't give me any points regardless.

Thank you for your help so far, appreciate it.

Homework Helper
Dearly Missed
Ok, I got the numbers confused, my bad. Looks like I've been mixing up two formulas with the ones you've provided me and the ones I have in my lecture notes.

In my post #10 I have put the numbers into a spreadsheet to calculate it so there shouldn't be any errors, not even decimals.

Question 1: Are you sure that the numbers you used to calculate the answer are correct?

Question 2: Can you indicate how slightly my ##S_X^2## value is off? Going over my calculations again to check my error would be a lot more convenient if I know what the ##S_X^2## value is supposed to be, and what the confidence interval is supposed to be. Having the right answer without the right calculations won't give me any points regardless.

Thank you for your help so far, appreciate it.

I went back and checked my input, and found one single wrong entry in my X-array. After fixing it, I get exactly the same results as yours.

Mutaja
Mutaja
I went back and checked my input, and found one single wrong entry in my X-array. After fixing it, I get exactly the same results as yours.
Thanks a lot for checking, I will use my answer in post #14.

Also thanks a lot for putting up with my confusion in this thread, I appreciate it.