Confidence interval - Coop Report, Need Help

Main Question or Discussion Point

Hi, I am an 3rd engineering student working on my coop report. My job was to design routes for survey work, and come up with figures of potential time savings if this was to be done on a grand scale. The sites are divided into 15 regions (by geography) but are not in any particular order.

From a population of 2062 survey sites, I chose a sample of 155 (1 of the regions). I drove these unplanned, took X number of minutes, then I drove it planned it took Y number of minutes. My time savings was 8.6%, and I am looking to turn this into some sort of confidence interval. To do so I was going to break it down as time per read site (min/read) and thus my total time in minutes / 155 reads would be used as my mean for each situation. Then I was going to compare them as a difference between means, this is where I would have the confidence interval at 95%. Problem is I do not know how I can find a standard deviation, as I have not actually mesured these time per read on an individual basis, and doing so would be an impossible task. Am a taking a wrong approach in how I am going to use the data or what statistical approach I am using? Is there a way I can still get some sort of standard deviation from the sample? Thanks for the help. John.

Related Set Theory, Logic, Probability, Statistics News on Phys.org
EnumaElish
Homework Helper
I guess, driving through 9 (or 4) more regions is out of the question?

You might assume a distributional form: say, uniform over an interval [a.b]. Location parameters a and b need to be calculated. If you have a guess about the min. (or max.) time it took you to drive unplanned, then you might assume that that many minutes is the expected value of the lowest (or highest) order statistic of a set of 155 random variables each of which is distributed uniformly over [a,b]. Since you can also calculate a mean, you will have two equations and two unknown parameters a and b that you can solve for. In case of the max. order statistic, Xmax = Max{X1, ..., X155}, the two equations are:

$$\int_a^b x\cdot f(x;a,b) \cdot dx = \overline X$$

$$\int_a^b y\cdot g(y;a,b)\cdot dy = \overline {X_\text{max}}$$

where:
f(x;a,b) = F'(x;a,b)
F(x;a,b) = (x - a)/(b - a)
g(x;a,b) = G'(x;a,b)
G(x;a,b) = Prob(Xmax < x|a,b) = Prob({X1, ..., X155} < x|a,b) = F(x|a,b)155.

Then you can repeat this procedure for the planned drive and calculate parameters a* and b* for the planned drive.

Then you can draw a pricture of the two distributions together, and see how much they overlap.

This is the best I can come up with right now.

Hi, thanks for the reply. So far I have only taken 1 introductory stats course, so I am not familiar with what you have posted. Though if it is a possible solution to my problem I can learn it. I am on my last week so I do not have time to drive these routes, maybe I could assemble another 100 data points. Besically my boss just told me last Friday that he was going to put the project forward and I need better, hard numbers to back it up.

After briefly reading up on what you have posted, would it be possible to use unplanned as my X-max and then planned as my median? Since I do not have {X1... X155}, could I assume X-max was the median of the unplanned data? Then calculate what is the probability that X (min/read) is bigger than X-max (min/read)?

How could I calculate a and b for my interval? Do I just make a guess at the size of the interval? I am not sure how to properly use those equations, but then again if you can tell me this should work I will learn it. I need an idea of what sort of assumptions I may need to make so I will know if it is realistic. Thanks again, John.

I forgot to add, that I have driven every single route planned during, and I do have data for these routes. My lack of data is in the unplanned driving of routes. Maybe that could add another element to this? Is there a way I could get a standard deviation by looking at the seperate 15 regions of planned data, weighting them by the number of reads each posses, and still use that as a comparison against unplanned data? Just some more ideas to throw out there, John.

EnumaElish
Homework Helper
In that case you might assume that your unplanned average is the constant mean (i.e. it is not a statistical variable): $\mu_u \equiv \overline {X_u}}$. Under this assumption you need not worry about the unplanned standard deviation (technically you are assuming it is zero). Then you can test whether your planned data support a significant difference between $\overline{X_p}$ and $\mu_u$. I think you can do this using Excel's ttest function.

This is the exact statistical procedure that is frequently used in engineering studies to test whether the mean lifetime of an "improved light bulb" is higher or lower than the mean lifetime of an "ordinary light bulb." Say it is known that the ordinary bulbs have an expected life of about X hours, and someone were to record the lifetimes of several (say, 155) improved light bulbs and calculate their average as Y hours. Then, using the data obtained from this improved bulb sample, a t-test would be run to answer the question, "is the improved bulb's mean lifetime statistically different from X at the Z% level of significance?" Usually Z is taken to be 5% or 10%. Z = 5% is equivalent to fitting a confidence interval (CI) of 95% around Y. Z = 10% is equivalent to fitting a CI of 90% around Y.

Alright perfect, this is starting to shape up. By t-test you mean student t distribution correct? What is the correct way of taking my data for planned routes and weighting it to calculate the standard deviations. Some regions have 260 reads and some have a little as 30. I need to weight the time/read accordingly to get my standard deviation. I know this is possible I remember doing it, I will skim my old text book but if anyone knows please let me know. Thanks again, John.

EnumaElish
Homework Helper
jubs said:
By t-test you mean student t distribution correct?
Correct; I mean: testing the hypothesis (statement) that "there is a significant probability of the t-ratio $t=(\bar x_p - \mu_u)/{s_\bar x_p} \text{ being }\ne 0$" (where the denominator is the standard deviation of the sample average $\bar x_p$).
What is the correct way of taking my data for planned routes and weighting it to calculate the standard deviations. Some regions have 260 reads and some have a little as 30. I need to weight the time/read accordingly to get my standard deviation.

Last edited:
Wow, this is going to work unbelievably well. Thanks a million for all the help, I just needed a bit of direction. Now would the df apply to the 15 regions? Or would it apply to the total number of read points? John.

Last edited:
EnumaElish
Homework Helper
If you're using multiple regions then df = (15 times average read per region) - 1 = Number of reads in all regions - 1.

But I thought you have calculated the unplanned time average in one region only. If so, restricting the analysis to reads in that one region only might be the more "conservative" approach. In this case df = number of reads in the region included in your analysis - 1.

I basically have all my stats together, and I am near the end I just need someone to tell me if my approach makes sense.

From my planned data I got a standard deviation. This was a low number due to my large sample size 2062, and thus was good for predicting a mean for a similar region. (My region is not a true random sample and could be significantly different from others)

Now I am posing the null hypothesis to see if planning should be adopted. The null hypothesis is if u <= then the mean of planned routes. If it is accepted planning routes will not be adopted. I must use my X as the mean of unplanned routes. My samples size is n = 155. My u = the mean of the planned routes. Problem is I don't have the standard deviation for unplanned samples. I cannot get it as my sample was not gathered as an individual basis but as a whole. My plan is to say, to accept the null hypothesis, and thus not plan routes, the standard deviation of unplanned routes must be this number. This number came out to be 15x larger then planned deviation. I think this has a good margin of safety to see an improvement for route planning. Does this make sense, and or is this a good way to make a case?

Sorry I have never really done this before.
Thanks, John.

EnumaElish
(Thinking aloud:) But, in this case, it seems to me, you have two random variables, Xu and Xp, so you need to derive the distribution of Y = (Xp - Xu) and then test whether Y's mean is significantly greater than zero or not. If both X's are distributed normally (let's assume they are), then Y is distributed normally with mean = $\mu_p - \mu_u$ and variance = $\sigma^2_p + \sigma^2_u - 2\text{Cov}(X_p\, ,X_u)$. You could say you have an estimate of $\sigma_p$ so it's possible to solve for the value of $\sigma_u$ that would reject your null hypothesis in terms of Y if you also knew the covariance between planned and unplanned reads, i.e. the term $\underline{-2\text{Cov}(X_p\, ,X_u)}$. Since I don't think you know or can derive this covariance, I am sceptical about what you are proposing.
The other approach is to say, "I test this for one region only and the results I am going to show you are conditional on that one region." In that approach you don't need to worry about the differences between regions because in effect you are taking $\mu_u$ as a constant instead of assuming Xu is a(nother) random variable. So the only random variable you need to worry about is Xp and you have all you need for a hypothesis test.