Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Confidence interval - Coop Report, Need Help

  1. Aug 23, 2005 #1
    Hi, I am an 3rd engineering student working on my coop report. My job was to design routes for survey work, and come up with figures of potential time savings if this was to be done on a grand scale. The sites are divided into 15 regions (by geography) but are not in any particular order.

    From a population of 2062 survey sites, I chose a sample of 155 (1 of the regions). I drove these unplanned, took X number of minutes, then I drove it planned it took Y number of minutes. My time savings was 8.6%, and I am looking to turn this into some sort of confidence interval. To do so I was going to break it down as time per read site (min/read) and thus my total time in minutes / 155 reads would be used as my mean for each situation. Then I was going to compare them as a difference between means, this is where I would have the confidence interval at 95%. Problem is I do not know how I can find a standard deviation, as I have not actually mesured these time per read on an individual basis, and doing so would be an impossible task. Am a taking a wrong approach in how I am going to use the data or what statistical approach I am using? Is there a way I can still get some sort of standard deviation from the sample? Thanks for the help. John.
  2. jcsd
  3. Aug 23, 2005 #2


    User Avatar
    Science Advisor
    Homework Helper

    I guess, driving through 9 (or 4) more regions is out of the question?

    You might assume a distributional form: say, uniform over an interval [a.b]. Location parameters a and b need to be calculated. If you have a guess about the min. (or max.) time it took you to drive unplanned, then you might assume that that many minutes is the expected value of the lowest (or highest) order statistic of a set of 155 random variables each of which is distributed uniformly over [a,b]. Since you can also calculate a mean, you will have two equations and two unknown parameters a and b that you can solve for. In case of the max. order statistic, Xmax = Max{X1, ..., X155}, the two equations are:

    [tex]\int_a^b x\cdot f(x;a,b) \cdot dx = \overline X[/tex]

    [tex]\int_a^b y\cdot g(y;a,b)\cdot dy = \overline {X_\text{max}}[/tex]

    f(x;a,b) = F'(x;a,b)
    F(x;a,b) = (x - a)/(b - a)
    g(x;a,b) = G'(x;a,b)
    G(x;a,b) = Prob(Xmax < x|a,b) = Prob({X1, ..., X155} < x|a,b) = F(x|a,b)155.

    Then you can repeat this procedure for the planned drive and calculate parameters a* and b* for the planned drive.

    Then you can draw a pricture of the two distributions together, and see how much they overlap.

    This is the best I can come up with right now.
  4. Aug 23, 2005 #3
    Hi, thanks for the reply. So far I have only taken 1 introductory stats course, so I am not familiar with what you have posted. Though if it is a possible solution to my problem I can learn it. I am on my last week so I do not have time to drive these routes, maybe I could assemble another 100 data points. Besically my boss just told me last Friday that he was going to put the project forward and I need better, hard numbers to back it up.

    After briefly reading up on what you have posted, would it be possible to use unplanned as my X-max and then planned as my median? Since I do not have {X1... X155}, could I assume X-max was the median of the unplanned data? Then calculate what is the probability that X (min/read) is bigger than X-max (min/read)?

    How could I calculate a and b for my interval? Do I just make a guess at the size of the interval? I am not sure how to properly use those equations, but then again if you can tell me this should work I will learn it. I need an idea of what sort of assumptions I may need to make so I will know if it is realistic. Thanks again, John.
  5. Aug 23, 2005 #4
    I forgot to add, that I have driven every single route planned during, and I do have data for these routes. My lack of data is in the unplanned driving of routes. Maybe that could add another element to this? Is there a way I could get a standard deviation by looking at the seperate 15 regions of planned data, weighting them by the number of reads each posses, and still use that as a comparison against unplanned data? Just some more ideas to throw out there, John.
  6. Aug 23, 2005 #5


    User Avatar
    Science Advisor
    Homework Helper

    In that case you might assume that your unplanned average is the constant mean (i.e. it is not a statistical variable): [itex]\mu_u \equiv \overline {X_u}}[/itex]. Under this assumption you need not worry about the unplanned standard deviation (technically you are assuming it is zero). Then you can test whether your planned data support a significant difference between [itex]\overline{X_p}[/itex] and [itex]\mu_u[/itex]. I think you can do this using Excel's ttest function.

    This is the exact statistical procedure that is frequently used in engineering studies to test whether the mean lifetime of an "improved light bulb" is higher or lower than the mean lifetime of an "ordinary light bulb." Say it is known that the ordinary bulbs have an expected life of about X hours, and someone were to record the lifetimes of several (say, 155) improved light bulbs and calculate their average as Y hours. Then, using the data obtained from this improved bulb sample, a t-test would be run to answer the question, "is the improved bulb's mean lifetime statistically different from X at the Z% level of significance?" Usually Z is taken to be 5% or 10%. Z = 5% is equivalent to fitting a confidence interval (CI) of 95% around Y. Z = 10% is equivalent to fitting a CI of 90% around Y.
  7. Aug 23, 2005 #6
    Alright perfect, this is starting to shape up. By t-test you mean student t distribution correct? What is the correct way of taking my data for planned routes and weighting it to calculate the standard deviations. Some regions have 260 reads and some have a little as 30. I need to weight the time/read accordingly to get my standard deviation. I know this is possible I remember doing it, I will skim my old text book but if anyone knows please let me know. Thanks again, John.
  8. Aug 23, 2005 #7


    User Avatar
    Science Advisor
    Homework Helper

    Correct; I mean: testing the hypothesis (statement) that "there is a significant probability of the t-ratio [itex]t=(\bar x_p - \mu_u)/{s_\bar x_p} \text{ being }\ne 0[/itex]" (where the denominator is the standard deviation of the sample average [itex]\bar x_p[/itex]).
    See these links: 1, 2.
    Last edited: Aug 23, 2005
  9. Aug 24, 2005 #8
    Wow, this is going to work unbelievably well. Thanks a million for all the help, I just needed a bit of direction. Now would the df apply to the 15 regions? Or would it apply to the total number of read points? John.
    Last edited: Aug 24, 2005
  10. Aug 24, 2005 #9


    User Avatar
    Science Advisor
    Homework Helper

    If you're using multiple regions then df = (15 times average read per region) - 1 = Number of reads in all regions - 1.

    But I thought you have calculated the unplanned time average in one region only. If so, restricting the analysis to reads in that one region only might be the more "conservative" approach. In this case df = number of reads in the region included in your analysis - 1.
  11. Aug 25, 2005 #10
    I basically have all my stats together, and I am near the end I just need someone to tell me if my approach makes sense.

    From my planned data I got a standard deviation. This was a low number due to my large sample size 2062, and thus was good for predicting a mean for a similar region. (My region is not a true random sample and could be significantly different from others)

    Now I am posing the null hypothesis to see if planning should be adopted. The null hypothesis is if u <= then the mean of planned routes. If it is accepted planning routes will not be adopted. I must use my X as the mean of unplanned routes. My samples size is n = 155. My u = the mean of the planned routes. Problem is I don't have the standard deviation for unplanned samples. I cannot get it as my sample was not gathered as an individual basis but as a whole. My plan is to say, to accept the null hypothesis, and thus not plan routes, the standard deviation of unplanned routes must be this number. This number came out to be 15x larger then planned deviation. I think this has a good margin of safety to see an improvement for route planning. Does this make sense, and or is this a good way to make a case?

    Sorry I have never really done this before.
    Thanks, John.
  12. Aug 25, 2005 #11


    User Avatar
    Science Advisor
    Homework Helper

    Since you need to take into consideration the differences between regions and the variance that will introduce to the unplanned reads, then this sounds like a sensible approach.

    (Thinking aloud:) But, in this case, it seems to me, you have two random variables, Xu and Xp, so you need to derive the distribution of Y = (Xp - Xu) and then test whether Y's mean is significantly greater than zero or not. If both X's are distributed normally (let's assume they are), then Y is distributed normally with mean = [itex]\mu_p - \mu_u[/itex] and variance = [itex]\sigma^2_p + \sigma^2_u - 2\text{Cov}(X_p\, ,X_u)[/itex]. You could say you have an estimate of [itex]\sigma_p[/itex] so it's possible to solve for the value of [itex]\sigma_u[/itex] that would reject your null hypothesis in terms of Y if you also knew the covariance between planned and unplanned reads, i.e. the term [itex]\underline{-2\text{Cov}(X_p\, ,X_u)}[/itex]. Since I don't think you know or can derive this covariance, I am sceptical about what you are proposing.

    The other approach is to say, "I test this for one region only and the results I am going to show you are conditional on that one region." In that approach you don't need to worry about the differences between regions because in effect you are taking [itex]\mu_u[/itex] as a constant instead of assuming Xu is a(nother) random variable. So the only random variable you need to worry about is Xp and you have all you need for a hypothesis test.
    Last edited: Aug 25, 2005
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook