Confidence interval - Coop Report, Need Help

In summary: Then you would solve for a* and b*: a*=0.086*155=11.12 and b*=0.086*155=7.68. So your standard deviation would be: \sigma_u=\sqrt{(11.12+7.68)^2}.
  • #1
jubs
19
0
Hi, I am an 3rd engineering student working on my coop report. My job was to design routes for survey work, and come up with figures of potential time savings if this was to be done on a grand scale. The sites are divided into 15 regions (by geography) but are not in any particular order.

From a population of 2062 survey sites, I chose a sample of 155 (1 of the regions). I drove these unplanned, took X number of minutes, then I drove it planned it took Y number of minutes. My time savings was 8.6%, and I am looking to turn this into some sort of confidence interval. To do so I was going to break it down as time per read site (min/read) and thus my total time in minutes / 155 reads would be used as my mean for each situation. Then I was going to compare them as a difference between means, this is where I would have the confidence interval at 95%. Problem is I do not know how I can find a standard deviation, as I have not actually mesured these time per read on an individual basis, and doing so would be an impossible task. Am a taking a wrong approach in how I am going to use the data or what statistical approach I am using? Is there a way I can still get some sort of standard deviation from the sample? Thanks for the help. John.
 
Physics news on Phys.org
  • #2
I guess, driving through 9 (or 4) more regions is out of the question?

You might assume a distributional form: say, uniform over an interval [a.b]. Location parameters a and b need to be calculated. If you have a guess about the min. (or max.) time it took you to drive unplanned, then you might assume that that many minutes is the expected value of the lowest (or highest) order statistic of a set of 155 random variables each of which is distributed uniformly over [a,b]. Since you can also calculate a mean, you will have two equations and two unknown parameters a and b that you can solve for. In case of the max. order statistic, Xmax = Max{X1, ..., X155}, the two equations are:

[tex]\int_a^b x\cdot f(x;a,b) \cdot dx = \overline X[/tex]

[tex]\int_a^b y\cdot g(y;a,b)\cdot dy = \overline {X_\text{max}}[/tex]

where:
f(x;a,b) = F'(x;a,b)
F(x;a,b) = (x - a)/(b - a)
g(x;a,b) = G'(x;a,b)
G(x;a,b) = Prob(Xmax < x|a,b) = Prob({X1, ..., X155} < x|a,b) = F(x|a,b)155.

Then you can repeat this procedure for the planned drive and calculate parameters a* and b* for the planned drive.

Then you can draw a pricture of the two distributions together, and see how much they overlap.

This is the best I can come up with right now.
 
  • #3
Hi, thanks for the reply. So far I have only taken 1 introductory stats course, so I am not familiar with what you have posted. Though if it is a possible solution to my problem I can learn it. I am on my last week so I do not have time to drive these routes, maybe I could assemble another 100 data points. Besically my boss just told me last Friday that he was going to put the project forward and I need better, hard numbers to back it up.

After briefly reading up on what you have posted, would it be possible to use unplanned as my X-max and then planned as my median? Since I do not have {X1... X155}, could I assume X-max was the median of the unplanned data? Then calculate what is the probability that X (min/read) is bigger than X-max (min/read)?

How could I calculate a and b for my interval? Do I just make a guess at the size of the interval? I am not sure how to properly use those equations, but then again if you can tell me this should work I will learn it. I need an idea of what sort of assumptions I may need to make so I will know if it is realistic. Thanks again, John.
 
  • #4
I forgot to add, that I have driven every single route planned during, and I do have data for these routes. My lack of data is in the unplanned driving of routes. Maybe that could add another element to this? Is there a way I could get a standard deviation by looking at the separate 15 regions of planned data, weighting them by the number of reads each posses, and still use that as a comparison against unplanned data? Just some more ideas to throw out there, John.
 
  • #5
In that case you might assume that your unplanned average is the constant mean (i.e. it is not a statistical variable): [itex]\mu_u \equiv \overline {X_u}}[/itex]. Under this assumption you need not worry about the unplanned standard deviation (technically you are assuming it is zero). Then you can test whether your planned data support a significant difference between [itex]\overline{X_p}[/itex] and [itex]\mu_u[/itex]. I think you can do this using Excel's ttest function.

This is the exact statistical procedure that is frequently used in engineering studies to test whether the mean lifetime of an "improved light bulb" is higher or lower than the mean lifetime of an "ordinary light bulb." Say it is known that the ordinary bulbs have an expected life of about X hours, and someone were to record the lifetimes of several (say, 155) improved light bulbs and calculate their average as Y hours. Then, using the data obtained from this improved bulb sample, a t-test would be run to answer the question, "is the improved bulb's mean lifetime statistically different from X at the Z% level of significance?" Usually Z is taken to be 5% or 10%. Z = 5% is equivalent to fitting a confidence interval (CI) of 95% around Y. Z = 10% is equivalent to fitting a CI of 90% around Y.
 
  • #6
Alright perfect, this is starting to shape up. By t-test you mean student t distribution correct? What is the correct way of taking my data for planned routes and weighting it to calculate the standard deviations. Some regions have 260 reads and some have a little as 30. I need to weight the time/read accordingly to get my standard deviation. I know this is possible I remember doing it, I will skim my old textbook but if anyone knows please let me know. Thanks again, John.
 
  • #7
jubs said:
By t-test you mean student t distribution correct?
Correct; I mean: testing the hypothesis (statement) that "there is a significant probability of the t-ratio [itex]t=(\bar x_p - \mu_u)/{s_\bar x_p} \text{ being }\ne 0[/itex]" (where the denominator is the standard deviation of the sample average [itex]\bar x_p[/itex]).
What is the correct way of taking my data for planned routes and weighting it to calculate the standard deviations. Some regions have 260 reads and some have a little as 30. I need to weight the time/read accordingly to get my standard deviation.
See these links: 1, 2.
 
Last edited:
  • #8
Wow, this is going to work unbelievably well. Thanks a million for all the help, I just needed a bit of direction. Now would the df apply to the 15 regions? Or would it apply to the total number of read points? John.
 
Last edited:
  • #9
If you're using multiple regions then df = (15 times average read per region) - 1 = Number of reads in all regions - 1.

But I thought you have calculated the unplanned time average in one region only. If so, restricting the analysis to reads in that one region only might be the more "conservative" approach. In this case df = number of reads in the region included in your analysis - 1.
 
  • #10
I basically have all my stats together, and I am near the end I just need someone to tell me if my approach makes sense.

From my planned data I got a standard deviation. This was a low number due to my large sample size 2062, and thus was good for predicting a mean for a similar region. (My region is not a true random sample and could be significantly different from others)

Now I am posing the null hypothesis to see if planning should be adopted. The null hypothesis is if u <= then the mean of planned routes. If it is accepted planning routes will not be adopted. I must use my X as the mean of unplanned routes. My samples size is n = 155. My u = the mean of the planned routes. Problem is I don't have the standard deviation for unplanned samples. I cannot get it as my sample was not gathered as an individual basis but as a whole. My plan is to say, to accept the null hypothesis, and thus not plan routes, the standard deviation of unplanned routes must be this number. This number came out to be 15x larger then planned deviation. I think this has a good margin of safety to see an improvement for route planning. Does this make sense, and or is this a good way to make a case?

Sorry I have never really done this before.
Thanks, John.
 
  • #11
Since you need to take into consideration the differences between regions and the variance that will introduce to the unplanned reads, then this sounds like a sensible approach.

(Thinking aloud:) But, in this case, it seems to me, you have two random variables, Xu and Xp, so you need to derive the distribution of Y = (Xp - Xu) and then test whether Y's mean is significantly greater than zero or not. If both X's are distributed normally (let's assume they are), then Y is distributed normally with mean = [itex]\mu_p - \mu_u[/itex] and variance = [itex]\sigma^2_p + \sigma^2_u - 2\text{Cov}(X_p\, ,X_u)[/itex]. You could say you have an estimate of [itex]\sigma_p[/itex] so it's possible to solve for the value of [itex]\sigma_u[/itex] that would reject your null hypothesis in terms of Y if you also knew the covariance between planned and unplanned reads, i.e. the term [itex]\underline{-2\text{Cov}(X_p\, ,X_u)}[/itex]. Since I don't think you know or can derive this covariance, I am sceptical about what you are proposing.

The other approach is to say, "I test this for one region only and the results I am going to show you are conditional on that one region." In that approach you don't need to worry about the differences between regions because in effect you are taking [itex]\mu_u[/itex] as a constant instead of assuming Xu is a(nother) random variable. So the only random variable you need to worry about is Xp and you have all you need for a hypothesis test.
 
Last edited:

1. What is a confidence interval?

A confidence interval is a statistical range within which the true value of a population parameter, such as the mean or proportion, is estimated to lie with a certain level of confidence.

2. How is a confidence interval calculated?

A confidence interval is calculated using a sample from a population, the desired level of confidence, and the standard error of the sample mean or proportion. The formula for a confidence interval is: sample statistic ± (critical value x standard error).

3. What does the confidence level represent in a confidence interval?

The confidence level, typically expressed as a percentage, represents the probability that the true value of the population parameter falls within the calculated confidence interval. For example, a 95% confidence level means that there is a 95% chance the true value falls within the calculated interval.

4. How is the sample size related to the width of a confidence interval?

The sample size is directly related to the width of a confidence interval. As the sample size increases, the width of the interval decreases, indicating a more precise estimate of the population parameter. This is because a larger sample size reduces the standard error, making the confidence interval narrower.

5. What are some limitations of confidence intervals?

Confidence intervals are based on the assumption that the data follows a normal distribution and that the sample is representative of the population. If these assumptions are not met, the confidence interval may not be accurate. Additionally, confidence intervals do not provide information about the shape of the distribution or the variability within the population.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
677
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
944
  • Set Theory, Logic, Probability, Statistics
Replies
22
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
21
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
Back
Top