# Combining distributions

1. Jul 21, 2008

### CRGreathouse

Warning: I've only taken one stats class, back as an undergrad (though it was a very fast-paced class designed for mathematicians). My understanding of all things statistical is consequently weak.

I'm trying to design a program to accurately time functions. The functions themselves are of no importance here, only the timing code.

At the moment my program runs the test suite (10 million runs) with an empty function, to measure overhead. It stores a fixed number of runs, 7 at the moment, then computes the average and standard deviation of the overhead. This lets me construct a 95% confidence interval for the overhead:
$$[\mu-1.96\sigma,\mu+1.96\sigma]$$

Simple enough so far, yes? So then I time each actual function once. (I don't want to ruin them multiple times because the real functions, as expected, take a fair bit longer than the empty function.) At this point I make the assumption that the distribution of the timing errors of the functions is the same as that of the overhead function (which seems reasonable to me). This gives me a 95% confidence interval (under my assumption) as such:
$$[t(1-1.96\sigma/\mu), t(1+1.96\sigma/\mu)]$$

Here's the part I want help on. I combine the intervals by taking the low-end estimate for the function's speed and subtracting the high-end estimate for the overhead, to the high-end estimate for the function minus the low-end estimate for the overhead. How do I describe my confidence that this is correct? Less than 95% (errors can accumulate), more than 95% (errors are likely to cancel, maybe like sqrt(2) rather than 2?), or just 95%? Is there a better way to calculate this? Have I made mistakes or bad assumptions in my analysis?

2. Jul 21, 2008

### Focus

I am somewhat confused about what you are trying to do. Maybe if you write it out in maths language I could be of more help. From what I understand you are trying to get a 95% confidence interval for the mean (of some sort). Pardon me if I am wrong here but I am thinking that you wish to approximate $$\mu$$ for $$N(\mu,\sigma^2)$$ from a set of data by taking the mean as an estimate for $$\mu$$. The confidance interval you have makes sense only if you write it as$$[\hat{\mu}-1.96\sigma,\hat{\mu}+1.96\sigma]$$. This assumes that you know $$\sigma$$ which I highly doubt. To get a confidence interval when you also have to estimate $$\sigma$$ is given by $$[\hat{\mu}-t_{n-1,0.025} \hat{\sigma},\hat{\mu}+t_{n-1,0.025} \hat{\sigma}]$$ where the t is from Students t-distribution.

Don't quite understand the rest of it but I hope this helps.

Warning: I found statistics quite boring, I may be trying to blacken its name.

3. Jul 21, 2008

### CRGreathouse

I'm being quite lax in my notation, forgive me. I wrote $\mu$ for $\hat{\mu}$ and $\sigma$ for $\hat{\sigma}.$ These figures come from a small sample of a potentially infinite data source.

Example:
I have, say, five measurements for the overhead:
[0.5, 0.6, 0.4, 0.5, 0.35]
which have average 0.47 and standard deviation 0.0975. This gives a 95% confidence interval of
[0.279, 0.661]
for the true value of the overhead. (This is the sample mean plus/minus the standard deviation times 1.96; the 1.96 comes from a z-table.)

Now I don't actually want this interval. What I want is to subtract the true value of the overhead from a set of measurements and get the measurements less overhead. But since I don't have that, I subtract the range from the measurements:
$$[m-0.661,m-0.279]$$
But of course I don't actually have the true value for the measurements themselves; I have only a single measurement for each. So I make the assumption in my first post which lets me estimate the confidence interval for the measurements. First I construct the relative error:
$$e_\text{rel}\approx1.96\hat{\sigma}/\hat{\mu}\approx1.96\cdot0.0975/0.46=0.406$$
Then I form the interval about the measurement:
$$[m(1-e_\text{rel}),m(1+e_\text{rel})]$$

But this assumes, in effect, that the worst case of the overhead error corresponds to the worst case of the measurement error, which doesn't seem likely. So I seem to think that the actual confidence of my final result is more than 95%. I'd like a way to calculate my confidence in this final interval -- in this case, so I can reduce the size of my final interval by dropping the confidence from perhaps 99% to 95%.

4. Aug 1, 2008

### Focus

If you want to substract the true value of the overhead then surely your CI should just be N(0,k). I'm still really confused about what you are trying to do so excuse me. You should also use $$[\hat{\mu}-t_{n-1,0.025} \hat{\sigma},\hat{\mu}+t_{n-1,0.025} \hat{\sigma}]$$ to compute your CI as you are estimating sigma^2 as well. I also have no idea what an overhead is but it sounds quite fancy, good luck with it!

5. Aug 1, 2008

### CRGreathouse

Nothing fancy. I'm timing a certain process for (say) ten million iterations, and there is time taken up by the iterations themselves (and I just want to measure the time of the process). so I run a 'do-nothing' process in the same loop, and that's my overhead. The actual recorded time should be the time of the process (ten-millionfold) plus the time of the overhead. But with measurement errors, that's hard to get right -- sometimes the process is fast enough that the overhead dominates the runtime.

6. Aug 1, 2008

### Focus

Well then you should measure the overhead and the process which means your error for the proccess without the overhead is (given that they are independent) $$N(\hat{\mu_1}-\hat{\mu_2},\hat{\sigma_1}^2+\hat{\sigma_2}^2)$$ be sure to use students t distribution when calculating the CI as the extra uncertainty from estimating variances is accounted for in that.

Must be quite boring running do-nothings all day. I hope they are paying you well for this :D.

7. Aug 1, 2008