Convergence of 2 sample means with 95% confidence

  • #1
fahraynk
186
6
I tried to derive an equation for one sample mean to converge to another sample mean within a 95% confidence interval, but I know I am wrong. Can someone tell me what I did wrong, and what is the correct formula?

Suppose:

##\hat{x_1},\hat{\sigma_1},N## are a sample mean, standard deviation calculated with ##N## samples,

##\hat{x_2},\hat{\sigma_2},n## are a sample mean, standard deviation calculated with ##n## samples ##n\leq N##

##\mu##,##\delta## are the true mean, true standard deviation for the population.

If ##d(\hat{x_1},\hat{x_2})## is a euclidean distance function on the sample means, then:

$$
d(\hat{x_1},\hat{x_2})\leq d(\hat{x_1},\mu)+ d(\hat{x_2},\mu)\leq 4\frac{\sigma_1}{\sqrt{N}}+4\frac{\sigma_2}{\sqrt{n}}
$$

With ##95##% confidence because : ##\mu\in [\hat{x_1}-\frac{2\sigma_1}{\sqrt{N}},\hat{x_1}+\frac{2\sigma_1}{\sqrt{N}}]## with 95% confidenceMy first question is, What is the relationship between sample standard deviation and population standard deviation?

When I take many samples, the standard deviation of the samples changes very little, so I assume the relationship ##\sigma_1=\sigma_2=\delta## :

$$
d(\hat{x_1},\hat{x_2})\leq 4(\frac{\sigma_1}{\sqrt{N}}+\frac{\sigma_2}{\sqrt{n}})=4\delta\frac{(\sqrt{N}+\sqrt{n})}{\sqrt{N}\sqrt{n}}=4\delta\frac{\frac{\sqrt{N}}{\sqrt{n}}+1}{\sqrt{N}}=>\\
=>\sqrt{N}d(\hat{x_1},\hat{x_2})-4\delta\leq\frac{4\delta\sqrt{N}}{\sqrt{n}}=>\\=>\sqrt{n}\leq\frac{4\delta\sqrt{N}}{\sqrt{N}d(\hat{x_1},\hat{x_2})-4\delta}
$$

But this can't be true, because if I choose ##d(\hat{x_1},\hat{x_2})=0## then ##\sqrt{n}\leq-\sqrt{N}##, but ##n## and ##N## must be positive.

What is wrong here?

Also, I am sure there must be a simple way to do this. What I really want to know is how to get ##n## as a function of ##d(\hat{x_1},\hat{x_2})## and ##\phi##, where ##\phi## is a confidence level, like 95% confidence.
 
Mathematics news on Phys.org
  • #2
I don't like the look of this. As I understand your scenario, n is not a variable, and d is a random variable that you cannot "choose".
Mathematically, your problem comes in the last step. If we set d=0, the penultimate line is
-4δ ≤ 4δ√N/√n ⇒
√n ≤ 4δ√N/(-4δ)
You have to be careful with inequalities. It's not as simple as "swapping terms" in an equation. When you multiply or divide both sides of an inequality by a negative quantity (-4δ), the direction of the inequality is reversed. So it should be
√n ≥ 4δ√N/(-4δ)
And more generally, if √Nd - 4δ is negative, moving it to the bottom of the RHS reverses the direction of the inequality.
 
  • #3
fahraynk said:
<Snip>
If ##d(\hat{x_1},\hat{x_2})## is a euclidean distance function on the sample means, then:

$$
d(\hat{x_1},\hat{x_2})\leq d(\hat{x_1},\mu)+ d(\hat{x_2},\mu)\leq 4\frac{\sigma_1}{\sqrt{N}}+4\frac{\sigma_2}{\sqrt{n}}
$$

With ##95##% confidence because : ##\mu\in [\hat{x_1}-\frac{2\sigma_1}{\sqrt{N}},\hat{x_1}+\frac{2\sigma_1}{\sqrt{N}}]## with 95% confidence<Snip>
What is wrong here?

Also, I am sure there must be a simple way to do this. What I really want to know is how to get ##n## as a function of ##d(\hat{x_1},\hat{x_2})## and ##\phi##, where ##\phi## is a confidence level, like 95% confidence.

Not sure, but maybe because ##\mu## is also in a similar interval about ##\hat{x_2} ##?
 
  • #4
As an idea ( which I have put off developing) maybe you can use correlation as an inner-product ( with some adjustments) , then find the norm generated by the inner-product and define a distance based on the norm. I will do it too...some day.
 
  • #5
fahraynk said:
My first question is, What is the relationship between sample standard deviation and population standard deviation?

some red flags here are that I see you're dividing by ##n## when in fact for sample variance you'd divide by ##n-1##. Similar issue is that while you can get unbiased estimates of variance, you'll have a biased standard deviation estimate due to (negative) convexity issues.

other issues in addition to what was raised above: I don't see why your mean estimates have a normal distribution -- this isn't stated anywhere. Sure CLT would tell you that normal approximation works for large enough ##n## but I don't see the sufficiency of size of ##n## addressed anywhere.

There's also ruler problem in that you're using estimates of standard deviation to measure estimates of mean -- but how do you know your std dev (or variance) estimates are any good? There's a lot of issues lurking in here... I think this is why books on statistics are long.

- - - -
if I was trying to develop some kind of estimate from scratch, I'd probably start with some kind of bounded random variable and apply Chernoff Bounds or concentration inequalities. That way you don't need variance information, only mean. Once I had this down, if feeling adventurous this may be applied to more general random variables (that still have first 2 moments) with the help of the method of truncation.
 
  • #6
fahraynk said:
I tried to derive an equation for one sample mean to converge to another sample mean within a 95% confidence interval,

Your description doesn't define a specific mathematical problem.

You might intend asking about a scenario where independent random samples are taken of a random variable. After taking ##n## samples, the sample mean is the random variable ##\mu_n##. After taking ##k## more samples, the sample mean is the random variable ##\mu_{n+k}## where the first ##n## of those samples are the same as those used to compute ##\mu_n##.

Or you might intend to ask about the situation where ##\mu_n## and ##\mu_{n+k}## are computed from two groups of samples that need not have any common samples.

You might be taking ##n## and ##k## as given and asking for an interval length ##L## such that there is a ##0.95## probability that ##| \mu_n - \mu_{k+n}| < L/2##

Or you might be taking ##L## and ##n## as given and asking for value of ##k## such that there is a 0.95 probability that ##|\mu_n - \mu_{n+k}| < L/2##

Or you might have in mind some question involving the relationship of ##\mu_n## and ##\mu_{n+k}## with the mean ##\mu## of the random variable being sampled.
 
Last edited:
  • #7
It eems that if N,n were both large enough, we could use the CLT to somehow argue they must be close to each other, under certain assumptions on sampling as @Stephen Tashi described in his post.
 
  • #8
Doesn't Weak Convergence allow us to say that ##\hat x_n## is Cauchy, so that for n>N ## |x_k -x_j | < \epsilon##?
 

Similar threads

Replies
21
Views
3K
Replies
4
Views
1K
Replies
1
Views
935
Replies
125
Views
18K
Replies
1
Views
988
Replies
1
Views
720
Back
Top