Determine P(t = x) if % of sample drawn from gaussian > t

NotASmurf · Jan 26, 2019

Sorry for the bad title, limited space

a sample group of size n, as well as a number t,is drawn randomly from a normal distribution, if we have the number of people in the sample group bigger than t, can we determine a PDF function of what value t is? Are they any simplifications we can use to hone in on it? Any help appreciated.

Ray Vickson · Jan 26, 2019

NotASmurf said:

Sorry for the bad title, limited space

a sample group of size n, as well as a number t,is drawn randomly from a normal distribution, if we have the number of people in the sample group bigger than t, can we determine a PDF function of what value t is? Are they any simplifications we can use to hone in on it? Any help appreciated.

Your question is incomprehensible. You need to take more time explaining what you mean, and if you need more space in which to say it, then take more space.

As I understand it, you have some distribution ##F(x)## (which you say is normal, but never mind that for now), and you draw a sample of size ##n+1## from ##F##. You call the first sampled-value ##t##, then call the others ##X_1, X_2, \ldots, X_n.## After that, you lose me. Apparently you connect the values of ##t## and the ##X_i## in some way, and then ask about some probabilities, but I cannot figure out what you want.

Or, maybe the value of ##t## is not really drawn from the same distribution ##F## as are the ##X_i##. In that case, are you asking for the distribution of ##t## if it is to be less than all the ##X_i?## Well, that's easy:
$$P(t < X_1, t < X_2, \ldots t < X_n) = P(X_1 > t) P(X_2 > t) \cdots P(X_n > t)\\
\hspace{4ex} = \left( \bar{F}(t) \right)^n,$$
where ##\bar{F}(t) = 1 - F(t)## is the complementary cdf. For the normal distribution ##N(\mu,\sigma)## this would be
$$\left( 1 - \Phi \left( \frac{t-\mu}{\sigma} \right) \right)^n$$ where ##\Phi(z)## is the cdf of the standard normal random variable ##N(0,1).##

NotASmurf · Jan 26, 2019

Thanks for the reply, but they are all from the same distrubution, we know that X1, X2..Xk are bigger than t, but Xk+1,Xk+2..Xn are smaller, we know all the Xi's, but NOT t, I seek to "determine" t, ie P(t = x).
Obviously if more of than half of the Xi's are bigger than t, it is more likely t is smaller than the median etc, I seek to build a distribution out of this

Ray Vickson · Jan 26, 2019

NotASmurf said:

Thanks for the reply, but they are all from the same distrubution, we know that X1, X2..Xk are bigger than t, but Xk+1,Xk+2..Xn are smaller, we know all the Xi's, but NOT t, I seek to "determine" t, ie P(t = x).
Obviously if more of than half of the Xi's are bigger than t, it is more likely t is smaller than the median etc, I seek to build a distribution out of this

OK. You cannot hope to have the samples ##X_1, X_2, \ldots,X_n## come out in descending values: there is no reason why we cannot have ##X_1 < X_2## and ##X_2 > X_3,## for example. However the so-called order statistics ##X_{(1)}, X_{(2)}, \ldots X_{(n)}## are, by definition, the ##X_i## values re-sorted into ascending order. That is, ##X_{(1)} = ## smallest of ##S \equiv \{X_1, X_2, \ldots, X_n\},## ##X_{(2)} = ## second smallest of ##S##, all the way up to ##X_{(n)} = ## largest of ##S##. Although you want the ##X## values sorted into descending order, the standard probabilistic formulas apply to them in ascending order. For that reason, I am going to suppose that you want ##X_{(k)} < t < X_{(k+1)}##, so the first ##k## of them are less than ##t## and the remaining ##n-k## of them are ##> t##.

The question only makes sense to me if you fix the values of ##n## and ##k##, and in that case you want to know
$$P(X_{(k)} < T < X_{(k+1)})$$ where ##T## is another independently-generated sample point from the same distribution ##F## as the ##X_i.##

That is a classical problem: if the random variables ##X_i## are continuous, with probability density function ##f(x)## and (cumulative) distribution function ##F(x) = \int_{-\infty}^x f(t) \, dt## then, for any pair ##u < v## the event ##E_{u,v} = \{ X_{(k)} < u \} \cap \{X_{(k+1)} > v \}## occurs whenever ##k## if the ##X_i## are ##< u## and the remaining ##n-k## of them are ##> v##. The ones that are to be ##< u## can be chosen from the sample in ##{ n \choose k }## ways (binomial coefficient), hence
$$P(E_{u,v}) = {n \choose k} F(u)^k (1-F(v))^{n-k} $$ This implies that
$$P(X_{(k)} < T < X_{(k+1)}| T=t) = {n \choose k} \int_{u=-\infty}^t \int_{v=t}^\infty F(u)^k (1-F(v))^{n-k} \, du \, dv \\
\hspace{4ex} = {n \choose k} \left[ \int_{-\infty}^t F(u)^k \, du \right] \, \left[ \int_t^\infty (1-F(v))^{n-k} \, dv \right] $$

The answer to your question (if I have interpreted it correctly) is
$$\text{ans.} = \int_{-\infty}^\infty P(X_{(k)} < T < X_{(k+1)}| T=t) \, f(t) \, dt$$ In other words, if we set
$$A_k(t) = \int_{-\infty}^t F(u)^k \, du \;\; \text{and} \;\; B_{n-k}(t) = \int_t^\infty (1-F(v))^{n-k} \, dv$$ then we have
$$\text{ans.} = {n \choose k} \int_{-\infty}^\infty A_k(t) B_{n-k}(t) f(t) \, dt.$$ If the ##X_i## and ##T## all come from the same ##N(\mu,\sigma)## there will not be any closed-form formula for the answer (even ##F(x)## itself has no closed-form formula in terms of standard, elementary functions). Probably you would need to use numerical integration methods for a given example, and perhaps be satisfied with graphical results. However, as indicated in some of the on-line articles about order statistics, if the sample ##n## is large you might be able to employ reasonable normal-distribution approximations to some of the quantities, and so get a bit further towards a usable formula.

For more about order statistics, see
https://en.wikipedia.org/wiki/Order_statistic
and
https://www2.stat.duke.edu/courses/Spring12/sta104.1/Lectures/Lec15.pdf

RPinPA · Jan 26, 2019

NotASmurf said:

Sorry for the bad title, limited space

a sample group of size n, as well as a number t,is drawn randomly from a normal distribution, if we have the number of people in the sample group bigger than t, can we determine a PDF function of what value t is? Are they any simplifications we can use to hone in on it? Any help appreciated.

You do not need to try to fit the entire question in the title. I am also struggling to understand what you are trying to do.
"

NotASmurf said:

but they are all from the same distrubution

They're not? You didn't say that. What distribution is X1 from? What distribution is X2 from?

NotASmurf said:

we know that X1, X2..Xk are bigger than t,

We do? How do we know that? You didn't say that before.

Isn't X1 the first random number we sample? If it's not, then what is the definition of X1?

NotASmurf said:

Obviously if more of than half of the Xi's are bigger than t,

Why would they be? How was t determned? What are the Xi? Where are they coming from?

I think you'd better try to give an example of your procedure. Then we can help you model it. I think you are using non-standard terminology for almost everything in your question.

NotASmurf · Jan 27, 2019

Ok so I made a program that takes all possible 6 length (arbitrary number for now) sample subsets, this is a simplified example, where the distribution is linear.

Now we notice that when t = 0 , 59.049% of the time 5/6 elements of the sample were bigger than t, when t = 1 20.48% of the time 3/6 elements were bigger than t, etc. Format is

"t = x : [percentage of the time that k/6 elements of the sample are bigger than t when t is x | k]. "

Here is a unnormalized graph of y = percentage time that t = x (for the case of 5/6 elements are greater)

I believe I can answer my question if I can understand why the graph is that shape and what type of graph it is. Any ideas?

WWGD · Mar 29, 2019

It seems like you are asking for the distribution of the sample median coming from a normal population? Or maybe you are coming up with a statistic to test for skewness of your population? The latter would not make sense.

Determine P(t = x) if % of sample drawn from gaussian > t

Attachments

1. What is P(t = x)?

2. How is P(t = x) determined?

3. What does it mean if % of sample drawn from Gaussian > t?

4. How does the Gaussian distribution affect the calculation of P(t = x)?

5. Can P(t = x) be greater than 1 or less than 0?

Similar threads

Hot Threads

Recent Insights