Determine P(t = x) if % of sample drawn from gaussian > t

  • Context: Undergrad 
  • Thread starter Thread starter NotASmurf
  • Start date Start date
  • Tags Tags
    Gaussian
Click For Summary

Discussion Overview

The discussion revolves around determining the probability density function (PDF) of a threshold value \( t \) based on a sample drawn from a normal distribution. Participants explore the relationship between the sample values and the threshold, particularly focusing on the implications of knowing how many sample values exceed \( t \). The scope includes theoretical reasoning and statistical modeling.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant asks if it is possible to determine a PDF function for \( t \) given a sample from a normal distribution and the number of samples exceeding \( t \).
  • Another participant suggests that if \( t \) is less than all sampled values, the probability can be expressed using the complementary cumulative distribution function (CDF).
  • A later reply clarifies that all sample values are from the same distribution and discusses the implications of knowing how many samples are above and below \( t \).
  • Participants discuss the concept of order statistics and how they relate to the problem, suggesting that the question may involve determining the probability that \( t \) lies between certain order statistics of the sample.
  • One participant mentions a program they created to analyze sample subsets and the observed percentages of how many elements exceed \( t \) for various values of \( t \).
  • Another participant questions whether the original inquiry pertains to the distribution of the sample median or a statistic for testing population skewness.

Areas of Agreement / Disagreement

Participants express varying interpretations of the problem, with some agreeing on the need for clarification regarding the definitions and relationships between \( t \) and the sample values. There is no consensus on the exact nature of the inquiry or the appropriate statistical approach to take.

Contextual Notes

Participants note that the definitions of the variables and the relationships between them are not clearly established, leading to confusion. The discussion also highlights the complexity of deriving a closed-form solution for the problem, particularly when dealing with normal distributions and order statistics.

Who May Find This Useful

Readers interested in statistical modeling, order statistics, and the properties of normal distributions may find this discussion relevant.

NotASmurf
Messages
150
Reaction score
2
Sorry for the bad title, limited space

a sample group of size n, as well as a number t,is drawn randomly from a normal distribution, if we have the number of people in the sample group bigger than t, can we determine a PDF function of what value t is? Are they any simplifications we can use to hone in on it? Any help appreciated.
 
Physics news on Phys.org
NotASmurf said:
Sorry for the bad title, limited space

a sample group of size n, as well as a number t,is drawn randomly from a normal distribution, if we have the number of people in the sample group bigger than t, can we determine a PDF function of what value t is? Are they any simplifications we can use to hone in on it? Any help appreciated.

Your question is incomprehensible. You need to take more time explaining what you mean, and if you need more space in which to say it, then take more space.

As I understand it, you have some distribution ##F(x)## (which you say is normal, but never mind that for now), and you draw a sample of size ##n+1## from ##F##. You call the first sampled-value ##t##, then call the others ##X_1, X_2, \ldots, X_n.## After that, you lose me. Apparently you connect the values of ##t## and the ##X_i## in some way, and then ask about some probabilities, but I cannot figure out what you want.

Or, maybe the value of ##t## is not really drawn from the same distribution ##F## as are the ##X_i##. In that case, are you asking for the distribution of ##t## if it is to be less than all the ##X_i?## Well, that's easy:
$$P(t < X_1, t < X_2, \ldots t < X_n) = P(X_1 > t) P(X_2 > t) \cdots P(X_n > t)\\
\hspace{4ex} = \left( \bar{F}(t) \right)^n,$$
where ##\bar{F}(t) = 1 - F(t)## is the complementary cdf. For the normal distribution ##N(\mu,\sigma)## this would be
$$\left( 1 - \Phi \left( \frac{t-\mu}{\sigma} \right) \right)^n$$ where ##\Phi(z)## is the cdf of the standard normal random variable ##N(0,1).##
 
Last edited:
Thanks for the reply, but they are all from the same distrubution, we know that X1, X2..Xk are bigger than t, but Xk+1,Xk+2..Xn are smaller, we know all the Xi's, but NOT t, I seek to "determine" t, ie P(t = x).
Obviously if more of than half of the Xi's are bigger than t, it is more likely t is smaller than the median etc, I seek to build a distribution out of this
 
NotASmurf said:
Thanks for the reply, but they are all from the same distrubution, we know that X1, X2..Xk are bigger than t, but Xk+1,Xk+2..Xn are smaller, we know all the Xi's, but NOT t, I seek to "determine" t, ie P(t = x).
Obviously if more of than half of the Xi's are bigger than t, it is more likely t is smaller than the median etc, I seek to build a distribution out of this

OK. You cannot hope to have the samples ##X_1, X_2, \ldots,X_n## come out in descending values: there is no reason why we cannot have ##X_1 < X_2## and ##X_2 > X_3,## for example. However the so-called order statistics ##X_{(1)}, X_{(2)}, \ldots X_{(n)}## are, by definition, the ##X_i## values re-sorted into ascending order. That is, ##X_{(1)} = ## smallest of ##S \equiv \{X_1, X_2, \ldots, X_n\},## ##X_{(2)} = ## second smallest of ##S##, all the way up to ##X_{(n)} = ## largest of ##S##. Although you want the ##X## values sorted into descending order, the standard probabilistic formulas apply to them in ascending order. For that reason, I am going to suppose that you want ##X_{(k)} < t < X_{(k+1)}##, so the first ##k## of them are less than ##t## and the remaining ##n-k## of them are ##> t##.

The question only makes sense to me if you fix the values of ##n## and ##k##, and in that case you want to know
$$P(X_{(k)} < T < X_{(k+1)})$$ where ##T## is another independently-generated sample point from the same distribution ##F## as the ##X_i.##

That is a classical problem: if the random variables ##X_i## are continuous, with probability density function ##f(x)## and (cumulative) distribution function ##F(x) = \int_{-\infty}^x f(t) \, dt## then, for any pair ##u < v## the event ##E_{u,v} = \{ X_{(k)} < u \} \cap \{X_{(k+1)} > v \}## occurs whenever ##k## if the ##X_i## are ##< u## and the remaining ##n-k## of them are ##> v##. The ones that are to be ##< u## can be chosen from the sample in ##{ n \choose k }## ways (binomial coefficient), hence
$$P(E_{u,v}) = {n \choose k} F(u)^k (1-F(v))^{n-k} $$ This implies that
$$P(X_{(k)} < T < X_{(k+1)}| T=t) = {n \choose k} \int_{u=-\infty}^t \int_{v=t}^\infty F(u)^k (1-F(v))^{n-k} \, du \, dv \\
\hspace{4ex} = {n \choose k} \left[ \int_{-\infty}^t F(u)^k \, du \right] \, \left[ \int_t^\infty (1-F(v))^{n-k} \, dv \right] $$

The answer to your question (if I have interpreted it correctly) is
$$\text{ans.} = \int_{-\infty}^\infty P(X_{(k)} < T < X_{(k+1)}| T=t) \, f(t) \, dt$$ In other words, if we set
$$A_k(t) = \int_{-\infty}^t F(u)^k \, du \;\; \text{and} \;\; B_{n-k}(t) = \int_t^\infty (1-F(v))^{n-k} \, dv$$ then we have
$$\text{ans.} = {n \choose k} \int_{-\infty}^\infty A_k(t) B_{n-k}(t) f(t) \, dt.$$ If the ##X_i## and ##T## all come from the same ##N(\mu,\sigma)## there will not be any closed-form formula for the answer (even ##F(x)## itself has no closed-form formula in terms of standard, elementary functions). Probably you would need to use numerical integration methods for a given example, and perhaps be satisfied with graphical results. However, as indicated in some of the on-line articles about order statistics, if the sample ##n## is large you might be able to employ reasonable normal-distribution approximations to some of the quantities, and so get a bit further towards a usable formula.

For more about order statistics, see
https://en.wikipedia.org/wiki/Order_statistic
and
https://www2.stat.duke.edu/courses/Spring12/sta104.1/Lectures/Lec15.pdf
 
Last edited:
NotASmurf said:
Sorry for the bad title, limited space

a sample group of size n, as well as a number t,is drawn randomly from a normal distribution, if we have the number of people in the sample group bigger than t, can we determine a PDF function of what value t is? Are they any simplifications we can use to hone in on it? Any help appreciated.

You do not need to try to fit the entire question in the title. I am also struggling to understand what you are trying to do.
"
NotASmurf said:
but they are all from the same distrubution
They're not? You didn't say that. What distribution is X1 from? What distribution is X2 from?

NotASmurf said:
we know that X1, X2..Xk are bigger than t,
We do? How do we know that? You didn't say that before.

Isn't X1 the first random number we sample? If it's not, then what is the definition of X1?

NotASmurf said:
Obviously if more of than half of the Xi's are bigger than t,
Why would they be? How was t determned? What are the Xi? Where are they coming from?

I think you'd better try to give an example of your procedure. Then we can help you model it. I think you are using non-standard terminology for almost everything in your question.
 
upload_2019-1-27_11-4-46.png


Ok so I made a program that takes all possible 6 length (arbitrary number for now) sample subsets, this is a simplified example, where the distribution is linear.

Now we notice that when t = 0 , 59.049% of the time 5/6 elements of the sample were bigger than t, when t = 1 20.48% of the time 3/6 elements were bigger than t, etc. Format is

"t = x : [percentage of the time that k/6 elements of the sample are bigger than t when t is x | k]. "

upload_2019-1-27_11-11-0.png


Here is a unnormalized graph of y = percentage time that t = x (for the case of 5/6 elements are greater)

I believe I can answer my question if I can understand why the graph is that shape and what type of graph it is. Any ideas?
 

Attachments

  • upload_2019-1-27_11-4-46.png
    upload_2019-1-27_11-4-46.png
    4.4 KB · Views: 489
  • upload_2019-1-27_11-11-0.png
    upload_2019-1-27_11-11-0.png
    2.6 KB · Views: 531
It seems like you are asking for the distribution of the sample median coming from a normal population? Or maybe you are coming up with a statistic to test for skewness of your population? The latter would not make sense.
 

Similar threads

  • · Replies 31 ·
2
Replies
31
Views
4K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 0 ·
Replies
0
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 9 ·
Replies
9
Views
5K
  • · Replies 15 ·
Replies
15
Views
4K