Extreme value theory and limiting distributions for i.i.d. order statistics

Click For Summary
SUMMARY

The discussion centers on the application of extreme value theory and limiting distributions for independent and identically distributed (i.i.d.) order statistics, particularly in the context of cryptographic research. James McLaughlin seeks guidance on estimating the adequacy of the asymptotic normal distribution when analyzing the top 0.000006388% of values from a dataset of size n = 2,199,023,255,551. Key references include H.A. David's "Order Statistics" and Embrechts' "Modelling Extremal Events," which provide foundational insights into empirical cumulative distribution functions (CDFs) and exceedance probabilities. The conversation emphasizes the importance of understanding the underlying distribution's characteristics to apply appropriate statistical models.

PREREQUISITES
  • Understanding of extreme value theory and its applications in statistics.
  • Familiarity with order statistics and their asymptotic properties.
  • Knowledge of empirical cumulative distribution functions (CDFs) and their significance.
  • Basic proficiency in statistical modeling and analysis using tools like Mathematica.
NEXT STEPS
  • Study the concepts of exceedance probabilities in the context of extreme value theory.
  • Explore the derivation of the probability density function (PDF) for the m-th largest data point in a known distribution.
  • Learn about the Central Limit Theorem and its application to order statistics.
  • Review Embrechts' "Modelling Extremal Events" for advanced insights into modeling extremal behavior in datasets.
USEFUL FOR

Statisticians, data scientists, cryptographers, and researchers involved in statistical modeling and analysis of large datasets, particularly those interested in extreme value theory and order statistics.

jamie_m
Messages
14
Reaction score
0
(This question was previously posted to sci.math.research. I only received one reply; sadly the advice therein conflicted with section 9.1 of H.A. David's "Order Statistics" - and probably with the fact that there was such a field of study as "r-extreme order statistics" - hence my reposting it here.)

I've been studying some cryptographic research in which the asymptotic normal distribution of the empirical sample quartile of order q is used to construct statistical models of the amount of data required for a successful cryptanalysis.

The main issue I have is that, while I'm pretty sure that such models have continued to be used for order statistics X_i (with i near to n) where the asymptotic normal distribution is inaccurate and where something based on extreme-value theory for the mth extremes would have been better, I don't have any idea as to how to compute an estimate for the value of i (or indeed q) above which the asymptotic normal might be considered suspect.

As an example, I'm currently dealing with the situation X_1 ≤ X_2 ...≤ X_n, where n = 2^{41}-1 = 2,199,023,255,551. In particular, I'm trying to work out whether the asymptotic normal is likely to be adequate when drawing conclusions about the top 2^{17} = 131,072 values or not - and while this seems a high m for m-th-extreme, it's not so high in relation to n, and this would mean I was dealing with the top 0.000006388% of values.

Can anyone give me some advice here?

Many thanks,

James McLaughlin.
 
Physics news on Phys.org
There's a few options depending on how much is known about the distribution that your data is drawn from, e.g. you haven't said yet whether it's discrete or continuous, bounded, has a parametric form or known tail asymptotics.

Also it might be easier to consider the empirical cdf (which are scaled binomial variables) rather than the empirical quantiles, especially if the original problem can be reformulated in terms of exceedance properties.

If you have parametric or tail info then there's a few results in Embrechts' book that may be useful.

HTH
 
(Apologies for neglecting my own thread for so long - I had something of a crisis of confidence with the research this forms a part of)

bpet, a randomly chosen X_i has a chi-squared distribution. I don't know whether this has a "parametric form or known tail asymptotics.", or how these apply in an order-statistics type situation.

Could you supply the title of Embrechts' book so I could look it up?

Also, could you expound on what you mean by "exceedance properties", and provide some more information on the empirical CDF and how you would use it?
 
jamie_m said:
(Apologies for neglecting my own thread for so long - I had something of a crisis of confidence with the research this forms a part of)

bpet, a randomly chosen X_i has a chi-squared distribution. I don't know whether this has a "parametric form or known tail asymptotics.", or how these apply in an order-statistics type situation.

Could you supply the title of Embrechts' book so I could look it up?

Also, could you expound on what you mean by "exceedance properties", and provide some more information on the empirical CDF and how you would use it?

bpet is almost certainly referring to Embrechts' Modelling Extremal Events. You can find it here.
 
jamie_m said:
(Apologies for neglecting my own thread for so long - I had something of a crisis of confidence with the research this forms a part of)

bpet, a randomly chosen X_i has a chi-squared distribution. I don't know whether this has a "parametric form or known tail asymptotics.", or how these apply in an order-statistics type situation.

Could you supply the title of Embrechts' book so I could look it up?

Also, could you expound on what you mean by "exceedance properties", and provide some more information on the empirical CDF and how you would use it?

Ok, the data's distribution is a known parametric form so it should be no problem to write down the pdf of the m-th largest data point. From there it should be possible to derive the normal approximation directly (may require an asymptotic expression for the incomplete gamma function) or even a higher-order approximation.

The empirical cdf is more useful for the situation where the data is from an unknown distribution. Also I think I meant to write "exceedance probabilities", not "exceedance properties".

HTH
 
"Ok, the data's distribution is a known parametric form so it should be no problem to write down the pdf of the m-th largest data point." you state.

Let me see if I'm doing this right so far:

The CDF of the mth largest data point, where D is the number of data points, and where P(X ≤ x) denotes the CDF of the chi-squared distribution, is:

D
= Ʃ ((P(X ≤ x)^{i})(1-P(X ≤ x))^{D - i})(D \choose i)
i=(D - (m-1))

To obtain the PDF, I need to differentiate that using the chain rule and the product rule - this being made easier since I can express this PDF in terms of the PDF and CDF of the chi-square distribution.

Probably through numerical integration with Mathematica, I use that to obtain the mean and variance. (Hopefully (D \choose i) won't involve numbers too large for it to handle)

If there's no simpler method, and if that's what I need to do, I don't see how to obtain a Normal approximation for that. Is there somewhere I can use the Central Limit Theorem that I haven't spotted?
 
Last edited:
jamie_m said:
I don't see how to obtain a Normal approximation for that.

I don't know anything about this problem, so my remarks are naive questions - not attempts to give some subtle hint about how to solve it.

How much of the asymptotic theory of order statistics is based on approximating the binomial distribution as a normal distribution?

If the cdf for one sample of random variable is F(x) and the pdf is f(x). We ask "In N samples, what is the probability that mth largest value is v?

One might hope that the probability can be computed as the two factors (probability that one sample is v)(probability that in remaining samples, m-1 are less than v and N-m are greater). The second factor could be viewed as a computation from a binomial distribution of N-1 samples, with probabiity F(v) of success.

Unlike the usual scenario for approximating the binomial by a normal, this approximation would involve approximating the density of the mth order statistic by a different normal distribution at each possible value for it. (It also involves a problematic event in the first factor, but often one can get away with misinterpreting f(v) that way.)

I have no trouble finding things on the web about the results of extreme value theory, but I haven't found any good explanations of how these results are derrived.
 

Similar threads

  • · Replies 15 ·
Replies
15
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 13 ·
Replies
13
Views
3K
  • Poll Poll
  • · Replies 2 ·
Replies
2
Views
8K
  • · Replies 13 ·
Replies
13
Views
10K
  • · Replies 10 ·
Replies
10
Views
5K