Extreme value theory and limiting distributions for i.i.d. order statistics

Click For Summary

Discussion Overview

The discussion revolves around the application of extreme value theory and limiting distributions in the context of independent and identically distributed (i.i.d.) order statistics, particularly in relation to cryptographic research. Participants explore the adequacy of asymptotic normal distributions for empirical sample quartiles and the implications of using extreme value theory for higher order statistics.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • James McLaughlin questions the reliability of asymptotic normal distribution for order statistics when dealing with extreme values and seeks guidance on determining the threshold for its adequacy.
  • Some participants suggest that the nature of the data distribution (discrete vs. continuous, bounded vs. unbounded) significantly affects the analysis and results.
  • One participant mentions the usefulness of the empirical cumulative distribution function (CDF) and proposes reformulating the problem in terms of exceedance probabilities.
  • Another participant identifies that the data follows a chi-squared distribution and discusses the implications for deriving the probability density function (PDF) of the m-th largest data point.
  • There is a discussion about the potential use of the Central Limit Theorem in approximating the distribution of the m-th order statistic, with some participants expressing uncertainty about how to apply it effectively.
  • Concerns are raised about the complexity of deriving normal approximations and the challenges associated with large combinatorial numbers in calculations.

Areas of Agreement / Disagreement

Participants express differing views on the adequacy of asymptotic normal distributions for extreme order statistics, and there is no consensus on the best approach to take. The discussion remains unresolved regarding the specific methods and assumptions that should be applied.

Contextual Notes

Participants note limitations related to the assumptions about the underlying distribution of the data, the complexity of mathematical derivations, and the potential need for numerical methods in calculations.

jamie_m
Messages
14
Reaction score
0
(This question was previously posted to sci.math.research. I only received one reply; sadly the advice therein conflicted with section 9.1 of H.A. David's "Order Statistics" - and probably with the fact that there was such a field of study as "r-extreme order statistics" - hence my reposting it here.)

I've been studying some cryptographic research in which the asymptotic normal distribution of the empirical sample quartile of order q is used to construct statistical models of the amount of data required for a successful cryptanalysis.

The main issue I have is that, while I'm pretty sure that such models have continued to be used for order statistics X_i (with i near to n) where the asymptotic normal distribution is inaccurate and where something based on extreme-value theory for the mth extremes would have been better, I don't have any idea as to how to compute an estimate for the value of i (or indeed q) above which the asymptotic normal might be considered suspect.

As an example, I'm currently dealing with the situation X_1 ≤ X_2 ...≤ X_n, where n = 2^{41}-1 = 2,199,023,255,551. In particular, I'm trying to work out whether the asymptotic normal is likely to be adequate when drawing conclusions about the top 2^{17} = 131,072 values or not - and while this seems a high m for m-th-extreme, it's not so high in relation to n, and this would mean I was dealing with the top 0.000006388% of values.

Can anyone give me some advice here?

Many thanks,

James McLaughlin.
 
Physics news on Phys.org
There's a few options depending on how much is known about the distribution that your data is drawn from, e.g. you haven't said yet whether it's discrete or continuous, bounded, has a parametric form or known tail asymptotics.

Also it might be easier to consider the empirical cdf (which are scaled binomial variables) rather than the empirical quantiles, especially if the original problem can be reformulated in terms of exceedance properties.

If you have parametric or tail info then there's a few results in Embrechts' book that may be useful.

HTH
 
(Apologies for neglecting my own thread for so long - I had something of a crisis of confidence with the research this forms a part of)

bpet, a randomly chosen X_i has a chi-squared distribution. I don't know whether this has a "parametric form or known tail asymptotics.", or how these apply in an order-statistics type situation.

Could you supply the title of Embrechts' book so I could look it up?

Also, could you expound on what you mean by "exceedance properties", and provide some more information on the empirical CDF and how you would use it?
 
jamie_m said:
(Apologies for neglecting my own thread for so long - I had something of a crisis of confidence with the research this forms a part of)

bpet, a randomly chosen X_i has a chi-squared distribution. I don't know whether this has a "parametric form or known tail asymptotics.", or how these apply in an order-statistics type situation.

Could you supply the title of Embrechts' book so I could look it up?

Also, could you expound on what you mean by "exceedance properties", and provide some more information on the empirical CDF and how you would use it?

bpet is almost certainly referring to Embrechts' Modelling Extremal Events. You can find it here.
 
jamie_m said:
(Apologies for neglecting my own thread for so long - I had something of a crisis of confidence with the research this forms a part of)

bpet, a randomly chosen X_i has a chi-squared distribution. I don't know whether this has a "parametric form or known tail asymptotics.", or how these apply in an order-statistics type situation.

Could you supply the title of Embrechts' book so I could look it up?

Also, could you expound on what you mean by "exceedance properties", and provide some more information on the empirical CDF and how you would use it?

Ok, the data's distribution is a known parametric form so it should be no problem to write down the pdf of the m-th largest data point. From there it should be possible to derive the normal approximation directly (may require an asymptotic expression for the incomplete gamma function) or even a higher-order approximation.

The empirical cdf is more useful for the situation where the data is from an unknown distribution. Also I think I meant to write "exceedance probabilities", not "exceedance properties".

HTH
 
"Ok, the data's distribution is a known parametric form so it should be no problem to write down the pdf of the m-th largest data point." you state.

Let me see if I'm doing this right so far:

The CDF of the mth largest data point, where D is the number of data points, and where P(X ≤ x) denotes the CDF of the chi-squared distribution, is:

D
= Ʃ ((P(X ≤ x)^{i})(1-P(X ≤ x))^{D - i})(D \choose i)
i=(D - (m-1))

To obtain the PDF, I need to differentiate that using the chain rule and the product rule - this being made easier since I can express this PDF in terms of the PDF and CDF of the chi-square distribution.

Probably through numerical integration with Mathematica, I use that to obtain the mean and variance. (Hopefully (D \choose i) won't involve numbers too large for it to handle)

If there's no simpler method, and if that's what I need to do, I don't see how to obtain a Normal approximation for that. Is there somewhere I can use the Central Limit Theorem that I haven't spotted?
 
Last edited:
jamie_m said:
I don't see how to obtain a Normal approximation for that.

I don't know anything about this problem, so my remarks are naive questions - not attempts to give some subtle hint about how to solve it.

How much of the asymptotic theory of order statistics is based on approximating the binomial distribution as a normal distribution?

If the cdf for one sample of random variable is F(x) and the pdf is f(x). We ask "In N samples, what is the probability that mth largest value is v?

One might hope that the probability can be computed as the two factors (probability that one sample is v)(probability that in remaining samples, m-1 are less than v and N-m are greater). The second factor could be viewed as a computation from a binomial distribution of N-1 samples, with probabiity F(v) of success.

Unlike the usual scenario for approximating the binomial by a normal, this approximation would involve approximating the density of the mth order statistic by a different normal distribution at each possible value for it. (It also involves a problematic event in the first factor, but often one can get away with misinterpreting f(v) that way.)

I have no trouble finding things on the web about the results of extreme value theory, but I haven't found any good explanations of how these results are derrived.
 

Similar threads

  • · Replies 15 ·
Replies
15
Views
4K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 4 ·
Replies
4
Views
3K
  • Poll Poll
  • · Replies 2 ·
Replies
2
Views
9K
  • · Replies 13 ·
Replies
13
Views
4K
  • · Replies 13 ·
Replies
13
Views
10K
  • · Replies 10 ·
Replies
10
Views
5K