Extreme value theory and limiting distributions for i.i.d. order statistics

In summary: This CDF is a chi-squared distribution with m-th-largest data point as the numerator and 1-mth-largest data point as the denominator.
  • #1
jamie_m
14
0
(This question was previously posted to sci.math.research. I only received one reply; sadly the advice therein conflicted with section 9.1 of H.A. David's "Order Statistics" - and probably with the fact that there was such a field of study as "r-extreme order statistics" - hence my reposting it here.)

I've been studying some cryptographic research in which the asymptotic normal distribution of the empirical sample quartile of order q is used to construct statistical models of the amount of data required for a successful cryptanalysis.

The main issue I have is that, while I'm pretty sure that such models have continued to be used for order statistics X_i (with i near to n) where the asymptotic normal distribution is inaccurate and where something based on extreme-value theory for the mth extremes would have been better, I don't have any idea as to how to compute an estimate for the value of i (or indeed q) above which the asymptotic normal might be considered suspect.

As an example, I'm currently dealing with the situation X_1 ≤ X_2 ...≤ X_n, where n = 2^{41}-1 = 2,199,023,255,551. In particular, I'm trying to work out whether the asymptotic normal is likely to be adequate when drawing conclusions about the top 2^{17} = 131,072 values or not - and while this seems a high m for m-th-extreme, it's not so high in relation to n, and this would mean I was dealing with the top 0.000006388% of values.

Can anyone give me some advice here?

Many thanks,

James McLaughlin.
 
Physics news on Phys.org
  • #2
There's a few options depending on how much is known about the distribution that your data is drawn from, e.g. you haven't said yet whether it's discrete or continuous, bounded, has a parametric form or known tail asymptotics.

Also it might be easier to consider the empirical cdf (which are scaled binomial variables) rather than the empirical quantiles, especially if the original problem can be reformulated in terms of exceedance properties.

If you have parametric or tail info then there's a few results in Embrechts' book that may be useful.

HTH
 
  • #3
(Apologies for neglecting my own thread for so long - I had something of a crisis of confidence with the research this forms a part of)

bpet, a randomly chosen X_i has a chi-squared distribution. I don't know whether this has a "parametric form or known tail asymptotics.", or how these apply in an order-statistics type situation.

Could you supply the title of Embrechts' book so I could look it up?

Also, could you expound on what you mean by "exceedance properties", and provide some more information on the empirical CDF and how you would use it?
 
  • #4
jamie_m said:
(Apologies for neglecting my own thread for so long - I had something of a crisis of confidence with the research this forms a part of)

bpet, a randomly chosen X_i has a chi-squared distribution. I don't know whether this has a "parametric form or known tail asymptotics.", or how these apply in an order-statistics type situation.

Could you supply the title of Embrechts' book so I could look it up?

Also, could you expound on what you mean by "exceedance properties", and provide some more information on the empirical CDF and how you would use it?

bpet is almost certainly referring to Embrechts' Modelling Extremal Events. You can find it here.
 
  • #5
jamie_m said:
(Apologies for neglecting my own thread for so long - I had something of a crisis of confidence with the research this forms a part of)

bpet, a randomly chosen X_i has a chi-squared distribution. I don't know whether this has a "parametric form or known tail asymptotics.", or how these apply in an order-statistics type situation.

Could you supply the title of Embrechts' book so I could look it up?

Also, could you expound on what you mean by "exceedance properties", and provide some more information on the empirical CDF and how you would use it?

Ok, the data's distribution is a known parametric form so it should be no problem to write down the pdf of the m-th largest data point. From there it should be possible to derive the normal approximation directly (may require an asymptotic expression for the incomplete gamma function) or even a higher-order approximation.

The empirical cdf is more useful for the situation where the data is from an unknown distribution. Also I think I meant to write "exceedance probabilities", not "exceedance properties".

HTH
 
  • #6
"Ok, the data's distribution is a known parametric form so it should be no problem to write down the pdf of the m-th largest data point." you state.

Let me see if I'm doing this right so far:

The CDF of the mth largest data point, where D is the number of data points, and where P(X ≤ x) denotes the CDF of the chi-squared distribution, is:

D
= Ʃ ((P(X ≤ x)^{i})(1-P(X ≤ x))^{D - i})(D \choose i)
i=(D - (m-1))

To obtain the PDF, I need to differentiate that using the chain rule and the product rule - this being made easier since I can express this PDF in terms of the PDF and CDF of the chi-square distribution.

Probably through numerical integration with Mathematica, I use that to obtain the mean and variance. (Hopefully (D \choose i) won't involve numbers too large for it to handle)

If there's no simpler method, and if that's what I need to do, I don't see how to obtain a Normal approximation for that. Is there somewhere I can use the Central Limit Theorem that I haven't spotted?
 
Last edited:
  • #7
jamie_m said:
I don't see how to obtain a Normal approximation for that.

I don't know anything about this problem, so my remarks are naive questions - not attempts to give some subtle hint about how to solve it.

How much of the asymptotic theory of order statistics is based on approximating the binomial distribution as a normal distribution?

If the cdf for one sample of random variable is F(x) and the pdf is f(x). We ask "In N samples, what is the probability that mth largest value is v?

One might hope that the probability can be computed as the two factors (probability that one sample is v)(probability that in remaining samples, m-1 are less than v and N-m are greater). The second factor could be viewed as a computation from a binomial distribution of N-1 samples, with probabiity F(v) of success.

Unlike the usual scenario for approximating the binomial by a normal, this approximation would involve approximating the density of the mth order statistic by a different normal distribution at each possible value for it. (It also involves a problematic event in the first factor, but often one can get away with misinterpreting f(v) that way.)

I have no trouble finding things on the web about the results of extreme value theory, but I haven't found any good explanations of how these results are derrived.
 

1. What is Extreme Value Theory?

Extreme Value Theory (EVT) is a branch of statistics that deals with the analysis of extreme or rare events. It provides a framework for modeling and predicting the behavior of extreme values in a dataset.

2. What are i.i.d. order statistics?

I.i.d. order statistics refer to a set of values that are independently and identically distributed (i.i.d.) and are arranged in ascending or descending order. These values are commonly used in EVT to analyze the behavior of extreme values.

3. What are limiting distributions in EVT?

In EVT, limiting distributions refer to the theoretical distribution that describes the behavior of extreme values in a dataset as the sample size increases infinitely. These distributions are used to estimate the probability of extreme events occurring in the future.

4. How is EVT applied in real-world scenarios?

EVT has various applications in fields such as finance, environmental science, and engineering. It can be used to model extreme weather events, predict stock market crashes, and estimate the maximum load a bridge can withstand, among others.

5. What are some limitations of Extreme Value Theory?

EVT assumes that the underlying distribution of the data is continuous and has a finite upper or lower bound. It also assumes that the data is independent and identically distributed, which may not always be the case in real-world scenarios. Additionally, EVT may not be suitable for analyzing extremes in small datasets.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
15
Views
3K
  • Atomic and Condensed Matter
Replies
4
Views
2K
  • STEM Academic Advising
Replies
13
Views
2K
  • Poll
  • Science and Math Textbooks
Replies
2
Views
7K
  • STEM Academic Advising
Replies
10
Views
4K
  • General Math
Replies
13
Views
9K
Back
Top