Quantiles on a stream of real numbers

Cristiano · Dec 27, 2015

I need to calculate some quantiles for a sample of 10⁸ real numbers with unknown mean and unknown variance.
I currently store and sort those numbers, but I would try a streaming method where the numbers are not stored.
In a paper is written: "If the size of the input stream, N is known, then the following simple algorithm can compute a random sample of size k in one-pass: choose each element independently with probability k/N to include in the sample."; please, could somebody tell me what it means?
I need a procedure to retain the best possible accuracy, any suggestion?

FactChecker · Dec 27, 2015

It means that as the raw data is examined, one at a time, you generate a uniform random number between 0 and 1. If the random number is less than k/N, store the number. Otherwise, do not store it. That will store an unbiased set of about k numbers (on average).

Cristiano · Dec 27, 2015

Then I will calculate the quantile for that subsample of size k, right?

FactChecker · Dec 27, 2015

Cristiano said:

Then I will calculate the quantile for that subsample of size k, right?

Yes. But you will not always get exactly k numbers. It is only probable that you will get approximately k numbers. So keep track of how many you really get and use that number for determining the quantiles.

Cristiano · Dec 27, 2015

I won't use that method, because I generate 10⁸ numbers to retain the best possible accuracy; I cannot use a subsample.
I asked just to be sure that I understood that method; thank you for the clarification, but I need a method that uses all the 10⁸ numbers.

FactChecker · Dec 27, 2015

Cristiano said:

I won't use that method, because I generate 10⁸ numbers to retain the best possible accuracy; I cannot use a subsample.
I asked just to be sure that I understood that method; thank you for the clarification, but I need a method that uses all the 10⁸ numbers.

ok. That's a lot of accuracy.

Stephen Tashi · Dec 27, 2015

Perhaps something like this will do: http://arxiv.org/pdf/1007.1032.pdf

Cristiano · Dec 27, 2015

That method is probably good for "central" quantiles, but I need to calculate extreme quantiles (down to 10^-7) and any partitioning or subsampling method doesn't work for me. I need a streaming procedure that takes all the 10^8 numbers.

FactChecker · Dec 27, 2015

Cristiano said:

That method is probably good for "central" quantiles, but I need to calculate extreme quantiles (down to 10^-7) and any partitioning or subsampling method doesn't work for me. I need a streaming procedure that takes all the 10^8 numbers.

Then you are talking about making a very accurate cumulative density function. You might as well sort it and just save the whole thing to look up as needed. If there is any random aspect to it, another repeat sample would probably give a very different answer. Random luck would have a great impact on the bottom 10 values of 10^8 data elements.

Cristiano · Dec 27, 2015

FactChecker said:

Then you are talking about making a very accurate cumulative density function.

If we consider that I calculate thousands of values for each quantile (to calculate its mean) and that each quantile is obtained from 10⁸ random numbers, I'd say that the result is only reasonably accurate; I get only 2 or 3 significant digits for the extreme quantiles and 4 or 5 for the other quantiles.

FactChecker said:

You might as well sort it and just save the whole thing to look up as needed.

I need to save the whole simulation because I can resume the simulation to improve the result.
The problem in storing 10⁸ numbers is only while the simulation is running (they take a huge amount of memory).

FactChecker said:

If there is any random aspect to it, another repeat sample would probably give a very different answer. Random luck would have a great impact on the bottom 10 values of 10^8 data elements.

You're right, for that reason I'm trying to find a method that doesn't store the values. BTW, the result obtained from 10⁸ numbers seems fairly good.

Stephen Tashi · Dec 27, 2015

Cristiano said:

I need a streaming procedure that takes all the 10^8 numbers.

What's the definition of a "streaming procedure"?

If you can't store all the samples in memory then perhaps you need an algorithm that builds a sorted file on disk and keeps it sorted as more data comes in. How to do that is more a computer science question than a question of mathematics. There are various "merge sort" algorithms.

Cristiano · Dec 28, 2015

Stephen Tashi said:

What's the definition of a "streaming procedure"?

I don't know the exact definition, but I mean something like this:
https://software.intel.com/en-us/node/497936

Stephen Tashi said:

If you can't store all the samples in memory then perhaps you need an algorithm that builds a sorted file on disk and keeps it sorted as more data comes in. [...].

The program would be terribly slow; all the data must reside in RAM. I prefer a "small" but reasonably good number of random numbers in RAM.

pbuk · Dec 29, 2015

Cristiano said:

I won't use that method, because I generate 10⁸ numbers to retain the best possible accuracy; I cannot use a subsample.
I asked just to be sure that I understood that method; thank you for the clarification, but I need a method that uses all the 10⁸ numbers.

If you get any material difference between sub-samples of 100,000 or even 10,000 points then either your distribution is not constant or it is not smooth enough for this kind of analysis.

Cristiano said:

That method is probably good for "central" quantiles, but I need to calculate extreme quantiles (down to 10^-7) and any partitioning or subsampling method doesn't work for me. I need a streaming procedure that takes all the 10^8 numbers.

You want to sort 10⁸ numbers into 10⁷ bins? What could possibly be the point?

Cristiano said:

If we consider that I calculate thousands of values for each quantile (to calculate its mean) and that each quantile is obtained from 10⁸ random numbers, I'd say that the result is only reasonably accurate; I get only 2 or 3 significant digits for the extreme quantiles and 4 or 5 for the other quantiles.

You should realize from this that what you are doing doesn't make sense.

Cristiano said:

I need to save the whole simulation because I can resume the simulation to improve the result.
The problem in storing 10⁸ numbers is only while the simulation is running (they take a huge amount of memory).

I don't see the problem - even if it takes 8 bytes to store each value and 8 bytes for a pointer you would only need 1.5GB for 10⁸ values.

pbuk · Dec 29, 2015

Stephen Tashi said:

What's the definition of a "streaming procedure"?

If you can't store all the samples in memory then perhaps you need an algorithm that builds a sorted file on disk and keeps it sorted as more data comes in. How to do that is more a computer science question than a question of mathematics. There are various "merge sort" algorithms.

That would be impossibly slow. If you really want to generate an ordered list on the fly you need some sort of binary tree algorithm - the easiest way to access this would probably be to use one built into a SQL implementation.

If I wanted to do this I would stream the data to a text file (a dozen lines of code), import the file into MySQL (a couple of minutes in phpMyAdmin) and then add an index. You can then do whatever analysis you want with SELECT ... LIMIT 1 OFFSET ... code

Cristiano · Dec 29, 2015

MrAnchovy said:

If you get any material difference between sub-samples of 100,000 or even 10,000 points then either your distribution is not constant or it is not smooth enough for this kind of analysis.

If I use only 100,000 numbers how can I calculate the quantile 10^-7?

MrAnchovy said:

You want to sort 10⁸ numbers into 10⁷ bins? What could possibly be the point?

There are no bins in my procedure. I sort 10^8 numbers to calculate the quantiles of interest.

MrAnchovy said:

You should realize from this that what you are doing doesn't make sense.

What do you mean?

MrAnchovy said:

I don't see the problem - even if it takes 8 bytes to store each value and 8 bytes for a pointer you would only need 1.5GB for 10⁸ values.

I calculate 3 parameters simultaneously and my program is multi-threaded, hence I need 8 x 10^8 bytes for 1 parameter and 1 thread (how did you get 1.5 GB?) x 3 parameters x 8 threads= 17.9 GB, but I only have 16 GB, then I run only 5 threads (11.2 GB).

pbuk · Dec 29, 2015

Cristiano said:

If I use only 100,000 numbers how can I calculate the quantile 10^-7?

By interpolation from 10⁵ or 10⁴ quantiles.

Cristiano said:

There are no bins in my procedure. I sort 10^8 numbers to calculate the quantiles of interest.

Bins = quantiles

Cristiano said:

What do you mean?

If you end up with a precision ≈10^-4 what is the point in working with an intermediate calculation at a precision ≈10^-7?

Cristiano said:

I calculate 3 parameters simultaneously

Now I'm even more confused. Are you not interested in the correlations among these three paramaters? What can you hope to learn from summarising each of them by quantiles independently?

Cristiano said:

... hence I need 8 x 10^8 bytes for 1 parameter and 1 thread (how did you get 1.5 GB?)

I allowed for 1 paramater (because that's all you mentioned) and I also allowed for a 64 bit pointer in case that is required by the data structure you are using. (8 + 8) x 10⁸ = 1.6 x 10⁹ ≈ 1.5 GB.

Cristiano said:

x 8 threads

So you are collecting data for each thread separately then adding them together? Rewrite this - e.g. use 7 threads for the simulation each writing to a stream and run a separate process to collate the streams.

But if I were you before I did anything else I would talk through what I was trying to achieve with someone else - are you in an academic institution or employment or is this just recreation?

Cristiano · Dec 29, 2015

MrAnchovy said:

By interpolation from 10⁵ or 10⁴ quantiles.
[...]
If you end up with a precision ≈10^-4 what is the point in working with an intermediate calculation at a precision ≈10^-7?

Please, would you elaborate a bit those two points?

MrAnchovy said:

Now I'm even more confused. Are you not interested in the correlations among these three paramaters? [...]

No, is just to improve the program efficiency, but it is totally OT here.

My question is very easy; I need to do what I asked in my first post. Rephrasing a bit: I have many real random numbers with unknown distribution and I need to calculate a given quantile; is there any way to do that without storing all the numbers and retaining the best possible accuracy?

pbuk · Dec 29, 2015

Cristiano said:

Please, would you elaborate a bit those two points?

Google "interpolation" and "false precision".

Cristiano said:

My question is very easy; I need to do what I asked in my first post. Rephrasing a bit: I have many real random numbers with unknown distribution

OK, what do you want to do with them - investigate them to try to deduce a distribution? Model them for a game or simulation? Whatever you want to do I'm pretty sure that sorting them into 10⁷ quantiles is not going to help.

Cristiano said:

and I need to calculate a given quantile; is there any way to do that without storing all the numbers and retaining the best possible accuracy?

Not that I know of.

Cristiano · Dec 29, 2015

MrAnchovy said:

OK, what do you want to do with them - investigate them to try to deduce a distribution? Model them for a game or simulation?

I'll calculate a test statistics for some statistical tests (K-S, A-D, ...): http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
and then I'll compare that statistics with the critical values obtained from the current simulation to get the corresponding p-value (I interpolate the values using a cubic spline).

MrAnchovy said:

Whatever you want to do I'm pretty sure that sorting them into 10⁷ quantiles is not going to help.

I'm reasonably sure to understand the phrase (but with my "English" I can't be sure), but I still don't understand what you mean.
I just sort all the 10⁸ numbers and then I use the formula R-2, SAS-5: https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
to calculate the wanted quantiles.

MrAnchovy said:

Not that I know of.

No problem, my current simulation is reasonably good. Thank you anyway.

Quantiles on a stream of real numbers

Similar threads

Hot Threads

B A Little Probability Puzzle

I Need help solving this Existence Algorithm for truth

A Does this computation satisfy LTL formulas?

A Prove that points which are indistinguishable from 0 exist (using logic)

A Mathematical Connection between Cosmic Expansion and Exponential Growth

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective