# Quantiles on a stream of real numbers

I need to calculate some quantiles for a sample of 108 real numbers with unknown mean and unknown variance.
I currently store and sort those numbers, but I would try a streaming method where the numbers are not stored.
In a paper is written: "If the size of the input stream, N is known, then the following simple algorithm can compute a random sample of size k in one-pass: choose each element independently with probability k/N to include in the sample."; please, could somebody tell me what it means?
I need a procedure to retain the best possible accuracy, any suggestion?

FactChecker
Gold Member
It means that as the raw data is examined, one at a time, you generate a uniform random number between 0 and 1. If the random number is less than k/N, store the number. Otherwise, do not store it. That will store an unbiased set of about k numbers (on average).

Then I will calculate the quantile for that subsample of size k, right?

FactChecker
Gold Member
Then I will calculate the quantile for that subsample of size k, right?
Yes. But you will not always get exactly k numbers. It is only probable that you will get approximately k numbers. So keep track of how many you really get and use that number for determining the quantiles.

I won't use that method, because I generate 108 numbers to retain the best possible accuracy; I cannot use a subsample.
I asked just to be sure that I understood that method; thank you for the clarification, but I need a method that uses all the 108 numbers.

FactChecker
Gold Member
I won't use that method, because I generate 108 numbers to retain the best possible accuracy; I cannot use a subsample.
I asked just to be sure that I understood that method; thank you for the clarification, but I need a method that uses all the 108 numbers.
ok. That's a lot of accuracy.

That method is probably good for "central" quantiles, but I need to calculate extreme quantiles (down to 10^-7) and any partitioning or subsampling method doesn't work for me. I need a streaming procedure that takes all the 10^8 numbers.

FactChecker
Gold Member
That method is probably good for "central" quantiles, but I need to calculate extreme quantiles (down to 10^-7) and any partitioning or subsampling method doesn't work for me. I need a streaming procedure that takes all the 10^8 numbers.
Then you are talking about making a very accurate cumulative density function. You might as well sort it and just save the whole thing to look up as needed. If there is any random aspect to it, another repeat sample would probably give a very different answer. Random luck would have a great impact on the bottom 10 values of 10^8 data elements.

• pbuk
Then you are talking about making a very accurate cumulative density function.

If we consider that I calculate thousands of values for each quantile (to calculate its mean) and that each quantile is obtained from 108 random numbers, I'd say that the result is only reasonably accurate; I get only 2 or 3 significant digits for the extreme quantiles and 4 or 5 for the other quantiles.

You might as well sort it and just save the whole thing to look up as needed.

I need to save the whole simulation because I can resume the simulation to improve the result.
The problem in storing 108 numbers is only while the simulation is running (they take a huge amount of memory).

If there is any random aspect to it, another repeat sample would probably give a very different answer. Random luck would have a great impact on the bottom 10 values of 10^8 data elements.

You're right, for that reason I'm trying to find a method that doesn't store the values. BTW, the result obtained from 108 numbers seems fairly good.

Stephen Tashi
I need a streaming procedure that takes all the 10^8 numbers.

What's the definition of a "streaming procedure"?

If you can't store all the samples in memory then perhaps you need an algorithm that builds a sorted file on disk and keeps it sorted as more data comes in. How to do that is more a computer science question than a question of mathematics. There are various "merge sort" algorithms.

What's the definition of a "streaming procedure"?

I don't know the exact definition, but I mean something like this:
https://software.intel.com/en-us/node/497936 [Broken]

If you can't store all the samples in memory then perhaps you need an algorithm that builds a sorted file on disk and keeps it sorted as more data comes in. [...].

The program would be terribly slow; all the data must reside in RAM. I prefer a "small" but reasonably good number of random numbers in RAM.

Last edited by a moderator:
pbuk
Gold Member
I won't use that method, because I generate 108 numbers to retain the best possible accuracy; I cannot use a subsample.
I asked just to be sure that I understood that method; thank you for the clarification, but I need a method that uses all the 108 numbers.
If you get any material difference between sub-samples of 100,000 or even 10,000 points then either your distribution is not constant or it is not smooth enough for this kind of analysis.
That method is probably good for "central" quantiles, but I need to calculate extreme quantiles (down to 10^-7) and any partitioning or subsampling method doesn't work for me. I need a streaming procedure that takes all the 10^8 numbers.
You want to sort 108 numbers into 107 bins? What could possibly be the point?
If we consider that I calculate thousands of values for each quantile (to calculate its mean) and that each quantile is obtained from 108 random numbers, I'd say that the result is only reasonably accurate; I get only 2 or 3 significant digits for the extreme quantiles and 4 or 5 for the other quantiles.
You should realise from this that what you are doing doesn't make sense.
I need to save the whole simulation because I can resume the simulation to improve the result.
The problem in storing 108 numbers is only while the simulation is running (they take a huge amount of memory).
I don't see the problem - even if it takes 8 bytes to store each value and 8 bytes for a pointer you would only need 1.5GB for 108 values.

pbuk
Gold Member
What's the definition of a "streaming procedure"?

If you can't store all the samples in memory then perhaps you need an algorithm that builds a sorted file on disk and keeps it sorted as more data comes in. How to do that is more a computer science question than a question of mathematics. There are various "merge sort" algorithms.
That would be impossibly slow. If you really want to generate an ordered list on the fly you need some sort of binary tree algorithm - the easiest way to access this would probably be to use one built in to a SQL implementation.

If I wanted to do this I would stream the data to a text file (a dozen lines of code), import the file into MySQL (a couple of minutes in phpMyAdmin) and then add an index. You can then do whatever analysis you want with SELECT ... LIMIT 1 OFFSET ... code

If you get any material difference between sub-samples of 100,000 or even 10,000 points then either your distribution is not constant or it is not smooth enough for this kind of analysis.

If I use only 100,000 numbers how can I calculate the quantile 10^-7?

You want to sort 108 numbers into 107 bins? What could possibly be the point?

There are no bins in my procedure. I sort 10^8 numbers to calculate the quantiles of interest.

You should realise from this that what you are doing doesn't make sense.

What do you mean?

I don't see the problem - even if it takes 8 bytes to store each value and 8 bytes for a pointer you would only need 1.5GB for 108 values.

I calculate 3 parameters simultaneously and my program is multi-threaded, hence I need 8 x 10^8 bytes for 1 parameter and 1 thread (how did you get 1.5 GB?) x 3 parameters x 8 threads= 17.9 GB, but I only have 16 GB, then I run only 5 threads (11.2 GB).

pbuk
Gold Member
If I use only 100,000 numbers how can I calculate the quantile 10^-7?
By interpolation from 105 or 104 quantiles.
There are no bins in my procedure. I sort 10^8 numbers to calculate the quantiles of interest.
Bins = quantiles
What do you mean?
If you end up with a precision ≈10-4 what is the point in working with an intermediate calculation at a precision ≈10-7?
I calculate 3 parameters simultaneously
Now I'm even more confused. Are you not interested in the correlations among these three paramaters? What can you hope to learn from summarising each of them by quantiles independently?
... hence I need 8 x 10^8 bytes for 1 parameter and 1 thread (how did you get 1.5 GB?)
I allowed for 1 paramater (because that's all you mentioned) and I also allowed for a 64 bit pointer in case that is required by the data structure you are using. (8 + 8) x 108 = 1.6 x 109 ≈ 1.5 GB.
So you are collecting data for each thread separately then adding them together? Rewrite this - e.g. use 7 threads for the simulation each writing to a stream and run a separate process to collate the streams.

But if I were you before I did anything else I would talk through what I was trying to achieve with someone else - are you in an academic institution or employment or is this just recreation?

By interpolation from 105 or 104 quantiles.
[...]
If you end up with a precision ≈10-4 what is the point in working with an intermediate calculation at a precision ≈10-7?

Please, would you elaborate a bit those two points?

Now I'm even more confused. Are you not interested in the correlations among these three paramaters? [...]

No, is just to improve the program efficiency, but it is totally OT here.

My question is very easy; I need to do what I asked in my first post. Rephrasing a bit: I have many real random numbers with unknown distribution and I need to calculate a given quantile; is there any way to do that without storing all the numbers and retaining the best possible accuracy?

pbuk
Gold Member
Please, would you elaborate a bit those two points?

My question is very easy; I need to do what I asked in my first post. Rephrasing a bit: I have many real random numbers with unknown distribution
OK, what do you want to do with them - investigate them to try to deduce a distribution? Model them for a game or simulation? Whatever you want to do I'm pretty sure that sorting them into 107 quantiles is not going to help.
and I need to calculate a given quantile; is there any way to do that without storing all the numbers and retaining the best possible accuracy?
Not that I know of.

OK, what do you want to do with them - investigate them to try to deduce a distribution? Model them for a game or simulation?

I'll calculate a test statistics for some statistical tests (K-S, A-D, ...): http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
and then I'll compare that statistics with the critical values obtained from the current simulation to get the corresponding p-value (I interpolate the values using a cubic spline).

Whatever you want to do I'm pretty sure that sorting them into 107 quantiles is not going to help.

I'm reasonably sure to understand the phrase (but with my "English" I can't be sure), but I still don't understand what you mean.
I just sort all the 108 numbers and then I use the formula R-2, SAS-5: https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
to calculate the wanted quantiles.

Not that I know of.

No problem, my current simulation is reasonably good. Thank you anyway.