Quantiles on a stream of real numbers

In summary, the conversation revolves around finding a method to calculate quantiles for a sample of 108 real numbers. The person is considering using a streaming method that does not store the numbers, but instead generates a random sample of size k by choosing each element independently with probability k/N. However, this method may not always give exactly k numbers, so it may not be accurate for extreme quantiles. The alternative suggestion is to sort and save all 108 numbers for accuracy, but this may not be feasible due to memory constraints. The conversation also discusses the possibility of using a streaming procedure that takes all 10^8 numbers, but there is uncertainty about what this means and how it would work. There is also a debate about the effectiveness and
  • #1
Cristiano
33
0
I need to calculate some quantiles for a sample of 108 real numbers with unknown mean and unknown variance.
I currently store and sort those numbers, but I would try a streaming method where the numbers are not stored.
In a paper is written: "If the size of the input stream, N is known, then the following simple algorithm can compute a random sample of size k in one-pass: choose each element independently with probability k/N to include in the sample."; please, could somebody tell me what it means?
I need a procedure to retain the best possible accuracy, any suggestion?
 
Physics news on Phys.org
  • #2
It means that as the raw data is examined, one at a time, you generate a uniform random number between 0 and 1. If the random number is less than k/N, store the number. Otherwise, do not store it. That will store an unbiased set of about k numbers (on average).
 
  • #3
Then I will calculate the quantile for that subsample of size k, right?
 
  • #4
Cristiano said:
Then I will calculate the quantile for that subsample of size k, right?
Yes. But you will not always get exactly k numbers. It is only probable that you will get approximately k numbers. So keep track of how many you really get and use that number for determining the quantiles.
 
  • #5
I won't use that method, because I generate 108 numbers to retain the best possible accuracy; I cannot use a subsample.
I asked just to be sure that I understood that method; thank you for the clarification, but I need a method that uses all the 108 numbers.
 
  • #6
Cristiano said:
I won't use that method, because I generate 108 numbers to retain the best possible accuracy; I cannot use a subsample.
I asked just to be sure that I understood that method; thank you for the clarification, but I need a method that uses all the 108 numbers.
ok. That's a lot of accuracy.
 
  • #8
That method is probably good for "central" quantiles, but I need to calculate extreme quantiles (down to 10^-7) and any partitioning or subsampling method doesn't work for me. I need a streaming procedure that takes all the 10^8 numbers.
 
  • #9
Cristiano said:
That method is probably good for "central" quantiles, but I need to calculate extreme quantiles (down to 10^-7) and any partitioning or subsampling method doesn't work for me. I need a streaming procedure that takes all the 10^8 numbers.
Then you are talking about making a very accurate cumulative density function. You might as well sort it and just save the whole thing to look up as needed. If there is any random aspect to it, another repeat sample would probably give a very different answer. Random luck would have a great impact on the bottom 10 values of 10^8 data elements.
 
  • Like
Likes pbuk
  • #10
FactChecker said:
Then you are talking about making a very accurate cumulative density function.

If we consider that I calculate thousands of values for each quantile (to calculate its mean) and that each quantile is obtained from 108 random numbers, I'd say that the result is only reasonably accurate; I get only 2 or 3 significant digits for the extreme quantiles and 4 or 5 for the other quantiles.

FactChecker said:
You might as well sort it and just save the whole thing to look up as needed.

I need to save the whole simulation because I can resume the simulation to improve the result.
The problem in storing 108 numbers is only while the simulation is running (they take a huge amount of memory).

FactChecker said:
If there is any random aspect to it, another repeat sample would probably give a very different answer. Random luck would have a great impact on the bottom 10 values of 10^8 data elements.

You're right, for that reason I'm trying to find a method that doesn't store the values. BTW, the result obtained from 108 numbers seems fairly good.
 
  • #11
Cristiano said:
I need a streaming procedure that takes all the 10^8 numbers.

What's the definition of a "streaming procedure"?

If you can't store all the samples in memory then perhaps you need an algorithm that builds a sorted file on disk and keeps it sorted as more data comes in. How to do that is more a computer science question than a question of mathematics. There are various "merge sort" algorithms.
 
  • #12
Stephen Tashi said:
What's the definition of a "streaming procedure"?

I don't know the exact definition, but I mean something like this:
https://software.intel.com/en-us/node/497936

Stephen Tashi said:
If you can't store all the samples in memory then perhaps you need an algorithm that builds a sorted file on disk and keeps it sorted as more data comes in. [...].

The program would be terribly slow; all the data must reside in RAM. I prefer a "small" but reasonably good number of random numbers in RAM.
 
Last edited by a moderator:
  • #13
Cristiano said:
I won't use that method, because I generate 108 numbers to retain the best possible accuracy; I cannot use a subsample.
I asked just to be sure that I understood that method; thank you for the clarification, but I need a method that uses all the 108 numbers.
If you get any material difference between sub-samples of 100,000 or even 10,000 points then either your distribution is not constant or it is not smooth enough for this kind of analysis.
Cristiano said:
That method is probably good for "central" quantiles, but I need to calculate extreme quantiles (down to 10^-7) and any partitioning or subsampling method doesn't work for me. I need a streaming procedure that takes all the 10^8 numbers.
You want to sort 108 numbers into 107 bins? What could possibly be the point?
Cristiano said:
If we consider that I calculate thousands of values for each quantile (to calculate its mean) and that each quantile is obtained from 108 random numbers, I'd say that the result is only reasonably accurate; I get only 2 or 3 significant digits for the extreme quantiles and 4 or 5 for the other quantiles.
You should realize from this that what you are doing doesn't make sense.
Cristiano said:
I need to save the whole simulation because I can resume the simulation to improve the result.
The problem in storing 108 numbers is only while the simulation is running (they take a huge amount of memory).
I don't see the problem - even if it takes 8 bytes to store each value and 8 bytes for a pointer you would only need 1.5GB for 108 values.
 
  • #14
Stephen Tashi said:
What's the definition of a "streaming procedure"?

If you can't store all the samples in memory then perhaps you need an algorithm that builds a sorted file on disk and keeps it sorted as more data comes in. How to do that is more a computer science question than a question of mathematics. There are various "merge sort" algorithms.
That would be impossibly slow. If you really want to generate an ordered list on the fly you need some sort of binary tree algorithm - the easiest way to access this would probably be to use one built into a SQL implementation.

If I wanted to do this I would stream the data to a text file (a dozen lines of code), import the file into MySQL (a couple of minutes in phpMyAdmin) and then add an index. You can then do whatever analysis you want with SELECT ... LIMIT 1 OFFSET ... code
 
  • #15
MrAnchovy said:
If you get any material difference between sub-samples of 100,000 or even 10,000 points then either your distribution is not constant or it is not smooth enough for this kind of analysis.

If I use only 100,000 numbers how can I calculate the quantile 10^-7?

MrAnchovy said:
You want to sort 108 numbers into 107 bins? What could possibly be the point?

There are no bins in my procedure. I sort 10^8 numbers to calculate the quantiles of interest.

MrAnchovy said:
You should realize from this that what you are doing doesn't make sense.

What do you mean?

MrAnchovy said:
I don't see the problem - even if it takes 8 bytes to store each value and 8 bytes for a pointer you would only need 1.5GB for 108 values.

I calculate 3 parameters simultaneously and my program is multi-threaded, hence I need 8 x 10^8 bytes for 1 parameter and 1 thread (how did you get 1.5 GB?) x 3 parameters x 8 threads= 17.9 GB, but I only have 16 GB, then I run only 5 threads (11.2 GB).
 
  • #16
Cristiano said:
If I use only 100,000 numbers how can I calculate the quantile 10^-7?
By interpolation from 105 or 104 quantiles.
Cristiano said:
There are no bins in my procedure. I sort 10^8 numbers to calculate the quantiles of interest.
Bins = quantiles
Cristiano said:
What do you mean?
If you end up with a precision ≈10-4 what is the point in working with an intermediate calculation at a precision ≈10-7?
Cristiano said:
I calculate 3 parameters simultaneously
Now I'm even more confused. Are you not interested in the correlations among these three paramaters? What can you hope to learn from summarising each of them by quantiles independently?
Cristiano said:
... hence I need 8 x 10^8 bytes for 1 parameter and 1 thread (how did you get 1.5 GB?)
I allowed for 1 paramater (because that's all you mentioned) and I also allowed for a 64 bit pointer in case that is required by the data structure you are using. (8 + 8) x 108 = 1.6 x 109 ≈ 1.5 GB.
Cristiano said:
x 8 threads
So you are collecting data for each thread separately then adding them together? Rewrite this - e.g. use 7 threads for the simulation each writing to a stream and run a separate process to collate the streams.

But if I were you before I did anything else I would talk through what I was trying to achieve with someone else - are you in an academic institution or employment or is this just recreation?
 
  • #17
MrAnchovy said:
By interpolation from 105 or 104 quantiles.
[...]
If you end up with a precision ≈10-4 what is the point in working with an intermediate calculation at a precision ≈10-7?

Please, would you elaborate a bit those two points?

MrAnchovy said:
Now I'm even more confused. Are you not interested in the correlations among these three paramaters? [...]

No, is just to improve the program efficiency, but it is totally OT here.

My question is very easy; I need to do what I asked in my first post. Rephrasing a bit: I have many real random numbers with unknown distribution and I need to calculate a given quantile; is there any way to do that without storing all the numbers and retaining the best possible accuracy?
 
  • #18
Cristiano said:
Please, would you elaborate a bit those two points?
Google "interpolation" and "false precision".

Cristiano said:
My question is very easy; I need to do what I asked in my first post. Rephrasing a bit: I have many real random numbers with unknown distribution
OK, what do you want to do with them - investigate them to try to deduce a distribution? Model them for a game or simulation? Whatever you want to do I'm pretty sure that sorting them into 107 quantiles is not going to help.
Cristiano said:
and I need to calculate a given quantile; is there any way to do that without storing all the numbers and retaining the best possible accuracy?
Not that I know of.
 
  • #19
MrAnchovy said:
OK, what do you want to do with them - investigate them to try to deduce a distribution? Model them for a game or simulation?

I'll calculate a test statistics for some statistical tests (K-S, A-D, ...): http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
and then I'll compare that statistics with the critical values obtained from the current simulation to get the corresponding p-value (I interpolate the values using a cubic spline).

MrAnchovy said:
Whatever you want to do I'm pretty sure that sorting them into 107 quantiles is not going to help.

I'm reasonably sure to understand the phrase (but with my "English" I can't be sure), but I still don't understand what you mean.
I just sort all the 108 numbers and then I use the formula R-2, SAS-5: https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
to calculate the wanted quantiles.

MrAnchovy said:
Not that I know of.

No problem, my current simulation is reasonably good. Thank you anyway.
 

1. What are quantiles on a stream of real numbers?

Quantiles on a stream of real numbers are statistical measures that divide a dataset into equal-sized subsets. They are used to understand the distribution of the numbers in the dataset and are commonly used in probability and statistics to analyze data.

2. How are quantiles calculated on a stream of real numbers?

Quantiles on a stream of real numbers are calculated by sorting the numbers in the dataset in ascending order and then dividing them into equal-sized subsets. The position of a particular quantile is determined by its percentile value. For example, the median, or 50th percentile, would divide the dataset into two equal-sized subsets.

3. What is the significance of quantiles on a stream of real numbers?

Quantiles on a stream of real numbers provide valuable insights into the distribution of the data. They help to identify the central tendency, spread, and shape of the dataset, which can be useful for making decisions and drawing conclusions. They are also useful for identifying outliers and understanding the overall range of the data.

4. How are quantiles used in data analysis?

Quantiles are commonly used in data analysis to compare and contrast different datasets, assess the variability of the data, and identify patterns and trends. They are also used to create box plots, which provide a visual representation of the distribution of the data. Additionally, quantiles are often used in hypothesis testing and to calculate confidence intervals.

5. Are there different types of quantiles on a stream of real numbers?

Yes, there are different types of quantiles that can be calculated on a stream of real numbers. Some common types include quartiles, deciles, and percentiles. Each type represents a different percentage of the dataset, with quartiles representing 25%, deciles representing 10%, and percentiles representing any percentage. Other types of quantiles include quintiles, octiles, and non-quantiles, which divide the dataset into five, eight, and any number of equal-sized subsets, respectively.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
923
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
477
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
0
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
846
Back
Top