Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Quantiles on a stream of real numbers

Tags:
  1. Dec 27, 2015 #1
    I need to calculate some quantiles for a sample of 108 real numbers with unknown mean and unknown variance.
    I currently store and sort those numbers, but I would try a streaming method where the numbers are not stored.
    In a paper is written: "If the size of the input stream, N is known, then the following simple algorithm can compute a random sample of size k in one-pass: choose each element independently with probability k/N to include in the sample."; please, could somebody tell me what it means?
    I need a procedure to retain the best possible accuracy, any suggestion?
     
  2. jcsd
  3. Dec 27, 2015 #2

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    It means that as the raw data is examined, one at a time, you generate a uniform random number between 0 and 1. If the random number is less than k/N, store the number. Otherwise, do not store it. That will store an unbiased set of about k numbers (on average).
     
  4. Dec 27, 2015 #3
    Then I will calculate the quantile for that subsample of size k, right?
     
  5. Dec 27, 2015 #4

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    Yes. But you will not always get exactly k numbers. It is only probable that you will get approximately k numbers. So keep track of how many you really get and use that number for determining the quantiles.
     
  6. Dec 27, 2015 #5
    I won't use that method, because I generate 108 numbers to retain the best possible accuracy; I cannot use a subsample.
    I asked just to be sure that I understood that method; thank you for the clarification, but I need a method that uses all the 108 numbers.
     
  7. Dec 27, 2015 #6

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    ok. That's a lot of accuracy.
     
  8. Dec 27, 2015 #7

    Stephen Tashi

    User Avatar
    Science Advisor

  9. Dec 27, 2015 #8
    That method is probably good for "central" quantiles, but I need to calculate extreme quantiles (down to 10^-7) and any partitioning or subsampling method doesn't work for me. I need a streaming procedure that takes all the 10^8 numbers.
     
  10. Dec 27, 2015 #9

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    Then you are talking about making a very accurate cumulative density function. You might as well sort it and just save the whole thing to look up as needed. If there is any random aspect to it, another repeat sample would probably give a very different answer. Random luck would have a great impact on the bottom 10 values of 10^8 data elements.
     
  11. Dec 27, 2015 #10
    If we consider that I calculate thousands of values for each quantile (to calculate its mean) and that each quantile is obtained from 108 random numbers, I'd say that the result is only reasonably accurate; I get only 2 or 3 significant digits for the extreme quantiles and 4 or 5 for the other quantiles.

    I need to save the whole simulation because I can resume the simulation to improve the result.
    The problem in storing 108 numbers is only while the simulation is running (they take a huge amount of memory).

    You're right, for that reason I'm trying to find a method that doesn't store the values. BTW, the result obtained from 108 numbers seems fairly good.
     
  12. Dec 27, 2015 #11

    Stephen Tashi

    User Avatar
    Science Advisor

    What's the definition of a "streaming procedure"?

    If you can't store all the samples in memory then perhaps you need an algorithm that builds a sorted file on disk and keeps it sorted as more data comes in. How to do that is more a computer science question than a question of mathematics. There are various "merge sort" algorithms.
     
  13. Dec 28, 2015 #12
    I don't know the exact definition, but I mean something like this:
    https://software.intel.com/en-us/node/497936 [Broken]

    The program would be terribly slow; all the data must reside in RAM. I prefer a "small" but reasonably good number of random numbers in RAM.
     
    Last edited by a moderator: May 7, 2017
  14. Dec 29, 2015 #13
    If you get any material difference between sub-samples of 100,000 or even 10,000 points then either your distribution is not constant or it is not smooth enough for this kind of analysis.
    You want to sort 108 numbers into 107 bins? What could possibly be the point?
    You should realise from this that what you are doing doesn't make sense.
    I don't see the problem - even if it takes 8 bytes to store each value and 8 bytes for a pointer you would only need 1.5GB for 108 values.
     
  15. Dec 29, 2015 #14
    That would be impossibly slow. If you really want to generate an ordered list on the fly you need some sort of binary tree algorithm - the easiest way to access this would probably be to use one built in to a SQL implementation.

    If I wanted to do this I would stream the data to a text file (a dozen lines of code), import the file into MySQL (a couple of minutes in phpMyAdmin) and then add an index. You can then do whatever analysis you want with SELECT ... LIMIT 1 OFFSET ... code
     
  16. Dec 29, 2015 #15
    If I use only 100,000 numbers how can I calculate the quantile 10^-7?

    There are no bins in my procedure. I sort 10^8 numbers to calculate the quantiles of interest.

    What do you mean?

    I calculate 3 parameters simultaneously and my program is multi-threaded, hence I need 8 x 10^8 bytes for 1 parameter and 1 thread (how did you get 1.5 GB?) x 3 parameters x 8 threads= 17.9 GB, but I only have 16 GB, then I run only 5 threads (11.2 GB).
     
  17. Dec 29, 2015 #16
    By interpolation from 105 or 104 quantiles.
    Bins = quantiles
    If you end up with a precision ≈10-4 what is the point in working with an intermediate calculation at a precision ≈10-7?
    Now I'm even more confused. Are you not interested in the correlations among these three paramaters? What can you hope to learn from summarising each of them by quantiles independently?
    I allowed for 1 paramater (because that's all you mentioned) and I also allowed for a 64 bit pointer in case that is required by the data structure you are using. (8 + 8) x 108 = 1.6 x 109 ≈ 1.5 GB.
    So you are collecting data for each thread separately then adding them together? Rewrite this - e.g. use 7 threads for the simulation each writing to a stream and run a separate process to collate the streams.

    But if I were you before I did anything else I would talk through what I was trying to achieve with someone else - are you in an academic institution or employment or is this just recreation?
     
  18. Dec 29, 2015 #17
    Please, would you elaborate a bit those two points?

    No, is just to improve the program efficiency, but it is totally OT here.

    My question is very easy; I need to do what I asked in my first post. Rephrasing a bit: I have many real random numbers with unknown distribution and I need to calculate a given quantile; is there any way to do that without storing all the numbers and retaining the best possible accuracy?
     
  19. Dec 29, 2015 #18
    Google "interpolation" and "false precision".

    OK, what do you want to do with them - investigate them to try to deduce a distribution? Model them for a game or simulation? Whatever you want to do I'm pretty sure that sorting them into 107 quantiles is not going to help.
    Not that I know of.
     
  20. Dec 29, 2015 #19
    I'll calculate a test statistics for some statistical tests (K-S, A-D, ...): http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
    and then I'll compare that statistics with the critical values obtained from the current simulation to get the corresponding p-value (I interpolate the values using a cubic spline).

    I'm reasonably sure to understand the phrase (but with my "English" I can't be sure), but I still don't understand what you mean.
    I just sort all the 108 numbers and then I use the formula R-2, SAS-5: https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
    to calculate the wanted quantiles.

    No problem, my current simulation is reasonably good. Thank you anyway.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook




Similar Discussions: Quantiles on a stream of real numbers
  1. Is zero a real number? (Replies: 4)

Loading...