Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

True random numbers

  1. Dec 20, 2007 #1
    Hi

    I'm writing a compression utility based on an improvement to Huffman coding and just for giggles, I thought I'd try it out on true random data from random.org to see if it did anything.

    What I found was that the data from random.org (which is derived from atmospheric noise) definitely displays biases which can be exploited to generate significant compression rates (I'm getting upwards of 55% savings.)

    I'm well aware of the "pigeonhole principle" and know that random data shouldn't be compressible so that brings up the question: Is this data somehow not random? Is randomness really just a function of scale?

    For example: If you take 1,000,000 randomly generated bits and slice them up into 8 bit samples, there are bound to be certain biases (ie: more instances of the number 15 than 16) If you slice those 1,000,000 bits into 25 bit samples, completely different biases would result, but it seems that by choosing a sample size, you automatically introduce biases into random data, which sort of makes it - not random.

    The only way I can think of to create a random set without biases would be to start with equal numbers of all values for a particular sample size and mix them. That way, there would be no frequency differences to exploit. But even this fails if you choose a different sample size.

    So I'm wondering is randomness just an illusion caused by sampling scale?
     
  2. jcsd
  3. Dec 20, 2007 #2
    Statistics are meaningful only when large enough samples are considered.
    For example, consider a coin. We "know" that coin tossing is completely random. However, you could try tossing it a few times and might end up with:
    HHHTHHTHHH
    which is obviously biased in favor of heads. However, you would expect freq(H)/freq(T)=1 as you go on tossing, i.e. as your sample size becomes large enough (BTW, given a distribution, you can estimate how large is "large enough").

    Assaf
    Physically Incorrect
     
  4. Dec 20, 2007 #3

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Slicing a random bitstream into groups of 8 does not make the data any less random at all.

    As far as the meaning of "random," you have to consider that getting a million heads in a row is a perfectly valid result of flipping a coin a million times.

    Your algorithm's ability to compress random data will disappear, on average, if you compress very large bitstreams, or, alternatively, if you individually compress a large number of small bitstreams.

    - Warren
     
  5. Dec 21, 2007 #4

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    How large are the strings that you're compressing?

    Are you including all needed information in the encoded string? Sometimes there's outside data that's used to compress, and this outside data plus the 'encoded data' is at least as large as the original. Along those lines, what format is the encoded data in?
     
  6. Dec 21, 2007 #5

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    That is, of course, a bias itself, and could be used to compress a string. 256 bytes in which each of the 256 possible values appears once can be compressed from 2048 bits to 1684 bits (though admittedly not easily).
     
  7. Dec 21, 2007 #6

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    I took 10,000 bytes from random.org (integers in the range 0-255) and stored them in a 10,000-byte bin file. I ran several compression algorithms on it (using 7-zip) and wasn't able to compress it to below 10,000 bytes. The compressed files were in the range 10,200 to 10,800 bytes or so.
     
  8. Dec 21, 2007 #7
    10,000 bytes and yes to including header info.
     
  9. Dec 21, 2007 #8
    On a Mac if I use a random 10,000 number set (with spaces deleted) from random.org and restrict the values to the range 0-9, I can use the built-in "Create archive of ..." command in the Finder and get around 50 percent compression.

    With my algorithm, I'm getting around 5 to 20 percent better compression rates than "zip" depending on file type.

    It's still a bit rough around the edges so, I may be making an error somewhere in the code that's giving me these smaller sizes. I have the compress/write-to-file working but still need to write the read-from-file/decompress part, which will indicate whether I've messed up or not.
     
    Last edited: Dec 21, 2007
  10. Dec 21, 2007 #9

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Ten kilobytes is a laughably small data set. Don't draw any conclusions at all from that...

    - Warren
     
  11. Dec 21, 2007 #10

    Ben Niehoff

    User Avatar
    Science Advisor
    Gold Member

    Well, that's your problem. A data stream restricted to numbers 0-9 isn't hardly random at all! You only need 4 bits of information to represent a number from 0 to 15; so 0-9 is even less than that. Therefore, if each character occupies one byte in the file, then it's no wonder your Mac gets 50% compression rates, because 50% of the file is zeros.

    If you want a truly random data stream, you MUST use integers in the range 0-255.
     
  12. Dec 21, 2007 #11

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Err.. yeah, Ben. Good catch. I was tacitly assuming that ktoz was using a random bitstream, where every bit is independent of every other bit. If he's just using a file full of ASCII characters '0' through '9' then sure, way more than half of each byte is wasted. A good compression algorithm could easily achieve 75% or even higher compression on such a file.

    - Warren
     
  13. Dec 21, 2007 #12
    I was kind of assuming the "zip" compressor was squeezing out the extraneous zeros, but what difference does restricting the set to 0-9 have, as opposed to 0-255, on whether a sequence is random or not? Doesn't seem like base 10 is intrinsically less capable of randomization than base 16.
     
  14. Dec 21, 2007 #13

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Are you actually putting the ASCII characters '0' through '9' in a file, and then trying to compress it? Just say yes or no.

    If so, consider the binary values for the ASCII codes for '0' through '9':

    '0': 0011 0000
    '1': 0011 0001
    '2': 0011 0010
    '3': 0011 0011
    '4': 0011 0100
    '5': 0011 0101
    '6': 0011 0110
    '7': 0011 0111
    '8': 0011 1000
    '9': 0011 1001

    As you can see, the first nibble doesn't even change; they're duplicated in every single byte, and carry no real information. The second nibble does change, but the majority of them start with zero. In other words, there are fewer than 4 bits of "real information" in each 8 bit byte. That means roughly 5/8ths of your data set is wasted, which is why any decent compression algorithm can reduce it by 50% or more.

    - Warren
     
    Last edited: Dec 22, 2007
  15. Dec 22, 2007 #14

    CRGreathouse

    User Avatar
    Science Advisor
    Homework Helper

    Ten bits is enough to encode three decimal digits (allowing compression to 42% the original size) with enough room left over for "end of string", "append 0 then end of string", ..., "append 9 then end of string", "append 0, read 4 more bits to get one more digit, then end of string", ..., "append 9, read 4 more bits to get one more digit, then end of string", and still have enough room left over for three error/override conditions.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: True random numbers
  1. Psuedo-random numbers (Replies: 3)

Loading...