# True random numbers

1. Dec 20, 2007

### ktoz

Hi

I'm writing a compression utility based on an improvement to Huffman coding and just for giggles, I thought I'd try it out on true random data from http://www.random.org/" [Broken] to see if it did anything.

What I found was that the data from random.org (which is derived from atmospheric noise) definitely displays biases which can be exploited to generate significant compression rates (I'm getting upwards of 55% savings.)

I'm well aware of the "http://en.wikipedia.org/wiki/Pigeonhole_principle" [Broken]" and know that random data shouldn't be compressible so that brings up the question: Is this data somehow not random? Is randomness really just a function of scale?

For example: If you take 1,000,000 randomly generated bits and slice them up into 8 bit samples, there are bound to be certain biases (ie: more instances of the number 15 than 16) If you slice those 1,000,000 bits into 25 bit samples, completely different biases would result, but it seems that by choosing a sample size, you automatically introduce biases into random data, which sort of makes it - not random.

The only way I can think of to create a random set without biases would be to start with equal numbers of all values for a particular sample size and mix them. That way, there would be no frequency differences to exploit. But even this fails if you choose a different sample size.

So I'm wondering is randomness just an illusion caused by sampling scale?

Last edited by a moderator: May 3, 2017
2. Dec 20, 2007

### ozymandias

Statistics are meaningful only when large enough samples are considered.
For example, consider a coin. We "know" that coin tossing is completely random. However, you could try tossing it a few times and might end up with:
HHHTHHTHHH
which is obviously biased in favor of heads. However, you would expect freq(H)/freq(T)=1 as you go on tossing, i.e. as your sample size becomes large enough (BTW, given a distribution, you can estimate how large is "large enough").

Assaf
"www.physicallyincorrect.com"[/URL]

Last edited by a moderator: Apr 23, 2017
3. Dec 20, 2007

### chroot

Staff Emeritus
Slicing a random bitstream into groups of 8 does not make the data any less random at all.

As far as the meaning of "random," you have to consider that getting a million heads in a row is a perfectly valid result of flipping a coin a million times.

Your algorithm's ability to compress random data will disappear, on average, if you compress very large bitstreams, or, alternatively, if you individually compress a large number of small bitstreams.

- Warren

4. Dec 21, 2007

### CRGreathouse

How large are the strings that you're compressing?

Are you including all needed information in the encoded string? Sometimes there's outside data that's used to compress, and this outside data plus the 'encoded data' is at least as large as the original. Along those lines, what format is the encoded data in?

5. Dec 21, 2007

### CRGreathouse

That is, of course, a bias itself, and could be used to compress a string. 256 bytes in which each of the 256 possible values appears once can be compressed from 2048 bits to 1684 bits (though admittedly not easily).

6. Dec 21, 2007

### CRGreathouse

I took 10,000 bytes from random.org (integers in the range 0-255) and stored them in a 10,000-byte bin file. I ran several compression algorithms on it (using 7-zip) and wasn't able to compress it to below 10,000 bytes. The compressed files were in the range 10,200 to 10,800 bytes or so.

7. Dec 21, 2007

### ktoz

10,000 bytes and yes to including header info.

8. Dec 21, 2007

### ktoz

On a Mac if I use a random 10,000 number set (with spaces deleted) from random.org and restrict the values to the range 0-9, I can use the built-in "Create archive of ..." command in the Finder and get around 50 percent compression.

With my algorithm, I'm getting around 5 to 20 percent better compression rates than "zip" depending on file type.

It's still a bit rough around the edges so, I may be making an error somewhere in the code that's giving me these smaller sizes. I have the compress/write-to-file working but still need to write the read-from-file/decompress part, which will indicate whether I've messed up or not.

Last edited: Dec 21, 2007
9. Dec 21, 2007

### chroot

Staff Emeritus
Ten kilobytes is a laughably small data set. Don't draw any conclusions at all from that...

- Warren

10. Dec 21, 2007

### Ben Niehoff

Well, that's your problem. A data stream restricted to numbers 0-9 isn't hardly random at all! You only need 4 bits of information to represent a number from 0 to 15; so 0-9 is even less than that. Therefore, if each character occupies one byte in the file, then it's no wonder your Mac gets 50% compression rates, because 50% of the file is zeros.

If you want a truly random data stream, you MUST use integers in the range 0-255.

11. Dec 21, 2007

### chroot

Staff Emeritus
Err.. yeah, Ben. Good catch. I was tacitly assuming that ktoz was using a random bitstream, where every bit is independent of every other bit. If he's just using a file full of ASCII characters '0' through '9' then sure, way more than half of each byte is wasted. A good compression algorithm could easily achieve 75% or even higher compression on such a file.

- Warren

12. Dec 21, 2007

### ktoz

I was kind of assuming the "zip" compressor was squeezing out the extraneous zeros, but what difference does restricting the set to 0-9 have, as opposed to 0-255, on whether a sequence is random or not? Doesn't seem like base 10 is intrinsically less capable of randomization than base 16.

13. Dec 21, 2007

### chroot

Staff Emeritus
Are you actually putting the ASCII characters '0' through '9' in a file, and then trying to compress it? Just say yes or no.

If so, consider the binary values for the ASCII codes for '0' through '9':

'0': 0011 0000
'1': 0011 0001
'2': 0011 0010
'3': 0011 0011
'4': 0011 0100
'5': 0011 0101
'6': 0011 0110
'7': 0011 0111
'8': 0011 1000
'9': 0011 1001

As you can see, the first nibble doesn't even change; they're duplicated in every single byte, and carry no real information. The second nibble does change, but the majority of them start with zero. In other words, there are fewer than 4 bits of "real information" in each 8 bit byte. That means roughly 5/8ths of your data set is wasted, which is why any decent compression algorithm can reduce it by 50% or more.

- Warren

Last edited: Dec 22, 2007
14. Dec 22, 2007

### CRGreathouse

Ten bits is enough to encode three decimal digits (allowing compression to 42% the original size) with enough room left over for "end of string", "append 0 then end of string", ..., "append 9 then end of string", "append 0, read 4 more bits to get one more digit, then end of string", ..., "append 9, read 4 more bits to get one more digit, then end of string", and still have enough room left over for three error/override conditions.