Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Needing a second opinion on data compression scheme

  1. Jul 28, 2008 #1
    Over the last 2 years, i have been developing a compression technique that is free of "mathematical" expressions. By doing this, the compression scheme would be free of traditional mathematic limitations usually applied to compression routines.

    I'll go right into it, please let me know if i need to clarify any of the below.

    I'll use a random set of digits for my example: 1415 9265 3589 7932

    Now the basic point of this method is to observe the frequency of each specific digit, 0 appears no times, 1 appears 2 times, 2 apears 2 times, 3 appears 1 times, etc.

    We then create a binary key sequence consiting of 10 bits. 98 7654 3210, then we proceed to turn them on, one at a time.

    98 7654 3210
    00 0000 0001

    would make the sequence above into a binary pattern of 0000 0000 0000 0000.

    98 7654 3210
    00 0000 0010

    would make the sequence above into a binary pattern of 1010 0000 0000 0000.

    Repeating this 1024 times, would create 1024 sets of 2 bytes of data. or 2048 KB of data. all of which would only have a 10 bit Key address in this particular example, as the source data does not change. 10/16 represents a compression ratio of .625 out of 1.0.

    The bigger picture: Lets say my source set was a continuous number (such as PI) And was always restricted to 8192 digit samples, starting at the first character in PI, and shifting over 1 digit, per 1024 keys. (keys 11 1111 1111, and 00 0000 0000 will not be used as they would make the entire set either all 0's or all 1's)

    8192 digits per sample = 8192 binary digits per key = 1KB of data created from a number that can easily be formed via any home computer, and would only have to have the address where to go in PI, and the key used to obtain the data.

    Now assuming we only use the currently formulated 1.2 trillion digits of PI (the world record held by Kanada Labratory) and each digit of this was held on a hard drive, each digit would only consume 4 bits, meaning the hard drive would have to be no bigger than 600 GB, which is very feasible in today's computer market.

    This system could feasibly create 1,200,000,000,000 x 8,192 bits = 9,830,400,000,000,000 bits of data, that would only need to be stored on one main computer at the compression facility.

    The key to storing the data: hash checksums.

    For every key created, 2 hash checksums are created using adler32 and crc32, the odds of 2 different strings of data, creating the same checksum in BOTH hashes are more so than the odds of creating every possibility of 8192 bits. as it would be

    2^32 to the 2^32 power odds of any 2 data strings creating the same hash. So using the hashes creating during the discovery phase to store each 1kb found using the keys as 8 bytes (2 seperate 4 byte hashes) meaning we could store on a 1TB hard drive, 125000000000 sets of 1kb, or 116.415321826934814453125 TB of data, that would NEVER need to be fully stored, anywhere else on the planet, ever again.

    Sorry for being so passionate about this, but please let me know what you think, and please ask any questions you may have, and any constructive advice would be wonderful.

    William Dean Dover
    Posted Monday, July 28, 2008 11:33AM CST
     
  2. jcsd
  3. Jul 29, 2008 #2
    I have no experience in data compression, but that makes a lot of sense. Its pretty simple, though I may just think so because I have not been initiated into data compression at any level yet. Still, pretty interesting. Sorry I cant offer any insights.
     
  4. Aug 2, 2008 #3
    I was in a hurry to post the first time, so i'll break it down and go into alot more detail, and go a little slower:

    PHASE 1: Building the Compressor:

    The compressor for this particular compression scheme is the most intricate part, and will require the most resources. However, upon completion it will eliminate the need for any piece of data to be transmitted at full size again. The Total size of the compressor disk space will be uncomparable to anything known currenlty to man.

    Taking an transcendental number - in this case PI, as it has already been formulated to the 1.2 trillionth digit, which makes it the perfect sample- The sample in question will consume roughly 4 bits per digit, 1,200,000,000,000 digits which means, you would need a hard drive roughly around 600GB just to hold the source data.

    Now, the main part of the compression technique is that the source data is transformed from regular decimal digits, directly to binary samples, thereby exploiting the natural digit frequencies found within PI to create naturally random strings of binary data.

    For Example: 3.141592653589.....

    1415 9265 3589, lets say we apply a key, that turns only decimal 9's, into binary 1's and the rest of the decimal digits from 0-8 into binary 0's. So the sequence would become:

    0000 1000 0001 or 081 in Hex, the binary key used would have the binary form of 10 0000 0000 (98 7654 3210) or 100 in hex.

    Using this method, we could then turn unlimited amounts of consistent decimal strings into valid binary sequences. If done sequentially and stored properly, it then becomes possible to eliminate the need for an addressing scheme when storing the derived binary strings.

    For example, if the program knows that i started at the first digit in PI, and all my samples consisted of 8192 digits, and that i was only applying 1022 keys to each set of 8192 digits, then it would be able to tell the exact address of the data, just by it's physical placement on the destination hard drive. Therefore, the following becomes possible:

    If for every string of 8192 decimal digits, I obtain 1022 unique strings of 8192 binary digits, then for each pair of 8192 digits, being as tho the program knows the address of the data stored, I could easily conserve space by simply storing only 2 seperate hash checksums (ADLER32 and CRC32 - both 4 byte hashes) And also allow myself an extremely easy way to "locate" a match during compression, simply by hashing the input string, and correctly matching both hashes. Even if multiple matches occur, the search will be fast, as will the reformation of the original data strings by the program and then the eventual comparison to the input strings. The completed compressed file, would only need to know the following:

    1) The location in PI to go to to retrieve the data
    2) The 10-bit key used to form the data from the PI decimal digits

    I envision the above being extremely flexible as the system grows and reaches it's completion.
    The system - altho seemingly perfect has a downside, the space and processing power required by the compressor system, and the speed and reliablitiy of the decompressor. I envision a seperate "box" system that is preloaded with the current values of PI available, with upgrade slots for several billion digits etc. Completion of this system will unlock 2 doors:

    1) No strand of data, found within this system, even up to a reasonable limit based on technology available will ever need to be transmitted in it's entirety again.
    2) All files may be compressed with this system, regardless of type, reversible coders may even be used to try to make as many matches as possible to the PI system. Any data compressed with this system, has the potential to be compressed again, using this system. The smallest possible file size being 15-30 bytes. Depending on the state of the system when the file is compressed.
     
  5. Aug 2, 2008 #4

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Your explanation is unclear. Demonstrate with concrete (and preferably small!) examples, or explicit algorithm.

    Learn some theory. The most relevant limit to data compression comes from a simple counting argument: there are more files of length N than there are files of length less-than-N. Therefore, it is impossible for a compression algorithm to compress all files.

    If you refuse to accept the theory, then implement your algorithm; that is the only way you will convince any knowledgable person of your final claims.
     
    Last edited: Aug 2, 2008
  6. Aug 2, 2008 #5
    I agree with you Hurkyl, any Mathematical based algorithm Must follow the counting argument and K's Complexity. But my method does not require any actual mathematical formula - other than obtaining the actual digits of PI, and doing hash checksums. Other than that is a simple addressing system only, there is no real mathematical algorithm in use, therefore the system is outside those basic rules of compression.

    I will use the following 96 sets of 4 digits, the first 384 digits of PI.

    1415 9265 3589 7932 3846 2643 3832 7950 2884 1971 6939 9375 1058 2097 4944 5923
    0781 6406 2862 0899 8628 0348 2534 2117 0679 8214 8086 5132 8230 6647 0938 4460
    9550 5822 3172 5359 4081 2848 1117 4502 8410 2701 9385 2110 5559 6446 2294 8954
    9303 8196 4428 8109 7566 5933 4461 2847 5648 2337 8678 3165 2712 0190 9145 6485
    6692 3460 3486 1045 4326 6482 1339 3607 2602 4914 1273 2458 0066 6315 8817 8815
    2092 0962 8292 5409 1715 3643 6789 2590 3600 1133 0530 5488 2046 6521 3841 4695

    Using the above i will apply 10 different Keys to the decimal digits, that will convert them from a 4 bit - binary digit to a 1 bit refrence digit. In the form previously stated:

    98 7654 3210
    (Binary)---------(Hex)------(Description)
    00 0000 0001----001------0 becomes a binary 1, digits 1-9 become binary 0's
    00 0000 0010----002------1 becomes a bianry 1, digits 0, 2-9 become binary 0's
    00 0000 0100----004------2 becomes a bianry 1, digits 0-1, 3-9 become binary 0's
    00 0000 1000----008------3 becomes a binary 1, digits 0-2, 4-9 become binary 0's
    00 0001 0000----010------4 becomes a binary 1, digits 0-3, 5-9 become binary 0's
    00 0010 0000----020------5 becomes a bianry 1, digits 0-4, 6-9 become binary 0's
    00 0100 0000----040------6 becomes a binary 1, digits 0-5, 7-9 become binary 0's
    00 1000 0000----080------7 becomes a binary 1, digits 0-6, 8-9 become binary 0's
    01 0000 0000----100------8 becomes a binary 1, digits 0-7, 9 become binary 0's
    10 0000 0000----200------9 becomes a bianry 1, digits 0-8 become binary 0's

    The above data would then become:
    Using the First Key 001, so the final address required to access the below data would be a refrence to the first digit in PI as the starting point, the 384th digit being the ending point, and the key being 001: in hex the address would look like:

    001 180 001:

    0000 0000 0000 0000 0000 0000 0000 0001 0000 0000 0000 0000 0100 0100 0000 0000
    1000 0010 0000 1000 0000 1000 0000 0000 1000 0000 0100 0000 0001 0000 1000 0001
    0001 0000 0000 0000 0100 0000 0000 0010 0001 0010 0000 0001 0000 0000 0000 0000
    0010 0000 0000 0010 0000 0000 0000 0000 0000 0000 0000 0000 0000 1001 0000 0000
    0000 0001 0000 0100 0000 0000 0000 0010 0010 0000 0000 0000 1100 0000 0000 0000
    0100 1000 0000 0010 0000 0000 0000 0001 0011 0000 1001 0000 0100 0000 0000 0000

    So 48 bytes of data, could be "compressed" to a simple address = 001180001 36 bits long.

    36/384 = roughly 10% of it's original size.

    You'll notice that is did NOT actually mathematcially compress anything, I just obtained a sliver of data from a known source. Meaning, my method does not need to follow any mathematical principles, as it is not mathematically based.

    001 180 002:

    1010 0000 0000 0000 0000 0000 0000 0000 0000 1001 0000 0000 1000 0000 0000 0000
    0001 0000 0000 0000 0000 0000 0000 0110 0000 0010 0000 0100 0000 0000 0000 0000
    0000 0000 0100 0000 0001 0000 1110 0000 0010 0001 0000 0110 0000 0000 0000 0000
    0000 0100 0000 0100 0000 0000 0001 0000 0000 0000 0000 0100 0010 0100 0100 0000
    0000 0000 0000 1000 0000 0000 1000 0000 0000 0010 1000 0000 0000 0010 0010 0010
    0000 0000 0000 0000 1010 0000 0000 0000 0000 1100 0000 0000 0000 0001 0001 0000

    001 180 004:

    0000 0100 0000 0001 0000 1000 0001 0000 1000 0000 0000 0000 0000 1000 0000 0010
    0000 0000 1001 0000 0010 0000 1000 1000 0000 0100 0000 0001 0100 0000 0000 0000
    0000 0011 0001 0000 0000 1000 0000 0001 0000 1000 0000 1000 0000 0000 1100 0000
    0000 0000 0010 0000 0000 0000 0000 1000 0000 1000 0000 0000 1001 0000 0000 0000
    0001 0000 0000 0000 0010 0001 0000 0000 1001 0000 0100 1000 0000 0000 0000 0000
    1001 0001 0101 0000 0000 0000 0000 1000 0000 0000 0000 0000 1000 0010 0000 0000

    If you'd like the rest of the set's i'd be glad to make them up, but the premise behind my idea is that instead of 384 digits, we use 8192 digits (which in binary would equal 1kb).
     
  7. Aug 2, 2008 #6
    Enough said.

    A single byte can hold 1.6 digits. It is simply impossible to save any space with any lossless compression scheme on random numbers.
     
  8. Aug 2, 2008 #7

    Borek

    User Avatar

    Staff: Mentor

    So if I understand correctly, the idea is to use PI as a table of all possible 8192 bytes long blocks, and to use number of block to select the correct one?

    Apart from the obvious flaw in the idea, two questions:

    1. Why PI and not systematic of 8192 bytes long integers, starting with 0 and incrementing each next by 1?

    2. Why do you convert decimal PI digits to binary ones, instead of simply using binary PI value?
     
  9. Aug 2, 2008 #8
    I see an important problem in your method. The data you are able to compress is composed almost all of '0'. Real generic data will have many more '1'.

    In general case, even with a large string of digits (1.2 trillions from PI) there isn't enough information to code all possible patterns. For example you have used 384-bits blocks. There are 2^384 possible combinations of bits of that length. Being extremely optimistic and using many more than 10 keys, you will only be able to find about 2^40 distinct 384-bits blocks inside PI. This is crearly insufficient.

    This means that only a small portion of the data will be really compressed and all other uncompressed blocks will need some extra header indicating it is not compressed. Probably the final length will be larger than the original data.

    As Hurkyl said, there are some laws that cannot be broken.
     
  10. Aug 2, 2008 #9

    Borek

    User Avatar

    Staff: Mentor

    You don't understand the most important thing. Basis rules are just that - basic rules. They don't cover subset of compression methods, they cover all of them.
     
  11. Aug 2, 2008 #10
    Before this turns into an argument over established law vs. theory. Let's all agree on one basic thing:


    If i did take the first 1,200,000,000,000 digits of PI, which have already been figured. Started at the first digit, and sampled sets of 8192 digits til the very end:

    1,199,999,991,809 sets, (it wouldn't be able to to use all 1.2 trillion digits, unless i added 8191 digits to the end)

    If i derived the 1024 possible sets from 98 7654 3210, and eliminated the sets 11 1111 1111 and 00 0000 0000 as they would only result in all 0's or 1's. I would have:

    1,199,999,991,809 x 1022 = 1,226,399,991,628,798 sets of 1kb of data.

    Now, regardless of the complexity of any of the data sets derived from this function, they would all easily be retrievable, only from knowning where to start in PI (if 8192 is built intot he program as the set limit) and the key used to modify the decimal digits into binary.

    So the address for these sets of data would only have to be

    (in hexdecimal)

    up to 1176592E000 for the PI starting address, and only 001-3FE for the key modifier.

    00000000001 001 through 11765922FFF 3FE would be my range.

    7 byte addresses. Where to go, and what to do within PI.

    Now, 7 bytes to represent each of 1,226,399,991,628,798 different possibilities of 1024 bytes. Without getting into an argument over the mathematical principles. These 1.2 Quadrillion possiblities, would never again need to be stored at full length. I'm not breaking any laws of mathematics doing this, and the only catch would be that a normal user having to calculate to the 1.2 trillionth digit in PI would be overwhelming for a homebased computer. But 40 or 50 million digits wouldn't take long. and on the newer 4 and 8 core processors, 100 or 200 million digits wouldnt' take long to compute, let alone the fact you wouldn't be restricted solely to PI, you could use 1/PI to the 40-50 million digit, and other transcendental numbers as well.
     
  12. Aug 2, 2008 #11

    Borek

    User Avatar

    Staff: Mentor

    The sooner you understand that to address 1.2 Quadrillion possiblities you need 1.2 Quadrillion addresses, and you need exactly the same number of bytes to address these bytes that you are storing, the better for you.

    So far, you have not explained why pi, why not e, or any other number. If only pi can be used - explain why. If any number can be used - explain why you can't use systematically build table, in which address must have the same length that the table entry itself.
     
  13. Aug 2, 2008 #12
    Ok, assume you really have 1,226,399,991,628,798 sets of 1 KB of data. This means you have a total of almost 1.3e15 sets out of 1e2466 (~ 2^8192) possible combinations of 8192 bits. You can't represent all possible 1 KB blocks that you may find in the uncompressed data.

    To take this into account, at least one control bit must be added to each block in the compressed data to indicate if the following information corresponds to raw data (that hadn't been compressed) or compressed information (starting address + key).

    If there is a match, 1 KB block is compressed to 48 bits (7 Bytes) (assuming the control bit is included)
    If there isn't a match, 1 KB block is translated to 8193 bits (1KB block + 1 control bit)

    The probability that one input block of 1 KB is found into the 1.3e15 sets available is:

    1.3e15 / 1e2466 = 1.3e-2451 (virtually 0)

    The expected mean size of compressed blocks will be:

    1.3e-2451 * 48 + (1 - 1.3e-2451) * 8193 = 8193 bits

    The compressed size will be greater than the original size !!!
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: Needing a second opinion on data compression scheme
  1. Website opinion (Replies: 1)

  2. Opinions please! (Replies: 6)

Loading...