Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Ultra-lossless compression algorithm

  1. May 30, 2007 #1
    About a year ago I came up with an idea for a compression algorithm which could potentially compress any type of data (pictures, movies, text, ect) with equal ability, and could potentially compress it to 1/1000 or less of its original size. It is such a simple idea that I am sure that if it was practical it would have been done.

    It is based on a very simple idea; here is an example: lets say a certain picture file looks like this in binary:


    That is 233 bits long, (I know a real picture file would be way longer this is just for illustration purposes). Now lets imagine that you wanted to compress this to a much smaller size. My idea would do it like this: it would treat the file as if it were one huge number, and then it would try to find a mathematical equivalent of it this is much shorter. For example the 233 bit number above, could be written as (3^147)+2 which is only 80 bits, even when considering 8 bits per number/math symbol, so that could be optimized into some sort of special type which only includes numbers and mathematical symbols. You can try it yourself at http://world.std.com/~reinhold/BigNumCalc.html

    And the algorithm could use all kinds of mathematical operations to reduce the size. I am pretty sure it should work theoretically, because any number can probably be written in a mathematical form which is much shorter. The only question is whether or not it is practical, my guess is no, considering that it would take billions of years to crack even a 128 bit encryption key, which I would compare this to, except the difference with this is that with a proper algorithm it should be possible to quickly narrow down on the possible solutions, because after each guess, you know exactly how far you were off by.

    Has anyone ever heard of a compression algorithm similar to this or at least an idea for one, or do you know of a reason that this would be completely impossible or impractical from a number theory or computing point of view?
    Last edited by a moderator: May 30, 2007
  2. jcsd
  3. May 30, 2007 #2


    User Avatar
    Science Advisor

    It could actually be compressed into 40 bits (4 bits should be enough to encode each digit/operator)
    From a theoretical perspective, efficiency aside, i don't think it would enable you to compress every arbitrary string of n bits into 1/1000 of the size or less.
    I think necessarily it would have to be unable to compress some strings. The reason i say that is because, suppose your algorithm can compress any n bits into n/1000 bits or less. Then, given c bits i run your compression once to compress the c bits into c/1000 bits and then repeat the process, compressing the compressed bits, until we are left with 1 or less bits. This would mean we can compress any bit string into a single bit, and a single bit doesn't have enough states to encode an arbitrary amount of information.

    Not to say you wouldn't be able to achieve very good compression, but i don't know if you could do a single static algorithm which would be able to compress all of the possible inputs by a significant factor.
  4. May 30, 2007 #3
    Thats a good point, I might repost this in the number theory section, they might know the specific scenarios in which it wouldn't work. As far as the 1/1000 figure, that is just what it would be capable of if it got a perfect string. I really don't know what something like this would be capable of on average thats why I am asking.

    I never thought that it would be done with a static algorithm, if it was possible I'm sure it would require some kind of intelligent searching algorithm and tons of optimizations.
    Last edited by a moderator: May 30, 2007
  5. May 30, 2007 #4


    User Avatar
    Science Advisor

    My feeling is that the process of finding a good compression is an NP-Complete problem. It kind of reminds me of the subset sum problem.
    Last edited: May 30, 2007
  6. May 31, 2007 #5


    User Avatar
    Science Advisor
    Homework Helper

    Assuming that "any number can probably be written in a mathematical form which is much shorter" is true (and I'm not sure about that) there is one problem you didn't mention.

    If you are going to use "all kinds of mathematical operations" to reduce the size, then the compressed file needs to contain instructions telling which operations to use to decompress it. In other words, you are proposing to replace the number with an algorithm (a computer program, if you like) that can compute the number.

    Actually, that's what existing compression algorithms do, except they use a predefined algorithm. So each file only need to store the "input data", it doesn't need to store the algorithm as well.

    Clearly for some sorts of files the program could be much shorter than the number but in general it would not be. Complexity Theory (the reputable scientific sort, not Irreducible Complexity and similar pseudoscience) studies this type of problem.
  7. May 31, 2007 #6
    If it is just written in basic mathematical language like "(3^147)+2" then shouldn't that be enough? Because "(3^147)+2" that is basically mathematical instructions right there, all you would need would be a special program on the receiving computer which would basically need to be a calculator which could calculate huge numbers, to decompress it. It would be like WinZip or any other file compression program, you have a program which you need to compress and decompress the files.
  8. May 31, 2007 #7


    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    1) You're not the first person to think of such things. In fact, I've thought of it independently a long time ago, and I suspect it's been around for ages.

    2) The problem with the idea is that it does not scale. The values of polynomials, logarithms, etc. get further and further apart on the number line as coefficients increase. For example, 4x^2 produces the values { 4, 16, 36, 64, 100, 144, 196, 256 } for x = { 1, 2, 3, 4, 5, 6, 7, 8 }. As the file sizes get larger, it becomes increasingly difficult to find a close number.

    This is a manifestion of the pigeon-hole problem. Let's say you want 1,000:1 compression. If you have files that are 10,000 bits long, then you can only have 10 bits of "mathematical expression data." (The format of this data is irrelevant.) You can only represent 210 different input files with only 10 bits of data, but there were 210,000 different possible 10,000 bit files to start with. In other words, you'll only be able to achieve 1,000:1 compression on a very small number of input files.

    If you can't find a close expression with ten bits of mathematical expression data, you will need to increase the amount of that data. This means you must continue adding terms to the polynomial, which means it's going to slow to a crawl. You can mitigate this somewhat by breaking the file up into blocks and compressing each block individually, but this means you reduce the compression.

    3) What you're left with, after you find the closest possible expression (given some run-time limit, etc.) is a small mathematical expression and a residue. The residue is the killer. The residue is essentially noise, and is not generally going to be very compressible.

    Because of the pigeon-hole problem, the residue is generally going to be almost as large as the original file -- if you are trying to create very small mathematical expressions. You'll have to keep making the expressions larger in order to shrink the residue. There will be an optimum at some point, where the sum of the lengths of the expression and residue is the smallest possible.

    Is this optimum any better than existing compression algorithms? That's the question you must ask yourself.

    You can probably write a simple implementation of this algorithm in Python (which has arbitrary-precision libraries) and see how it works. I don't think you'll be that impressed in the end.

    - Warren
  9. Jun 1, 2007 #8
    What about bits with 3 options? So you could have numbers like 010201201002? Instead of either a dot or blank space.. Have a dark dot, a medium dot, and a blank space to represent the three bits on hard drives. What about 4-bits? What if it were worked into the whole computation of everything? Real-time hardware compression. Software compression on top of this.

    Also, with numbers. Why stick with a base 10 system? (numbers are base 10, they go from 0 to 9, then 10 to 19, in 10's). Hex is base 16. If you take numbers that are stored using 8-bits, then convert the number to a hex code, you will use a shorter number. If you use base 256, using all the codes possible in 8-bit storage, you will have an extremely small number. I bet much smaller than the form of compression you're talking about, and it would be much faster. This would work exceedingly well for numbers and even text files (the alphabet is base 26. Numbers + the alphabet is base 36), but not programs that use more than letters and numbers (which have base 256 available to them, but probably don't use all 256 letters, numbers, and symbols). A file could even be broken into parts. Now, if there were bits that had 3 parts, rather than 2, an even greater amount of compression would be possible. Bits right now use a base 2 system. 2*2*2*2*2*2*2*2 (8 bits) has 256 possible outcomes. 3*3*3*3*3*3*3*3 has 6561 outcomes, which is over 25 times the amount. That would be a base 6561 system of representing data. Bits with 4 possibilities would have 65536 outcomes, which is 256 times the amount. A file with nothing but numbers would be compressed to only 1/656th the size if bits had 3 possibilities. 1GB of numbers would become about 1.6mb. With bits that have 4 possibilities, 1GB of numbers would become about 164kb.
    Last edited: Jun 1, 2007
  10. Jun 1, 2007 #9


    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    C'mon BoredNL, at least label your satire as such for the uninitiated. :rofl:

    - Warren
  11. Jun 1, 2007 #10
    I want 20 bits to each bit.


    I want a quantum computer with quantum storage.
    Last edited: Jun 1, 2007
  12. Jun 1, 2007 #11
    Thanks for the explanation chroot, I had a feeling it would be something like that; the residue that would cause the problems, I just wanted to know for sure. With the increases in bandwidth and storage, it isnt even really worth compressing something unless it is >10 MB and this would make the scaling problem exponentially worse, I'm guessing that even with a very highly developed intelligent algorithm it would still take days to compress a large file.
  13. Jun 8, 2007 #12

    Chris Hillman

    User Avatar
    Science Advisor

    I probably simply know too much about the often complicated interrelationships between dozens of notions of entropy, but I'd say that you didn't discover an algorithm, but rather some (mostly impractical) intuition for how all compression algorithms work. Namely, by removing "redundancies" in some instance of some kind of mathematical structure. To obtain efficient compression with reasonable effort, however, looking at statistical structure rather than trying to find hidden number theoretical structure is probably a much better idea! But it is true that you don't need to study statistical structure, you could also study algorithmic structure, which turns out, surprisingly, to be closely related if less convenient (cf. Chaitin). There are many other kinds of structure--- only a week or so ago I mentioned the fact that you could study a group action structure and remove redundancies in that. But statistical structure seems to be the easiest to work with.

    I'd recommend Cover and Thomas, Elements of Information Theory, as an excellent first textbook on information theory. IT is a vast subject; one reason I like this textbook is that it conveys some sense of the breadth and depth of IT.

    I'd also add that your phrasing could suggest that you believe that your scheme can compress anything, which I hope you realize is not true--- no compression algorithm can compress a file which does not possess what Shannon called "redundancy". Also be aware that there are a goodly number of cranks who claim results which directly contradict Shannon's coding theorem, so you want to be careful to avoid being lumped with that group!
    Last edited: Jun 8, 2007
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook