Ultra-lossless compression algorithm

q3snt · May 30, 2007

About a year ago I came up with an idea for a compression algorithm which could potentially compress any type of data (pictures, movies, text, ect) with equal ability, and could potentially compress it to 1/1000 or less of its original size. It is such a simple idea that I am sure that if it was practical it would have been done.

It is based on a very simple idea; here is an example: let's say a certain picture file looks like this in binary:

11111110001001000011001011010111111100000000010101001001000010110101100001011010111111101001100101100011000110
00000001100010010111110110100000111000001111010111001000000011100101010000000111110010101010010100110011111000
1000111011101

That is 233 bits long, (I know a real picture file would be way longer this is just for illustration purposes). Now let's imagine that you wanted to compress this to a much smaller size. My idea would do it like this: it would treat the file as if it were one huge number, and then it would try to find a mathematical equivalent of it this is much shorter. For example the 233 bit number above, could be written as (3^147)+2 which is only 80 bits, even when considering 8 bits per number/math symbol, so that could be optimized into some sort of special type which only includes numbers and mathematical symbols. You can try it yourself at http://world.std.com/~reinhold/BigNumCalc.html

And the algorithm could use all kinds of mathematical operations to reduce the size. I am pretty sure it should work theoretically, because any number can probably be written in a mathematical form which is much shorter. The only question is whether or not it is practical, my guess is no, considering that it would take billions of years to crack even a 128 bit encryption key, which I would compare this to, except the difference with this is that with a proper algorithm it should be possible to quickly narrow down on the possible solutions, because after each guess, you know exactly how far you were off by.

Has anyone ever heard of a compression algorithm similar to this or at least an idea for one, or do you know of a reason that this would be completely impossible or impractical from a number theory or computing point of view?

-Job- · May 30, 2007

It could actually be compressed into 40 bits (4 bits should be enough to encode each digit/operator)
From a theoretical perspective, efficiency aside, i don't think it would enable you to compress every arbitrary string of n bits into 1/1000 of the size or less.
I think necessarily it would have to be unable to compress some strings. The reason i say that is because, suppose your algorithm can compress any n bits into n/1000 bits or less. Then, given c bits i run your compression once to compress the c bits into c/1000 bits and then repeat the process, compressing the compressed bits, until we are left with 1 or less bits. This would mean we can compress any bit string into a single bit, and a single bit doesn't have enough states to encode an arbitrary amount of information.

Not to say you wouldn't be able to achieve very good compression, but i don't know if you could do a single static algorithm which would be able to compress all of the possible inputs by a significant factor.

q3snt · May 30, 2007

I think necessarily it would have to be unable to compress some strings. The reason i say that is because, suppose your algorithm can compress any n bits into n/1000 bits or less. Then, given c bits i run your compression once to compress the c bits into c/1000 bits and then repeat the process, compressing the compressed bits, until we are left with 1 or less bits.

Thats a good point, I might repost this in the number theory section, they might know the specific scenarios in which it wouldn't work. As far as the 1/1000 figure, that is just what it would be capable of if it got a perfect string. I really don't know what something like this would be capable of on average that's why I am asking.

Not to say you wouldn't be able to achieve very good compression, but i don't know if you could do a single static algorithm which would be able to compress all of the possible inputs by a significant factor.

I never thought that it would be done with a static algorithm, if it was possible I'm sure it would require some kind of intelligent searching algorithm and tons of optimizations.

-Job- · May 30, 2007

My feeling is that the process of finding a good compression is an NP-Complete problem. It kind of reminds me of the subset sum problem.

AlephZero · May 31, 2007

Assuming that "any number can probably be written in a mathematical form which is much shorter" is true (and I'm not sure about that) there is one problem you didn't mention.

If you are going to use "all kinds of mathematical operations" to reduce the size, then the compressed file needs to contain instructions telling which operations to use to decompress it. In other words, you are proposing to replace the number with an algorithm (a computer program, if you like) that can compute the number.

Actually, that's what existing compression algorithms do, except they use a predefined algorithm. So each file only need to store the "input data", it doesn't need to store the algorithm as well.

Clearly for some sorts of files the program could be much shorter than the number but in general it would not be. Complexity Theory (the reputable scientific sort, not Irreducible Complexity and similar pseudoscience) studies this type of problem.

q3snt · May 31, 2007

If you are going to use "all kinds of mathematical operations" to reduce the size, then the compressed file needs to contain instructions telling which operations to use to decompress it.

If it is just written in basic mathematical language like "(3^147)+2" then shouldn't that be enough? Because "(3^147)+2" that is basically mathematical instructions right there, all you would need would be a special program on the receiving computer which would basically need to be a calculator which could calculate huge numbers, to decompress it. It would be like WinZip or any other file compression program, you have a program which you need to compress and decompress the files.

chroot · May 31, 2007

1) You're not the first person to think of such things. In fact, I've thought of it independently a long time ago, and I suspect it's been around for ages.

2) The problem with the idea is that it does not scale. The values of polynomials, logarithms, etc. get further and further apart on the number line as coefficients increase. For example, 4x^2 produces the values { 4, 16, 36, 64, 100, 144, 196, 256 } for x = { 1, 2, 3, 4, 5, 6, 7, 8 }. As the file sizes get larger, it becomes increasingly difficult to find a close number.

This is a manifestion of the pigeon-hole problem. Let's say you want 1,000:1 compression. If you have files that are 10,000 bits long, then you can only have 10 bits of "mathematical expression data." (The format of this data is irrelevant.) You can only represent 2¹⁰ different input files with only 10 bits of data, but there were 2^10,000 different possible 10,000 bit files to start with. In other words, you'll only be able to achieve 1,000:1 compression on a very small number of input files.

If you can't find a close expression with ten bits of mathematical expression data, you will need to increase the amount of that data. This means you must continue adding terms to the polynomial, which means it's going to slow to a crawl. You can mitigate this somewhat by breaking the file up into blocks and compressing each block individually, but this means you reduce the compression.

3) What you're left with, after you find the closest possible expression (given some run-time limit, etc.) is a small mathematical expression and a residue. The residue is the killer. The residue is essentially noise, and is not generally going to be very compressible.

Because of the pigeon-hole problem, the residue is generally going to be almost as large as the original file -- if you are trying to create very small mathematical expressions. You'll have to keep making the expressions larger in order to shrink the residue. There will be an optimum at some point, where the sum of the lengths of the expression and residue is the smallest possible.

Is this optimum any better than existing compression algorithms? That's the question you must ask yourself.

You can probably write a simple implementation of this algorithm in Python (which has arbitrary-precision libraries) and see how it works. I don't think you'll be that impressed in the end.

- Warren

BoredNL · Jun 1, 2007

What about bits with 3 options? So you could have numbers like 010201201002? Instead of either a dot or blank space.. Have a dark dot, a medium dot, and a blank space to represent the three bits on hard drives. What about 4-bits? What if it were worked into the whole computation of everything? Real-time hardware compression. Software compression on top of this.

Also, with numbers. Why stick with a base 10 system? (numbers are base 10, they go from 0 to 9, then 10 to 19, in 10's). Hex is base 16. If you take numbers that are stored using 8-bits, then convert the number to a hex code, you will use a shorter number. If you use base 256, using all the codes possible in 8-bit storage, you will have an extremely small number. I bet much smaller than the form of compression you're talking about, and it would be much faster. This would work exceedingly well for numbers and even text files (the alphabet is base 26. Numbers + the alphabet is base 36), but not programs that use more than letters and numbers (which have base 256 available to them, but probably don't use all 256 letters, numbers, and symbols). A file could even be broken into parts. Now, if there were bits that had 3 parts, rather than 2, an even greater amount of compression would be possible. Bits right now use a base 2 system. 2*2*2*2*2*2*2*2 (8 bits) has 256 possible outcomes. 3*3*3*3*3*3*3*3 has 6561 outcomes, which is over 25 times the amount. That would be a base 6561 system of representing data. Bits with 4 possibilities would have 65536 outcomes, which is 256 times the amount. A file with nothing but numbers would be compressed to only 1/656th the size if bits had 3 possibilities. 1GB of numbers would become about 1.6mb. With bits that have 4 possibilities, 1GB of numbers would become about 164kb.

chroot · Jun 1, 2007

C'mon BoredNL, at least label your satire as such for the uninitiated.

- Warren

BoredNL · Jun 1, 2007

I want 20 bits to each bit.

Actually.

I want a quantum computer with quantum storage.

q3snt · Jun 1, 2007

Thanks for the explanation chroot, I had a feeling it would be something like that; the residue that would cause the problems, I just wanted to know for sure. With the increases in bandwidth and storage, it isn't even really worth compressing something unless it is >10 MB and this would make the scaling problem exponentially worse, I'm guessing that even with a very highly developed intelligent algorithm it would still take days to compress a large file.

Chris Hillman · Jun 8, 2007

q3snt said:

Has anyone ever heard of a compression algorithm similar to this or at least an idea for one, or do you know of a reason that this would be completely impossible or impractical from a number theory or computing point of view?

I probably simply know too much about the often complicated interrelationships between dozens of notions of entropy, but I'd say that you didn't discover an algorithm, but rather some (mostly impractical) intuition for how all compression algorithms work. Namely, by removing "redundancies" in some instance of some kind of mathematical structure. To obtain efficient compression with reasonable effort, however, looking at statistical structure rather than trying to find hidden number theoretical structure is probably a much better idea! But it is true that you don't need to study statistical structure, you could also study algorithmic structure, which turns out, surprisingly, to be closely related if less convenient (cf. Chaitin). There are many other kinds of structure--- only a week or so ago I mentioned the fact that you could study a group action structure and remove redundancies in that. But statistical structure seems to be the easiest to work with.

I'd recommend Cover and Thomas, Elements of Information Theory, as an excellent first textbook on information theory. IT is a vast subject; one reason I like this textbook is that it conveys some sense of the breadth and depth of IT.

I'd also add that your phrasing could suggest that you believe that your scheme can compress anything, which I hope you realize is not true--- no compression algorithm can compress a file which does not possesses what Shannon called "redundancy". Also be aware that there are a goodly number of cranks who claim results which directly contradict Shannon's coding theorem, so you want to be careful to avoid being lumped with that group!

Ultra-lossless compression algorithm

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

France to ditch Windows for Linux

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

Is This Music AI?

Help me build my server with a laptop that has a broken screen

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect