- #1

- 517

- 0

Does this really work? Why isn't it used instead of zip files and such? Would it just cost too much processing time to compress?

- Thread starter Bartholomew
- Start date

- #1

- 517

- 0

Does this really work? Why isn't it used instead of zip files and such? Would it just cost too much processing time to compress?

- #2

chroot

Staff Emeritus

Science Advisor

Gold Member

- 10,226

- 34

You can break the file up into optimal-sized chunks and apply the method to each chunk individually, but the compression is not very good.

It happens that there are some magic numbers that compress extremely well -- perhaps one message is only one or two numbers different from 107^172 + 976^2 and will be compressed quite well -- but, in general, few messages would be so lucky.

- Warren

- #3

- 517

- 0

Is it proven that no compression scheme can on average reduce completely random static?

- #4

Zurtex

Science Advisor

Homework Helper

- 1,120

- 1

If such a processing power existed then there would no need to compress such data as it could transmit just as easily. On the other hand if such space existed to save all the data for compression program there would be no need to compress the data.

Even if it is one day possible to create such a compression system it is at the moment unforeseeable why it would ever be 'useful'.

- #5

- 517

- 0

So each message of a given length can represent a finite number of numbers. If you want to compress a decimal number 100 digits long, then there has to be an equivalent expression for it in the compression scheme which is 1 through 88 digits long (because 10^100 is 89 digits long in base 13). If you sum up all the representable numbers by the scheme from 1 digit long to 88 digits long, you get a number which is smaller by an order of magnitude than the numbers representable by a 100-digit decimal expression. So even if ALL the numbers in the compression scheme figured out to 100-digit decimal numbers, which they don't, they could only cover a small portion of the total; more than 9/10ths of the decimal expressions wouldn't be compressible. Of course, the flaw in this reasoning is that if all these decimal numbers were compressible to exactly 89 digits long (and not using all of the 89 digit expressions... but nm about that), there would still be some small overall compression. I don't know, though... that seems wrong, I would really expect no overall compression.

That makes me think of this compression scheme: It is a look-up table for 100-digit numbers. Every 100 digit number is written down and linked one-to-one to a corresponding 100-or-less digit number. To compress a 100 digit number using this scheme, you follow the look-up and write down the other number. Most of the other numbers will be 100 digits long but a tenth of them will be shorter... so compression is seemingly achieved overall. The one hitch is that the "stop" operator--signifying the end of a number--might have to be counted as a digit. But what would prevent someone from using this scheme over and over, 40 or 50 times, to compress shorter and shorter? Like--compress the original number, then compress its compressed form, then compress THAT form, and so on, using other tables for numbers from 1 to 99 digits. (besides the practical considerations of working with lookup tables for 100 digits numbers :p) There is obviously something wrong with this reasoning, because if you did it over and over again it would seem you could compress it as much as you wanted, until every 100-digit number were compressed to a 1-digit number!! Probably it's the possibility of having more than 1 input to a look-up number; like, after 2 iterations, you got a 100-digit number that wants to compress to a 34-digit number, but you also have a different 34-digit number that wants to cycle to the same 34-digit number that the 100-digit number needs.

This is all kind of slipshod reasoning though, and it doesn't get at the main question: CAN a scheme achieve compression on average, or is it impossible?

- #6

Zurtex

Science Advisor

Homework Helper

- 1,120

- 1

Computer programmers often like to view things in base 16 as apposed to base 2, as it is easier to comprehend when viewing. In ASCII all characters can be represented as a 2 digit base 16 number, computers treat these numbers as a series of seperate numbers. The exponent compression system is such that treats the entire data as one number, this means that numbers get very big very quickly, for example I'm half way through writting a short and it is currently 22'545 chracters long (which is 45'090 base 16 digits long). This is just text and to store the data it becomes even bigger as you need to save font, italics, underlines, dates of save etc. If you had a compression system like this it would need to be able to cope with files that would be atleast 2'147'483'648 base 16 digits long.

Dealing which such huge numbers means that all the normal rules of efficent storing of data go out the window. Furthermore more working out the multiples of the number alone would take years on a normal home computer. So working out efficent exponants would take such huge amounts of processing power by the time we had the power we would be dealing with files much bigger and finding that files of that age suffer from the same problem.

- #7

chroot

Staff Emeritus

Science Advisor

Gold Member

- 10,226

- 34

It is abolsutely technically feasible. Using a standard arbitrary-precision arithmetic package, you can build a system like this in a few minutes. I have already described the problems with it. I've built one of these compression algorithms, and seen for myself what the problems are with it.

And Zurtex, you are quite underestimating the computational power of your computer!

Bartholomew,

The system you propose to represent common 100-digit strings with smaller strings is a variation of a very popular scheme called Huffman coding. Huffman coding is used, well, practically everywhere already.

- Warren

- #8

Hurkyl

Staff Emeritus

Science Advisor

Gold Member

- 14,916

- 19

It is impossible.CAN a scheme achieve compression on average, or is it impossible?

However, we're usually not concerned about compressiong random data: we're concerned about compressing the types of data we generate.

Incidentally, I'm only talking about file size; most file systems waste space, so you

- #9

Zurtex

Science Advisor

Homework Helper

- 1,120

- 1

I am?chroot said:Zurtex,

It is abolsutely technically feasible. Using a standard arbitrary-precision arithmetic package, you can build a system like this in a few minutes. I have already described the problems with it. I've built one of these compression algorithms, and seen for myself what the problems are with it.

And Zurtex, you are quite underestimating the computational power of your computer!

- Warren

Then why does it take 2 weeks or more for a home computer to work out whether a number is a prime number or not when it is within the region of 7'000'000 digits long?

I find it hard to believe that such an algorithm could be created to deal with numbers greater than 2’000’000’000 digits long and work quickly on a computer if it was designed to find the most efficient way of expressing the data.

- #10

Alkatran

Science Advisor

Homework Helper

- 944

- 0

I wrote a program to find prime numbers on my computer. It checked each succesive number against all previous prime numbers, etc etc. Not very advanced but workable. It reaches 5000 (the value being checked, not the amount of numbers found) within a few SECONDS. However, to reach a million could take ten minutes.Zurtex said:I am?

Then why does it take 2 weeks or more for a home computer to work out whether a number is a prime number or not when it is within the region of 7'000'000 digits long?

I find it hard to believe that such an algorithm could be created to deal with numbers greater than 2’000’000’000 digits long and work quickly on a computer if it was designed to find the most efficient way of expressing the data.

Also, you need to remember that even though your "short" is stored so that it is in ASCII (assuming your text editing program does that) it can still be read in 7 bit chunks, or 3 bits. It's entirely up to the programmer for what means what. If you limit yourself to 8 bits, you may miss possibilities.

As I understand it now, compression tends to work on larger files better than smaller ones. Your standard compression searches through the file and creates a tree of common occurences. It then saves a file with the tree at the front, and the file (saved according to the tree, so some large patterns are reduced but some short ones are enlarged) afterwards. So once you get to very low file sizes the tree takes up more space than the file and you shouldn't bother.

- #11

Zurtex

Science Advisor

Homework Helper

- 1,120

- 1

- #12

chroot

Staff Emeritus

Science Advisor

Gold Member

- 10,226

- 34

- Warren

- #13

Alkatran

Science Advisor

Homework Helper

- 944

- 0

I deleted my above post because it didn't advance the post.

- #14

Zurtex

Science Advisor

Homework Helper

- 1,120

- 1

Ahh forget it, chroot you most likely know more than me on the subject I'm sure your right.

- #15

chroot

Staff Emeritus

Science Advisor

Gold Member

- 10,226

- 34

I don't expect you to forget it just because I say so. :) Download an arbitrary-precision arithmetic package, like bignum, and play around with it. It's fun, actually -- too bad it doesn't really seem to work well as a compression algorithm. :(

- Warren

- #16

- 517

- 0

I know how data is represented in a computer--what I was talking about with "base 10" or "base 13" would not be for computer use. Data is represented in base 2 in a computer only because it is easy to deal with base 2 through circuitry. If there were some simple physical construction that could fundamentally occupy 1 of 13 states, then computers might just as easily be base 13 instead of base 2. The use of base 2 is a result of physical reality, not abstract math--at least, so I think. Maybe there are also mathematical reasons--which I don't know about--for using base 2 to represent numbers. Anyone know?

I understand why the 100-digit decimal "compression" scheme "works": the de-compressor already knows the useful fact that the number is 100 digits long, so he can pinpoint it to an order of magnitude. It makes sense that compression should be achieved 10% of the time then. It's not a practical scheme because when do you ever know exactly how long the message you're receiving is? (and learning the file-length from a website doesn't count because the file-length number is several digits long and would therefore outweigh the compression in the vast majority of cases). The compression would only be 1 digit's worth in 90% of cases that it's achieved at all.

Chroot, I looked it up, and I see what you mean about it being similar to huffman coding--with the big difference that huffman coding is practical and makes use of some previous order to the data, whereas my scheme isn't and doesn't.

Hurkyl, are you SURE compression can't be achieved on average? Do you know who proved that or a summary of the proof if it's simple? I'm sure it wouldn't be too hard to prove if you're only looking at schemes where the range of values for each digit is the same through the message or fixed by the position of the digit, but if the range of possible values for each digit depends on what other digits are (like if digit 3 could be 0-8 unless the first 2 digits are 1 and 2, in which case digit 3 can be 0-9), it seems to me it might be tricky because of difficulty in counting the total number of possible states. I lean towards believing you, but what is the proof of it? The reason I still have some doubt is that a series of digits signifying various things has some inherent structure to it, just because we know it's a series of digits. Maybe that structure can be taken advantage of in a compression scheme which is itself a series of digits, and maybe (probably) it can't.

- #17

Hurkyl

Staff Emeritus

Science Advisor

Gold Member

- 14,916

- 19

Probabilistic algorithms have been known for a while, and the AKS test is deterministic and runs in just over O((ln n)^6) time; that is the 6th power of the number of digits.

Let me amend my statement by saying that it depends on what you mean by "on average". For example, one can write a scheme where there are two files that each get reduced in size by one byte, and one file that gets increased in size by two bytes. (everything else remains the same size)Hurkyl, are you SURE compression can't be achieved on average?

But the other way of looking at it is that the set of all files of size N or less cannot be fit into the set of all files of size N-1 or less... at best the compression scheme will take functions of size N or less... which means the average compression cannot be positive.

- #18

- 517

- 0

I stand greatly corrected about the prime algorithm. Still, it does take more time than simple arithmetic operations.

The file-size way of looking at things makes sense, but exactly how is the file size determined when the size of each digit depends on the digits before it? For example, say the first digit can be 0-9. Then multiply whatever digit that was by 3 and take the ones place; the next digit can be 0-(whatever that ones place digit was). Keep going like that so that the possible range of each digit depends on the ones place of 3x the digit before it--and then you have a file by that method which is 150 digits long, so how do you measure the size of that file?

Edit: make that the upper limit of the range of each digit is the one's place of 3x the previous digit PLUS ONE so you don't hit 0 and then stop.

The file-size way of looking at things makes sense, but exactly how is the file size determined when the size of each digit depends on the digits before it? For example, say the first digit can be 0-9. Then multiply whatever digit that was by 3 and take the ones place; the next digit can be 0-(whatever that ones place digit was). Keep going like that so that the possible range of each digit depends on the ones place of 3x the digit before it--and then you have a file by that method which is 150 digits long, so how do you measure the size of that file?

Edit: make that the upper limit of the range of each digit is the one's place of 3x the previous digit PLUS ONE so you don't hit 0 and then stop.

Last edited:

- #19

Hurkyl

Staff Emeritus

Science Advisor

Gold Member

- 14,916

- 19

- Last Post

- Replies
- 13

- Views
- 2K

- Replies
- 3

- Views
- 2K

- Last Post

- Replies
- 4

- Views
- 1K

- Replies
- 1

- Views
- 1K

- Last Post

- Replies
- 2

- Views
- 391

- Last Post

- Replies
- 20

- Views
- 9K

- Replies
- 1

- Views
- 496

- Last Post

- Replies
- 4

- Views
- 3K

- Replies
- 3

- Views
- 6K

- Replies
- 2

- Views
- 1K