Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Idea for data compression

  1. Aug 18, 2004 #1
    I was reading the science fiction story "the gold at the starbow's end" by poul anderson (I think) which contains an idea for data compression. Poul Anderson suggested that a large number be compressed to the sum or difference of a few high powers of numbers. In the story it was something like "2506 ^ 9994 + 204 ^ 31 - 73."

    Does this really work? Why isn't it used instead of zip files and such? Would it just cost too much processing time to compress?
     
  2. jcsd
  3. Aug 18, 2004 #2

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Nope, it just suffers from the pigeonhole problem like many other compression algorithms. The larger the bases or exponents of your numbers, the more widely spaced they are. If you try it out on some sample data (I have), it turns out that in the majority of cases the offset (the -73 in your example) tends to be almost as large as the original data.

    You can break the file up into optimal-sized chunks and apply the method to each chunk individually, but the compression is not very good.

    It happens that there are some magic numbers that compress extremely well -- perhaps one message is only one or two numbers different from 107^172 + 976^2 and will be compressed quite well -- but, in general, few messages would be so lucky.

    - Warren
     
  4. Aug 21, 2004 #3
    Is it proven that no compression scheme can on average reduce completely random static?
     
  5. Aug 21, 2004 #4

    Zurtex

    User Avatar
    Science Advisor
    Homework Helper

    I would actually rather imagine that it is very possible, chroot says that the "-73" at the end is almost as big as the original data itself. However if such an advanced compression were made that it would simply find an exponent close to this number. However this suffers from 1 of 2 possible problems, the sheer amount of processing power to work out such exponents and make them efficient or the sheer amount of data to have such exponents recorded down on to the hard drive to defence from.

    If such a processing power existed then there would no need to compress such data as it could transmit just as easily. On the other hand if such space existed to save all the data for compression program there would be no need to compress the data.

    Even if it is one day possible to create such a compression system it is at the moment unforeseeable why it would ever be 'useful'.
     
  6. Aug 22, 2004 #5
    Here's my vague understanding of why the exponent's scheme can't achieve compression on average: Each "letter" in the scheme can be either a digit 0-9, an exponent sign, a plus sign, or a minus sign. Each spot must be capable of expressing any of 13 different states, as opposed to 10 states for a simple decimal number. So there is an upper limit on how many numbers can be expressed by the exponents scheme; for a 5 "letter" exponents compression, the maximum number of representable numbers is 13^5--exactly the number of representable numbers in a base 13 number system (except the exponents scheme has many numbers it can represent in more than 1 way, so it can't represent as many as the simple base 13 system can). Just changing from one base to another obviously doesn't achieve any compression. You pay for the shorter length in the greater number of digits per spot.

    So each message of a given length can represent a finite number of numbers. If you want to compress a decimal number 100 digits long, then there has to be an equivalent expression for it in the compression scheme which is 1 through 88 digits long (because 10^100 is 89 digits long in base 13). If you sum up all the representable numbers by the scheme from 1 digit long to 88 digits long, you get a number which is smaller by an order of magnitude than the numbers representable by a 100-digit decimal expression. So even if ALL the numbers in the compression scheme figured out to 100-digit decimal numbers, which they don't, they could only cover a small portion of the total; more than 9/10ths of the decimal expressions wouldn't be compressible. Of course, the flaw in this reasoning is that if all these decimal numbers were compressible to exactly 89 digits long (and not using all of the 89 digit expressions... but nm about that), there would still be some small overall compression. I don't know, though... that seems wrong, I would really expect no overall compression.

    That makes me think of this compression scheme: It is a look-up table for 100-digit numbers. Every 100 digit number is written down and linked one-to-one to a corresponding 100-or-less digit number. To compress a 100 digit number using this scheme, you follow the look-up and write down the other number. Most of the other numbers will be 100 digits long but a tenth of them will be shorter... so compression is seemingly achieved overall. The one hitch is that the "stop" operator--signifying the end of a number--might have to be counted as a digit. But what would prevent someone from using this scheme over and over, 40 or 50 times, to compress shorter and shorter? Like--compress the original number, then compress its compressed form, then compress THAT form, and so on, using other tables for numbers from 1 to 99 digits. (besides the practical considerations of working with lookup tables for 100 digits numbers :p) There is obviously something wrong with this reasoning, because if you did it over and over again it would seem you could compress it as much as you wanted, until every 100-digit number were compressed to a 1-digit number!! Probably it's the possibility of having more than 1 input to a look-up number; like, after 2 iterations, you got a 100-digit number that wants to compress to a 34-digit number, but you also have a different 34-digit number that wants to cycle to the same 34-digit number that the 100-digit number needs.

    This is all kind of slipshod reasoning though, and it doesn't get at the main question: CAN a scheme achieve compression on average, or is it impossible?
     
  7. Aug 22, 2004 #6

    Zurtex

    User Avatar
    Science Advisor
    Homework Helper

    Bartholomew I don't mean to be offensive but when reading your post I can't help thinking you don't understand very well how data is currently stored on computers. So please excuse my rudeness if you already understand everything in this explanation. Everything on computers is stored in base 2 and will always continue to be stored in base 2 until there is a revolution in computer technology (there a few on the horizon that look very promising). As it stands to store text we use ASCII which consists of 256 characters, this is a relatively efficent system and computers have no trouble converting the 8 digit 2 base number into symbols on the screen.

    Computer programmers often like to view things in base 16 as apposed to base 2, as it is easier to comprehend when viewing. In ASCII all characters can be represented as a 2 digit base 16 number, computers treat these numbers as a series of seperate numbers. The exponent compression system is such that treats the entire data as one number, this means that numbers get very big very quickly, for example I'm half way through writting a short and it is currently 22'545 chracters long (which is 45'090 base 16 digits long). This is just text and to store the data it becomes even bigger as you need to save font, italics, underlines, dates of save etc. If you had a compression system like this it would need to be able to cope with files that would be atleast 2'147'483'648 base 16 digits long.

    Dealing which such huge numbers means that all the normal rules of efficent storing of data go out the window. Furthermore more working out the multiples of the number alone would take years on a normal home computer. So working out efficent exponants would take such huge amounts of processing power by the time we had the power we would be dealing with files much bigger and finding that files of that age suffer from the same problem.
     
  8. Aug 22, 2004 #7

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Zurtex,

    It is abolsutely technically feasible. Using a standard arbitrary-precision arithmetic package, you can build a system like this in a few minutes. I have already described the problems with it. I've built one of these compression algorithms, and seen for myself what the problems are with it.

    And Zurtex, you are quite underestimating the computational power of your computer!

    Bartholomew,

    The system you propose to represent common 100-digit strings with smaller strings is a variation of a very popular scheme called Huffman coding. Huffman coding is used, well, practically everywhere already.

    - Warren
     
  9. Aug 22, 2004 #8

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    It is impossible.

    However, we're usually not concerned about compressiong random data: we're concerned about compressing the types of data we generate.


    Incidentally, I'm only talking about file size; most file systems waste space, so you can achieve compression on average by eliminating this space.
     
  10. Aug 22, 2004 #9

    Zurtex

    User Avatar
    Science Advisor
    Homework Helper

    I am?
    Then why does it take 2 weeks or more for a home computer to work out whether a number is a prime number or not when it is within the region of 7'000'000 digits long?

    I find it hard to believe that such an algorithm could be created to deal with numbers greater than 2’000’000’000 digits long and work quickly on a computer if it was designed to find the most efficient way of expressing the data.
     
  11. Aug 22, 2004 #10

    Alkatran

    User Avatar
    Science Advisor
    Homework Helper

    I wrote a program to find prime numbers on my computer. It checked each succesive number against all previous prime numbers, etc etc. Not very advanced but workable. It reaches 5000 (the value being checked, not the amount of numbers found) within a few SECONDS. However, to reach a million could take ten minutes.

    Also, you need to remember that even though your "short" is stored so that it is in ASCII (assuming your text editing program does that) it can still be read in 7 bit chunks, or 3 bits. It's entirely up to the programmer for what means what. If you limit yourself to 8 bits, you may miss possibilities.

    As I understand it now, compression tends to work on larger files better than smaller ones. Your standard compression searches through the file and creates a tree of common occurences. It then saves a file with the tree at the front, and the file (saved according to the tree, so some large patterns are reduced but some short ones are enlarged) afterwards. So once you get to very low file sizes the tree takes up more space than the file and you shouldn't bother.
     
  12. Aug 22, 2004 #11

    Zurtex

    User Avatar
    Science Advisor
    Homework Helper

    Yes I've created programs that will check if a number is a prime within 2 seconds, but that only goes up to about 10 digits.
     
  13. Aug 22, 2004 #12

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Who said anything about prime numbers? The exponential expressions certainly don't need to have prime numbers in them, they can work with any numbers.

    - Warren
     
  14. Aug 22, 2004 #13

    Alkatran

    User Avatar
    Science Advisor
    Homework Helper

    The prime numbers came up as an example of computer speed, to show that they can do a lot very quickly, given the right technique.

    I deleted my above post because it didn't advance the post.
     
  15. Aug 22, 2004 #14

    Zurtex

    User Avatar
    Science Advisor
    Homework Helper

    Ahh forget it, chroot you most likely know more than me on the subject I'm sure your right.
     
  16. Aug 22, 2004 #15

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Zurtex,

    I don't expect you to forget it just because I say so. :) Download an arbitrary-precision arithmetic package, like bignum, and play around with it. It's fun, actually -- too bad it doesn't really seem to work well as a compression algorithm. :(

    - Warren
     
  17. Aug 22, 2004 #16
    Well, the reason checking for a large prime takes a long time is that it takes O(n^2) time for a number of size n. Simple arithmetic operations like doubling a number or subtracting another number from it are generally O(log n) because you're dealing with digits on their own rather than whole numbers, so it's much, much quicker. A number 2 billion digits long would take gigabytes of disk space, so you wouldn't work with it anyway; 200 million digits long is more like it at most. But you could probably do simple processing on a 2 billion digit number in a few hours rather than weeks with a home computer; at least, that's how long it takes to defrag the hard drive, and defragging the hard drive is probably O(log n) where n is not the simple number of bytes but the number representable by those bytes.

    I know how data is represented in a computer--what I was talking about with "base 10" or "base 13" would not be for computer use. Data is represented in base 2 in a computer only because it is easy to deal with base 2 through circuitry. If there were some simple physical construction that could fundamentally occupy 1 of 13 states, then computers might just as easily be base 13 instead of base 2. The use of base 2 is a result of physical reality, not abstract math--at least, so I think. Maybe there are also mathematical reasons--which I don't know about--for using base 2 to represent numbers. Anyone know?

    I understand why the 100-digit decimal "compression" scheme "works": the de-compressor already knows the useful fact that the number is 100 digits long, so he can pinpoint it to an order of magnitude. It makes sense that compression should be achieved 10% of the time then. It's not a practical scheme because when do you ever know exactly how long the message you're receiving is? (and learning the file-length from a website doesn't count because the file-length number is several digits long and would therefore outweigh the compression in the vast majority of cases). The compression would only be 1 digit's worth in 90% of cases that it's achieved at all.

    Chroot, I looked it up, and I see what you mean about it being similar to huffman coding--with the big difference that huffman coding is practical and makes use of some previous order to the data, whereas my scheme isn't and doesn't.

    Hurkyl, are you SURE compression can't be achieved on average? Do you know who proved that or a summary of the proof if it's simple? I'm sure it wouldn't be too hard to prove if you're only looking at schemes where the range of values for each digit is the same through the message or fixed by the position of the digit, but if the range of possible values for each digit depends on what other digits are (like if digit 3 could be 0-8 unless the first 2 digits are 1 and 2, in which case digit 3 can be 0-9), it seems to me it might be tricky because of difficulty in counting the total number of possible states. I lean towards believing you, but what is the proof of it? The reason I still have some doubt is that a series of digits signifying various things has some inherent structure to it, just because we know it's a series of digits. Maybe that structure can be taken advantage of in a compression scheme which is itself a series of digits, and maybe (probably) it can't.
     
  18. Aug 22, 2004 #17

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Actually, the time complexity of testing numbers for primality is polynomial in the number of digits, not in the size of the number.

    Probabilistic algorithms have been known for a while, and the AKS test is deterministic and runs in just over O((ln n)^6) time; that is the 6th power of the number of digits.



    Let me amend my statement by saying that it depends on what you mean by "on average". For example, one can write a scheme where there are two files that each get reduced in size by one byte, and one file that gets increased in size by two bytes. (everything else remains the same size)

    But the other way of looking at it is that the set of all files of size N or less cannot be fit into the set of all files of size N-1 or less... at best the compression scheme will take functions of size N or less... which means the average compression cannot be positive.
     
  19. Aug 22, 2004 #18
    I stand greatly corrected about the prime algorithm. Still, it does take more time than simple arithmetic operations.

    The file-size way of looking at things makes sense, but exactly how is the file size determined when the size of each digit depends on the digits before it? For example, say the first digit can be 0-9. Then multiply whatever digit that was by 3 and take the ones place; the next digit can be 0-(whatever that ones place digit was). Keep going like that so that the possible range of each digit depends on the ones place of 3x the digit before it--and then you have a file by that method which is 150 digits long, so how do you measure the size of that file?

    Edit: make that the upper limit of the range of each digit is the one's place of 3x the previous digit PLUS ONE so you don't hit 0 and then stop.
     
    Last edited: Aug 22, 2004
  20. Aug 22, 2004 #19

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Note that the only thing my argument assumed about the method of getting the size of a file is that, for a given N, there are only a finite number of files of size N or less.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: Idea for data compression
  1. Data analysis (Replies: 2)

  2. Data Smoothing (Replies: 4)

  3. Is this idea correct? (Replies: 6)

Loading...