An algorithm for data compression?

A_Seagull · Mar 27, 2013

An algorithm for data compression?

I am coming from a philosophy and science background and I'm particularly interested in finding out what a generalised data compression algorithm would look like. (By algorithm I mean something like a flowchart that would show how this process could be done).

Effectively, much of science is about data compression. For example 'F=ma' is a means of greatly compressing data concerning an object moving in a force field.

Also, one of the brains intuitive processes is one of data compression. For example, it can take a multitude of sense-data and compress it into the simple string 'tree'.

So I am interested in how this could be achieved at a fundamental level. I am seeking a generalised algorithm that could, in principle and at a much later date, be coded into a computer language and run on a computer.

I have had several attempts myself to create such an algorithm but without much success.

I suspect that it may be very hard to find an entirely generalised algorithm that could efficiently compress any compressible string of data. Instead, it may be necessary to start with some form of 'template' to initiate the data compression process.

So for the purpose of the exercise, I suggest that one consider as input a string of integers. This input string could be considered to be infinite in length, but for the purpose of the exercise some cut-off would undoubtedly be necessary.

The output of the algorithm would contain some brief description of the ' pattern' that had been found and a process for recreating the original data string.

Clearly the input data cannot be random, otherwise no data compression would be possible. It must contain some form of 'pattern', even if it is only an approximate pattern.

Perhaps initially it can be assumed that the data can be described exactly by a pattern, in other words that there are no 'errors' in the data. (However, subsequently it would be interesting to see how an algorithm could cope with some data points not fitting the pattern perfectly, as this would be more 'realistic'.

There would also need to be some provision for criteria for the algorithm to halt and produce its answer.

There may already be in existence some such algorithm, but if so I have not uncovered it.

Any suggestions? Any solutions?

SteamKing · Mar 27, 2013

Try these articles:
http://www.ics.uci.edu/~dan/pubs/DataCompression.html
http://mattmahoney.net/dc/dce.html

dodo · Mar 27, 2013

Hi,
I'm not sure if I follow, but it seems to me that you're asking what abstraction is, at its most fundamental. As such, the question is either too vague or too open; it's doubtful that you may get an answer when either (a) the question is not an accurate one, or (b) nobody really knows the answer.

If it helps, data compression algorithms are about re-coding a message, in such a way that its information content is either only minimally lost, or not at all. But I don't see how this applies to the act of modelling nature with physical laws and concisely expressing these laws using symbols. "Information content" is a precisely defined quantity in the former case, but largely subjective in the latter.

Also, "patterns" for integer sequences are not really a good model, as there is an indefinite number of different patterns that could give you the same sequence; only to the human brain one of these patterns is the obvious choice. For example, the sequence 1, 2, 4, 8, 16, 32, ... has the "obvious" pattern [itex]2^n[/itex], but it could also be reproduced by a 6th-degree polynomial (**)... which will fail to give 64 as the next number. But 64 was *not* in the problem statement. Stating that 64 *must* be the next number is prejudicing yourself against a particular pattern.

My 2 cents.

P.S.: (**) I was improvising when I said "6th-degree polynomial". Then I went to some software and found the following 5th-degree polynomial that does the trick exactly. Just feed values of n=0,1,2,3,4,5 to[tex]\frac{25 n^5}{3000} - \frac{25 n^4}{600} + \frac {5 n^3}{24} + \frac{25 n^2}{600} + \frac{47n}{60}+1[/tex]For n=6 it will give 63. :)

A_Seagull · Mar 27, 2013

dodo said:

Hi,
I'm not sure if I follow, but it seems to me that you're asking what abstraction is, at its most fundamental. As such, the question is either too vague or too open; it's doubtful that you may get an answer when either (a) the question is not an accurate one, or (b) nobody really knows the answer.

If it helps, data compression algorithms are about re-coding a message, in such a way that its information content is either only minimally lost, or not at all. But I don't see how this applies to the act of modelling nature with physical laws and concisely expressing these laws using symbols. "Information content" is a precisely defined quantity in the former case, but largely subjective in the latter.

Also, "patterns" for integer sequences are not really a good model, as there is an indefinite number of different patterns that could give you the same sequence; only to the human brain one of these patterns is the obvious choice. For example, the sequence 1, 2, 4, 8, 16, 32, ... has the "obvious" pattern 2^n, but it could also be reproduced by a 6th-degree polynomial... which will fail to give 64 as the next number. But 64 was *not* in the problem statement. Stating that 64 *must* be the next number is prejudicing yourself against a particular pattern.

My 2 cents.

Reagrding (a) I can certainly make the question more precise if you like - see below following your example.

Regarding (b) That is why I am posting this thread - to get an indiccation of an answer to whether there is an answer!

Regarding ""Information content" is a precisely defined quantity in the former case, but largely subjective in the latter." I presume you have 'former' and 'latter' the wrong way around. With regard to the 'loss of information' I do not think it is entirely 'subjective' and the point is that the algorithm would have to set parameters that would determine an estimate for that data loss so that two possible forms of data compression could be compared and the 'better' one selected.

Regarding your example of 1,2,4,8,16,32... For that finite sequence there may well be more than one form of data compression. But which one is the 'best'. If both 'theories' can reproduce the data equally well, then the separation between themn as to which one is 'better' would come down to which one can be stated in the smaller amount of data.

Also considering your sequence, I am certainly not suggesting that '64' would be the next integer. Far from it. It could be '95'. and the following one '128'. In which case it could be that the algorithm might consider that '95' is somewaht erroneous. And still find the sequence 2^n.

Also if an algorithm could find an appropriate 'pattern' to that sequence of data could that same algorithm be able to compress the data sequence : '1,3,5,7,9,11,...'?

dodo · Mar 27, 2013

A_Seagull said:

I presume you have 'former' and 'latter' the wrong way around.

[strike]You're right[/strike] I don't really know, sorry about that one. (English is not my first language.)

A_Seagull said:

Also if an algorithm could find an appropriate 'pattern' to that sequence of data could that same algorithm be able to compress the data sequence : '1,3,5,7,9,11,...'?

I used a combination of an "approximate" algorithm, and some hand tweaking. But you could try this yourself. See this webpage, for example; in the text box, write two columns of numbers: the left column with the integers 0,1,2,3,4,5, and the second with the numbers in your sequence. Like this:

Code:

and choose a degree for the polynomial as large as the number of integers that you have (6), or maybe one less (5). Then see and interpret the results as desired, even try a smaller degree when appropriate. While this method is intended to produce an approximation, it may give you some ideas.

Another method that comes to mind is Newton's finite differences. Think of taking the successive differences of the integers in your sequence; for 1,3,5,7,9,11 you are done at the first step, as all of them differ by 2, and you have a pattern immediately. If not, then take successive differences of these previous differences that you obtained in the step before; and repeat until you obtain a constant difference. (The algorithm terminates because you always have one difference less than the participant numbers; in the worst case you'll end up with a last iteration subtracting only two numbers and giving only one difference.) Then imagine a way to work backwards each step, to go from differences (and initial number) to the previous step. The method can be more elaborate than what I wrote, but this should be enough to keep you toying for a while and gaining some insight before consulting the full math workout.

rcgldr · Mar 27, 2013

In the case of hardware compression, algorithms like LZ77 can be implemented using a content addressable memory to "search" for previous occurances of a symbol (plus a single bit used to keep track of previous matches in a set of strings in the "sliding window" used by LZ77) in one content addressable memory cycle. This greatly speeds up the process (compared to using a hash table or doing a search).

chiro · Mar 28, 2013

A generalized algorithm would have to understand a generic way to take in an arbitrary amount of data and find a format (i.e. think linguistic representation) that minimizes its Kolmogorov complexity with the criteria that the representation has a bijection to the original data set.

This would mean that you would in essence, have to understand how to arbitrarily transform your source data into a language (or basis) that minimizes the complexity on an extremely high and general level for any representation.

This is actually what all compression algorithms do: both the lossless and the lossy.

The lossy algorithms typically use integral transforms to project the signal data to some basis and compute the coeffecients and store these rather than the raw data. The results are that you get a smooth or blurry output with some compression.

The lossless ones also do the same thing except that the bases used are completely bijective.

In the lossy ones, typically there is some quantization constraint which means that the coefficients are stored up to some fixed resolution and it means that information will be lost at some point.

The algorithms look at finding schemes which minimize the size and maximize the quality and to do this for video and audio, you need to understand how the eyes and ears work.

A_Seagull · Mar 29, 2013

Thank you to Steamking, dodo, rcgldr and chiro for your links and comments.

Clearly the subject is hugely complex. It would seem that in order to have a universal algorithm for data compression on all types of data strings, it would require a hugely complex algorithm that could only 'justify' that complexity by having a huge amount of varied data to compress. (An example of this 'justification' could be the human brain and its interaction with the exterior world.)

But what I am looking for at this stage would be a much simpler algorithm, and while necessarily not universal in its ability to compress all manner of data, it would at least be able to compress some.

In this respect LZ77 is of particular interest and is of a form that I had not encountered before. It is very simple and does not require a finite string. However it is limited as it could not compress the sequence 1,2,3,4,.. but would instead convert that to an inflated rather than compressed string.

Just to clarify what I am looking for and what I mean by a 'universal' algorithm is: one that compresses all types of data strings and generates an optimal (or approximately optiomal) compression of that data. And it would do this without any outside influence from an operator or mathematician for the selection of compression types to search for nor in the evaluation of the possible different answers.

So at this stage I am looking for a very simple algorithm that would have maximal apllicability. Can any improvement be made on LZ77?

DrZoidberg · Mar 29, 2013

A_Seagull said:

Just to clarify what I am looking for and what I mean by a 'universal' algorithm is: one that compresses all types of data strings and generates an optimal (or approximately optiomal) compression of that data. And it would do this without any outside influence from an operator or mathematician for the selection of compression types to search for nor in the evaluation of the possible different answers.

Did you read the section "No Universal Compression" in this text?
http://mattmahoney.net/dc/dce.html#Section_11

But since you mentioned brains, maybe what you are looking for are neural networks? Those can also be seen as a form of lossy compression.

A_Seagull · Mar 29, 2013

DrZoidberg said:

Did you read the section "No Universal Compression" in this text?
http://mattmahoney.net/dc/dce.html#Section_11

But since you mentioned brains, maybe what you are looking for are neural networks? Those can also be seen as a form of lossy compression.

Yes I did thank you. And I quote from it : "Suppose there were a compression algorithm that could compress all strings of at least a certain size...". I am not looking for an algorithm that will necessarily compress all strings. I am only looking for an algorithm that will compress a number of different types of strings. And this can be a lossy compression .

I think this could be philosophically important. Because somewhere along the history of the evolution, life began with no concept of an exterior world. And now humans (and other animals) have created a 'picture ' of that exterior world. And this must necessarily have been achieved wihout any preconception of what that exterior world might consist of. So if you think of the problem as one of an input of sense data and the brain as some sort of data processor then how is it possible for the brain to 'process' that data without any preconceptions of how to do so? I suspect that the only way to achieve this is through some form od data compression or pattern creation. So I am exploring possibilities for what an algoriothm for that initial data compression/ pattern creation process might look like.

AlephZero · Mar 29, 2013

I don't have a view on whether or it is philosophically important, but IMO it won't fly very far on a math forum until you succeed in defining the problem better.

I am only looking for an algorithm that will compress a number of different types of strings. And this can be a lossy compression.

There are an infinite number of those. For example the Douglas Adams algorithm: Replace every input string with the compressed string "42". Since "42" is the answer to "life the universe and everything", decompressing it can obviously recover all possible input strings

DrZoidberg · Mar 29, 2013

A_Seagull said:

So I am exploring possibilities for what an algoriothm for that initial data compression/ pattern creation process might look like.

As I said before. The algorithm you are looking for is called "neural network".

NegativeDept · Mar 29, 2013

People here might already know about these two classics, but just in case:

Shannon's source coding theorem establishes limits to possible data compression.

The Kolmogorov complexity of an object is a measure of the computational resources needed to specify the object.

A_Seagull · Mar 30, 2013

DrZoidberg said:

As I said before. The algorithm you are looking for is called "neural network".

And what is that algorithm?

An algorithm for data compression?

Similar threads

High School Ant on a stretchy rope puzzle

High School Potato paradox

Geometric Game: Fun With Matches (Safe!)

Undergrad Three Circle Problem

High School Three Squares Problem

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect