An algorithm for data compression?

In summary, the conversation discusses the concept of a generalized data compression algorithm and its potential applications in science and human cognition. The speaker is seeking a fundamental algorithm that can efficiently compress any compressible string of data, but acknowledges that it may be difficult to find one that can handle all forms of data. They suggest using a string of integers as input and discuss the need for criteria for the algorithm to halt and provide an answer. The conversation also touches on the subjectivity of information content in different contexts.
  • #1
A_Seagull
8
0
An algorithm for data compression?

I am coming from a philosophy and science background and I'm particularly interested in finding out what a generalised data compression algorithm would look like. (By algorithm I mean something like a flowchart that would show how this process could be done).

Effectively, much of science is about data compression. For example 'F=ma' is a means of greatly compressing data concerning an object moving in a force field.

Also, one of the brains intuitive processes is one of data compression. For example, it can take a multitude of sense-data and compress it into the simple string 'tree'.

So I am interested in how this could be achieved at a fundamental level. I am seeking a generalised algorithm that could, in principle and at a much later date, be coded into a computer language and run on a computer.

I have had several attempts myself to create such an algorithm but without much success.

I suspect that it may be very hard to find an entirely generalised algorithm that could efficiently compress any compressible string of data. Instead, it may be necessary to start with some form of 'template' to initiate the data compression process.

So for the purpose of the exercise, I suggest that one consider as input a string of integers. This input string could be considered to be infinite in length, but for the purpose of the exercise some cut-off would undoubtedly be necessary.

The output of the algorithm would contain some brief description of the ' pattern' that had been found and a process for recreating the original data string.

Clearly the input data cannot be random, otherwise no data compression would be possible. It must contain some form of 'pattern', even if it is only an approximate pattern.

Perhaps initially it can be assumed that the data can be described exactly by a pattern, in other words that there are no 'errors' in the data. (However, subsequently it would be interesting to see how an algorithm could cope with some data points not fitting the pattern perfectly, as this would be more 'realistic'.

There would also need to be some provision for criteria for the algorithm to halt and produce its answer.

There may already be in existence some such algorithm, but if so I have not uncovered it.

Any suggestions? Any solutions?
 
Mathematics news on Phys.org
  • #3
Hi,
I'm not sure if I follow, but it seems to me that you're asking what abstraction is, at its most fundamental. As such, the question is either too vague or too open; it's doubtful that you may get an answer when either (a) the question is not an accurate one, or (b) nobody really knows the answer.

If it helps, data compression algorithms are about re-coding a message, in such a way that its information content is either only minimally lost, or not at all. But I don't see how this applies to the act of modelling nature with physical laws and concisely expressing these laws using symbols. "Information content" is a precisely defined quantity in the former case, but largely subjective in the latter.

Also, "patterns" for integer sequences are not really a good model, as there is an indefinite number of different patterns that could give you the same sequence; only to the human brain one of these patterns is the obvious choice. For example, the sequence 1, 2, 4, 8, 16, 32, ... has the "obvious" pattern [itex]2^n[/itex], but it could also be reproduced by a 6th-degree polynomial (**)... which will fail to give 64 as the next number. But 64 was *not* in the problem statement. Stating that 64 *must* be the next number is prejudicing yourself against a particular pattern.

My 2 cents.

P.S.: (**) I was improvising when I said "6th-degree polynomial". Then I went to some software and found the following 5th-degree polynomial that does the trick exactly. Just feed values of n=0,1,2,3,4,5 to[tex]\frac{25 n^5}{3000} - \frac{25 n^4}{600} + \frac {5 n^3}{24} + \frac{25 n^2}{600} + \frac{47n}{60}+1[/tex]For n=6 it will give 63. :)
 
Last edited:
  • #4
dodo said:
Hi,
I'm not sure if I follow, but it seems to me that you're asking what abstraction is, at its most fundamental. As such, the question is either too vague or too open; it's doubtful that you may get an answer when either (a) the question is not an accurate one, or (b) nobody really knows the answer.



If it helps, data compression algorithms are about re-coding a message, in such a way that its information content is either only minimally lost, or not at all. But I don't see how this applies to the act of modelling nature with physical laws and concisely expressing these laws using symbols. "Information content" is a precisely defined quantity in the former case, but largely subjective in the latter.

Also, "patterns" for integer sequences are not really a good model, as there is an indefinite number of different patterns that could give you the same sequence; only to the human brain one of these patterns is the obvious choice. For example, the sequence 1, 2, 4, 8, 16, 32, ... has the "obvious" pattern 2^n, but it could also be reproduced by a 6th-degree polynomial... which will fail to give 64 as the next number. But 64 was *not* in the problem statement. Stating that 64 *must* be the next number is prejudicing yourself against a particular pattern.

My 2 cents.


Reagrding (a) I can certainly make the question more precise if you like - see below following your example.

Regarding (b) That is why I am posting this thread - to get an indiccation of an answer to whether there is an answer!

Regarding ""Information content" is a precisely defined quantity in the former case, but largely subjective in the latter." I presume you have 'former' and 'latter' the wrong way around. With regard to the 'loss of information' I do not think it is entirely 'subjective' and the point is that the algorithm would have to set parameters that would determine an estimate for that data loss so that two possible forms of data compression could be compared and the 'better' one selected.

Regarding your example of 1,2,4,8,16,32... For that finite sequence there may well be more than one form of data compression. But which one is the 'best'. If both 'theories' can reproduce the data equally well, then the separation between themn as to which one is 'better' would come down to which one can be stated in the smaller amount of data.

Also considering your sequence, I am certainly not suggesting that '64' would be the next integer. Far from it. It could be '95'. and the following one '128'. In which case it could be that the algorithm might consider that '95' is somewaht erroneous. And still find the sequence 2^n.

Also if an algorithm could find an appropriate 'pattern' to that sequence of data could that same algorithm be able to compress the data sequence : '1,3,5,7,9,11,...'?
 
  • #5
A_Seagull said:
I presume you have 'former' and 'latter' the wrong way around.
[strike]You're right[/strike] I don't really know, sorry about that one. (English is not my first language.)

A_Seagull said:
Also if an algorithm could find an appropriate 'pattern' to that sequence of data could that same algorithm be able to compress the data sequence : '1,3,5,7,9,11,...'?
I used a combination of an "approximate" algorithm, and some hand tweaking. But you could try this yourself. See this webpage, for example; in the text box, write two columns of numbers: the left column with the integers 0,1,2,3,4,5, and the second with the numbers in your sequence. Like this:
Code:
0 1
1 3
2 5
3 7
4 9
5 11
and choose a degree for the polynomial as large as the number of integers that you have (6), or maybe one less (5). Then see and interpret the results as desired, even try a smaller degree when appropriate. While this method is intended to produce an approximation, it may give you some ideas.

Another method that comes to mind is Newton's finite differences. Think of taking the successive differences of the integers in your sequence; for 1,3,5,7,9,11 you are done at the first step, as all of them differ by 2, and you have a pattern immediately. If not, then take successive differences of these previous differences that you obtained in the step before; and repeat until you obtain a constant difference. (The algorithm terminates because you always have one difference less than the participant numbers; in the worst case you'll end up with a last iteration subtracting only two numbers and giving only one difference.) Then imagine a way to work backwards each step, to go from differences (and initial number) to the previous step. The method can be more elaborate than what I wrote, but this should be enough to keep you toying for a while and gaining some insight before consulting the full math workout.
 
Last edited:
  • #6
In the case of hardware compression, algorithms like LZ77 can be implemented using a content addressable memory to "search" for previous occurances of a symbol (plus a single bit used to keep track of previous matches in a set of strings in the "sliding window" used by LZ77) in one content addressable memory cycle. This greatly speeds up the process (compared to using a hash table or doing a search).
 
  • #7
A generalized algorithm would have to understand a generic way to take in an arbitrary amount of data and find a format (i.e. think linguistic representation) that minimizes its Kolmogorov complexity with the criteria that the representation has a bijection to the original data set.

This would mean that you would in essence, have to understand how to arbitrarily transform your source data into a language (or basis) that minimizes the complexity on an extremely high and general level for any representation.

This is actually what all compression algorithms do: both the lossless and the lossy.

The lossy algorithms typically use integral transforms to project the signal data to some basis and compute the coeffecients and store these rather than the raw data. The results are that you get a smooth or blurry output with some compression.

The lossless ones also do the same thing except that the bases used are completely bijective.

In the lossy ones, typically there is some quantization constraint which means that the coefficients are stored up to some fixed resolution and it means that information will be lost at some point.

The algorithms look at finding schemes which minimize the size and maximize the quality and to do this for video and audio, you need to understand how the eyes and ears work.
 
  • #8
Thank you to Steamking, dodo, rcgldr and chiro for your links and comments.

Clearly the subject is hugely complex. It would seem that in order to have a universal algorithm for data compression on all types of data strings, it would require a hugely complex algorithm that could only 'justify' that complexity by having a huge amount of varied data to compress. (An example of this 'justification' could be the human brain and its interaction with the exterior world.)

But what I am looking for at this stage would be a much simpler algorithm, and while necessarily not universal in its ability to compress all manner of data, it would at least be able to compress some.

In this respect LZ77 is of particular interest and is of a form that I had not encountered before. It is very simple and does not require a finite string. However it is limited as it could not compress the sequence 1,2,3,4,.. but would instead convert that to an inflated rather than compressed string.

Just to clarify what I am looking for and what I mean by a 'universal' algorithm is: one that compresses all types of data strings and generates an optimal (or approximately optiomal) compression of that data. And it would do this without any outside influence from an operator or mathematician for the selection of compression types to search for nor in the evaluation of the possible different answers.

So at this stage I am looking for a very simple algorithm that would have maximal apllicability. Can any improvement be made on LZ77?
 
  • #9
A_Seagull said:
Just to clarify what I am looking for and what I mean by a 'universal' algorithm is: one that compresses all types of data strings and generates an optimal (or approximately optiomal) compression of that data. And it would do this without any outside influence from an operator or mathematician for the selection of compression types to search for nor in the evaluation of the possible different answers.

Did you read the section "No Universal Compression" in this text?
http://mattmahoney.net/dc/dce.html#Section_11

But since you mentioned brains, maybe what you are looking for are neural networks? Those can also be seen as a form of lossy compression.
 
  • #10
DrZoidberg said:
Did you read the section "No Universal Compression" in this text?
http://mattmahoney.net/dc/dce.html#Section_11

But since you mentioned brains, maybe what you are looking for are neural networks? Those can also be seen as a form of lossy compression.

Yes I did thank you. And I quote from it : "Suppose there were a compression algorithm that could compress all strings of at least a certain size...". I am not looking for an algorithm that will necessarily compress all strings. I am only looking for an algorithm that will compress a number of different types of strings. And this can be a lossy compression .

I think this could be philosophically important. Because somewhere along the history of the evolution, life began with no concept of an exterior world. And now humans (and other animals) have created a 'picture ' of that exterior world. And this must necessarily have been achieved wihout any preconception of what that exterior world might consist of. So if you think of the problem as one of an input of sense data and the brain as some sort of data processor then how is it possible for the brain to 'process' that data without any preconceptions of how to do so? I suspect that the only way to achieve this is through some form od data compression or pattern creation. So I am exploring possibilities for what an algoriothm for that initial data compression/ pattern creation process might look like.
 
  • #11
I don't have a view on whether or it is philosophically important, but IMO it won't fly very far on a math forum until you succeed in defining the problem better.

I am only looking for an algorithm that will compress a number of different types of strings. And this can be a lossy compression.

There are an infinite number of those. For example the Douglas Adams algorithm: Replace every input string with the compressed string "42". Since "42" is the answer to "life the universe and everything", decompressing it can obviously recover all possible input strings :biggrin:
 
  • #12
A_Seagull said:
So I am exploring possibilities for what an algoriothm for that initial data compression/ pattern creation process might look like.

As I said before. The algorithm you are looking for is called "neural network".
 
  • #13
People here might already know about these two classics, but just in case:

Shannon's source coding theorem establishes limits to possible data compression.

The Kolmogorov complexity of an object is a measure of the computational resources needed to specify the object.
 
  • #14
DrZoidberg said:
As I said before. The algorithm you are looking for is called "neural network".

And what is that algorithm?
 

1. What is an algorithm for data compression?

An algorithm for data compression is a set of steps or rules that are used to reduce the size of data files while maintaining their original content. It is a common technique used in computer science and data storage to save space and increase efficiency.

2. How does data compression work?

Data compression works by identifying and removing redundant or irrelevant information from a data file. This can be achieved through various techniques such as encoding, dictionary-based compression, or statistical compression. The compressed data is then stored in a smaller space, making it easier to transfer and store.

3. What are the benefits of using data compression?

Some of the main benefits of data compression include reduced storage space, faster data transfer, and improved efficiency. Compressed data also requires less bandwidth and can help in reducing costs for data storage and transmission. Additionally, it can also provide better security as compressed files are more difficult to access and modify.

4. What are the types of data compression algorithms?

There are two main types of data compression algorithms: lossy and lossless. Lossy compression removes some data from the file, resulting in a smaller file size but potentially losing some information. Lossless compression, on the other hand, reduces the file size without losing any data. Within these categories, there are various techniques and algorithms such as Huffman coding, Lempel-Ziv-Welch (LZW) compression, and run-length encoding.

5. What are the applications of data compression?

Data compression has a wide range of applications in various industries, including computer science, telecommunications, and multimedia. It is commonly used in data storage, file transfer, and data streaming. It is also essential for efficient data transmission over networks and is widely used in the compression of images, videos, and audio files.

Similar threads

Replies
6
Views
973
  • General Math
Replies
28
Views
4K
  • General Math
Replies
5
Views
841
  • Programming and Computer Science
Replies
13
Views
3K
  • Computing and Technology
Replies
1
Views
924
  • Engineering and Comp Sci Homework Help
Replies
4
Views
802
  • Programming and Computer Science
Replies
8
Views
1K
Replies
1
Views
1K
  • High Energy, Nuclear, Particle Physics
Replies
1
Views
1K
Replies
1
Views
41
Back
Top