The data compression field is a peer-reviewed paper corpus that is truly enormous. Even today, researchers are still looking for better ways to compress the ever increasing onslaught of scientific that needs to be stored, transmitted and analyzed.
Some of the modern compressors are:
- sz3
https://github.com/szcompressor/SZ3
- mgard
- sperr
- fpzip
- zfp
https://github.com/llnl/zfp
- pfpl
https://github.com/burtscher/PFPL
Compressors are divided into two camps:
- lossles
- lossy
Lossless is for data that must remain intact when decompressed. Its very hard to develop lossless compressors. Many compress at the bit level.
Lossy compressors handle scientific data either measured or collected from a simulation. Some level of fuzziness is allowed within a given error bound. Lossy decompression generates similar data but the source data won't match the decompressed data.
PFPL is my favorite because it's very fast, uses lossy compression within strict error bounds, and leverages the CPU and, if available, the NVIDIA GPU. Also, it was developed at my university's CS department.
Compressors are tested against some of the toughest datasets in the SDR Benchmark suite.
My feeling is that you are naively entering this well-tread and busy subfield of CS.
Try out these compressors, along with any others you find online, to see how well they perform on your machine setup.
There is no ideal compressor. Some tools compress certain data files better than others.
There are also some very intractable datasets ie random data with no pattern to exploit.
My work in the field is very primitive in comparison to the approaches used by the best compressors. I'm using linear, quadratic, cubic, and quartic lossy compression where I try to fit a string of data to a polynomial expression. The "compressed data" is actually the coefficients of the polynomial expressions. However, they will regenerate the data within a user-supplied error bound.
Sadly, its compression ratio is good for smooth, slow-moving data, like an undulating sine curve. But when compressing data that borders on random, which other compressors handle well, it falls apart, yielding a poor compression ratio.
But the hope is I can do better.