Apology for Lameness: How to Make Amends

  • Thread starter Thread starter Dmrwizkidz
  • Start date Start date
Click For Summary

Discussion Overview

The discussion revolves around data compression methods, particularly in the context of using mathematical constants like Pi for indexing and retrieval of data. Participants explore the implications of different chunk sizes and indexing strategies in relation to compression efficiency.

Discussion Character

  • Technical explanation
  • Mathematical reasoning
  • Debate/contested

Main Points Raised

  • One participant suggests that using larger data chunks can improve the efficiency of a compression method, but highlights the trade-off with the size of the index needed to locate data.
  • Another participant discusses the implications of using a large set of possible strings for indexing, noting that the size of the index may exceed the size of the data being compressed.
  • There is a consideration of how compression algorithms exploit regularities in input data, with the assertion that more regular data allows for greater compression.
  • A participant provides a specific example of attempting to compress a hexadecimal representation of a username using digits from Pi, illustrating the potential for varying compression ratios based on the method employed.
  • Another participant mentions the Lempel-Ziv (LZ) compression methods as popular algorithms for lossless storage and suggests looking into indexing tools to enhance pattern searching.

Areas of Agreement / Disagreement

Participants express differing views on the effectiveness of using Pi for data compression and indexing, with no consensus reached on the best approach or the implications of the discussed methods.

Contextual Notes

The discussion includes assumptions about data regularity and the efficiency of various compression methods, which remain unresolved. The effectiveness of the proposed indexing strategies is also not conclusively determined.

Dmrwizkidz
Messages
2
Reaction score
0
Removed. Due to Lameness on my Part
 
Last edited:
Mathematics news on Phys.org
Hi, Dean, welcome to the forums.

Your compression method would pay off when you choose large enough chunks of data, so that the size of the key + start index (an end index is not needed, for fixed-size input chunks) is small compared to the size of the chunk.

Put yourself for a moment in the following situation: if your data chunk size is, say, 512 bits, imagine that, instead of using Pi, you use the full collection of [tex]10^{512}[/tex] possible strings of digits, each of length 512,
0000 ... 0000
0000 ... 0001
...
9999 ... 9998
9999 ... 9999​
where you are guaranteed to find a match to your input data.

Here is the catch: given your 512-bit input chunk, the start index would be a number between 0 and [tex]10^{512}-1[/tex], which needs more than 512 bits to be stored.

If your table was made of all [tex]2^{512}[/tex] combinations of binary digits instead of decimal (and you key were either 01 or 10), the size of the index would again be 512 bits. In this case, you can see why the choice of key will not make the index smaller.

In the bottom of this discussion is the idea of how much information you want to compress. Compression algorithms usually tackle some known regularities in the input data (for example, if the data is always ASCII text, or if it is a sound wave, or a picture): the more regular the input data is, the less information it actually contains, so it is possible to convey that information in less bits.

If there is something about your idea that I'm missing, please let me know.
 
Also Removed
 
Last edited:
Dmrwizkidz said:
Then my Address would be something like - 000000001-001 through 9502F9000-3FE of course assuming the chunk of data being found is a known constant as well. (6 bytes)

Sure, but on average the addresses (in that format) will take up more room than the data you're trying to compress.

Let's say you want to store 0xC3D2C7D2C5C1 (the first 7 letters of my username) by finding digits in pi. 'On average' you'll have to look 215 trillion places deep to find it, giving you an address 12 hexadecimal digits long. Let's say you're rather lucky and find such a string 70 billion digits in, 3000 times sooner than expected. Now you just need to give 10 hexadecimal digits for the start address and 10 hexadecimal digits for the end address and you've compressed 12 hexadecimal digits into 20 for a compression ratio of -66%.

If you store the length of the string instead, and make the length and address self-delimiting, you can get right back to 0% compression on average.

By contrast, noting that the string above is all lowercase letters, it could be compressed into 9 hexadecimal digits. Finally a positive compression ratio -- 25%.
 
I'm not sure about compression too much, but The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for lossless storage. http://en.wikipedia.org/wiki/Data_compression

If you want to use PI, you might want to look into some sort of indexing. Indexing makes searching for patterns much faster. Look into SED and AWK for pattern matching tools. I believe they are open source.
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 11 ·
Replies
11
Views
1K
  • · Replies 5 ·
Replies
5
Views
2K
Replies
2
Views
4K
  • · Replies 47 ·
2
Replies
47
Views
14K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 16 ·
Replies
16
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 41 ·
2
Replies
41
Views
7K