# Set representation

1. Mar 31, 2006

### 0rthodontist

Is there a general good way to represent sets as sequences, without wasting data or conveying extra data? One way is a membership list like 11010100111010 but this only works if the possible number of elements is small. I am thinking, data _other_ than ordering, directly about the set objects, can be conveyed through ordering. Like if I'm talking about a set of eleven ten-digit numbers, and one of them has all distinct digits, then I can list the other 10 numbers in an order according to the digits of the 11th number, and from that list of 10 numbers all 11 can be reconstructed and the list does not seem to convey any extra information. Is there a general way to do this for any set of numbers?

Last edited: Mar 31, 2006
2. Apr 1, 2006

### MathematicalPhysicist

how do you propose to do this ordering?

do you mean for example, the number is 1345768902 now how do you determine the order of the other number by relating to its digits.

just for example the next numbers are: 1111111111,2222222222,3333333333,4444444444,5555555555,6666666666,7777777777,8888888888,9999999999,9999999991.
then how can you infer the order?

the basic way (and ugly) is to break it up to bases of 10 and then checking every digit in comparison to the 11th number and the other numbers, but i don't see what you're trying to accomplish this way, if this is even your way?

3. Apr 1, 2006

### Hurkyl

Staff Emeritus
There are all sorts of silly ways to represent things.

But we have a theorem: in any compression scheme, there must be incompressible sequences. In fact, to get any compression at all, some things need to be inflated.

It's easy to see why. How many n-long sequences are there? Now, how many n-long compressed representations are possible?

Anyways, the important thing about a set, usually, is that it contains objects -- I'd be worried about any representation that makes it difficult to quickly tell if your set contains a particular object, or to be able to step through all the objects in the set efficiently. And if your set is mutable, you don't want to have to do a lot of work each time you add or remove an element!

4. Apr 1, 2006

### 0rthodontist

LQG, I was saying the number 1345768902 could represent a permutation of the other elements from the ascending order. In this case the order would be 2222222222, 4444444444, 5555555555, 6666666666, 8888888888, 7777777777, 9999999991, 9999999999, 1111111111, 3333333333

It probably wouldn't be all that practical since in a set of x equal length objects, you're only going to be able to get O(x) extra bits at best, while the set contains at least the order of x ln x bits. And the processing overheads might be high. I'm just wondering if there is a clean way to do it, ignoring processing overheads. In this case you could always achieve some compression over simply representing each element of the set since the set is not a sequence so the pigeonhole principle doesn't directly apply in that way.

Basically it's the question of how do you map integers 1-1 and roughly onto to permutations.

5. Apr 1, 2006

### Hurkyl

Staff Emeritus
Ah.

You're complaining about the fact that we tend to write an n-element set as an ordered list, meaning that, information theoretically, this representation is actually storing ~n lg n bits of irrelevant information.

So, you're wondering if there is a cheap way of recovering those n lg n bits? Or just if there is any way at all of recovering those bits?

6. Apr 1, 2006

### Hurkyl

Staff Emeritus
Well, it's easy enough to put all finite subsets of N in bijection with positive integers:

But, I suppose you're looking for a way that is biased in favor of small sets, instead of a way that's biased in favor of small numbers?

7. Apr 1, 2006

### Hurkyl

Staff Emeritus
One useful method would be to store a set as a list of differences: the first element of your list is the smallest element of the set. The remaining elements are the differences of subsequent elements.

8. Apr 1, 2006

### Hurkyl

Staff Emeritus
Let's compute! Suppose we want to store the ordered list (aka sequence) of numbers {x}.

So, this will require:

n + f(x(1)) + f(x(2)) + ... + f(x(n))

bits to store. Here:

f(x) is the number of bits required to store the number x. It turns out that we cannot manage this feat with merely lg x bits. I think we can get arbitrarily close, but I think it's straightforward to manage with f(x) = lg x + lg lg x bits.

We could probably do much better with that n term, but we're considering very small n, and that term will be irrelevant for the rest of this post anyways.

Now, the representation I mentioned in my last post will take storage:

n + f(x(1)) + f(x(2) - x(1)) + ... + f(x(n) - x(n-1))

So what is the savings? Well, that's easy: it's simply:

f(x(2)) - f(x(2) - x(1)) +
f(x(3)) - f(x(3) - x(2)) + ... +
f(x(n)) - f(x(n) - x(n-1))

Let's do a gross first-order Taylor approximation (i.e. differential approximation):

x(1) f'(x(2)) + x(2) f'(x(3)) + ... + x(n-1) f'(x(n))

In the naive f(x) = lg x case, we have f'(x) = 1 / (x log 2), so the savings is:

$$\frac{1}{\log 2} \left( \frac{x_1}{x_2} + \frac{x_2}{x_3} + \cdots + \frac{x_{n-1}}{x_n} \right)$$

Notice that when the numbers are similar, we've saved almost (n-1) / (log 2) bits!

In a more realistic f(x) = lg x + lg lg x case, we have

$$f'(x) = \frac{1}{x \log 2} \left( 1 + \frac{1}{\log x} \right)$$

so in this case, we've saved

$$\frac{1}{\log 2} \left( \frac{x_1}{x_2} \left(1 + \frac{1}{\log x_2}\right) + \cdots + \frac{x_{n-1}}{x_n} \left(1 + \frac{1}{\log x_n}\right) \right)$$

so for small x, we've actually done better.

But as I said, I did a gross approximation. What if we used more terms of the Taylor series?

9. Apr 1, 2006

### 0rthodontist

Why n lg n? I thought it was just n. There are n! possible arrangements of the set elements, for a total of O(2^n lg n) arrangements by Stirling's formula. The number of bits is the log of that, which is O(n + lg lg n) = O(n).

I was thinking about the case when you have all elements the same number of bits. The idea is the same as in my first post, you store the first x elements and part of the remaining n - x elements, and the remainder of the message is the number in correspondence with the order you stored the first x elements. Here is one way to compute that correspondence:

Code (Text):

Integer-to-Permutation(k, x) // k = the integer to convert, x is as above
// this assumes k is not too large in relation to x
// (I think the condition is it can't be greater than x!, but I'm unsure)
let a be an array of integers of size x (indexed from 1)
let b also be such an array
for i <-- x down to 1
a[x - i + 1] <-- k mod i
k <-- floor(k / i)

for i <-- 1 to x
b[Empty-index(b, a[i])] <-- i

return b

Empty-index(b, c)
let z be the index of the (c+1)'th empty entry in b
return z

Last edited: Apr 1, 2006
10. Apr 1, 2006

### Hurkyl

Staff Emeritus
I've just memorized that O(log n!) = O(n log n). I'm not sure where you went wrong, but n! doesn't look anything like 2^n lg n.

Oh, I see what happened: you forgot your parentheses! That's supposed to be 2^(n lg n) -- the relevant fact is that:

n! ~ 2^O(n lg n)

And I really don't see how my approach of storing the consecutive differences looks like what you were suggesting.

11. Apr 1, 2006

### 0rthodontist

Ahh, no, I was just totally confused and it was sheer luck that I was anywhere close to anything approaching correct.

It's not, what I was suggesting was to generalize the idea of my first post to any set whose elements have equal numbers of digits (requiring that so I don't have to think about delimeters).

I think the following algorithm inverts the algorithm of my first post, or at least has the right idea.
Code (Text):

Permutation-to-Integer(b, x) //b and x are as before, b is a permutation array of size x and x is the number of elements in the permutation

let k = 0
let a be an array of size x (indexed from 1 as before)
for i <-- x down to 1
let z be the index of the largest element of b
y <-- index-to-a(b, z)
b[z] <-- empty
a[i] <-- y
k <-- k * (x - i + 1) + (a[i] - 1)
// (I realize that a is not really necessary but it is included for clarity since it's supposed to become the same a as in the previous algorithm)

return k

Index-to-a(b, c)

let y be one greater than the number of blank spaces of index less than c in b
return y

The method described in this post and the last should save O(x lg x) symbols because there are x! different numbers that can correspond to permutations for a given x, which is almost n lg n symbols. How much less is x than n? x = n-1 if there is an element greater than or equal to x lg x symbols in length. x = n - 2 if there is not such an element, but there are two elements whose combined length is greater than or equal to x lg x symbols. If all the elements have length w, I have to solve this for x:
(n - x - 1) * w <= lg (x) <= (n - x)*w

Last edited: Apr 2, 2006
12. Apr 2, 2006

### 0rthodontist

(Sorry for double-posting instead of editing, it's been 14 hours so I figured it's OK)
To clarify if necessary, the procedure I mean is, once you have picked a number x of elements, select the last lg x! digits of the remaining elements and let those digits be the number k. Run Integer-to-Permutation(k, x) and order the first x elements according to the permutation you derive. Then append to these x ordered packets any digits of the remaining elements that were not previously encoded in the ordering. To decode, select the first x packets and let the permutation they are arranged in be b. Run Permutation-to-Integer(b, x) and append the result to the encoded representation, and the result is the original set of objects.

A comment on your earlier post: why is it more realistic to say that an encoding of an number x is lg x + lg lg x instead of just lg x?

13. Apr 2, 2006

### Hurkyl

Staff Emeritus
I can't find my theory of computation book to look it up, but basically, if it takes K(x) bits to store x, and K(y) bits to store y, then for every positive constant c, we sometimes have K(x + y) > K(x) + K(y) + c. (and I think "sometimes" means "almost always" for large things) Incidentally, this assumes we're using a sufficiently space-"optimal" method.

The problem is that it's not so easy to say "insert a delimiter here" -- how do you tell the delimiter apart from the things its separating?

You could use a "magic" 8-bit string to separate the different numbers. But whenever the magic string actually appears in the number, you have to replace it with a 9-bit "escape" sequence, so that you don't think the number ended. On average, this method will multiply storage by (1+e) for some positive constant e. So, it would take

(1+e) lg x + 8

bits to be able to store your number x and indicate where it ends in memory. (You can tweak e and 8 by changing the size of the magic string)

You could do better by storing the number of bits in x, and then the binary representation of x. Since the lg x is small, we can afford the magic string method, and this approach requires

lg x + (1 + e) lg lg x + 8

bits to be able to store the number x, and indicate where it ends in memory.

Of course, I'm thinking the computer-sciency asymptotic way, and ignoring all the practicalities that might look like "Well, I'll never want to consider a number bigger than 10^100, and I'm going to potentially waste 16 bits per number, on average, so that I can word align my data, and..." which requires an entirely different approach to optimization.

Last edited: Apr 2, 2006