- #1

- 177

- 24

I am tackling a problem related to clustering, and something came up that got me quite baffled.

Suppose you have N distinct objects. Each of them is associated to a list of 'descriptors'.

The task is to cluster together objects that have as many descriptors as possible in common,

*with the important condition that all objects in a cluster must have at least one descriptor in common*.

Example:

N=5, M_i is the i-th object, D_i is the list of descriptors associated to M_i.

M_1; D_1 = {a,b}.

M_2; D_2 = {b,c}.

M_3; D_3 = {d,x}.

M_4; D_4 = {d,y}.

M_5; D_5 = {e,z}.

Of course we can have N=5 clusters: one per object.

But we can do better than that, because there are objects that have descriptors in common.

First, it is easy to spot that M_5 can only be in a cluster of its own, as it has no descriptors in common with any other object. So we can begin by removing M_5 and set it aside.

For the remaining objects, in this case it's easy to see that we can cluster M_1 + M_2 (b in common), and M_3 + M_4 (d in common). M_2 has no descriptor in common with either M_3 or M_4, so it could not be in their cluster, nor could M_1. Therefore we could have 3 clusters for this set.

[We could also have 4, by leaving 2 objects on their own].

So the problem in general is to reduce the number of clusters to give maximal 'meaning' to the initial set, as opposed to leaving it as a collection of disparate objects. [Shannon entropy may help, but that's another story].

I tried to do this using various techniques.

The one that seemed to work best, until I stumbled on the issue I'm describing, was the following.

I pivot the descriptors to get a matrix like:

D = a,b,c,d,e,x,y,z

M_1 = 1,1,0,0,0,0,0,0

M_2 = 0,1,1,0,0,0,0,0

M_3 = 0,0,0,1,0,1,0,0

...

Objects like M_5 can be identified by removing first all columns that sum to 1 (these descriptors are not in common between any pair of objects); then the rows that sum to 0 correspond to objects that are only 'like' themselves, e.g. M_5. These rows can be removed and set aside.

This leaves a reduced matrix like:

D = b,d

M_1 = 1,0

M_2 = 1,0

M_3 = 0,1

M_4 = 0,1

I then use R's function 'hclust' to get a hierarchical clustering based on the distance matrix of the above matrix, calculated by 'dist' (with method="binary" in my case).

The hclust output is like a tree that can be 'cut' at different levels, corresponding to the number of clusters. Cutting at level k produces a list of all objects, spread in the best possible way among k clusters.

It seemed normal to me to cut at different levels (of course in a case with many more objects, 1445 to be exact, with 604 descriptors, after matrix reduction) and see in each case the distribution of objects in the various clusters, e.g. by 'hist', to find the most 'meaningful' level.

And that's where I found (it should have been obvious to me) that I couldn't cut at

*any arbitrarily low level*.

There is a minimal number of clusters below which you end up including in at least one cluster an object that has no descriptors in common with any other (which violates the important condition I mentioned).

In the simple example above, if I wanted to cut at k=1, I would have to put together all objects, and there is at least one pair of objects with no descriptors in common in the overall set.

In more complicated cases you could have:

a,b,c

a,c,d

a,x,y

d,w,z

e,w,y

Here you can't go below 2, because you can cluster together the first 3 objects and the last 2, but no lower, or you will necessarily include an 'extraneous' object in one cluster.

But in a very large set with hundred of descriptors, where maybe the minimal number is 50 or more, how would you go about evaluating this?

So, more formally, my question is: how can I calculate a priori, or maybe after running hclust, I don't mind, the minimal number of clusters below which the condition of 'at least one descriptor in common in all objects in each cluster' is violated?

Sorry if this is obvious - I'm quite knackered after thinking this over very hard and not getting anywhere; any help would be greatly appreciated.

Thank you!

L