# Minimal number of clusters

• I
Hello,
I am tackling a problem related to clustering, and something came up that got me quite baffled.
Suppose you have N distinct objects. Each of them is associated to a list of 'descriptors'.
The task is to cluster together objects that have as many descriptors as possible in common, with the important condition that all objects in a cluster must have at least one descriptor in common.

Example:
N=5, M_i is the i-th object, D_i is the list of descriptors associated to M_i.
M_1; D_1 = {a,b}.
M_2; D_2 = {b,c}.
M_3; D_3 = {d,x}.
M_4; D_4 = {d,y}.
M_5; D_5 = {e,z}.

Of course we can have N=5 clusters: one per object.
But we can do better than that, because there are objects that have descriptors in common.
First, it is easy to spot that M_5 can only be in a cluster of its own, as it has no descriptors in common with any other object. So we can begin by removing M_5 and set it aside.
For the remaining objects, in this case it's easy to see that we can cluster M_1 + M_2 (b in common), and M_3 + M_4 (d in common). M_2 has no descriptor in common with either M_3 or M_4, so it could not be in their cluster, nor could M_1. Therefore we could have 3 clusters for this set.
[We could also have 4, by leaving 2 objects on their own].
So the problem in general is to reduce the number of clusters to give maximal 'meaning' to the initial set, as opposed to leaving it as a collection of disparate objects. [Shannon entropy may help, but that's another story].

I tried to do this using various techniques.
The one that seemed to work best, until I stumbled on the issue I'm describing, was the following.
I pivot the descriptors to get a matrix like:
D = a,b,c,d,e,x,y,z
M_1 = 1,1,0,0,0,0,0,0
M_2 = 0,1,1,0,0,0,0,0
M_3 = 0,0,0,1,0,1,0,0
...

Objects like M_5 can be identified by removing first all columns that sum to 1 (these descriptors are not in common between any pair of objects); then the rows that sum to 0 correspond to objects that are only 'like' themselves, e.g. M_5. These rows can be removed and set aside.
This leaves a reduced matrix like:

D = b,d
M_1 = 1,0
M_2 = 1,0
M_3 = 0,1
M_4 = 0,1

I then use R's function 'hclust' to get a hierarchical clustering based on the distance matrix of the above matrix, calculated by 'dist' (with method="binary" in my case).
The hclust output is like a tree that can be 'cut' at different levels, corresponding to the number of clusters. Cutting at level k produces a list of all objects, spread in the best possible way among k clusters.

It seemed normal to me to cut at different levels (of course in a case with many more objects, 1445 to be exact, with 604 descriptors, after matrix reduction) and see in each case the distribution of objects in the various clusters, e.g. by 'hist', to find the most 'meaningful' level.
And that's where I found (it should have been obvious to me) that I couldn't cut at any arbitrarily low level.
There is a minimal number of clusters below which you end up including in at least one cluster an object that has no descriptors in common with any other (which violates the important condition I mentioned).

In the simple example above, if I wanted to cut at k=1, I would have to put together all objects, and there is at least one pair of objects with no descriptors in common in the overall set.
In more complicated cases you could have:
a,b,c
a,c,d
a,x,y
d,w,z
e,w,y

Here you can't go below 2, because you can cluster together the first 3 objects and the last 2, but no lower, or you will necessarily include an 'extraneous' object in one cluster.
But in a very large set with hundred of descriptors, where maybe the minimal number is 50 or more, how would you go about evaluating this?

So, more formally, my question is: how can I calculate a priori, or maybe after running hclust, I don't mind, the minimal number of clusters below which the condition of 'at least one descriptor in common in all objects in each cluster' is violated?

Sorry if this is obvious - I'm quite knackered after thinking this over very hard and not getting anywhere; any help would be greatly appreciated.
Thank you!
L

andrewkirk
Homework Helper
Gold Member
Interesting problem.

It seems to me that most clustering algorithms use measures that are either based on numeric properties of the data, or which allow transitive closure (by which I mean that X and Z can be clustered together if X shares property a with Y and Y shares property b with Z).

In your case you do not want to allow transitive closure, and all your properties are categorical ('factor' in R-speak) rather than numeric. That seems to be quite different from the way clustering is generally done. I wonder whether it is better to try to do this from scratch, rather than use clustering algorithms in a way for which they are not designed.

But before giving up on hclust, one possibility is to write a program that runs it, then runs a loop that increments k each time by 1, starting at 1, and inspects the partition given by hclust at that level. If it matches your criterion it stops, and the answer is that value of k. Otherwise it increments k and loops back.

mfb
Mentor
Form sets of elements with a given descriptor, and the minimal number of clusters becomes the set cover problem, which is NP complete.

andrewkirk
Homework Helper
Gold Member
Following on from my previous post, if your data is in a data frame called dat, with the first column being an ID and every other column being a vector of logicals corresponding to one of the properties, with dat[i,j+1] being TRUE iff the ith data element has property j, and you have called hclust with the result stored in variable hc, the following code should find your minimum k:
Code:
num.props<-ncol(dat)-1
n<-nrow(dat)
k<-1
satisfied<-FALSE
while (!satisfied & k<=n){
x<-cutree(hc,k)
m<-length(unique(x))
satisfied<-TRUE
while (m>0 & satisfied){ # test whether all elements of cluster m share a common property
satisfied<-sum(apply(x[x==m],2,prod))
m<-m-1
}
}
However, given what @mfb has pointed out, you may be able to find an algorithm to solve the problem directly, by internet searching something like
'R-cran set cover problem'

lavoisier
Thank you both!
This is what came up when searching for what you suggested: https://stackoverflow.com/questions/39184257/set-cover-approximation-in-r
Which is good, because I'm already using lpSolve for other applications, in fact using a very clever trick that @andrewkirk showed me to assign a cost to the inclusion of additional sets.
The implementation in the link seems quite simple, because the focus is not on the assignment of a unique set to each item (which is what I want in my case: a unique cluster number to each object), but on the minimisation of the number of container sets.
What I mean is that if the situation were the following:
Code:
items s1 s2 s3 s4
1  1  0  0  0
2  1  1  0  0
3  1  0  1  0
4  0  1  1  1
5  0  0 *1* 1
set s3 could be chosen instead of s4 to produce an equally minimal solution, but item 3 would be present twice, so if we wanted to know 'from which set' to take item 3, we would still have 2 possibilities. Same for item 5.
At the moment I don't see how I could apply lpSolve to my clustering problem, without writing a very complex matrix where all object-descriptor pairs are fully expanded, and then it would become computationally unfeasible. But I may be wrong.

So for now I will assume that the above method tells me the minimal number of clusters I can possibly make, but then I still rely on hclust.
Then for the 'transitive closure' part, I am hoping hclust is clever enough to discriminate, given that it has a full distance matrix as input.
If the task of hclust is to put together sets that have minimal average distance between all pairs of objects, I would hope that, given the choice, it will preferentially include objects that have descriptors in common with all items already present in the cluster, rather than only with a few.
Like, if {a,b,c} and {a,d,e} are already in the cluster, I hope {a,x,y} will be included preferentially, rather than {b,w,z}, because the former has 1 descriptor in common with both pre-existing objects rather than with only one. Not sure if I'd have to define the distance function myself to achieve this. I'll need to think about it.

Or indeed, I can just run hclust and then check the cutree output at different k levels for consistency with my condition. It may be the most practical approach.

Correction to the above: item 5 would not be present twice, only item 3. Still, that means that finding the minimal set of 'sets' that cover all items is not sufficient in general to match each item to a specific set. Situations like the one in the first example, with s1 and s4 covering exclusively some items, with no overlap, are rare, at least in my data.

Update: I tried out the method described in the link on the same set as above.
It turned out that at least 238 descriptors (out of 604) were needed to cover all 1445 objects.
As I said above, however, there is still much overlap, or in other words, the rowSum after reducing the matrix to its lpSolve minimal solution is > 1 for most objects, meaning that the assignment of one single descriptor to each object is by far not unique (I would guess the number of possibilities is the product of all rowSums; and this is only for one of potentially several minimal solutions...).

After applying hclust to the binary distance matrix of the reduced descriptor matrix, in most clusters all objects had at least one descriptor in common, especially if I cut at high k. However, there were also many objects that could be clustered with other objects but were instead made into their own 1-object cluster.
I guess this is an effect of the large k. E.g. if I had A={a,b,c}, B={b,d,e}, C={b,x}, I could put all 3 objects in k=1 cluster (b in common), but if I forced k=2 I would necessarily separate one of these objects out. Maybe there is a way to calculate the 'maximal' number of clusters to avoid this issue.

Reducing k caused an increase in the number of clusters with no descriptor common to all objects. I would have expected this to happen only for k<238. Instead, even with k = 238, I had 1 cluster of this type popping up. Bizarrely, when I re-run hclust on the lpSolve solution matrix (which I thought was better than just cutree with k=238 on the general hclust output generated with the full matrix), I got a much worse result, namely 12 clusters with no descriptors in common to all objects.

So there is still something not working as it should. I suspect this issue can be avoided by a better distance matrix calculation and/or a different clustering method...