Quantifying randomness using clustering algorithms

1. Apr 27, 2014

HJ Farnsworth

Greetings,

I'm not sure if this site, or this area of the site, is the most likely place for me to get an answer to the question I am about to ask - so if anyone reads the question and doesn't know the answer, but knows of a more likely place for me to get an answer, please let me know, it would be very much appreciated!

I've started to gain an interest in clustering and clustering algorithms, but I am brand new to the subject. Below is a question that I immediately had on the subject as it pertains to determining whether a data set is random or not.

Let's say that I have a data set of size $N$ of completely random numbers in the interval $[0,1]$.

1. Is there a general formula for the probability of counting a given number $C$ of clusters for that sample size, i.e., a formula of the form $P(N,C)=...$?

I think that finding an answer to the above question is extremely unlikely, since as far as I have been able to tell, there is no universally accepted definition of what constitutes a cluster - so, the answer to the above question is completely dependent on the algorithm used count clusters. So, I will modify the above question in a couple of ways...

2a. Is there a general formula for the probability of counting a given number $C$ of clusters for a sample size of $N$, for a given algorithm $A$, i.e., a formula of the form $P_{A}(N,C)=...$?

2b. Are there formulas for counting a given number $C$ of clusters for a sample size of $N$, for some commonly used clustering algorithms (e.g., k-means clustering), i.e., formulas of the form $P_{A}(N,C)=...$?

If the answers to any of the above questions are yes, in what way are they generalizable to higher dimensions? I.e., the above question was phrased for data on a 1D interval $[0,1]$. But, if there was a 2D interval $[0,1]\times[0,1]$, then you could have clustering in the $x$-direction, the $y$-direction, or both directions simultaneously. Similarly for $D$ dimensions. So, my final version of the above questions are...

3a. Is there a general formula for the probability of counting a given number $C$ of clusters for a sample size of $N$, for a given dimensionality of the data $D$, and for a given algorithm $A$, i.e., a formula of the form $P_{A}(N,C,D)=...$?

3b. Are there formulas for counting a given number $C$ of clusters for a sample size of $N$, for a given dimensionality of the data $D$, and for some commonly used clustering algorithms (e.g., k-means clustering), i.e., formulas of the form $P_{A}(N,C,D)=...$?

Thank you very much for any help that you can give.

-HJ Farnsworth

2. Apr 27, 2014

mathman

For question 2a - it would depend on the precise definition of cluster. For 2b - I can't answer, since I am not familiar with the definition you are referring to.