Quantifying randomness using clustering algorithms

HJ Farnsworth · Apr 27, 2014

Greetings,

I'm not sure if this site, or this area of the site, is the most likely place for me to get an answer to the question I am about to ask - so if anyone reads the question and doesn't know the answer, but knows of a more likely place for me to get an answer, please let me know, it would be very much appreciated!

I've started to gain an interest in clustering and clustering algorithms, but I am brand new to the subject. Below is a question that I immediately had on the subject as it pertains to determining whether a data set is random or not.

Let's say that I have a data set of size [itex]N[/itex] of completely random numbers in the interval [itex][0,1][/itex].

1. Is there a general formula for the probability of counting a given number [itex]C[/itex] of clusters for that sample size, i.e., a formula of the form [itex]P(N,C)=...[/itex]?

I think that finding an answer to the above question is extremely unlikely, since as far as I have been able to tell, there is no universally accepted definition of what constitutes a cluster - so, the answer to the above question is completely dependent on the algorithm used count clusters. So, I will modify the above question in a couple of ways...

2a. Is there a general formula for the probability of counting a given number [itex]C[/itex] of clusters for a sample size of [itex]N[/itex], for a given algorithm [itex]A[/itex], i.e., a formula of the form [itex]P_{A}(N,C)=...[/itex]?

2b. Are there formulas for counting a given number [itex]C[/itex] of clusters for a sample size of [itex]N[/itex], for some commonly used clustering algorithms (e.g., k-means clustering), i.e., formulas of the form [itex]P_{A}(N,C)=...[/itex]?

If the answers to any of the above questions are yes, in what way are they generalizable to higher dimensions? I.e., the above question was phrased for data on a 1D interval [itex][0,1][/itex]. But, if there was a 2D interval [itex][0,1]\times[0,1][/itex], then you could have clustering in the [itex]x[/itex]-direction, the [itex]y[/itex]-direction, or both directions simultaneously. Similarly for [itex]D[/itex] dimensions. So, my final version of the above questions are...

3a. Is there a general formula for the probability of counting a given number [itex]C[/itex] of clusters for a sample size of [itex]N[/itex], for a given dimensionality of the data [itex]D[/itex], and for a given algorithm [itex]A[/itex], i.e., a formula of the form [itex]P_{A}(N,C,D)=...[/itex]?

3b. Are there formulas for counting a given number [itex]C[/itex] of clusters for a sample size of [itex]N[/itex], for a given dimensionality of the data [itex]D[/itex], and for some commonly used clustering algorithms (e.g., k-means clustering), i.e., formulas of the form [itex]P_{A}(N,C,D)=...[/itex]?

Thank you very much for any help that you can give.

-HJ Farnsworth

mathman · Apr 27, 2014

For question 2a - it would depend on the precise definition of cluster. For 2b - I can't answer, since I am not familiar with the definition you are referring to.

Quantifying randomness using clustering algorithms

1. What is the purpose of quantifying randomness using clustering algorithms?

2. How do clustering algorithms quantify randomness?

3. What types of clustering algorithms are commonly used for quantifying randomness?

4. Can clustering algorithms be used to determine if a dataset is truly random?

5. How can the results of clustering algorithms be interpreted in terms of randomness?

Similar threads

Hot Threads

Recent Insights