Quantifying randomness using clustering algorithms

HJ Farnsworth · Apr 27, 2014

Greetings,

I'm not sure if this site, or this area of the site, is the most likely place for me to get an answer to the question I am about to ask - so if anyone reads the question and doesn't know the answer, but knows of a more likely place for me to get an answer, please let me know, it would be very much appreciated!

I've started to gain an interest in clustering and clustering algorithms, but I am brand new to the subject. Below is a question that I immediately had on the subject as it pertains to determining whether a data set is random or not.

Let's say that I have a data set of size [itex]N[/itex] of completely random numbers in the interval [itex][0,1][/itex].

1. Is there a general formula for the probability of counting a given number [itex]C[/itex] of clusters for that sample size, i.e., a formula of the form [itex]P(N,C)=...[/itex]?

I think that finding an answer to the above question is extremely unlikely, since as far as I have been able to tell, there is no universally accepted definition of what constitutes a cluster - so, the answer to the above question is completely dependent on the algorithm used count clusters. So, I will modify the above question in a couple of ways...

2a. Is there a general formula for the probability of counting a given number [itex]C[/itex] of clusters for a sample size of [itex]N[/itex], for a given algorithm [itex]A[/itex], i.e., a formula of the form [itex]P_{A}(N,C)=...[/itex]?

2b. Are there formulas for counting a given number [itex]C[/itex] of clusters for a sample size of [itex]N[/itex], for some commonly used clustering algorithms (e.g., k-means clustering), i.e., formulas of the form [itex]P_{A}(N,C)=...[/itex]?

If the answers to any of the above questions are yes, in what way are they generalizable to higher dimensions? I.e., the above question was phrased for data on a 1D interval [itex][0,1][/itex]. But, if there was a 2D interval [itex][0,1]\times[0,1][/itex], then you could have clustering in the [itex]x[/itex]-direction, the [itex]y[/itex]-direction, or both directions simultaneously. Similarly for [itex]D[/itex] dimensions. So, my final version of the above questions are...

3a. Is there a general formula for the probability of counting a given number [itex]C[/itex] of clusters for a sample size of [itex]N[/itex], for a given dimensionality of the data [itex]D[/itex], and for a given algorithm [itex]A[/itex], i.e., a formula of the form [itex]P_{A}(N,C,D)=...[/itex]?

3b. Are there formulas for counting a given number [itex]C[/itex] of clusters for a sample size of [itex]N[/itex], for a given dimensionality of the data [itex]D[/itex], and for some commonly used clustering algorithms (e.g., k-means clustering), i.e., formulas of the form [itex]P_{A}(N,C,D)=...[/itex]?

Thank you very much for any help that you can give.

-HJ Farnsworth

mathman · Apr 27, 2014

For question 2a - it would depend on the precise definition of cluster. For 2b - I can't answer, since I am not familiar with the definition you are referring to.

Quantifying randomness using clustering algorithms

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect