Quantifying randomness using clustering algorithms

Click For Summary
The discussion centers on the challenge of quantifying randomness in data sets using clustering algorithms. The original poster seeks formulas to determine the probability of counting a specific number of clusters within a random data set, considering different algorithms and dimensionalities. They acknowledge the lack of a universally accepted definition of a cluster, which complicates the formulation of such probabilities. Additionally, they inquire about the generalizability of these formulas to higher dimensions. The conversation highlights the complexities involved in defining clusters and the need for algorithm-specific approaches.
HJ Farnsworth
Messages
126
Reaction score
1
Greetings,

I'm not sure if this site, or this area of the site, is the most likely place for me to get an answer to the question I am about to ask - so if anyone reads the question and doesn't know the answer, but knows of a more likely place for me to get an answer, please let me know, it would be very much appreciated!

I've started to gain an interest in clustering and clustering algorithms, but I am brand new to the subject. Below is a question that I immediately had on the subject as it pertains to determining whether a data set is random or not.

Let's say that I have a data set of size N of completely random numbers in the interval [0,1].

1. Is there a general formula for the probability of counting a given number C of clusters for that sample size, i.e., a formula of the form P(N,C)=...?

I think that finding an answer to the above question is extremely unlikely, since as far as I have been able to tell, there is no universally accepted definition of what constitutes a cluster - so, the answer to the above question is completely dependent on the algorithm used count clusters. So, I will modify the above question in a couple of ways...

2a. Is there a general formula for the probability of counting a given number C of clusters for a sample size of N, for a given algorithm A, i.e., a formula of the form P_{A}(N,C)=...?

2b. Are there formulas for counting a given number C of clusters for a sample size of N, for some commonly used clustering algorithms (e.g., k-means clustering), i.e., formulas of the form P_{A}(N,C)=...?

If the answers to any of the above questions are yes, in what way are they generalizable to higher dimensions? I.e., the above question was phrased for data on a 1D interval [0,1]. But, if there was a 2D interval [0,1]\times[0,1], then you could have clustering in the x-direction, the y-direction, or both directions simultaneously. Similarly for D dimensions. So, my final version of the above questions are...

3a. Is there a general formula for the probability of counting a given number C of clusters for a sample size of N, for a given dimensionality of the data D, and for a given algorithm A, i.e., a formula of the form P_{A}(N,C,D)=...?

3b. Are there formulas for counting a given number C of clusters for a sample size of N, for a given dimensionality of the data D, and for some commonly used clustering algorithms (e.g., k-means clustering), i.e., formulas of the form P_{A}(N,C,D)=...?

Thank you very much for any help that you can give.

-HJ Farnsworth
 
Physics news on Phys.org
For question 2a - it would depend on the precise definition of cluster. For 2b - I can't answer, since I am not familiar with the definition you are referring to.
 
The standard _A " operator" maps a Null Hypothesis Ho into a decision set { Do not reject:=1 and reject :=0}. In this sense ( HA)_A , makes no sense. Since H0, HA aren't exhaustive, can we find an alternative operator, _A' , so that ( H_A)_A' makes sense? Isn't Pearson Neyman related to this? Hope I'm making sense. Edit: I was motivated by a superficial similarity of the idea with double transposition of matrices M, with ## (M^{T})^{T}=M##, and just wanted to see if it made sense to talk...

Similar threads

  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 0 ·
Replies
0
Views
2K
  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 16 ·
Replies
16
Views
2K