Quantifying randomness using clustering algorithms

  • Context: Graduate 
  • Thread starter Thread starter HJ Farnsworth
  • Start date Start date
  • Tags Tags
    Algorithms Randomness
Click For Summary
SUMMARY

This discussion focuses on quantifying randomness through clustering algorithms, specifically addressing the probability of counting clusters in random datasets. The user, HJ Farnsworth, inquires about general formulas for calculating the probability of clusters based on sample size (N), cluster count (C), and algorithm (A), particularly for k-means clustering. The complexity increases with dimensionality, prompting questions about generalizability to higher dimensions. The lack of a universally accepted definition of a cluster complicates the formulation of these probabilities.

PREREQUISITES
  • Understanding of clustering algorithms, particularly k-means clustering
  • Familiarity with probability theory and statistical formulas
  • Knowledge of dimensionality in data analysis
  • Basic concepts of random number generation in the interval [0,1]
NEXT STEPS
  • Research the mathematical foundations of clustering algorithms, focusing on k-means clustering
  • Explore probability distributions related to clustering in random datasets
  • Investigate dimensionality reduction techniques and their impact on clustering
  • Study existing literature on cluster definitions and their implications in data analysis
USEFUL FOR

Data scientists, statisticians, and researchers interested in clustering methodologies and the mathematical quantification of randomness in datasets.

HJ Farnsworth
Messages
126
Reaction score
1
Greetings,

I'm not sure if this site, or this area of the site, is the most likely place for me to get an answer to the question I am about to ask - so if anyone reads the question and doesn't know the answer, but knows of a more likely place for me to get an answer, please let me know, it would be very much appreciated!

I've started to gain an interest in clustering and clustering algorithms, but I am brand new to the subject. Below is a question that I immediately had on the subject as it pertains to determining whether a data set is random or not.

Let's say that I have a data set of size N of completely random numbers in the interval [0,1].

1. Is there a general formula for the probability of counting a given number C of clusters for that sample size, i.e., a formula of the form P(N,C)=...?

I think that finding an answer to the above question is extremely unlikely, since as far as I have been able to tell, there is no universally accepted definition of what constitutes a cluster - so, the answer to the above question is completely dependent on the algorithm used count clusters. So, I will modify the above question in a couple of ways...

2a. Is there a general formula for the probability of counting a given number C of clusters for a sample size of N, for a given algorithm A, i.e., a formula of the form P_{A}(N,C)=...?

2b. Are there formulas for counting a given number C of clusters for a sample size of N, for some commonly used clustering algorithms (e.g., k-means clustering), i.e., formulas of the form P_{A}(N,C)=...?

If the answers to any of the above questions are yes, in what way are they generalizable to higher dimensions? I.e., the above question was phrased for data on a 1D interval [0,1]. But, if there was a 2D interval [0,1]\times[0,1], then you could have clustering in the x-direction, the y-direction, or both directions simultaneously. Similarly for D dimensions. So, my final version of the above questions are...

3a. Is there a general formula for the probability of counting a given number C of clusters for a sample size of N, for a given dimensionality of the data D, and for a given algorithm A, i.e., a formula of the form P_{A}(N,C,D)=...?

3b. Are there formulas for counting a given number C of clusters for a sample size of N, for a given dimensionality of the data D, and for some commonly used clustering algorithms (e.g., k-means clustering), i.e., formulas of the form P_{A}(N,C,D)=...?

Thank you very much for any help that you can give.

-HJ Farnsworth
 
Physics news on Phys.org
For question 2a - it would depend on the precise definition of cluster. For 2b - I can't answer, since I am not familiar with the definition you are referring to.
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
785
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 18 ·
Replies
18
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 1 ·
Replies
1
Views
3K