Quantifying randomness using clustering algorithms

In summary, the speaker is looking for help with clustering algorithms and asks several questions about the probability of counting a given number of clusters for a sample size, dimensionality of the data, and different algorithms. They acknowledge that the answer may depend on the definition of a cluster and thank the potential responders in advance.
  • #1
HJ Farnsworth
128
1
Greetings,

I'm not sure if this site, or this area of the site, is the most likely place for me to get an answer to the question I am about to ask - so if anyone reads the question and doesn't know the answer, but knows of a more likely place for me to get an answer, please let me know, it would be very much appreciated!

I've started to gain an interest in clustering and clustering algorithms, but I am brand new to the subject. Below is a question that I immediately had on the subject as it pertains to determining whether a data set is random or not.

Let's say that I have a data set of size [itex]N[/itex] of completely random numbers in the interval [itex][0,1][/itex].

1. Is there a general formula for the probability of counting a given number [itex]C[/itex] of clusters for that sample size, i.e., a formula of the form [itex]P(N,C)=...[/itex]?

I think that finding an answer to the above question is extremely unlikely, since as far as I have been able to tell, there is no universally accepted definition of what constitutes a cluster - so, the answer to the above question is completely dependent on the algorithm used count clusters. So, I will modify the above question in a couple of ways...

2a. Is there a general formula for the probability of counting a given number [itex]C[/itex] of clusters for a sample size of [itex]N[/itex], for a given algorithm [itex]A[/itex], i.e., a formula of the form [itex]P_{A}(N,C)=...[/itex]?

2b. Are there formulas for counting a given number [itex]C[/itex] of clusters for a sample size of [itex]N[/itex], for some commonly used clustering algorithms (e.g., k-means clustering), i.e., formulas of the form [itex]P_{A}(N,C)=...[/itex]?

If the answers to any of the above questions are yes, in what way are they generalizable to higher dimensions? I.e., the above question was phrased for data on a 1D interval [itex][0,1][/itex]. But, if there was a 2D interval [itex][0,1]\times[0,1][/itex], then you could have clustering in the [itex]x[/itex]-direction, the [itex]y[/itex]-direction, or both directions simultaneously. Similarly for [itex]D[/itex] dimensions. So, my final version of the above questions are...

3a. Is there a general formula for the probability of counting a given number [itex]C[/itex] of clusters for a sample size of [itex]N[/itex], for a given dimensionality of the data [itex]D[/itex], and for a given algorithm [itex]A[/itex], i.e., a formula of the form [itex]P_{A}(N,C,D)=...[/itex]?

3b. Are there formulas for counting a given number [itex]C[/itex] of clusters for a sample size of [itex]N[/itex], for a given dimensionality of the data [itex]D[/itex], and for some commonly used clustering algorithms (e.g., k-means clustering), i.e., formulas of the form [itex]P_{A}(N,C,D)=...[/itex]?

Thank you very much for any help that you can give.

-HJ Farnsworth
 
Physics news on Phys.org
  • #2
For question 2a - it would depend on the precise definition of cluster. For 2b - I can't answer, since I am not familiar with the definition you are referring to.
 

1. What is the purpose of quantifying randomness using clustering algorithms?

The purpose of quantifying randomness using clustering algorithms is to identify patterns and relationships within a dataset that may appear random at first glance. By using clustering algorithms, scientists can group data points together based on their similarities, allowing for a better understanding of the underlying structure of the data.

2. How do clustering algorithms quantify randomness?

Clustering algorithms use mathematical techniques to group data points into clusters based on their similarities. These algorithms typically measure the distance between data points and use this information to create clusters that are as homogeneous as possible. The randomness of a dataset can be quantified by the number and distribution of these clusters.

3. What types of clustering algorithms are commonly used for quantifying randomness?

There are various types of clustering algorithms that can be used to quantify randomness, including k-means, hierarchical clustering, and density-based clustering. Each algorithm has its own strengths and weaknesses, and the choice of which one to use depends on the specific dataset and research goals.

4. Can clustering algorithms be used to determine if a dataset is truly random?

Clustering algorithms can provide insights into the structure and patterns of a dataset, but they cannot determine if a dataset is truly random. The randomness of a dataset is a statistical property that can only be determined through rigorous testing and analysis.

5. How can the results of clustering algorithms be interpreted in terms of randomness?

The results of clustering algorithms can be interpreted in terms of randomness by looking at the number and distribution of clusters. A dataset with a high level of randomness may have a large number of clusters with no apparent pattern, while a dataset with low randomness may have fewer, more clearly defined clusters. Additionally, the size and shape of the clusters can provide insight into the underlying structure and potential relationships within the dataset.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
473
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
457
  • Set Theory, Logic, Probability, Statistics
Replies
0
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
331
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
930
Replies
12
Views
733
  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Astronomy and Astrophysics
Replies
1
Views
1K
Back
Top