Could someone explain to me how this clustering algorithm works?

Click For Summary
SUMMARY

This discussion focuses on the hierarchical clustering algorithm using MATLAB's Statistics and Machine Learning Toolbox. The process involves calculating Euclidean distances with the pdist function, followed by hierarchical clustering using linkage and cluster functions. The conversation highlights the importance of understanding the underlying data structure, as the effectiveness of clustering depends on how well the algorithm fits the data. It emphasizes that while classification techniques may not be perfect, they can still provide valuable insights when applied appropriately.

PREREQUISITES
  • Understanding of hierarchical clustering techniques
  • Familiarity with MATLAB and its Statistics and Machine Learning Toolbox
  • Knowledge of Euclidean distance calculations
  • Basic concepts of data classification
NEXT STEPS
  • Explore MATLAB's linkage function documentation for advanced clustering options
  • Learn about different distance metrics in clustering, such as Manhattan and Cosine distances
  • Investigate the implications of clustering results on data interpretation and model selection
  • Study the limitations and advantages of various classification algorithms
USEFUL FOR

Data scientists, statisticians, and machine learning practitioners interested in clustering methodologies and their applications in data analysis.

Jamin2112
Messages
973
Reaction score
12
So MathWorks.com shows this as an example:

d = pdist(meas);
Z = linkage(d);
c = cluster(Z,'maxclust',3:5);

http://www.mathworks.com/help/stats/cluster.html.

I'm confused about why the routine gives any useful information. First it returns the Euclidean distances between values in some array meas. Then it performs hierarchal clustering on those distances. How is that useful? If I had a vector (0, 1, 50, 99, 100), then the distances are |0-1|=1, |0-50|=50, |0-99|=99, |0-100|=100, |1-50|=49, |1-99|=98, |1-100| = 99, |50-99|=49, |50-100|=50, |99-100| = 1. So I'm then clustering the values 1, 50, 99, 100, 49, 98, 99, 49, 50, 1. If I tell it to form a max of 3 clusters, the clusters will probably be (1,1), (49, 49, 50), and (98, 99, 99, 100). The first cluster is corresponding to the distances between 0 and 1, and between 99, 100. So that means I'm clustering the values 0, 1, 99, and 100 together.

Or am I totally not understanding this?
 
Physics news on Phys.org
Hey Jamin2112.

The name of the game is classification.

If you get a good fit for hierarchical classifications then this can help you in identifying a possible model where some data points are best represented in a hierarchical manner.

Basically each kind of classification has its advantages and dis-advantages depending on how it actually classifies the data.

Each algorithm will classify the data that it "expects" to have well and do a bad job if the data isn't as "expected".

Also hierarchical classification is a natural way to classify general data currently and its a lot easier to understand than some other techniques which is why its often done.

If you get a technique that classifies something with an underlying idea that is too complex, then it may become useless to use for all practical purposes.

Classification techniques don't have to be exact but rather, they just have to be "good enough" in many applications (although some require better than "good enough").
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 11 ·
Replies
11
Views
6K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 1 ·
Replies
1
Views
6K