Could someone explain to me how this clustering algorithm works?

In summary, the conversation discusses the use of the pdist and linkage functions in hierarchical clustering on a given dataset. The purpose of this technique is to classify data points in a hierarchical manner and can be useful in identifying possible models. However, the effectiveness of this technique depends on the data being classified as expected. It is a commonly used method due to its simplicity and ease of understanding, but may not always be the most accurate.
  • #1
Jamin2112
986
12
So MathWorks.com shows this as an example:

d = pdist(meas);
Z = linkage(d);
c = cluster(Z,'maxclust',3:5);


http://www.mathworks.com/help/stats/cluster.html.

I'm confused about why the routine gives any useful information. First it returns the Euclidean distances between values in some array meas. Then it performs hierarchal clustering on those distances. How is that useful? If I had a vector (0, 1, 50, 99, 100), then the distances are |0-1|=1, |0-50|=50, |0-99|=99, |0-100|=100, |1-50|=49, |1-99|=98, |1-100| = 99, |50-99|=49, |50-100|=50, |99-100| = 1. So I'm then clustering the values 1, 50, 99, 100, 49, 98, 99, 49, 50, 1. If I tell it to form a max of 3 clusters, the clusters will probably be (1,1), (49, 49, 50), and (98, 99, 99, 100). The first cluster is corresponding to the distances between 0 and 1, and between 99, 100. So that means I'm clustering the values 0, 1, 99, and 100 together.

Or am I totally not understanding this?
 
Physics news on Phys.org
  • #2
Hey Jamin2112.

The name of the game is classification.

If you get a good fit for hierarchical classifications then this can help you in identifying a possible model where some data points are best represented in a hierarchical manner.

Basically each kind of classification has its advantages and dis-advantages depending on how it actually classifies the data.

Each algorithm will classify the data that it "expects" to have well and do a bad job if the data isn't as "expected".

Also hierarchical classification is a natural way to classify general data currently and its a lot easier to understand than some other techniques which is why its often done.

If you get a technique that classifies something with an underlying idea that is too complex, then it may become useless to use for all practical purposes.

Classification techniques don't have to be exact but rather, they just have to be "good enough" in many applications (although some require better than "good enough").
 

1. What is clustering and why is it used?

Clustering is a machine learning technique used to group similar data points together in order to find patterns and relationships within a dataset. It is used to identify similarities and differences among data points and to discover hidden structures in the data.

2. How does a clustering algorithm work?

A clustering algorithm works by first selecting a set of data points, also known as centroids. These centroids are then used to group similar data points into clusters. The algorithm iteratively adjusts the centroids to minimize the distance between the data points and their assigned cluster. This process continues until the centroids no longer move or until a specified number of iterations is reached.

3. What are some common types of clustering algorithms?

Some common types of clustering algorithms include K-Means, Hierarchical Clustering, and Density-based Spatial Clustering of Applications with Noise (DBSCAN). Each algorithm has its own strengths and weaknesses and may be more suitable for certain types of data.

4. How do you determine the number of clusters in a dataset?

Determining the number of clusters in a dataset is an important step in clustering. There are a few methods for determining the optimal number of clusters, such as the Elbow method and the Silhouette method. These methods use metrics to evaluate the quality of the clustering and help determine the ideal number of clusters for a given dataset.

5. What are some applications of clustering in real-world scenarios?

Clustering has a wide range of applications in various industries. It is commonly used in customer segmentation, market research, image recognition, and anomaly detection. It can also be used in recommender systems, fraud detection, and gene expression analysis.

Similar threads

  • Engineering and Comp Sci Homework Help
Replies
6
Views
2K
  • General Math
Replies
5
Views
1K
  • Math Proof Training and Practice
2
Replies
46
Views
10K
  • Introductory Physics Homework Help
Replies
1
Views
968
  • Special and General Relativity
Replies
5
Views
1K
Replies
5
Views
1K
  • General Math
Replies
1
Views
1K
  • Introductory Physics Homework Help
Replies
1
Views
5K
  • Advanced Physics Homework Help
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
1K
Back
Top