Hierarchical Clustering: Ward linkage

  • Thread starter eoghan
  • Start date
  • Tags
    Linkage
In summary, the Ward linkage method in agglomerative hierarchical clustering is based on the within-group variance, which is a weighted squared distance between cluster centers. This means that the method is not affected by the metric used to compute the distances between individual elements. However, when using the hclust function in R, the choice of distance metric can lead to different results, which may be due to incorrect arguments being used. This is because the variance of a random variable representing distance can change if the units of measure are changed, and the z-score may not remain the same when switching between different distance metrics.
  • #1
eoghan
207
7
Hi there!

The Ward linkage method in agglomerative hierarchical clustering computes the distance between two clusters using the within-group variance, which results in a weighted squared distance between cluster centers. Therefore, Ward linkage method doesn't rely on the distances of single elements, so it should be independent on the metric (euclidean , manhattan, squared euclidean...) used to compute the distance among the elements, because in the end the linkage criterion is based on the variance of the clusters which has a definite formula independently from the chosen metric. Nonetheless if I try the hclust function in R I obtain different results depending on the distance metric among the elements. Why?
Thank you
 
Physics news on Phys.org
  • #2
eoghan said:
the variance of the clusters which has a definite formula independently from the chosen metric.

Why do you think the variance of the clusters is independent of the metric? What formula are you talking about?

The variance of a random variable representing a distance won't remain the same number if you change the units of measure. A quantity such as the "z-score" of a random variable representing a distance would remain numerically the same if the units of distance are changed, but it wouldn't necessarily remain the same if you switch from using euclidean distance to Manhattan distance.

I notice the Wikipedia article http://en.wikipedia.org/wiki/Ward's_method has a caution about using the correct arguments in the R programming language.
 

1. What is hierarchical clustering?

Hierarchical clustering is a type of unsupervised machine learning algorithm that is used to group data points into clusters based on their similarities. It involves creating a hierarchy of clusters, where the most similar data points are grouped together in the same cluster and the least similar data points are in separate clusters.

2. What is Ward linkage in hierarchical clustering?

Ward linkage is a method used to measure the distance between clusters in hierarchical clustering. It is also known as the minimum variance method and it aims to minimize the total within-cluster variance. This means that it tries to group data points that are similar to each other and have low variance within the cluster.

3. How does Ward linkage work?

Ward linkage starts with each data point in its own individual cluster. Then, it calculates the distance between each pair of clusters and merges the two clusters with the minimum increase in within-cluster variance. This process is repeated until all data points are in the same cluster and the desired number of clusters is achieved.

4. What are the advantages of using Ward linkage in hierarchical clustering?

One advantage of using Ward linkage is that it leads to well-defined, compact clusters. This means that the clusters formed will have low within-cluster variance and the data points within each cluster will be similar to each other. Additionally, Ward linkage is less sensitive to outliers compared to other linkage methods.

5. How do you interpret the results of hierarchical clustering using Ward linkage?

The results of hierarchical clustering using Ward linkage can be visualized using a dendrogram, which shows the hierarchy of clusters. The height of the dendrogram indicates the distance between clusters. The closer the clusters are on the dendrogram, the more similar the data points are. The number of clusters can be determined by looking at the horizontal line that crosses the longest vertical line without crossing any other vertical lines.

Similar threads

  • General Math
Replies
5
Views
1K
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
3K
  • Calculus and Beyond Homework Help
Replies
3
Views
2K
  • STEM Academic Advising
Replies
13
Views
2K
  • Special and General Relativity
Replies
2
Views
1K
Replies
1
Views
1K
  • Poll
  • Science and Math Textbooks
Replies
2
Views
5K
  • Astronomy and Astrophysics
Replies
19
Views
4K
  • Quantum Physics
Replies
3
Views
1K
Back
Top