By Teja Kodali
Hello everyone! In this post, I will show you how to do hierarchical clustering in R. We will use the iris
dataset again, like we did for K means clustering.
What is hierarchical clustering?
If you recall from the post about k means clustering, it requires us to specify the number of clusters, and finding the optimal number of clusters can often be hard. Hierarchical clustering is an alternative approach which builds a hierarchy from the bottom-up, and doesn’t require us to specify the number of clusters beforehand.
The algorithm works as follows:
- Put each data point in its own cluster.
- Identify the closest two clusters and combine them into one cluster.
- Repeat the above step till all the data points are in a single cluster.
Once this is done, it is usually represented by a dendrogram like structure.
There are a few ways to determine how close two clusters are:
- Complete linkage clustering: Find the maximum possible distance between points belonging to two different clusters.
- Single linkage clustering: Find the minimum possible distance between points belonging to two different clusters.
- Mean linkage clustering: Find all possible pairwise distances for points belonging to two different clusters and then calculate the average.
- Centroid linkage clustering: Find the centroid of each cluster and calculate the distance between centroids of two clusters.
Complete linkage and mean linkage clustering are the ones used most often.
Clustering
In my post on K Means Clustering, we saw that there were 3 different species of flowers.
Let us see how well the hierarchical clustering algorithm can do. We can use hclust
for this. hclust
requires us to provide the data in the form of a distance matrix. We can do this by using dist
. By default, the complete linkage method is used.
clusterswhich generates the following dendrogram:
We can see from the figure that the best choices for total number of clusters are either ...read more
Source:: r-bloggers.com