Clustering is a set of techniques used to partition data into groups or clusters. Clusters are loosely defined as groups of data objects that are more similar to other objects in their cluster than they are to data objects in other clusters. In practice, clustering helps identify two qualities of data.
Meaningful clusters expand domain knowledge. For example, in the medical field, researchers applied clustering to gene expression experiments. The clustering results identified groups of patients who respond differently to medical treatments.
Useful clusters, on the other hand, serve as an intermediate step in a data pipeline. For example, businesses use clustering for customer segmentation. The clustering results segment customers into groups with similar purchase histories, which businesses can then use to create targeted advertising campaigns.
Two techniques are used by this algorithm- Agglomerative and Divisive. In HC, the number of clusters K can be set precisely like in K-means, and n is the number of data points such that n>K.
The agglomerative HC starts from n clusters and aggregates data until K clusters are obtained. The divisive start from only one cluster and then splits depending on similarities until K clusters are obtained.
The similarity here is the distance among points, which can be computed in many ways, and it is the crucial element of discrimination. It can be computed with different approaches:
Min: Given two clusters C1 and C2 such that point a belongs to C1 and b to C2. The similarity between them is equal to the minimum distance.
Max: The similarity between points a and b is equal to the maximum distance.
Average: All the pairs of points are taken, and their similarities are computed. Then the average of similarities is the similarity between C1 and C2.