HACAM: Hierarchical Agglomerative Clustering Around Medoids - and its Limitations

Lernen, Wissen, Daten, Analysen 2021 · Erich Schubert ·

Partitioning Around Medoids (PAM) is a popular and flexible clustering method. Also known by the name 𝑘-Medoids clustering, and originally conceived for the 𝐿1-norm, it can be used to cluster data into 𝑘 partitions with respect to an arbitrary distance or similarity measure. The ability to work with any distance makes this method more widely applicable than, for example, 𝑘-means clustering. Similar to 𝑘-means, a challenge when using PAM is the need to choose the number of clusters, 𝑘, before running the algorithm. In many cases, the “optimal” 𝑘 will not be known beforehand, and the user may need to run PAM several times with different 𝑘 and rely on additional heuristics to pick the “best” result. We introduce the algorithm Hierarchical Agglomerative Clustering Around Medoids (HACAM), a combination of ideas from classic hierarchical agglomerative clustering (HAC), but where points are clustered around medoids. In our approach, each subtree of the dendrogram has a representative point, which is the medoid: the point with the smallest average distance to all others. In contrast to the arithmetic mean, this does not make assumptions on the data representation or distance function used. Unfortunately, we also show that the requirement to produce a hierarchical result is a limiting factor to the cluster quality, as the optimum result for a particular number of clusters 𝑘 does not have to be consistent with the optimum result with 𝑘+1 clusters. Hence, if a range of interesting values of 𝑘 is known beforehand, existing algorithms such as FasterPAM remain favorable.

PDF Abstract