![]() |
|
Parallel Algorithms for Hierarchical Clustering
Clark F. Olson Technical report UCB/CSD-94-786, Computer Science Division, University of California at Berkeley, 1994. Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces. O(n^2) algorithms, where n is the number of points to cluster, have long been known for this problem [Sibson, 1973; Defays, 1977; Day and Edelsbrunner, 1984]. This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe O(n) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an n node CRCW PRAM and O(n\log n) algorithms for these metrics (except average link and complete link) on n/log n node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering. |