Data Clustering

by levien on do 10 juli 2008 // Posted in misc // under

I've written a fast perl/PDL implementation of UPGMA data clustering for very large datasets. The problem is that existing clustering packages have difficulty handling datasets with more than a few thousand data points. Especially the distance matrices tend to become a problem. For example, clustering the outcome of a 300x300 grid-based simulation (90,000 data points) would require a (non-sparse) distance matrix of 8.1 billion entries. This would use over 30 Gb of memory when stored as 4-byte floating point values.

I needed to cluster a lot of such datasets, so to make this manageable I implemented a simple but fast UPGMA clustering algorithm in PDL (the Perl Data Language). To conserve memory, it doesn't store a full hierarchical clustering tree, but rather partitions the data into clusters based on a pre-defined minimum distance between the cluster centroids. Moreover, to handle large data sets it can compute the clusters based on a random sample of the data points, and then assign all other points to the nearest cluster. I'm not sure if this procedure is statistically sound, but it's fast and seems to work very well in practice.

The Perl-code is available under an LGPL-license from my github repository. It should really be cleaned up and modified a bit to make it more readable for others, or better, be rewritten in Python with Pandas... See the README in the git-repository for more details on the implementation and its problems.

If you want to cluster smaller datasets (up to a few thousand points), you're probably better off using more established packages such as Gnu R, Matlab or the open-source C Clustering Library by Michiel de Hoon et al.

  • Clustering in R
  • Cluster 3.0 and the C clustering library:\~mdehoon/software/cluster/
  • The Algorithm::Cluster module for Perl:\~mdehoon/Algorithm-Cluster-1.39/perl/
  • Pycluster for Python:\~mdehoon/software/cluster/software.htm#pycluster