00001 // -*- c++ -*- 00002 00003 /*! 00004 @mainpage 00005 00006 \image html kmedoids.png 00007 00008 <h2>Introduction</h2> 00009 00010 The Muster library provides implementations of sequential and parallel K-Medoids 00011 clustering algorithms. It is intended as a general framework for parallel 00012 cluster analysis, particularly for performance data analysis on systems with 00013 very large numbers of processes. 00014 00015 The parallel implementations in the Muster are designed to perform well even 00016 in environments where the data to be clustered is entirely distributed. For 00017 example, many performance tools need to analyze one data element from each 00018 process in a system. To analyze this data efficiently, clustering algorithms 00019 that move as little data as possible are required. In Muster, we exploit 00020 sampled clustering algorithms to realize this efficiency. 00021 00022 The parallel algorithms in Muster are implemented using the Message Passing 00023 Interface (MPI), making them suitable for use on many of the world's largest 00024 supercomputers. They should, however, also run efficiently on your laptop. 00025 00026 <h2>Getting Started</h2> 00027 00028 <h3>Partitions</h3> 00029 The algorithms in Muster are <i>partitioning algorithms</i>, that is, they divide 00030 a data set up into a set of groups, or <i>clusters</i> of data objects. Together, 00031 these groups are called a <i>clustering</i> or a <i>partitioning</i> of the data. 00032 00033 There are two classes that represent clusterings. These are as follows: 00034 <dl> 00035 <dt>\link cluster::partition partition\endlink</dt> 00036 <dd>This class represents a partitioning of a data set. It stores the clusters in the 00037 data set along with <i>representatives</i> from each of the clusters. It also stores, 00038 for every object in the data set, the cluster to which that object has been assigned. 00039 00040 A \link cluster::partition partition\endlink is entirely local to the process that owns it. It exists 00041 in one memory space and contains data about <i>all</i> objects in the data set it describes. 00042 </dd> 00043 00044 <dt>\link cluster::par_partition par_partition\endlink</dt> 00045 <dd>This class is similar to \link cluster::partition partition\endlink, but it is a distributed 00046 data structure. A \link cluster::par_partition par_partition\endlink object represents the 00047 results of parallel clustering algorithm. 00048 Instead of storing the cluster assignments of <i>all</i> objects in the data set, each 00049 par_partition object stores only the cluster membership for objects that are local to the 00050 calling process. 00051 </dd> 00052 </dl> 00053 00054 Note that par_partition does not inherit from partition, because the classes are structurally 00055 different. One is a local, centralized data structure, and its state represents all data in the 00056 set, while the other is a distributed structure, and represents only a part of the full data set. 00057 00058 If you need to, you can convert a par_partition to a partition with the 00059 \link cluster::par_partition::gather() par_partition::gather()\endlink method, but the two 00060 classes are not interchangeable. 00061 00062 <h3>Clustering Algorithms</h3> 00063 Classes for clustering algorithms derive from either \link cluster::partition partition\endlink 00064 or \link cluster::par_partition par_partition\endlink. The algorithms themselves are methods on these derived 00065 classes, and they store their output in the class. This allows all (or at least most of) 00066 the state for the algorithms and their output to be encapsulated in a single class. 00067 00068 Algorithms themselves are template functions on the derived classes. You can thus call them 00069 on any type of object with any number of distance metrics. Because they are template 00070 functions, you don't need to explicitly indicate the types of the things you pass to the 00071 clustering algorithms; the types are inferred from the algorithms' parameters. 00072 00073 There are two classes you should be concerned with: 00074 00075 <dl> 00076 <dt>\link cluster::kmedoids kmedoids\endlink</dt> 00077 <dd>This class inherits from \link cluster::partition partition\endlink and contains implementations 00078 of the PAM and CLARA sequential clustering algorithms (Kaufman and Rousseeuw, 1990). 00079 PAM is an \f$O(n^2)\f$, optimal K-medoids algorithm, and CLARA is an \f$O(n)\f$ implementation 00080 that executes multiple trials of PAM. These algorithms are implemented in the 00081 \link cluster::kmedoids::pam() pam()\endlink and \link cluster::kmedoids::clara() clara()\endlink methods. 00082 00083 PAM and CLARA both require that the caller supply \f$k\f$, the number of clusters, as a 00084 parameter to the algorithm. We have adopted a technique used by the 00085 X-Means (Pelleg and Moore, 2000) algorithm to choose an "ideal" \f$k\f$ from many clustering 00086 trials using the \link bic.h Bayesian Information Criterion (BIC)\endlink. Instead of supplying a 00087 specific \f$k\f$, the caller supplies a range of values for \f$k\f$, and the algorithms 00088 use the BIC to select the best fit from the range. 00089 00090 The BIC variants are implemented in 00091 \link cluster::kmedoids::xpam() xpam()\endlink and \link cluster::kmedoids::xclara() xclara()\endlink. 00092 They will be slower than the fixed-k versions, as they can require many iterations of PAM or CLARA be tested 00093 to find the best \f$k\f$. 00094 </dd> 00095 00096 <dt>\link cluster::par_kmedoids par_kmedoids\endlink</dt> 00097 <dd>This class inherits from \link cluster::par_partition par_partition\endlink and it implements the CAPEK 00098 parallel clustering algorithm. Functionally, CAPEK is similar to CLARA, but it is 00099 distributed and runs in \f$O(\frac{n}{P}log(P))\f$ time for \f$n\f$ data objects and \f$P\f$ 00100 processes. If \f$n = P\f$, that is, if there are only as many input data elements as processes, 00101 CAPEK runs in \f$O(log(P))\f$ time. 00102 00103 The fixed-k version of CAPEK is implemented in \link cluster::par_kmedoids::capek() capek()\endlink, 00104 and a variant using the BIC to select a best \f$k\f$ is in \link 00105 cluster::par_kmedoids::xcapek() capek()\endlink. 00106 </dd> 00107 </dl> 00108 00109 <h3>Dissimilarity Functions and Matrices</h3> 00110 Most of the algorithms here require some sort of dissimilarity metric to run. As with 00111 the STL, you can use any callable object (function or functor) as a distance function. 00112 See the documentation for \link cluster::par_kmedoids::xcapek() xcapek()\endlink for an 00113 example a dissimilarity functor. 00114 00115 The PAM algorithm, which is the basis for all the algorithms in this package, requires a 00116 precomputed dissimilarity matrix in order to run efficiently. Given a set of \f$n\f$ 00117 objects, a dissimilarity matrix is a triangular, \f$n \times n\f$ matrix \f$D\f$ where 00118 each element \f$D_{ij}\f$ represents the distance between the \f$i^{th}\f$ and \f$j^{th}\f$ 00119 objects in the data set. It takes \f$O(n^2)\f$ time to compute a distance matrix like this. 00120 00121 We use the <code>boost::symmetric_matrix</code> class to represent dissimilarity matrices. This class 00122 provides a packed representation of an upper-triangular matrix, making it an efficient way to 00123 store a dissimilarity matrix for clustering. For convenience, we have typedef'd 00124 <code>boost::symmetric_matrix</code> to \link cluster::dissimilarity_matrix\endlink. 00125 To construct a dissimilarity matrix, use \link cluster::build_dissimilarity_matrix()\endlink to do this. 00126 00127 PAM is the only algorithm 00128 the package that requires the use to pass in the matrix explicitly. This is for 00129 efficiency reasons. A user (or another algorithm) may want to call PAM many times 00130 using the same dissimilarity matrix, and with this design, the user can first build 00131 the matrix then call PAM without paying the (potentially very high) cost of building 00132 the matrix. 00133 00134 00135 <h2>Author</h2> 00136 Muster was implemented by Todd Gamblin at 00137 <a href="http://www.llnl.gov">Lawrence Livermore National Laboratory</a>. 00138 00139 Please send questions, bug reports, or suggestions 00140 to <a href="mailto:tgamblin@llnl.gov">tgamblin@llnl.gov</a>. 00141 00142 00143 <h2>References</h2> 00144 For more details on the algorithms implemented in Muster, You can consult the following 00145 sources: 00146 00147 <ol> 00148 <li>For more on CAPEK: 00149 <p> 00150 Todd Gamblin, Bronis R. de Supinski, Martin Schulz, Rob Fowler, and Daniel A. Reed. 00151 <a href="http://www.cs.unc.edu/~tgamblin/pubs/scalable-cluster-ics10.pdf"> 00152 <b>Clustering Performance Data Efficiently at Massive Scales</b></a>. 00153 <i>Proceedings of the International Conference on Supercomputing (ICS'10)</i>, 00154 Tsukuba, Japan, June 1-4, 2010. 00155 00156 <li>For more on X-Means and the Bayesian Information Criterion: 00157 <p> 00158 Dan Pelleg and Andrew Moore. <a href="http://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf"> 00159 <b>X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters</b></a>. 00160 <i>Proceedings of the Seventeenth International Conference on Machine Learning</i>, 00161 San Francisco, CA. June 29-July 2, 2000. pp 727-734. 00162 00163 <li>For more on PAM and CLARA: 00164 <p> 00165 Leonard Kaufman and Peter J. Rousseeuw. 00166 <b><a href="http://www.amazon.com/Finding-Groups-Data-Introduction-Probability/dp/0471735787">Finding Groups in Data: An Introduction to Cluster Analysis</a></b>. John Wiley & Sons, Inc., New York. 00167 00168 </ol> 00169 */