Muster: doc/mainpage Source File

Go to the documentation of this file.
00001 // -*- c++ -*-
00002 
00003 /*! 
00004   @mainpage
00005 
00006   \image html kmedoids.png
00007     
00008   <h2>Introduction</h2>
00009 
00010   The Muster library provides implementations of sequential and parallel K-Medoids 
00011   clustering algorithms.  It is intended as a general framework for parallel 
00012   cluster analysis, particularly for performance data analysis on systems with 
00013   very large numbers of processes.
00014 
00015   The parallel implementations in the Muster are designed to perform well even
00016   in environments where the data to be clustered is entirely distributed.  For
00017   example, many performance tools need to analyze one data element from each
00018   process in a system.  To analyze this data efficiently, clustering algorithms 
00019   that move as little data as possible are required.  In Muster, we exploit 
00020   sampled clustering algorithms to realize this efficiency.
00021   
00022   The parallel algorithms in Muster are implemented using the Message Passing 
00023   Interface (MPI), making them suitable for use on many of the world's largest 
00024   supercomputers.  They should, however, also run efficiently on your laptop.
00025   
00026   <h2>Getting Started</h2>
00027   
00028   <h3>Partitions</h3>
00029   The algorithms in Muster are <i>partitioning algorithms</i>, that is, they divide
00030   a data set up into a set of groups, or <i>clusters</i> of data objects.  Together, 
00031   these groups are called a <i>clustering</i> or a <i>partitioning</i> of the data.
00032 
00033   There are two classes that represent clusterings.  These are as follows:
00034   <dl>
00035   <dt>\link cluster::partition partition\endlink</dt>
00036   <dd>This class represents a partitioning of a data set.  It stores the clusters in the 
00037   data set along with <i>representatives</i> from each of the clusters.  It also stores,
00038   for every object in the data set, the cluster to which that object has been assigned.
00039 
00040   A \link cluster::partition partition\endlink is entirely local to the process that owns it.  It exists 
00041   in one memory space and contains data about <i>all</i> objects in the data set it describes.
00042   </dd>
00043 
00044   <dt>\link cluster::par_partition par_partition\endlink</dt>
00045   <dd>This class is similar to \link cluster::partition partition\endlink, but it is a distributed 
00046   data structure. A \link cluster::par_partition par_partition\endlink object represents the 
00047   results of parallel clustering algorithm.
00048   Instead of storing the cluster assignments of <i>all</i> objects in the data set, each 
00049   par_partition object stores only the cluster membership for objects that are local to the
00050   calling process.
00051   </dd>
00052   </dl>
00053 
00054   Note that par_partition does not inherit from partition, because the classes are structurally
00055   different.  One is a local, centralized data structure, and its state represents all data in the 
00056   set, while the other is a distributed structure, and represents only a part of the full data set.
00057   
00058   If you need to, you can convert a par_partition to a partition with the 
00059   \link cluster::par_partition::gather() par_partition::gather()\endlink method, but the two 
00060   classes are not interchangeable.
00061 
00062   <h3>Clustering Algorithms</h3>
00063   Classes for clustering algorithms derive from either \link cluster::partition partition\endlink 
00064   or \link cluster::par_partition par_partition\endlink.  The algorithms themselves are methods on these derived 
00065   classes, and they store their output in the class.  This allows all (or at least most of)
00066   the state for the algorithms and their output to be encapsulated in a single class.  
00067 
00068   Algorithms themselves are template functions on the derived classes.  You can thus call them
00069   on any type of object with any number of distance metrics.  Because they are template
00070   functions, you don't need to explicitly indicate the types of the things you pass to the 
00071   clustering algorithms; the types are inferred from the algorithms' parameters.
00072   
00073   There are two classes you should be concerned with:
00074 
00075   <dl>
00076   <dt>\link cluster::kmedoids kmedoids\endlink</dt>
00077   <dd>This class inherits from \link cluster::partition partition\endlink and  contains implementations 
00078   of the PAM and CLARA sequential clustering algorithms (Kaufman and Rousseeuw, 1990).
00079   PAM is an \f$O(n^2)\f$, optimal K-medoids algorithm, and CLARA is an \f$O(n)\f$ implementation
00080   that executes multiple trials of PAM.  These algorithms are implemented in the 
00081   \link cluster::kmedoids::pam() pam()\endlink and \link cluster::kmedoids::clara() clara()\endlink methods.  
00082 
00083   PAM and CLARA both require that the caller supply \f$k\f$, the number of clusters, as a 
00084   parameter to the algorithm. We have adopted a technique used by the 
00085   X-Means (Pelleg and Moore, 2000) algorithm to choose an "ideal" \f$k\f$ from many clustering
00086   trials using the \link bic.h Bayesian Information Criterion (BIC)\endlink.  Instead of supplying a 
00087   specific \f$k\f$, the caller supplies a range of values for \f$k\f$, and the algorithms 
00088   use the BIC to select the best fit from the range.
00089 
00090   The BIC variants are implemented in 
00091   \link cluster::kmedoids::xpam() xpam()\endlink and \link cluster::kmedoids::xclara() xclara()\endlink.  
00092   They will be slower than the fixed-k versions, as they can require many iterations of PAM or CLARA be tested
00093   to find the best \f$k\f$.
00094   </dd>
00095 
00096   <dt>\link cluster::par_kmedoids par_kmedoids\endlink</dt>
00097   <dd>This class inherits from \link cluster::par_partition par_partition\endlink and it implements the CAPEK
00098   parallel clustering algorithm.  Functionally, CAPEK is similar to CLARA, but it is 
00099   distributed and runs in \f$O(\frac{n}{P}log(P))\f$ time for \f$n\f$ data objects and \f$P\f$
00100   processes.  If \f$n = P\f$, that is, if there are only as many input data elements as processes,
00101   CAPEK runs in \f$O(log(P))\f$ time.  
00102 
00103   The fixed-k version of CAPEK is implemented in \link cluster::par_kmedoids::capek() capek()\endlink, 
00104   and a variant using the BIC to select a best \f$k\f$ is in \link 
00105   cluster::par_kmedoids::xcapek() capek()\endlink.
00106   </dd>
00107   </dl>
00108   
00109   <h3>Dissimilarity Functions and Matrices</h3>
00110   Most of the algorithms here require some sort of dissimilarity metric to run.  As with
00111   the STL, you can use any callable object (function or functor) as a distance function.
00112   See the documentation for \link cluster::par_kmedoids::xcapek() xcapek()\endlink for an 
00113   example a dissimilarity functor.
00114 
00115   The PAM algorithm, which is the basis for all the algorithms in this package, requires a 
00116   precomputed dissimilarity matrix in order to run efficiently.  Given a set of \f$n\f$
00117   objects, a dissimilarity matrix is a triangular, \f$n \times n\f$ matrix \f$D\f$ where 
00118   each element \f$D_{ij}\f$ represents the distance between the \f$i^{th}\f$ and \f$j^{th}\f$ 
00119   objects in the data set.  It takes \f$O(n^2)\f$ time to compute a distance matrix like this.
00120 
00121   We use the <code>boost::symmetric_matrix</code> class to represent dissimilarity matrices.  This class
00122   provides a packed representation of an upper-triangular matrix, making it an efficient way to
00123   store a dissimilarity matrix for clustering.  For convenience, we have typedef'd 
00124   <code>boost::symmetric_matrix</code> to \link cluster::dissimilarity_matrix\endlink.
00125   To construct a dissimilarity matrix, use \link cluster::build_dissimilarity_matrix()\endlink to do this.
00126 
00127   PAM is the only algorithm
00128   the package that requires the use to pass in the matrix explicitly.  This is for 
00129   efficiency reasons.  A user (or another algorithm) may want to call PAM many times
00130   using the same dissimilarity matrix, and with this design, the user can first build 
00131   the matrix then call PAM without paying the (potentially very high) cost of building 
00132   the matrix.  
00133 
00134 
00135   <h2>Author</h2>
00136   Muster was implemented by Todd Gamblin at 
00137   <a href="http://www.llnl.gov">Lawrence Livermore National Laboratory</a>.
00138 
00139   Please send questions, bug reports, or suggestions 
00140   to <a href="mailto:tgamblin@llnl.gov">tgamblin@llnl.gov</a>.
00141 
00142 
00143   <h2>References</h2>
00144   For more details on the algorithms implemented in Muster, You can consult the following
00145   sources:
00146   
00147   <ol>
00148   <li>For more on CAPEK:
00149   <p>
00150   Todd Gamblin, Bronis R. de Supinski, Martin Schulz, Rob Fowler, and Daniel A. Reed.
00151   <a href="http://www.cs.unc.edu/~tgamblin/pubs/scalable-cluster-ics10.pdf">
00152   <b>Clustering Performance Data Efficiently at Massive Scales</b></a>.
00153   <i>Proceedings of the International Conference on Supercomputing (ICS'10)</i>,
00154   Tsukuba, Japan, June 1-4, 2010.
00155 
00156   <li>For more on X-Means and the Bayesian Information Criterion:
00157   <p>
00158   Dan Pelleg and Andrew Moore.  <a href="http://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf">
00159   <b>X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters</b></a>.
00160   <i>Proceedings of the Seventeenth International Conference on Machine Learning</i>, 
00161   San Francisco, CA. June 29-July 2, 2000.  pp 727-734.
00162 
00163   <li>For more on PAM and CLARA:
00164   <p>
00165   Leonard Kaufman and Peter J. Rousseeuw.  
00166   <b><a href="http://www.amazon.com/Finding-Groups-Data-Introduction-Probability/dp/0471735787">Finding Groups in Data: An Introduction to Cluster Analysis</a></b>. John Wiley & Sons, Inc., New York. 
00167 
00168   </ol>
00169 */