Public Member Functions | Protected Types | Protected Member Functions | Protected Attributes

par_kmedoids Class Reference

This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms. More...

#include <par_kmedoids.h>

Inheritance diagram for par_kmedoids:
Inheritance graph
[legend]
Collaboration diagram for par_kmedoids:
Collaboration graph
[legend]

List of all members.

Public Member Functions

 par_kmedoids (MPI_Comm comm=MPI_COMM_WORLD)
 Constructs a parallel kmedoids object and seeds its random number generator.
virtual ~par_kmedoids ()
double average_dissimilarity ()
 Get the average dissimilarity of objects w/their medoids for the last run.
double bic_score ()
 BIC score for selected clustering.
void set_max_reps (size_t reps)
 Sets max_reps, Max number of times to run PAM with each sampled dataset.
size_t get_max_reps ()
 Max number of times to run PAM with each sampled dataset.
void set_init_size (size_t size)
 Sets init_size, baseline size for samples, added to 2*k.
size_t get_init_size ()
 Baseline size for samples, added to 2*k.
void set_epsilon (double epsilon)
 Set tolerance for convergence.
template<class T , class D >
void run_pam_trials (trial_generator &trials, const std::vector< T > &objects, D dmetric, std::vector< typename id_pair< T >::vector > &all_medoids, MPI_Comm comm)
 Farms out trials of PAM to worker processes then collects medoids from all trials to all processors.
template<class T , class D >
void capek (const std::vector< T > &objects, D dmetric, size_t k, std::vector< T > *medoids=NULL)
 This is the Clustering Algorithm with Parallel Extensions to K-Medoids (CAPEK).
template<class T , class D >
double xcapek (const std::vector< T > &objects, D dmetric, size_t max_k, size_t dimensionality, std::vector< T > *medoids=NULL)
 K-agnostic version of capek().
const Timerget_timer ()
 Get the Timer with info on the last run of either capek() or xcapek().

Protected Types

typedef boost::mt19937 random_t
 Type for random number generator used here.

Protected Member Functions

void seed_random_uniform (MPI_Comm comm)
 Seeds random number generators across all processes with the same number, taken from the time in microseconds since the epoch on the process 0.
template<class T >
void pack_vector (MPI_Comm comm, const std::vector< T > &packables, std::vector< char > &buffer)
 This packs a vector of packable objects on one process.
template<class T >
void unpack_vector (MPI_Comm comm, const std::vector< char > &buffer, std::vector< T > &packables)
 This unpacks a vector of packable objects packed by pack_vector().
template<typename T , typename D >
std::pair< double, size_t > closest_medoid (const T &object, object_id oid, const std::vector< id_pair< T > > &medoids, D dmetric)
 Find the closest object in the medoids vector to the object passed in.

Protected Attributes

random_t random
 Random number distribution to be used for samples.
double total_dissimilarity
 Total dissimilarity bt/w objects and medoids for last clustering.
double best_bic_score
 BIC score for the clustering found.
size_t init_size
 baseline size for samples
size_t max_reps
 Max repetitions of trials for a particular k.
double epsilon
 Tolerance for convergence tests in kmedoids PAM runs.
Timer timer
 Performance timer.

Detailed Description

This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms.

For more information on these algorithms, see this paper:

Todd Gamblin, Bronis R. de Supinski, Martin Schulz, Rob Fowler, and Daniel A. Reed. Clustering Performance Data Efficiently at Massive Scales. Proceedings of the International Conference on Supercomputing (ICS'10), Tsukuba, Japan, June 1-4, 2010.

Example usage:

 // This is a theoretical point struct to be clustered.
 struct point {
     double x, y;
 };
 
 // Euclidean distance functor to use for clustering.
 struct euclidean_distance {
     double operator()(const point& lhs, const point& rhs) {
         double a = lhs.x - rhs.x;
         double b = lhs.y - rhs.y;
         return sqrt(a*a + b*b);
     }
 };

 vector<point> points;
 // ... Each process puts some local points in the vector ...
 
 par_kmedoids parkm(MPI_COMM_WORLD);
 vector<point> medoids;   // storage for reprsentatives
 
 // Run xcapek(), finding a max of 20 clusters among the 2d points.
 parkm.xcapek(points, euclidean_distance(), 20, 2, &medoids);

After this runs, these members of parkm are interesting:

Together, these define the clustering that the algorithm found. See par_partition for an explanation of how distributed partitions like this are represented.

The medoids vector contains actual copies of the representatives for each cluster. The copies correspond to the object ids in parkm.medoid_ids. Supplying the medoids vector like this is optional, so if you don't need copies of the representatives, you can omit it from the call.

Definition at line 112 of file par_kmedoids.h.


Member Typedef Documentation

typedef boost::mt19937 random_t [protected]

Type for random number generator used here.

Definition at line 530 of file par_kmedoids.h.


Constructor & Destructor Documentation

par_kmedoids ( MPI_Comm  comm = MPI_COMM_WORLD )

Constructs a parallel kmedoids object and seeds its random number generator.

This is a collective operation, and needs to be called by all processes.

par_kmedoids assumes that there is one object to be clustered per process.

Definition at line 47 of file par_kmedoids.cpp.

virtual ~par_kmedoids (  ) [inline, virtual]

Definition at line 122 of file par_kmedoids.h.


Member Function Documentation

double average_dissimilarity (  )

Get the average dissimilarity of objects w/their medoids for the last run.

Definition at line 62 of file par_kmedoids.cpp.

double bic_score (  )

BIC score for selected clustering.

Definition at line 68 of file par_kmedoids.cpp.

void capek ( const std::vector< T > &  objects,
dmetric,
size_t  k,
std::vector< T > *  medoids = NULL 
) [inline]

This is the Clustering Algorithm with Parallel Extensions to K-Medoids (CAPEK).

Assumes that objects to be clustered are fully distributed across parallel process, with the same number of objects per process.

Template Parameters:
TType of objects to be clustered. T must support the following operations:

  • int packed_size(MPI_Comm comm) const
  • void pack(void *buf, int bufsize, int *position, MPI_Comm comm) const
  • static T unpack(void *buf, int bufsize, int *position, MPI_Comm comm)
DDissimilarity metric type. D should be callable on (T, T) and should return a double representing the distance between the two T's.
Parameters:
[in]objectsLocal objects to cluster (ASSUME: currently must be same number per process!)
[in]dmetricDistance metric to build dissimilarity matrices with
[in]kNumber of clusters to find.
[out]medoidsOptional output vector where global medoids will be stored along with their source ranks.

CAPEK will run trials insatances of PAM on separate processors for each k in 1..max_k using the run_pam_trials() routine. Each trial aggregates sample_size objects distributed over all processes in the system.

See also:
xcapek() for a K-agnostic version of this algorithm.

Definition at line 296 of file par_kmedoids.h.

std::pair<double, size_t> closest_medoid ( const T &  object,
object_id  oid,
const std::vector< id_pair< T > > &  medoids,
dmetric 
) [inline, protected]

Find the closest object in the medoids vector to the object passed in.

Returns a pair of the closest medoid's id and its distance from the object.

Parameters:
[in]objectObject to compare to medoids.
[in]oidID of the object (need this so medoids prefer themselves as their own medoids).
[in]medoidsVector of medoids to find the closest from.
[in]dmetricDistance metric to assess closeness with.

Definition at line 600 of file par_kmedoids.h.

size_t get_init_size (  ) [inline]

Baseline size for samples, added to 2*k.

Definition at line 150 of file par_kmedoids.h.

size_t get_max_reps (  ) [inline]

Max number of times to run PAM with each sampled dataset.

Definition at line 139 of file par_kmedoids.h.

const Timer& get_timer (  ) [inline]

Get the Timer with info on the last run of either capek() or xcapek().

Definition at line 527 of file par_kmedoids.h.

void pack_vector ( MPI_Comm  comm,
const std::vector< T > &  packables,
std::vector< char > &  buffer 
) [inline, protected]

This packs a vector of packable objects on one process.

To be packable, T must support these operations:

  • int packed_size(MPI_Comm comm) const
  • void pack(void *buf, int bufsize, int *position, MPI_Comm comm) const
  • static T unpack(void *buf, int bufsize, int *position, MPI_Comm comm) Each T in the input vector is packed and put on the back of

Definition at line 556 of file par_kmedoids.h.

void run_pam_trials ( trial_generator trials,
const std::vector< T > &  objects,
dmetric,
std::vector< typename id_pair< T >::vector > &  all_medoids,
MPI_Comm  comm 
) [inline]

Farms out trials of PAM to worker processes then collects medoids from all trials to all processors.

Puts resulting medoids in all_medoids when done.

Definition at line 163 of file par_kmedoids.h.

void seed_random_uniform ( MPI_Comm  comm ) [protected]

Seeds random number generators across all processes with the same number, taken from the time in microseconds since the epoch on the process 0.

Definition at line 72 of file par_kmedoids.cpp.

void set_epsilon ( double  epsilon )

Set tolerance for convergence.

This is relative error, not absolute error. It will be multiplied by the mean distance before it is used to test convergence. Defaults to 1e-15; may need to be higher if there exist clusterings with very similar quality.

Definition at line 57 of file par_kmedoids.cpp.

void set_init_size ( size_t  size ) [inline]

Sets init_size, baseline size for samples, added to 2*k.

Defaults to 40, per Kaufman and Rousseeuw.

Definition at line 145 of file par_kmedoids.h.

void set_max_reps ( size_t  reps ) [inline]

Sets max_reps, Max number of times to run PAM with each sampled dataset.

Default is 5, per Kaufman and Rousseeuw.

Definition at line 134 of file par_kmedoids.h.

void unpack_vector ( MPI_Comm  comm,
const std::vector< char > &  buffer,
std::vector< T > &  packables 
) [inline, protected]

This unpacks a vector of packable objects packed by pack_vector().

Definition at line 578 of file par_kmedoids.h.

double xcapek ( const std::vector< T > &  objects,
dmetric,
size_t  max_k,
size_t  dimensionality,
std::vector< T > *  medoids = NULL 
) [inline]

K-agnostic version of capek().

This version attempts to guess the best K for the data using the Bayesian Information Criterion (BIC) described in bic.h. Evaluation of the BIC is parallelized using global reduction operations.

Like capek(), this uses run_pam_trials() to farm out trials of the PAM clustering algorithm, but it requires more trials than capek(). In particular, it will run (sum(1..max_k) * trials) total trials in parallel on MPI worker processes.

Template Parameters:
TType of objects to be clustered. T must support the following operations:

  • int packed_size(MPI_Comm comm) const
  • void pack(void *buf, int bufsize, int *position, MPI_Comm comm) const
  • static T unpack(void *buf, int bufsize, int *position, MPI_Comm comm)
DDissimilarity metric type. D should be callable on (T, T) and should return a double representing the distance between the two T's.
Parameters:
[in]objectsLocal objects to cluster (ASSUME: currently must be same number per process!)
[in]dmetricDistance metric to build dissimilarity matrices with
[in]max_kMax number of clusters to find.
[in]dimensionalityDimensionality of objects, used by BIC.
[out]medoidsOptional output vector where global medoids will be stored along with their source ranks.
Returns:
The best BIC value found, that is, the BIC value of the final clustering.

Definition at line 409 of file par_kmedoids.h.


Member Data Documentation

double best_bic_score [protected]

BIC score for the clustering found.

Definition at line 534 of file par_kmedoids.h.

double epsilon [protected]

Tolerance for convergence tests in kmedoids PAM runs.

Definition at line 537 of file par_kmedoids.h.

size_t init_size [protected]

baseline size for samples

Definition at line 535 of file par_kmedoids.h.

size_t max_reps [protected]

Max repetitions of trials for a particular k.

Definition at line 536 of file par_kmedoids.h.

random_t random [protected]

Random number distribution to be used for samples.

Definition at line 531 of file par_kmedoids.h.

Timer timer [protected]

Performance timer.

Definition at line 539 of file par_kmedoids.h.

double total_dissimilarity [protected]

Total dissimilarity bt/w objects and medoids for last clustering.

Definition at line 533 of file par_kmedoids.h.


The documentation for this class was generated from the following files:
Muster. Copyright © 2010, Lawrence Livermore National Laboratory, LLNL-CODE-433662.
Distribution of Muster and its documentation is subject to terms of the Muster LICENSE.
Generated on Mon Dec 20 2010 using Doxygen 1.7.2