This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms. More...

#include <par_kmedoids.h>

Inheritance diagram for par_kmedoids:

Collaboration diagram for par_kmedoids:

Public Member Functions
	par_kmedoids (MPI_Comm comm=MPI_COMM_WORLD)
	Constructs a parallel kmedoids object and seeds its random number generator.
virtual	~par_kmedoids ()
double	average_dissimilarity ()
	Get the average dissimilarity of objects w/their medoids for the last run.
double	bic_score ()
	BIC score for selected clustering.
void	set_max_reps (size_t reps)
	Sets max_reps, Max number of times to run PAM with each sampled dataset.
size_t	get_max_reps ()
	Max number of times to run PAM with each sampled dataset.
void	set_init_size (size_t size)
	Sets init_size, baseline size for samples, added to 2*k.
size_t	get_init_size ()
	Baseline size for samples, added to 2*k.
void	set_epsilon (double epsilon)
	Set tolerance for convergence.
template<class T , class D >
void	run_pam_trials (trial_generator &trials, const std::vector< T > &objects, D dmetric, std::vector< typename id_pair< T >::vector > &all_medoids, MPI_Comm comm)
	Farms out trials of PAM to worker processes then collects medoids from all trials to all processors.
template<class T , class D >
void	capek (const std::vector< T > &objects, D dmetric, size_t k, std::vector< T > *medoids=NULL)
	This is the Clustering Algorithm with Parallel Extensions to K-Medoids (CAPEK).
template<class T , class D >
double	xcapek (const std::vector< T > &objects, D dmetric, size_t max_k, size_t dimensionality, std::vector< T > *medoids=NULL)
	K-agnostic version of capek().
const Timer &	get_timer ()
	Get the Timer with info on the last run of either capek() or xcapek().
Protected Types
typedef boost::mt19937	random_t
	Type for random number generator used here.
Protected Member Functions
void	seed_random_uniform (MPI_Comm comm)
	Seeds random number generators across all processes with the same number, taken from the time in microseconds since the epoch on the process 0.
template<class T >
void	pack_vector (MPI_Comm comm, const std::vector< T > &packables, std::vector< char > &buffer)
	This packs a vector of packable objects on one process.
template<class T >
void	unpack_vector (MPI_Comm comm, const std::vector< char > &buffer, std::vector< T > &packables)
	This unpacks a vector of packable objects packed by pack_vector().
template<typename T , typename D >
std::pair< double, size_t >	closest_medoid (const T &object, object_id oid, const std::vector< id_pair< T > > &medoids, D dmetric)
	Find the closest object in the medoids vector to the object passed in.
Protected Attributes
random_t	random
	Random number distribution to be used for samples.
double	total_dissimilarity
	Total dissimilarity bt/w objects and medoids for last clustering.
double	best_bic_score
	BIC score for the clustering found.
size_t	init_size
	baseline size for samples
size_t	max_reps
	Max repetitions of trials for a particular k.
double	epsilon
	Tolerance for convergence tests in kmedoids PAM runs.
Timer	timer
	Performance timer.

Detailed Description

This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms.

For more information on these algorithms, see this paper:

: Todd Gamblin, Bronis R. de Supinski, Martin Schulz, Rob Fowler, and Daniel A. Reed. Clustering Performance Data Efficiently at Massive Scales. Proceedings of the International Conference on Supercomputing (ICS'10), Tsukuba, Japan, June 1-4, 2010.

Example usage:

 // This is a theoretical point struct to be clustered.
 struct point {
     double x, y;
 };
 
 // Euclidean distance functor to use for clustering.
 struct euclidean_distance {
     double operator()(const point& lhs, const point& rhs) {
         double a = lhs.x - rhs.x;
         double b = lhs.y - rhs.y;
         return sqrt(a*a + b*b);
     }
 };

 vector<point> points;
 // ... Each process puts some local points in the vector ...
 
 par_kmedoids parkm(MPI_COMM_WORLD);
 vector<point> medoids;   // storage for reprsentatives
 
 // Run xcapek(), finding a max of 20 clusters among the 2d points.
 parkm.xcapek(points, euclidean_distance(), 20, 2, &medoids);

After this runs, these members of parkm are interesting:

parkm.cluster_ids: A vector of cluster ids for the local objects in points
parkm.medoid_ids: A vector of object ids for all the cluster representatives

Together, these define the clustering that the algorithm found. See par_partition for an explanation of how distributed partitions like this are represented.

The medoids vector contains actual copies of the representatives for each cluster. The copies correspond to the object ids in parkm.medoid_ids. Supplying the medoids vector like this is optional, so if you don't need copies of the representatives, you can omit it from the call.

Definition at line 112 of file par_kmedoids.h.

Member Typedef Documentation

typedef boost::mt19937 random_t [protected]

Type for random number generator used here.

Definition at line 530 of file par_kmedoids.h.

Constructor & Destructor Documentation

par_kmedoids ( MPI_Comm comm = MPI_COMM_WORLD )

Constructs a parallel kmedoids object and seeds its random number generator.

This is a collective operation, and needs to be called by all processes.

par_kmedoids assumes that there is one object to be clustered per process.

Definition at line 47 of file par_kmedoids.cpp.

virtual ~par_kmedoids ( ) [inline, virtual]

Definition at line 122 of file par_kmedoids.h.

Member Function Documentation

double average_dissimilarity ( )

Get the average dissimilarity of objects w/their medoids for the last run.

Definition at line 62 of file par_kmedoids.cpp.

double bic_score ( )

BIC score for selected clustering.

Definition at line 68 of file par_kmedoids.cpp.

void capek	(	const std::vector< T > &	objects,
		D	dmetric,
		size_t	k,
		std::vector< T > *	medoids = `NULL`
	)		`[inline]`

This is the Clustering Algorithm with Parallel Extensions to K-Medoids (CAPEK).

Assumes that objects to be clustered are fully distributed across parallel process, with the same number of objects per process.

Template Parameters:

T

Type of objects to be clustered. T must support the following operations:

int packed_size(MPI_Comm comm) const
void pack(void *buf, int bufsize, int *position, MPI_Comm comm) const
static T unpack(void *buf, int bufsize, int *position, MPI_Comm comm)

D Dissimilarity metric type. D should be callable on (T, T) and should return a double representing the distance between the two T's.

Parameters:

[in]	objects	Local objects to cluster (ASSUME: currently must be same number per process!)
[in]	dmetric	Distance metric to build dissimilarity matrices with
[in]	k	Number of clusters to find.
[out]	medoids	Optional output vector where global medoids will be stored along with their source ranks.

CAPEK will run trials insatances of PAM on separate processors for each k in 1..max_k using the run_pam_trials() routine. Each trial aggregates sample_size objects distributed over all processes in the system.

See also:: xcapek() for a K-agnostic version of this algorithm.

Definition at line 296 of file par_kmedoids.h.

std::pair<double, size_t> closest_medoid	(	const T &	object,
		object_id	oid,
		const std::vector< id_pair< T > > &	medoids,
		D	dmetric
	)		`[inline, protected]`

Find the closest object in the medoids vector to the object passed in.

Returns a pair of the closest medoid's id and its distance from the object.

Parameters:

[in]	object	Object to compare to medoids.
[in]	oid	ID of the object (need this so medoids prefer themselves as their own medoids).
[in]	medoids	Vector of medoids to find the closest from.
[in]	dmetric	Distance metric to assess closeness with.

Definition at line 600 of file par_kmedoids.h.

size_t get_init_size ( ) [inline]

Baseline size for samples, added to 2*k.

Definition at line 150 of file par_kmedoids.h.

size_t get_max_reps ( ) [inline]

Max number of times to run PAM with each sampled dataset.

Definition at line 139 of file par_kmedoids.h.

const Timer& get_timer ( ) [inline]

Get the Timer with info on the last run of either capek() or xcapek().

Definition at line 527 of file par_kmedoids.h.

void pack_vector	(	MPI_Comm	comm,
		const std::vector< T > &	packables,
		std::vector< char > &	buffer
	)		`[inline, protected]`

This packs a vector of packable objects on one process.

To be packable, T must support these operations:

int packed_size(MPI_Comm comm) const
void pack(void *buf, int bufsize, int *position, MPI_Comm comm) const
static T unpack(void *buf, int bufsize, int *position, MPI_Comm comm) Each T in the input vector is packed and put on the back of

Definition at line 556 of file par_kmedoids.h.

void run_pam_trials	(	trial_generator &	trials,
		const std::vector< T > &	objects,
		D	dmetric,
		std::vector< typename id_pair< T >::vector > &	all_medoids,
		MPI_Comm	comm
	)		`[inline]`

Farms out trials of PAM to worker processes then collects medoids from all trials to all processors.

Puts resulting medoids in all_medoids when done.

Definition at line 163 of file par_kmedoids.h.

void seed_random_uniform ( MPI_Comm comm ) [protected]

Seeds random number generators across all processes with the same number, taken from the time in microseconds since the epoch on the process 0.

Definition at line 72 of file par_kmedoids.cpp.

void set_epsilon ( double epsilon )

Set tolerance for convergence.

This is relative error, not absolute error. It will be multiplied by the mean distance before it is used to test convergence. Defaults to 1e-15; may need to be higher if there exist clusterings with very similar quality.

Definition at line 57 of file par_kmedoids.cpp.

void set_init_size ( size_t size ) [inline]

Sets init_size, baseline size for samples, added to 2*k.

Defaults to 40, per Kaufman and Rousseeuw.

Definition at line 145 of file par_kmedoids.h.

void set_max_reps ( size_t reps ) [inline]

Sets max_reps, Max number of times to run PAM with each sampled dataset.

Default is 5, per Kaufman and Rousseeuw.

Definition at line 134 of file par_kmedoids.h.

void unpack_vector	(	MPI_Comm	comm,
		const std::vector< char > &	buffer,
		std::vector< T > &	packables
	)		`[inline, protected]`

This unpacks a vector of packable objects packed by pack_vector().

Definition at line 578 of file par_kmedoids.h.

double xcapek	(	const std::vector< T > &	objects,
		D	dmetric,
		size_t	max_k,
		size_t	dimensionality,
		std::vector< T > *	medoids = `NULL`
	)		`[inline]`

K-agnostic version of capek().

This version attempts to guess the best K for the data using the Bayesian Information Criterion (BIC) described in bic.h. Evaluation of the BIC is parallelized using global reduction operations.

Like capek(), this uses run_pam_trials() to farm out trials of the PAM clustering algorithm, but it requires more trials than capek(). In particular, it will run (sum(1..max_k) * trials) total trials in parallel on MPI worker processes.

Template Parameters:

T

Type of objects to be clustered. T must support the following operations:

int packed_size(MPI_Comm comm) const
void pack(void *buf, int bufsize, int *position, MPI_Comm comm) const
static T unpack(void *buf, int bufsize, int *position, MPI_Comm comm)

D Dissimilarity metric type. D should be callable on (T, T) and should return a double representing the distance between the two T's.

Parameters:

[in]	objects	Local objects to cluster (ASSUME: currently must be same number per process!)
[in]	dmetric	Distance metric to build dissimilarity matrices with
[in]	max_k	Max number of clusters to find.
[in]	dimensionality	Dimensionality of objects, used by BIC.
[out]	medoids	Optional output vector where global medoids will be stored along with their source ranks.

Returns:: The best BIC value found, that is, the BIC value of the final clustering.

Definition at line 409 of file par_kmedoids.h.

Member Data Documentation

double best_bic_score [protected]

BIC score for the clustering found.

Definition at line 534 of file par_kmedoids.h.

double epsilon [protected]

Tolerance for convergence tests in kmedoids PAM runs.

Definition at line 537 of file par_kmedoids.h.

size_t init_size [protected]

baseline size for samples

Definition at line 535 of file par_kmedoids.h.

size_t max_reps [protected]

Max repetitions of trials for a particular k.

Definition at line 536 of file par_kmedoids.h.

random_t random [protected]

Random number distribution to be used for samples.

Definition at line 531 of file par_kmedoids.h.

Timer timer [protected]

Performance timer.

Definition at line 539 of file par_kmedoids.h.

double total_dissimilarity [protected]

Total dissimilarity bt/w objects and medoids for last clustering.

Definition at line 533 of file par_kmedoids.h.

The documentation for this class was generated from the following files:

src/par_kmedoids.h
src/par_kmedoids.cpp

par_kmedoids Class Reference

Public Member Functions

Protected Types

Protected Member Functions

Protected Attributes

Detailed Description

Member Typedef Documentation

Constructor & Destructor Documentation

Member Function Documentation

Member Data Documentation