This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms. More...
#include <par_kmedoids.h>
Public Member Functions | |
par_kmedoids (MPI_Comm comm=MPI_COMM_WORLD) | |
Constructs a parallel kmedoids object and seeds its random number generator. | |
virtual | ~par_kmedoids () |
double | average_dissimilarity () |
Get the average dissimilarity of objects w/their medoids for the last run. | |
double | bic_score () |
BIC score for selected clustering. | |
void | set_max_reps (size_t reps) |
Sets max_reps, Max number of times to run PAM with each sampled dataset. | |
size_t | get_max_reps () |
Max number of times to run PAM with each sampled dataset. | |
void | set_init_size (size_t size) |
Sets init_size, baseline size for samples, added to 2*k. | |
size_t | get_init_size () |
Baseline size for samples, added to 2*k. | |
void | set_epsilon (double epsilon) |
Set tolerance for convergence. | |
template<class T , class D > | |
void | run_pam_trials (trial_generator &trials, const std::vector< T > &objects, D dmetric, std::vector< typename id_pair< T >::vector > &all_medoids, MPI_Comm comm) |
Farms out trials of PAM to worker processes then collects medoids from all trials to all processors. | |
template<class T , class D > | |
void | capek (const std::vector< T > &objects, D dmetric, size_t k, std::vector< T > *medoids=NULL) |
This is the Clustering Algorithm with Parallel Extensions to K-Medoids (CAPEK). | |
template<class T , class D > | |
double | xcapek (const std::vector< T > &objects, D dmetric, size_t max_k, size_t dimensionality, std::vector< T > *medoids=NULL) |
K-agnostic version of capek(). | |
const Timer & | get_timer () |
Get the Timer with info on the last run of either capek() or xcapek(). | |
Protected Types | |
typedef boost::mt19937 | random_t |
Type for random number generator used here. | |
Protected Member Functions | |
void | seed_random_uniform (MPI_Comm comm) |
Seeds random number generators across all processes with the same number, taken from the time in microseconds since the epoch on the process 0. | |
template<class T > | |
void | pack_vector (MPI_Comm comm, const std::vector< T > &packables, std::vector< char > &buffer) |
This packs a vector of packable objects on one process. | |
template<class T > | |
void | unpack_vector (MPI_Comm comm, const std::vector< char > &buffer, std::vector< T > &packables) |
This unpacks a vector of packable objects packed by pack_vector(). | |
template<typename T , typename D > | |
std::pair< double, size_t > | closest_medoid (const T &object, object_id oid, const std::vector< id_pair< T > > &medoids, D dmetric) |
Find the closest object in the medoids vector to the object passed in. | |
Protected Attributes | |
random_t | random |
Random number distribution to be used for samples. | |
double | total_dissimilarity |
Total dissimilarity bt/w objects and medoids for last clustering. | |
double | best_bic_score |
BIC score for the clustering found. | |
size_t | init_size |
baseline size for samples | |
size_t | max_reps |
Max repetitions of trials for a particular k. | |
double | epsilon |
Tolerance for convergence tests in kmedoids PAM runs. | |
Timer | timer |
Performance timer. |
This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms.
For more information on these algorithms, see this paper:
Example usage:
// This is a theoretical point struct to be clustered. struct point { double x, y; }; // Euclidean distance functor to use for clustering. struct euclidean_distance { double operator()(const point& lhs, const point& rhs) { double a = lhs.x - rhs.x; double b = lhs.y - rhs.y; return sqrt(a*a + b*b); } }; vector<point> points; // ... Each process puts some local points in the vector ... par_kmedoids parkm(MPI_COMM_WORLD); vector<point> medoids; // storage for reprsentatives // Run xcapek(), finding a max of 20 clusters among the 2d points. parkm.xcapek(points, euclidean_distance(), 20, 2, &medoids);
After this runs, these members of parkm
are interesting:
parkm.cluster_ids
: A vector of cluster ids for the local objects in points
parkm.medoid_ids
: A vector of object ids for all the cluster representativesTogether, these define the clustering that the algorithm found. See par_partition for an explanation of how distributed partitions like this are represented.
The medoids
vector contains actual copies of the representatives for each cluster. The copies correspond to the object ids in parkm.medoid_ids
. Supplying the medoids
vector like this is optional, so if you don't need copies of the representatives, you can omit it from the call.
Definition at line 112 of file par_kmedoids.h.
typedef boost::mt19937 random_t [protected] |
Type for random number generator used here.
Definition at line 530 of file par_kmedoids.h.
par_kmedoids | ( | MPI_Comm | comm = MPI_COMM_WORLD ) |
Constructs a parallel kmedoids object and seeds its random number generator.
This is a collective operation, and needs to be called by all processes.
par_kmedoids assumes that there is one object to be clustered per process.
Definition at line 47 of file par_kmedoids.cpp.
virtual ~par_kmedoids | ( | ) | [inline, virtual] |
Definition at line 122 of file par_kmedoids.h.
double average_dissimilarity | ( | ) |
Get the average dissimilarity of objects w/their medoids for the last run.
Definition at line 62 of file par_kmedoids.cpp.
double bic_score | ( | ) |
BIC score for selected clustering.
Definition at line 68 of file par_kmedoids.cpp.
void capek | ( | const std::vector< T > & | objects, |
D | dmetric, | ||
size_t | k, | ||
std::vector< T > * | medoids = NULL |
||
) | [inline] |
This is the Clustering Algorithm with Parallel Extensions to K-Medoids (CAPEK).
Assumes that objects to be clustered are fully distributed across parallel process, with the same number of objects per process.
T | Type of objects to be clustered. T must support the following operations:
|
D | Dissimilarity metric type. D should be callable on (T, T) and should return a double representing the distance between the two T's. |
[in] | objects | Local objects to cluster (ASSUME: currently must be same number per process!) |
[in] | dmetric | Distance metric to build dissimilarity matrices with |
[in] | k | Number of clusters to find. |
[out] | medoids | Optional output vector where global medoids will be stored along with their source ranks. |
CAPEK will run trials insatances of PAM on separate processors for each k in 1..max_k using the run_pam_trials() routine. Each trial aggregates sample_size objects distributed over all processes in the system.
Definition at line 296 of file par_kmedoids.h.
std::pair<double, size_t> closest_medoid | ( | const T & | object, |
object_id | oid, | ||
const std::vector< id_pair< T > > & | medoids, | ||
D | dmetric | ||
) | [inline, protected] |
Find the closest object in the medoids vector to the object passed in.
Returns a pair of the closest medoid's id and its distance from the object.
[in] | object | Object to compare to medoids. |
[in] | oid | ID of the object (need this so medoids prefer themselves as their own medoids). |
[in] | medoids | Vector of medoids to find the closest from. |
[in] | dmetric | Distance metric to assess closeness with. |
Definition at line 600 of file par_kmedoids.h.
size_t get_init_size | ( | ) | [inline] |
Baseline size for samples, added to 2*k.
Definition at line 150 of file par_kmedoids.h.
size_t get_max_reps | ( | ) | [inline] |
Max number of times to run PAM with each sampled dataset.
Definition at line 139 of file par_kmedoids.h.
const Timer& get_timer | ( | ) | [inline] |
Get the Timer with info on the last run of either capek() or xcapek().
Definition at line 527 of file par_kmedoids.h.
void pack_vector | ( | MPI_Comm | comm, |
const std::vector< T > & | packables, | ||
std::vector< char > & | buffer | ||
) | [inline, protected] |
This packs a vector of packable objects on one process.
To be packable, T must support these operations:
Definition at line 556 of file par_kmedoids.h.
void run_pam_trials | ( | trial_generator & | trials, |
const std::vector< T > & | objects, | ||
D | dmetric, | ||
std::vector< typename id_pair< T >::vector > & | all_medoids, | ||
MPI_Comm | comm | ||
) | [inline] |
Farms out trials of PAM to worker processes then collects medoids from all trials to all processors.
Puts resulting medoids in all_medoids when done.
Definition at line 163 of file par_kmedoids.h.
void seed_random_uniform | ( | MPI_Comm | comm ) | [protected] |
Seeds random number generators across all processes with the same number, taken from the time in microseconds since the epoch on the process 0.
Definition at line 72 of file par_kmedoids.cpp.
void set_epsilon | ( | double | epsilon ) |
Set tolerance for convergence.
This is relative error, not absolute error. It will be multiplied by the mean distance before it is used to test convergence. Defaults to 1e-15; may need to be higher if there exist clusterings with very similar quality.
Definition at line 57 of file par_kmedoids.cpp.
void set_init_size | ( | size_t | size ) | [inline] |
Sets init_size, baseline size for samples, added to 2*k.
Defaults to 40, per Kaufman and Rousseeuw.
Definition at line 145 of file par_kmedoids.h.
void set_max_reps | ( | size_t | reps ) | [inline] |
Sets max_reps, Max number of times to run PAM with each sampled dataset.
Default is 5, per Kaufman and Rousseeuw.
Definition at line 134 of file par_kmedoids.h.
void unpack_vector | ( | MPI_Comm | comm, |
const std::vector< char > & | buffer, | ||
std::vector< T > & | packables | ||
) | [inline, protected] |
This unpacks a vector of packable objects packed by pack_vector().
Definition at line 578 of file par_kmedoids.h.
double xcapek | ( | const std::vector< T > & | objects, |
D | dmetric, | ||
size_t | max_k, | ||
size_t | dimensionality, | ||
std::vector< T > * | medoids = NULL |
||
) | [inline] |
K-agnostic version of capek().
This version attempts to guess the best K for the data using the Bayesian Information Criterion (BIC) described in bic.h. Evaluation of the BIC is parallelized using global reduction operations.
Like capek(), this uses run_pam_trials() to farm out trials of the PAM clustering algorithm, but it requires more trials than capek(). In particular, it will run (sum(1..max_k) * trials) total trials in parallel on MPI worker processes.
T | Type of objects to be clustered. T must support the following operations:
|
D | Dissimilarity metric type. D should be callable on (T, T) and should return a double representing the distance between the two T's. |
[in] | objects | Local objects to cluster (ASSUME: currently must be same number per process!) |
[in] | dmetric | Distance metric to build dissimilarity matrices with |
[in] | max_k | Max number of clusters to find. |
[in] | dimensionality | Dimensionality of objects, used by BIC. |
[out] | medoids | Optional output vector where global medoids will be stored along with their source ranks. |
Definition at line 409 of file par_kmedoids.h.
double best_bic_score [protected] |
BIC score for the clustering found.
Definition at line 534 of file par_kmedoids.h.
double epsilon [protected] |
Tolerance for convergence tests in kmedoids PAM runs.
Definition at line 537 of file par_kmedoids.h.
size_t init_size [protected] |
baseline size for samples
Definition at line 535 of file par_kmedoids.h.
size_t max_reps [protected] |
Max repetitions of trials for a particular k.
Definition at line 536 of file par_kmedoids.h.
Random number distribution to be used for samples.
Definition at line 531 of file par_kmedoids.h.
Performance timer.
Definition at line 539 of file par_kmedoids.h.
double total_dissimilarity [protected] |
Total dissimilarity bt/w objects and medoids for last clustering.
Definition at line 533 of file par_kmedoids.h.