Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data, by Kaufman and Rousseeuw. More...
#include <kmedoids.h>
Public Member Functions | |
kmedoids (size_t num_objects=0) | |
Constructor. | |
~kmedoids () | |
double | average_dissimilarity () const |
Get the average dissimilarity of objects w/their medoids for the last run. | |
void | set_sort_medoids (bool sort_medoids) |
Set whether medoids will be sorted by object id after clustering is complete. | |
void | set_epsilon (double epsilon) |
Set tolerance for convergence. | |
void | pam (const dissimilarity_matrix &distance, size_t k, const object_id *initial_medoids=NULL) |
Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw. | |
double | xpam (const dissimilarity_matrix &distance, size_t max_k, size_t dimensionality) |
Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw. | |
template<class T , class D > | |
void | clara (const std::vector< T > &objects, D dmetric, size_t k) |
CLARA clustering algorithm, as per Kaufman and Rousseuw and R. | |
template<class T , class D > | |
void | center_medoids (const std::vector< T > &objects, D distance) |
Takes existing clustering and reassigns medoids by taking closest medoid to mean of each cluster. | |
template<class T , class D > | |
double | xclara (const std::vector< T > &objects, D dmetric, size_t max_k, size_t dimensionality) |
K-Agnostic version of CLARA. | |
void | set_init_size (size_t sz) |
void | set_max_reps (size_t r) |
void | set_xcallback (void(*)(const partition &part, double bic)) |
Set callback function for XPAM and XCLARA. default is none. | |
Protected Types | |
typedef boost::mt19937 | random_type |
typedef boost::random_number_generator < random_type, unsigned long > | rng_type |
Randomness source for this algorithm. | |
Protected Member Functions | |
void | init_medoids (size_t k, const dissimilarity_matrix &distance) |
KR BUILD algorithm for assigning initial medoids to a partition. | |
double | cost (medoid_id i, object_id h, const dissimilarity_matrix &distance) const |
Total cost of swapping object h with medoid i. | |
template<class DM > | |
double | assign_objects_to_clusters (DM distance) |
Assign each object to the cluster with the closest medoid. | |
Protected Attributes | |
random_type | random |
Type for RNG used in this algorithm. | |
rng_type | rng |
std::vector< medoid_id > | sec_nearest |
double | total_dissimilarity |
Index of second closest medoids. Used by PAM. | |
bool | sort_medoids |
Total dissimilarity bt/w objects and their medoid. | |
double | epsilon |
Whether medoids should be canonically sorted by object id. | |
size_t | init_size |
Normalized sensitivity for convergence. | |
size_t | max_reps |
initial sample size (before 2*k) | |
void(* | xcallback )(const partition &part, double bic) |
initial sample size (before 2*k) |
Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data, by Kaufman and Rousseeuw.
Definition at line 60 of file kmedoids.h.
typedef boost::mt19937 random_type [protected] |
Definition at line 262 of file kmedoids.h.
typedef boost::random_number_generator<random_type, unsigned long> rng_type [protected] |
Randomness source for this algorithm.
Adaptor for STL algorithms.
Definition at line 266 of file kmedoids.h.
kmedoids | ( | size_t | num_objects = 0 ) |
Constructor.
Can optionally specify number of objects to be clustered. and this will start out with all of them in one cluster.
The random number generator associated with this kmedoids object is seeded with the time in microseconds since the epoch.
Definition at line 55 of file kmedoids.cpp.
~kmedoids | ( | ) |
Definition at line 68 of file kmedoids.cpp.
double assign_objects_to_clusters | ( | DM | distance ) | [inline, protected] |
Assign each object to the cluster with the closest medoid.
distance | a callable object that computes distances between indices, as a distance matrix would. Algorithms are free to use real distance matrices (as in PAM) or to compute lazily (as in CLARA medoid assignment). |
Definition at line 296 of file kmedoids.h.
double average_dissimilarity | ( | ) | const |
Get the average dissimilarity of objects w/their medoids for the last run.
Definition at line 70 of file kmedoids.cpp.
void center_medoids | ( | const std::vector< T > & | objects, |
D | distance | ||
) | [inline] |
Takes existing clustering and reassigns medoids by taking closest medoid to mean of each cluster.
This is O(n) and can give better representatives for CLARA clusterings. This is needed to apply gaussian-model BIC criteria to clusterings.
T | To use this function, T needs to support algebraic operations. Specifically, T must support enough to construct a mean:
|
Definition at line 195 of file kmedoids.h.
void clara | ( | const std::vector< T > & | objects, |
D | dmetric, | ||
size_t | k | ||
) | [inline] |
CLARA clustering algorithm, as per Kaufman and Rousseuw and R.
Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining."
T | Type of objects to be clustered. |
D | Dissimilarity metric type. D should be callable on (T, T) and should return a double. |
objects | Objects to cluster |
dmetric | Distance metric to build dissimilarity matrices with |
k | Number of clusters to partition |
Definition at line 130 of file kmedoids.h.
double cost | ( | medoid_id | i, |
object_id | h, | ||
const dissimilarity_matrix & | distance | ||
) | const [protected] |
Total cost of swapping object h with medoid i.
Sums costs of this exchagne for all objects j.
Definition at line 131 of file kmedoids.cpp.
void init_medoids | ( | size_t | k, |
const dissimilarity_matrix & | distance | ||
) | [protected] |
KR BUILD algorithm for assigning initial medoids to a partition.
Definition at line 86 of file kmedoids.cpp.
void pam | ( | const dissimilarity_matrix & | distance, |
size_t | k, | ||
const object_id * | initial_medoids = NULL |
||
) |
Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw.
distance | dissimilarity matrix for all objects to cluster |
k | number of clusters to produce |
initial_medoids | Optionally supply k initial object ids to be used as initial medoids. |
Definition at line 157 of file kmedoids.cpp.
void set_epsilon | ( | double | epsilon ) |
Set tolerance for convergence.
This is relative error, not absolute error. It will be multiplied by the mean distance before it is used to test convergence. Defaults to 1e-15; may need to be higher if there exist clusterings with very similar quality.
Definition at line 78 of file kmedoids.cpp.
void set_init_size | ( | size_t | sz ) | [inline] |
Definition at line 254 of file kmedoids.h.
void set_max_reps | ( | size_t | r ) | [inline] |
Definition at line 255 of file kmedoids.h.
void set_sort_medoids | ( | bool | sort_medoids ) |
Set whether medoids will be sorted by object id after clustering is complete.
Defaults to true.
Definition at line 74 of file kmedoids.cpp.
void set_xcallback | ( | void(*)(const partition &part, double bic) | xpc ) |
Set callback function for XPAM and XCLARA. default is none.
Definition at line 82 of file kmedoids.cpp.
double xclara | ( | const std::vector< T > & | objects, |
D | dmetric, | ||
size_t | max_k, | ||
size_t | dimensionality | ||
) | [inline] |
K-Agnostic version of CLARA.
This uses the BIC criterion as described in bic.h to run clara() a number of times and to select a best run of clara() from the trials. This will be slower than regular clara(). In particular, it's O(n*max_k).
[in] | objects | Objects to cluster |
[in] | dmetric | Distance metric to build dissimilarity matrices with |
[in] | max_k | Max number of clusters to find. |
[in] | dimensionality | Dimensionality of objects, used by BIC. |
Definition at line 235 of file kmedoids.h.
double xpam | ( | const dissimilarity_matrix & | distance, |
size_t | max_k, | ||
size_t | dimensionality | ||
) |
Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw.
Runs PAM from 1 to max_k and selects the best k using the bayesian information criterion. Sets this partition to the best partition found using PAM from 1 to k.
Based on X-Means, see Pelleg & Moore, 2000.
distance | dissimilarity matrix for all objects to cluster |
max_k | Upper limit on number of clusters to find. |
dimensionality | Number of dimensions in clustered data, for BIC. |
Definition at line 218 of file kmedoids.cpp.
double epsilon [protected] |
Whether medoids should be canonically sorted by object id.
Definition at line 272 of file kmedoids.h.
size_t init_size [protected] |
Normalized sensitivity for convergence.
Definition at line 273 of file kmedoids.h.
size_t max_reps [protected] |
initial sample size (before 2*k)
Definition at line 274 of file kmedoids.h.
random_type random [protected] |
Type for RNG used in this algorithm.
Definition at line 263 of file kmedoids.h.
Definition at line 267 of file kmedoids.h.
std::vector<medoid_id> sec_nearest [protected] |
Definition at line 269 of file kmedoids.h.
bool sort_medoids [protected] |
Total dissimilarity bt/w objects and their medoid.
Definition at line 271 of file kmedoids.h.
double total_dissimilarity [protected] |
Index of second closest medoids. Used by PAM.
Definition at line 270 of file kmedoids.h.
initial sample size (before 2*k)
Callback for each iteration of xpam. is called with the current clustering and its BIC score.
Definition at line 278 of file kmedoids.h.