Public Member Functions | Protected Types | Protected Member Functions | Protected Attributes

kmedoids Class Reference

Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data, by Kaufman and Rousseeuw. More...

#include <kmedoids.h>

Inheritance diagram for kmedoids:
Inheritance graph
[legend]
Collaboration diagram for kmedoids:
Collaboration graph
[legend]

List of all members.

Public Member Functions

 kmedoids (size_t num_objects=0)
 Constructor.
 ~kmedoids ()
double average_dissimilarity () const
 Get the average dissimilarity of objects w/their medoids for the last run.
void set_sort_medoids (bool sort_medoids)
 Set whether medoids will be sorted by object id after clustering is complete.
void set_epsilon (double epsilon)
 Set tolerance for convergence.
void pam (const dissimilarity_matrix &distance, size_t k, const object_id *initial_medoids=NULL)
 Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw.
double xpam (const dissimilarity_matrix &distance, size_t max_k, size_t dimensionality)
 Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw.
template<class T , class D >
void clara (const std::vector< T > &objects, D dmetric, size_t k)
 CLARA clustering algorithm, as per Kaufman and Rousseuw and R.
template<class T , class D >
void center_medoids (const std::vector< T > &objects, D distance)
 Takes existing clustering and reassigns medoids by taking closest medoid to mean of each cluster.
template<class T , class D >
double xclara (const std::vector< T > &objects, D dmetric, size_t max_k, size_t dimensionality)
 K-Agnostic version of CLARA.
void set_init_size (size_t sz)
void set_max_reps (size_t r)
void set_xcallback (void(*)(const partition &part, double bic))
 Set callback function for XPAM and XCLARA. default is none.

Protected Types

typedef boost::mt19937 random_type
typedef
boost::random_number_generator
< random_type, unsigned long > 
rng_type
 Randomness source for this algorithm.

Protected Member Functions

void init_medoids (size_t k, const dissimilarity_matrix &distance)
 KR BUILD algorithm for assigning initial medoids to a partition.
double cost (medoid_id i, object_id h, const dissimilarity_matrix &distance) const
 Total cost of swapping object h with medoid i.
template<class DM >
double assign_objects_to_clusters (DM distance)
 Assign each object to the cluster with the closest medoid.

Protected Attributes

random_type random
 Type for RNG used in this algorithm.
rng_type rng
std::vector< medoid_idsec_nearest
double total_dissimilarity
 Index of second closest medoids. Used by PAM.
bool sort_medoids
 Total dissimilarity bt/w objects and their medoid.
double epsilon
 Whether medoids should be canonically sorted by object id.
size_t init_size
 Normalized sensitivity for convergence.
size_t max_reps
 initial sample size (before 2*k)
void(* xcallback )(const partition &part, double bic)
 initial sample size (before 2*k)

Detailed Description

Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data, by Kaufman and Rousseeuw.

Definition at line 60 of file kmedoids.h.


Member Typedef Documentation

typedef boost::mt19937 random_type [protected]

Definition at line 262 of file kmedoids.h.

typedef boost::random_number_generator<random_type, unsigned long> rng_type [protected]

Randomness source for this algorithm.

Adaptor for STL algorithms.

Definition at line 266 of file kmedoids.h.


Constructor & Destructor Documentation

kmedoids ( size_t  num_objects = 0 )

Constructor.

Can optionally specify number of objects to be clustered. and this will start out with all of them in one cluster.

The random number generator associated with this kmedoids object is seeded with the time in microseconds since the epoch.

Definition at line 55 of file kmedoids.cpp.

~kmedoids (  )

Definition at line 68 of file kmedoids.cpp.


Member Function Documentation

double assign_objects_to_clusters ( DM  distance ) [inline, protected]

Assign each object to the cluster with the closest medoid.

Returns:
Total dissimilarity of objects w/their medoids.
Parameters:
distancea callable object that computes distances between indices, as a distance matrix would. Algorithms are free to use real distance matrices (as in PAM) or to compute lazily (as in CLARA medoid assignment).

Definition at line 296 of file kmedoids.h.

double average_dissimilarity (  ) const

Get the average dissimilarity of objects w/their medoids for the last run.

Definition at line 70 of file kmedoids.cpp.

void center_medoids ( const std::vector< T > &  objects,
distance 
) [inline]

Takes existing clustering and reassigns medoids by taking closest medoid to mean of each cluster.

This is O(n) and can give better representatives for CLARA clusterings. This is needed to apply gaussian-model BIC criteria to clusterings.

Template Parameters:
TTo use this function, T needs to support algebraic operations.
Specifically, T must support enough to construct a mean:

  • addition T + T = T
  • scalar division T / c = T

Definition at line 195 of file kmedoids.h.

void clara ( const std::vector< T > &  objects,
dmetric,
size_t  k 
) [inline]

CLARA clustering algorithm, as per Kaufman and Rousseuw and R.

Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining."

Template Parameters:
TType of objects to be clustered.
DDissimilarity metric type. D should be callable on (T, T) and should return a double.
Parameters:
objectsObjects to cluster
dmetricDistance metric to build dissimilarity matrices with
kNumber of clusters to partition

Definition at line 130 of file kmedoids.h.

double cost ( medoid_id  i,
object_id  h,
const dissimilarity_matrix distance 
) const [protected]

Total cost of swapping object h with medoid i.

Sums costs of this exchagne for all objects j.

Definition at line 131 of file kmedoids.cpp.

void init_medoids ( size_t  k,
const dissimilarity_matrix distance 
) [protected]

KR BUILD algorithm for assigning initial medoids to a partition.

Definition at line 86 of file kmedoids.cpp.

void pam ( const dissimilarity_matrix distance,
size_t  k,
const object_id initial_medoids = NULL 
)

Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw.

Parameters:
distancedissimilarity matrix for all objects to cluster
knumber of clusters to produce
initial_medoidsOptionally supply k initial object ids to be used as initial medoids.
See also:
build_dissimilarity_matrix(), a function to automatically construct a dissimilarity matrix given a vector of objects and a distance function.

Definition at line 157 of file kmedoids.cpp.

void set_epsilon ( double  epsilon )

Set tolerance for convergence.

This is relative error, not absolute error. It will be multiplied by the mean distance before it is used to test convergence. Defaults to 1e-15; may need to be higher if there exist clusterings with very similar quality.

Definition at line 78 of file kmedoids.cpp.

void set_init_size ( size_t  sz ) [inline]

Definition at line 254 of file kmedoids.h.

void set_max_reps ( size_t  r ) [inline]

Definition at line 255 of file kmedoids.h.

void set_sort_medoids ( bool  sort_medoids )

Set whether medoids will be sorted by object id after clustering is complete.

Defaults to true.

Definition at line 74 of file kmedoids.cpp.

void set_xcallback ( void(*)(const partition &part, double bic)  xpc )

Set callback function for XPAM and XCLARA. default is none.

Definition at line 82 of file kmedoids.cpp.

double xclara ( const std::vector< T > &  objects,
dmetric,
size_t  max_k,
size_t  dimensionality 
) [inline]

K-Agnostic version of CLARA.

This uses the BIC criterion as described in bic.h to run clara() a number of times and to select a best run of clara() from the trials. This will be slower than regular clara(). In particular, it's O(n*max_k).

Parameters:
[in]objectsObjects to cluster
[in]dmetricDistance metric to build dissimilarity matrices with
[in]max_kMax number of clusters to find.
[in]dimensionalityDimensionality of objects, used by BIC.

Definition at line 235 of file kmedoids.h.

double xpam ( const dissimilarity_matrix distance,
size_t  max_k,
size_t  dimensionality 
)

Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw.

Runs PAM from 1 to max_k and selects the best k using the bayesian information criterion. Sets this partition to the best partition found using PAM from 1 to k.

Based on X-Means, see Pelleg & Moore, 2000.

Parameters:
distancedissimilarity matrix for all objects to cluster
max_kUpper limit on number of clusters to find.
dimensionalityNumber of dimensions in clustered data, for BIC.
Returns:
the best BIC value found (the bic value of the final partitioning).
See also:
build_dissimilarity_matrix(), a function to automatically construct a dissimilarity matrix given a vector of objects and a distance function.

Definition at line 218 of file kmedoids.cpp.


Member Data Documentation

double epsilon [protected]

Whether medoids should be canonically sorted by object id.

Definition at line 272 of file kmedoids.h.

size_t init_size [protected]

Normalized sensitivity for convergence.

Definition at line 273 of file kmedoids.h.

size_t max_reps [protected]

initial sample size (before 2*k)

Definition at line 274 of file kmedoids.h.

random_type random [protected]

Type for RNG used in this algorithm.

Definition at line 263 of file kmedoids.h.

rng_type rng [protected]

Definition at line 267 of file kmedoids.h.

std::vector<medoid_id> sec_nearest [protected]

Definition at line 269 of file kmedoids.h.

bool sort_medoids [protected]

Total dissimilarity bt/w objects and their medoid.

Definition at line 271 of file kmedoids.h.

double total_dissimilarity [protected]

Index of second closest medoids. Used by PAM.

Definition at line 270 of file kmedoids.h.

void(* xcallback)(const partition &part, double bic) [protected]

initial sample size (before 2*k)

Callback for each iteration of xpam. is called with the current clustering and its BIC score.

Definition at line 278 of file kmedoids.h.


The documentation for this class was generated from the following files:
Muster. Copyright © 2010, Lawrence Livermore National Laboratory, LLNL-CODE-433662.
Distribution of Muster and its documentation is subject to terms of the Muster LICENSE.
Generated on Mon Dec 20 2010 using Doxygen 1.7.2