Namespace for everything in the cluster library. More...
Classes | |
struct | counter_iterator |
Counting output iterator that records how many times an output iterator was assigned to, but ignores the value stored. More... | |
struct | matrix_distance |
Adaptor for passing a matrix by reference to template functions that take a callable distance function. More... | |
struct | lazy_distance_functor |
Functor for computing distance lazily from an object array and a distance metric. More... | |
struct | id_pair |
MPI-packable struct for an MPI-packable type plus its object id. More... | |
class | kmedoids |
Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data, by Kaufman and Rousseeuw. More... | |
class | multi_gather |
Asynchronous, some-to-some gather operation used by parallel clustering algorithms to simultaneously send members of sample sets to a set of distributed worker processes. More... | |
class | par_kmedoids |
This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms. More... | |
struct | par_partition |
par_partition represents a partitioning of a distributed data set. More... | |
struct | partition |
This represents a partitioning of a data set. More... | |
struct | trial |
This struct represents parameters for a single trial run of kmedoids. More... | |
class | trial_generator |
Class to generate a set of trials for clustering. More... | |
Typedefs | |
typedef boost::numeric::ublas::symmetric_matrix < double > | dissimilarity_matrix |
Packed repersentation of symmetric dissimilarity matrix. | |
typedef size_t | medoid_id |
More descriptive type for medoid index. | |
typedef size_t | object_id |
More descriptive type for object index. | |
typedef std::vector< std::set < object_id > > | cluster_list |
Explicit representation of a clustering. | |
Functions | |
template<typename D > | |
double | bic (const partition &p, D distance, size_t M) |
Directly computes the BIC from a partition object based on the cluster centroids and the number of clusters. | |
template<typename SizeIterator , typename DissimIterator > | |
double | bic (size_t k, SizeIterator cluster_sizes, DissimIterator sum2_dissim, size_t dimensionality) |
This version of the BIC assumes some precomputed information. | |
template<class T > | |
counter_iterator< T > | counter (T &ref) |
Adaptor for creating type-inferred counters. | |
template<class T , class D > | |
void | build_dissimilarity_matrix (const std::vector< T > &objects, D dissimilarity, dissimilarity_matrix &mat) |
Computes a dissimilarity matrix from a vector of objects. | |
template<class T , class D > | |
void | build_dissimilarity_matrix (const std::vector< T > &objects, const std::vector< size_t > &subset, D dissimilarity, dissimilarity_matrix &mat) |
Computes a dissimilarity matrix from a subset of a vector of objects. | |
template<class T , class D > | |
lazy_distance_functor< T, D > | lazy_distance (const std::vector< T > &objs, D dist) |
Type-inferred syntactic sugar for constructing lazy_distance_functor. | |
template<class T > | |
id_pair< T > | make_id_pair (const T &elt, int id) |
Helper function for making arbitrary id_pairs with type inference. | |
template<class T > | |
std::ostream & | operator<< (std::ostream &out, const id_pair< T > &p) |
Print out an id_pair as a tuple of its element and its source rank. | |
std::ostream & | operator<< (std::ostream &out, const par_partition &par) |
Right now this just uses parition::operator<<() by making a partition with this par_partition's cluster_ids and medoid_ids vectors and outputting it. | |
ostream & | operator<< (ostream &out, const cluster_list &clusters) |
double | mirkin_distance (const cluster_list &c1, const cluster_list &c2) |
Mirkin distance bt/w two clusterings. | |
std::ostream & | operator<< (std::ostream &out, const partition &km) |
For convenience. | |
double | mirkin_distance (const partition &c1, const partition &c2) |
Convenience overload for comparing partition objects directly. | |
void | expand (cluster_list &list, size_t level=1) |
Expand a cluster_list by l levels. | |
std::ostream & | operator<< (std::ostream &out, const partition::member_writer &mw) |
std::ostream & | operator<< (std::ostream &out, const cluster_list &list) |
Prints out nicely formatted clustering. | |
template<typename D > | |
double | total_dissimilarity (const partition &p, D dist) |
Compute the total dissimilarity between all objects and their medoids. | |
template<typename D > | |
double | total_dissimilarity (const partition &p, D dist, medoid_id m) |
Compute the total dissimilarity between all objects in a particular cluster and its medoid. | |
template<typename D > | |
double | total_squared_dissimilarity (const partition &p, D dist) |
Compute the total squared dissimilarity between all objects and their medoids. | |
template<typename D > | |
double | total_squared_dissimilarity (const partition &p, D dist, medoid_id m) |
Compute the total squared dissimilarity between all objects in a particular cluster and its medoid. | |
template<class OutputIterator , class Random > | |
void | random_subset (size_t numElements, size_t sample_size, OutputIterator out, Random &random) |
This is Knuth's algorithm R for taking a sample of indices from 0 to numElements. | |
long | get_time_seed () |
Returns a reasonably distributed seed for random number generators. |
Namespace for everything in the cluster library.
typedef std::vector< std::set<object_id> > cluster_list |
Explicit representation of a clustering.
Instead of a vecto of representative ids, this has k sets of object_ids indicating which objects are in a particular cluster. You can convert a partition to a cluster_list with to_cluster_list().
Definition at line 60 of file partition.h.
typedef boost::numeric::ublas::symmetric_matrix<double> dissimilarity_matrix |
Packed repersentation of symmetric dissimilarity matrix.
Definition at line 48 of file dissimilarity.h.
typedef size_t medoid_id |
More descriptive type for medoid index.
Definition at line 51 of file partition.h.
typedef size_t object_id |
More descriptive type for object index.
Definition at line 52 of file partition.h.
double cluster::bic | ( | const partition & | p, |
D | distance, | ||
size_t | M | ||
) |
Directly computes the BIC from a partition object based on the cluster centroids and the number of clusters.
[in] | p | A partition object describing the clustering to be evaluated. |
[in] | distance | A distance function callable on two indices from the partition p. |
[in] | M | Dimensionality parameter -- degrees of freedom in the input dataset. |
double cluster::bic | ( | size_t | k, |
SizeIterator | cluster_sizes, | ||
DissimIterator | sum2_dissim, | ||
size_t | dimensionality | ||
) |
This version of the BIC assumes some precomputed information.
This is useful for parallel implementations, where it is more efficient to compute some global sums as a reduction rather than aggregating a full partition to one process. Here, we assume that the sizes of the distributed clusters as well as the total squared intra-cluster dissimilarity (between each object and its medoid) is known.
[in] | k | Number of clusters in the clustering. Same as k from k-means or k-medoids. |
[in] | cluster_sizes | Start of range of k sizes. *cluster_sizes .. *(cluster_sizes + k) should be the sizes of clusters 1 to k |
[in] | sum2_dissim | Sum of squared dissimilarities of each object w.r.t. its nearest medoid. |
[in] | dimensionality | Dimensionality of clustered data. e.g., 2 for 2-dimensional points. |
void cluster::build_dissimilarity_matrix | ( | const std::vector< T > & | objects, |
D | dissimilarity, | ||
dissimilarity_matrix & | mat | ||
) |
Computes a dissimilarity matrix from a vector of objects.
[in] | objects | Vector of any type T. |
[in] | dissimilarity | A dissimilarity measure that gives the distance between two T's. Needs to be callable on (T, T). |
[out] | mat | Output parameter. Dissimiliarity matrix is stored here. |
Definition at line 59 of file dissimilarity.h.
void cluster::build_dissimilarity_matrix | ( | const std::vector< T > & | objects, |
const std::vector< size_t > & | subset, | ||
D | dissimilarity, | ||
dissimilarity_matrix & | mat | ||
) |
Computes a dissimilarity matrix from a subset of a vector of objects.
objects | Vector of any type T. |
subset | Indirection vector. Contains indices into objects for elements to be compared. |
dissimilarity | A dissimilarity measure that gives the distance between two T's. Needs to be callable(T, T). |
mat | Output parameter. Dissimiliarity matrix is stored here. |
Definition at line 83 of file dissimilarity.h.
counter_iterator<T> cluster::counter | ( | T & | ref ) |
Adaptor for creating type-inferred counters.
Example Usage:
#include <algorithm> // construct two sets set<int> s1, s2; // insert some things so that their intersection has 2 ints. s1.insert(1); s1.insert(2); s1.insert(3); s2.insert(2); s2.insert(3); s2.insert(4); // Compute intersection, but throw away the values and just // store the size in the count variable. size_t count; set_intersection(s1.begin(), s1.end(), s2.begin(), s2.end(), counter(count)); // now count == 2, since 2 items were inserted // by <code>set_intersection.
void expand | ( | cluster_list & | list, |
size_t | level = 1 |
||
) |
Expand a cluster_list by l levels.
That is, replace each index i in the cluster_list with indices in [2^l * i ... 2^l * (i+1) - 1]
Definition at line 170 of file partition.cpp.
long cluster::get_time_seed | ( | ) | [inline] |
lazy_distance_functor<T,D> cluster::lazy_distance | ( | const std::vector< T > & | objs, |
D | dist | ||
) |
Type-inferred syntactic sugar for constructing lazy_distance_functor.
Definition at line 124 of file dissimilarity.h.
id_pair<T> cluster::make_id_pair | ( | const T & | elt, |
int | id | ||
) |
double mirkin_distance | ( | const cluster_list & | c1, |
const cluster_list & | c2 | ||
) |
Mirkin distance bt/w two clusterings.
Definition at line 125 of file partition.cpp.
double mirkin_distance | ( | const partition & | c1, |
const partition & | c2 | ||
) |
Convenience overload for comparing partition objects directly.
Definition at line 162 of file partition.cpp.
std::ostream& cluster::operator<< | ( | std::ostream & | out, |
const id_pair< T > & | p | ||
) |
std::ostream& cluster::operator<< | ( | std::ostream & | out, |
const partition::member_writer & | mw | ||
) | [inline] |
Definition at line 141 of file partition.h.
std::ostream & operator<< | ( | std::ostream & | out, |
const par_partition & | par | ||
) |
Right now this just uses parition::operator<<() by making a partition with this par_partition's cluster_ids and medoid_ids vectors and outputting it.
Definition at line 110 of file par_partition.cpp.
ostream& cluster::operator<< | ( | ostream & | out, |
const cluster_list & | clusters | ||
) |
Definition at line 113 of file partition.cpp.
std::ostream & operator<< | ( | std::ostream & | out, |
const partition & | p | ||
) |
For convenience.
Definition at line 154 of file partition.cpp.
std::ostream& cluster::operator<< | ( | std::ostream & | out, |
const cluster_list & | list | ||
) |
Prints out nicely formatted clustering.
void cluster::random_subset | ( | size_t | numElements, |
size_t | sample_size, | ||
OutputIterator | out, | ||
Random & | random | ||
) |
This is Knuth's algorithm R for taking a sample of indices from 0 to numElements.
We sample size indices from this (the superset) and put them in the subset's mapping.
numElements | total number of elements to select from |
sample_size | number of elements to select |
out | destination for selected elements |
random | model of STL Random Number Generator. must be callable as random(N), returning a random number in [0,N). |
double cluster::total_dissimilarity | ( | const partition & | p, |
D | dist, | ||
medoid_id | m | ||
) |
Compute the total dissimilarity between all objects in a particular cluster and its medoid.
Definition at line 182 of file partition.h.
double cluster::total_dissimilarity | ( | const partition & | p, |
D | dist | ||
) |
Compute the total dissimilarity between all objects and their medoids.
Definition at line 170 of file partition.h.
double cluster::total_squared_dissimilarity | ( | const partition & | p, |
D | dist, | ||
medoid_id | m | ||
) |
Compute the total squared dissimilarity between all objects in a particular cluster and its medoid.
Definition at line 210 of file partition.h.
double cluster::total_squared_dissimilarity | ( | const partition & | p, |
D | dist | ||
) |
Compute the total squared dissimilarity between all objects and their medoids.
Definition at line 196 of file partition.h.