Classes | Typedefs | Functions

cluster Namespace Reference

Namespace for everything in the cluster library. More...

Classes

struct  counter_iterator
 Counting output iterator that records how many times an output iterator was assigned to, but ignores the value stored. More...
struct  matrix_distance
 Adaptor for passing a matrix by reference to template functions that take a callable distance function. More...
struct  lazy_distance_functor
 Functor for computing distance lazily from an object array and a distance metric. More...
struct  id_pair
 MPI-packable struct for an MPI-packable type plus its object id. More...
class  kmedoids
 Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data, by Kaufman and Rousseeuw. More...
class  multi_gather
 Asynchronous, some-to-some gather operation used by parallel clustering algorithms to simultaneously send members of sample sets to a set of distributed worker processes. More...
class  par_kmedoids
 This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms. More...
struct  par_partition
 par_partition represents a partitioning of a distributed data set. More...
struct  partition
 This represents a partitioning of a data set. More...
struct  trial
 This struct represents parameters for a single trial run of kmedoids. More...
class  trial_generator
 Class to generate a set of trials for clustering. More...

Typedefs

typedef
boost::numeric::ublas::symmetric_matrix
< double > 
dissimilarity_matrix
 Packed repersentation of symmetric dissimilarity matrix.
typedef size_t medoid_id
 More descriptive type for medoid index.
typedef size_t object_id
 More descriptive type for object index.
typedef std::vector< std::set
< object_id > > 
cluster_list
 Explicit representation of a clustering.

Functions

template<typename D >
double bic (const partition &p, D distance, size_t M)
 Directly computes the BIC from a partition object based on the cluster centroids and the number of clusters.
template<typename SizeIterator , typename DissimIterator >
double bic (size_t k, SizeIterator cluster_sizes, DissimIterator sum2_dissim, size_t dimensionality)
 This version of the BIC assumes some precomputed information.
template<class T >
counter_iterator< T > counter (T &ref)
 Adaptor for creating type-inferred counters.
template<class T , class D >
void build_dissimilarity_matrix (const std::vector< T > &objects, D dissimilarity, dissimilarity_matrix &mat)
 Computes a dissimilarity matrix from a vector of objects.
template<class T , class D >
void build_dissimilarity_matrix (const std::vector< T > &objects, const std::vector< size_t > &subset, D dissimilarity, dissimilarity_matrix &mat)
 Computes a dissimilarity matrix from a subset of a vector of objects.
template<class T , class D >
lazy_distance_functor< T, D > lazy_distance (const std::vector< T > &objs, D dist)
 Type-inferred syntactic sugar for constructing lazy_distance_functor.
template<class T >
id_pair< T > make_id_pair (const T &elt, int id)
 Helper function for making arbitrary id_pairs with type inference.
template<class T >
std::ostream & operator<< (std::ostream &out, const id_pair< T > &p)
 Print out an id_pair as a tuple of its element and its source rank.
std::ostream & operator<< (std::ostream &out, const par_partition &par)
 Right now this just uses parition::operator<<() by making a partition with this par_partition's cluster_ids and medoid_ids vectors and outputting it.
ostream & operator<< (ostream &out, const cluster_list &clusters)
double mirkin_distance (const cluster_list &c1, const cluster_list &c2)
 Mirkin distance bt/w two clusterings.
std::ostream & operator<< (std::ostream &out, const partition &km)
 For convenience.
double mirkin_distance (const partition &c1, const partition &c2)
 Convenience overload for comparing partition objects directly.
void expand (cluster_list &list, size_t level=1)
 Expand a cluster_list by l levels.
std::ostream & operator<< (std::ostream &out, const partition::member_writer &mw)
std::ostream & operator<< (std::ostream &out, const cluster_list &list)
 Prints out nicely formatted clustering.
template<typename D >
double total_dissimilarity (const partition &p, D dist)
 Compute the total dissimilarity between all objects and their medoids.
template<typename D >
double total_dissimilarity (const partition &p, D dist, medoid_id m)
 Compute the total dissimilarity between all objects in a particular cluster and its medoid.
template<typename D >
double total_squared_dissimilarity (const partition &p, D dist)
 Compute the total squared dissimilarity between all objects and their medoids.
template<typename D >
double total_squared_dissimilarity (const partition &p, D dist, medoid_id m)
 Compute the total squared dissimilarity between all objects in a particular cluster and its medoid.
template<class OutputIterator , class Random >
void random_subset (size_t numElements, size_t sample_size, OutputIterator out, Random &random)
 This is Knuth's algorithm R for taking a sample of indices from 0 to numElements.
long get_time_seed ()
 Returns a reasonably distributed seed for random number generators.

Detailed Description

Namespace for everything in the cluster library.


Typedef Documentation

typedef std::vector< std::set<object_id> > cluster_list

Explicit representation of a clustering.

Instead of a vecto of representative ids, this has k sets of object_ids indicating which objects are in a particular cluster. You can convert a partition to a cluster_list with to_cluster_list().

Definition at line 60 of file partition.h.

typedef boost::numeric::ublas::symmetric_matrix<double> dissimilarity_matrix

Packed repersentation of symmetric dissimilarity matrix.

Definition at line 48 of file dissimilarity.h.

typedef size_t medoid_id

More descriptive type for medoid index.

Definition at line 51 of file partition.h.

typedef size_t object_id

More descriptive type for object index.

Definition at line 52 of file partition.h.


Function Documentation

double cluster::bic ( const partition &  p,
distance,
size_t  M 
)

Directly computes the BIC from a partition object based on the cluster centroids and the number of clusters.

Parameters:
[in]pA partition object describing the clustering to be evaluated.
[in]distanceA distance function callable on two indices from the partition p.
[in]MDimensionality parameter -- degrees of freedom in the input dataset.
Returns:
A valid BIC score based on the input parameters. Higher numbers indicate better fit.

Definition at line 76 of file bic.h.

double cluster::bic ( size_t  k,
SizeIterator  cluster_sizes,
DissimIterator  sum2_dissim,
size_t  dimensionality 
)

This version of the BIC assumes some precomputed information.

This is useful for parallel implementations, where it is more efficient to compute some global sums as a reduction rather than aggregating a full partition to one process. Here, we assume that the sizes of the distributed clusters as well as the total squared intra-cluster dissimilarity (between each object and its medoid) is known.

Parameters:
[in]kNumber of clusters in the clustering. Same as k from k-means or k-medoids.
[in]cluster_sizesStart of range of k sizes. *cluster_sizes .. *(cluster_sizes + k) should be the sizes of clusters 1 to k
[in]sum2_dissimSum of squared dissimilarities of each object w.r.t. its nearest medoid.
[in]dimensionalityDimensionality of clustered data. e.g., 2 for 2-dimensional points.

Definition at line 123 of file bic.h.

void cluster::build_dissimilarity_matrix ( const std::vector< T > &  objects,
dissimilarity,
dissimilarity_matrix &  mat 
)

Computes a dissimilarity matrix from a vector of objects.

Parameters:
[in]objectsVector of any type T.
[in]dissimilarityA dissimilarity measure that gives the distance between two T's. Needs to be callable on (T, T).
[out]matOutput parameter. Dissimiliarity matrix is stored here.

Definition at line 59 of file dissimilarity.h.

void cluster::build_dissimilarity_matrix ( const std::vector< T > &  objects,
const std::vector< size_t > &  subset,
dissimilarity,
dissimilarity_matrix &  mat 
)

Computes a dissimilarity matrix from a subset of a vector of objects.

Parameters:
objectsVector of any type T.
subsetIndirection vector. Contains indices into objects for elements to be compared.
dissimilarityA dissimilarity measure that gives the distance between two T's. Needs to be callable(T, T).
matOutput parameter. Dissimiliarity matrix is stored here.

Definition at line 83 of file dissimilarity.h.

counter_iterator<T> cluster::counter ( T &  ref )

Adaptor for creating type-inferred counters.

Example Usage:

 #include <algorithm>
 
 // construct two sets
 set<int> s1, s2;
 
 // insert some things so that their intersection has 2 ints.
 s1.insert(1); s1.insert(2); s1.insert(3);
 s2.insert(2); s2.insert(3); s2.insert(4);
 
 // Compute intersection, but throw away the values and just 
 // store the size in the count variable.
 size_t count;
 set_intersection(s1.begin(), s1.end(), 
                  s2.begin(), s2.end(), counter(count));

 // now count == 2, since 2 items were inserted 
 // by <code>set_intersection.

Definition at line 105 of file counter.h.

void expand ( cluster_list &  list,
size_t  level = 1 
)

Expand a cluster_list by l levels.

That is, replace each index i in the cluster_list with indices in [2^l * i ... 2^l * (i+1) - 1]

Definition at line 170 of file partition.cpp.

long cluster::get_time_seed (  ) [inline]

Returns a reasonably distributed seed for random number generators.

Based on the product of the seconds and usec in gettimeofday().

Definition at line 78 of file random.h.

lazy_distance_functor<T,D> cluster::lazy_distance ( const std::vector< T > &  objs,
dist 
)

Type-inferred syntactic sugar for constructing lazy_distance_functor.

Definition at line 124 of file dissimilarity.h.

id_pair<T> cluster::make_id_pair ( const T &  elt,
int  id 
)

Helper function for making arbitrary id_pairs with type inference.

Definition at line 86 of file id_pair.h.

double mirkin_distance ( const cluster_list &  c1,
const cluster_list &  c2 
)

Mirkin distance bt/w two clusterings.

Definition at line 125 of file partition.cpp.

double mirkin_distance ( const partition &  c1,
const partition &  c2 
)

Convenience overload for comparing partition objects directly.

Definition at line 162 of file partition.cpp.

std::ostream& cluster::operator<< ( std::ostream &  out,
const id_pair< T > &  p 
)

Print out an id_pair as a tuple of its element and its source rank.

Template Parameters:
Tinferred from the id_pair<T> this is called on. Must support operator<<.
Parameters:
outOutput stream to write the id_pair p onto
pAn id_pair<T>. T must support operator<<.

Definition at line 99 of file id_pair.h.

std::ostream& cluster::operator<< ( std::ostream &  out,
const partition::member_writer &  mw 
) [inline]

Definition at line 141 of file partition.h.

std::ostream & operator<< ( std::ostream &  out,
const par_partition &  par 
)

Right now this just uses parition::operator<<() by making a partition with this par_partition's cluster_ids and medoid_ids vectors and outputting it.

Definition at line 110 of file par_partition.cpp.

ostream& cluster::operator<< ( ostream &  out,
const cluster_list &  clusters 
)

Definition at line 113 of file partition.cpp.

std::ostream & operator<< ( std::ostream &  out,
const partition &  p 
)

For convenience.

Definition at line 154 of file partition.cpp.

std::ostream& cluster::operator<< ( std::ostream &  out,
const cluster_list &  list 
)

Prints out nicely formatted clustering.

void cluster::random_subset ( size_t  numElements,
size_t  sample_size,
OutputIterator  out,
Random &  random 
)

This is Knuth's algorithm R for taking a sample of indices from 0 to numElements.

We sample size indices from this (the superset) and put them in the subset's mapping.

Parameters:
numElementstotal number of elements to select from
sample_sizenumber of elements to select
outdestination for selected elements
randommodel of STL Random Number Generator. must be callable as random(N), returning a random number in [0,N).

Definition at line 57 of file random.h.

double cluster::total_dissimilarity ( const partition &  p,
dist,
medoid_id  m 
)

Compute the total dissimilarity between all objects in a particular cluster and its medoid.

Definition at line 182 of file partition.h.

double cluster::total_dissimilarity ( const partition &  p,
dist 
)

Compute the total dissimilarity between all objects and their medoids.

Definition at line 170 of file partition.h.

double cluster::total_squared_dissimilarity ( const partition &  p,
dist,
medoid_id  m 
)

Compute the total squared dissimilarity between all objects in a particular cluster and its medoid.

Definition at line 210 of file partition.h.

double cluster::total_squared_dissimilarity ( const partition &  p,
dist 
)

Compute the total squared dissimilarity between all objects and their medoids.

Definition at line 196 of file partition.h.

Muster. Copyright © 2010, Lawrence Livermore National Laboratory, LLNL-CODE-433662.
Distribution of Muster and its documentation is subject to terms of the Muster LICENSE.
Generated on Mon Dec 20 2010 using Doxygen 1.7.2