Namespace for everything in the cluster library. More...

Classes
struct	counter_iterator
	Counting output iterator that records how many times an output iterator was assigned to, but ignores the value stored. More...
struct	matrix_distance
	Adaptor for passing a matrix by reference to template functions that take a callable distance function. More...
struct	lazy_distance_functor
	Functor for computing distance lazily from an object array and a distance metric. More...
struct	id_pair
	MPI-packable struct for an MPI-packable type plus its object id. More...
class	kmedoids
	Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data, by Kaufman and Rousseeuw. More...
class	multi_gather
	Asynchronous, some-to-some gather operation used by parallel clustering algorithms to simultaneously send members of sample sets to a set of distributed worker processes. More...
class	par_kmedoids
	This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms. More...
struct	par_partition
	par_partition represents a partitioning of a distributed data set. More...
struct	partition
	This represents a partitioning of a data set. More...
struct	trial
	This struct represents parameters for a single trial run of kmedoids. More...
class	trial_generator
	Class to generate a set of trials for clustering. More...
Typedefs
typedef boost::numeric::ublas::symmetric_matrix < double >	dissimilarity_matrix
	Packed repersentation of symmetric dissimilarity matrix.
typedef size_t	medoid_id
	More descriptive type for medoid index.
typedef size_t	object_id
	More descriptive type for object index.
typedef std::vector< std::set < object_id > >	cluster_list
	Explicit representation of a clustering.
Functions
template<typename D >
double	bic (const partition &p, D distance, size_t M)
	Directly computes the BIC from a partition object based on the cluster centroids and the number of clusters.
template<typename SizeIterator , typename DissimIterator >
double	bic (size_t k, SizeIterator cluster_sizes, DissimIterator sum2_dissim, size_t dimensionality)
	This version of the BIC assumes some precomputed information.
template<class T >
counter_iterator< T >	counter (T &ref)
	Adaptor for creating type-inferred counters.
template<class T , class D >
void	build_dissimilarity_matrix (const std::vector< T > &objects, D dissimilarity, dissimilarity_matrix &mat)
	Computes a dissimilarity matrix from a vector of objects.
template<class T , class D >
void	build_dissimilarity_matrix (const std::vector< T > &objects, const std::vector< size_t > &subset, D dissimilarity, dissimilarity_matrix &mat)
	Computes a dissimilarity matrix from a subset of a vector of objects.
template<class T , class D >
lazy_distance_functor< T, D >	lazy_distance (const std::vector< T > &objs, D dist)
	Type-inferred syntactic sugar for constructing lazy_distance_functor.
template<class T >
id_pair< T >	make_id_pair (const T &elt, int id)
	Helper function for making arbitrary id_pairs with type inference.
template<class T >
std::ostream &	operator<< (std::ostream &out, const id_pair< T > &p)
	Print out an id_pair as a tuple of its element and its source rank.
std::ostream &	operator<< (std::ostream &out, const par_partition &par)
	Right now this just uses parition::operator<<() by making a partition with this par_partition's cluster_ids and medoid_ids vectors and outputting it.
ostream &	operator<< (ostream &out, const cluster_list &clusters)
double	mirkin_distance (const cluster_list &c1, const cluster_list &c2)
	Mirkin distance bt/w two clusterings.
std::ostream &	operator<< (std::ostream &out, const partition &km)
	For convenience.
double	mirkin_distance (const partition &c1, const partition &c2)
	Convenience overload for comparing partition objects directly.
void	expand (cluster_list &list, size_t level=1)
	Expand a cluster_list by l levels.
std::ostream &	operator<< (std::ostream &out, const partition::member_writer &mw)
std::ostream &	operator<< (std::ostream &out, const cluster_list &list)
	Prints out nicely formatted clustering.
template<typename D >
double	total_dissimilarity (const partition &p, D dist)
	Compute the total dissimilarity between all objects and their medoids.
template<typename D >
double	total_dissimilarity (const partition &p, D dist, medoid_id m)
	Compute the total dissimilarity between all objects in a particular cluster and its medoid.
template<typename D >
double	total_squared_dissimilarity (const partition &p, D dist)
	Compute the total squared dissimilarity between all objects and their medoids.
template<typename D >
double	total_squared_dissimilarity (const partition &p, D dist, medoid_id m)
	Compute the total squared dissimilarity between all objects in a particular cluster and its medoid.
template<class OutputIterator , class Random >
void	random_subset (size_t numElements, size_t sample_size, OutputIterator out, Random &random)
	This is Knuth's algorithm R for taking a sample of indices from 0 to numElements.
long	get_time_seed ()
	Returns a reasonably distributed seed for random number generators.

Detailed Description

Namespace for everything in the cluster library.

Typedef Documentation

typedef std::vector< std::set<object_id> > cluster_list

Explicit representation of a clustering.

Instead of a vecto of representative ids, this has k sets of object_ids indicating which objects are in a particular cluster. You can convert a partition to a cluster_list with to_cluster_list().

Definition at line 60 of file partition.h.

typedef boost::numeric::ublas::symmetric_matrix<double> dissimilarity_matrix

Packed repersentation of symmetric dissimilarity matrix.

Definition at line 48 of file dissimilarity.h.

typedef size_t medoid_id

More descriptive type for medoid index.

Definition at line 51 of file partition.h.

typedef size_t object_id

More descriptive type for object index.

Definition at line 52 of file partition.h.

Function Documentation

double cluster::bic	(	const partition &	p,
		D	distance,
		size_t	M
	)

Directly computes the BIC from a partition object based on the cluster centroids and the number of clusters.

Parameters:

[in]	p	A partition object describing the clustering to be evaluated.
[in]	distance	A distance function callable on two indices from the partition p.
[in]	M	Dimensionality parameter -- degrees of freedom in the input dataset.

Returns:: A valid BIC score based on the input parameters. Higher numbers indicate better fit.

Definition at line 76 of file bic.h.

double cluster::bic	(	size_t	k,
		SizeIterator	cluster_sizes,
		DissimIterator	sum2_dissim,
		size_t	dimensionality
	)

This version of the BIC assumes some precomputed information.

This is useful for parallel implementations, where it is more efficient to compute some global sums as a reduction rather than aggregating a full partition to one process. Here, we assume that the sizes of the distributed clusters as well as the total squared intra-cluster dissimilarity (between each object and its medoid) is known.

Parameters:

[in]	k	Number of clusters in the clustering. Same as k from k-means or k-medoids.
[in]	cluster_sizes	Start of range of k sizes. `cluster_sizes .. (cluster_sizes + k)` should be the sizes of clusters 1 to k
[in]	sum2_dissim	Sum of squared dissimilarities of each object w.r.t. its nearest medoid.
[in]	dimensionality	Dimensionality of clustered data. e.g., 2 for 2-dimensional points.

Definition at line 123 of file bic.h.

void cluster::build_dissimilarity_matrix	(	const std::vector< T > &	objects,
		D	dissimilarity,
		dissimilarity_matrix &	mat
	)

Computes a dissimilarity matrix from a vector of objects.

Parameters:

[in]	objects	Vector of any type T.
[in]	dissimilarity	A dissimilarity measure that gives the distance between two T's. Needs to be callable on (T, T).
[out]	mat	Output parameter. Dissimiliarity matrix is stored here.

Definition at line 59 of file dissimilarity.h.

void cluster::build_dissimilarity_matrix	(	const std::vector< T > &	objects,
		const std::vector< size_t > &	subset,
		D	dissimilarity,
		dissimilarity_matrix &	mat
	)

Computes a dissimilarity matrix from a subset of a vector of objects.

Parameters:

objects	Vector of any type T.
subset	Indirection vector. Contains indices into objects for elements to be compared.
dissimilarity	A dissimilarity measure that gives the distance between two T's. Needs to be callable(T, T).
mat	Output parameter. Dissimiliarity matrix is stored here.

Definition at line 83 of file dissimilarity.h.

counter_iterator<T> cluster::counter ( T & ref )

Adaptor for creating type-inferred counters.

Example Usage:

 #include <algorithm>
 
 // construct two sets
 set<int> s1, s2;
 
 // insert some things so that their intersection has 2 ints.
 s1.insert(1); s1.insert(2); s1.insert(3);
 s2.insert(2); s2.insert(3); s2.insert(4);
 
 // Compute intersection, but throw away the values and just 
 // store the size in the count variable.
 size_t count;
 set_intersection(s1.begin(), s1.end(), 
                  s2.begin(), s2.end(), counter(count));

 // now count == 2, since 2 items were inserted 
 // by <code>set_intersection.

Definition at line 105 of file counter.h.

void expand	(	cluster_list &	list,
		size_t	level = `1`
	)

Expand a cluster_list by l levels.

That is, replace each index i in the cluster_list with indices in [2^l * i ... 2^l * (i+1) - 1]

Definition at line 170 of file partition.cpp.

long cluster::get_time_seed ( ) [inline]

Returns a reasonably distributed seed for random number generators.

Based on the product of the seconds and usec in gettimeofday().

Definition at line 78 of file random.h.

lazy_distance_functor<T,D> cluster::lazy_distance	(	const std::vector< T > &	objs,
		D	dist
	)

Type-inferred syntactic sugar for constructing lazy_distance_functor.

Definition at line 124 of file dissimilarity.h.

id_pair<T> cluster::make_id_pair	(	const T &	elt,
		int	id
	)

Helper function for making arbitrary id_pairs with type inference.

Definition at line 86 of file id_pair.h.

double mirkin_distance	(	const cluster_list &	c1,
		const cluster_list &	c2
	)

Mirkin distance bt/w two clusterings.

Definition at line 125 of file partition.cpp.

double mirkin_distance	(	const partition &	c1,
		const partition &	c2
	)

Convenience overload for comparing partition objects directly.

Definition at line 162 of file partition.cpp.

std::ostream& cluster::operator<<	(	std::ostream &	out,
		const id_pair< T > &	p
	)

Print out an id_pair as a tuple of its element and its source rank.

Template Parameters:

T	inferred from the id_pair<T> this is called on. Must support operator<<.

Parameters:

out	Output stream to write the id_pair p onto
p	An id_pair<T>. T must support operator<<.

Definition at line 99 of file id_pair.h.

std::ostream& cluster::operator<<	(	std::ostream &	out,
		const partition::member_writer &	mw
	)		`[inline]`

Definition at line 141 of file partition.h.

std::ostream & operator<<	(	std::ostream &	out,
		const par_partition &	par
	)

Right now this just uses parition::operator<<() by making a partition with this par_partition's cluster_ids and medoid_ids vectors and outputting it.

Definition at line 110 of file par_partition.cpp.

ostream& cluster::operator<<	(	ostream &	out,
		const cluster_list &	clusters
	)

Definition at line 113 of file partition.cpp.

std::ostream & operator<<	(	std::ostream &	out,
		const partition &	p
	)

For convenience.

Definition at line 154 of file partition.cpp.

std::ostream& cluster::operator<<	(	std::ostream &	out,
		const cluster_list &	list
	)

Prints out nicely formatted clustering.

void cluster::random_subset	(	size_t	numElements,
		size_t	sample_size,
		OutputIterator	out,
		Random &	random
	)

This is Knuth's algorithm R for taking a sample of indices from 0 to numElements.

We sample size indices from this (the superset) and put them in the subset's mapping.

Parameters:

numElements	total number of elements to select from
sample_size	number of elements to select
out	destination for selected elements
random	model of STL Random Number Generator. must be callable as random(N), returning a random number in [0,N).

Definition at line 57 of file random.h.

double cluster::total_dissimilarity	(	const partition &	p,
		D	dist,
		medoid_id	m
	)

Compute the total dissimilarity between all objects in a particular cluster and its medoid.

Definition at line 182 of file partition.h.

double cluster::total_dissimilarity	(	const partition &	p,
		D	dist
	)

Compute the total dissimilarity between all objects and their medoids.

Definition at line 170 of file partition.h.

double cluster::total_squared_dissimilarity	(	const partition &	p,
		D	dist,
		medoid_id	m
	)

Compute the total squared dissimilarity between all objects in a particular cluster and its medoid.

Definition at line 210 of file partition.h.

double cluster::total_squared_dissimilarity	(	const partition &	p,
		D	dist
	)

Compute the total squared dissimilarity between all objects and their medoids.

Definition at line 196 of file partition.h.

cluster Namespace Reference

Classes

Typedefs

Functions

Detailed Description

Typedef Documentation

Function Documentation