**3. SURVEY OF CLUSTERING-VALIDATION TECHNIQUES **

The data-mining literature provides a range of different validation techniques, with the main line of distinction between external and internal validation measures (Halkidi *et al*., 2001). These two groups of techniques differ fundamentally in their focuses, and find application in distinct experimental settings. External validation measures comprise all those methods that evaluate a clustering result based on the knowledge of the correct class labels. Evidently, this is useful to permit an entirely objective evaluation and comparison of clustering algorithms on benchmark data, for which the class labels are known to correspond to true cluster structure. In cases where no class labelling is available, or the available labels are dubious, an evaluation based on internal validation measures becomes appropriate. Internal validation techniques do not use additional knowledge in the form of class labels, but base their quality estimate on the information intrinsic to the data alone. Specifically, they attempt to measure how well a given partitioning corresponds to the natural cluster structure of the data.

The following survey of validation techniques is limited to those for crisp partitionings, that is, partitions in which each data item is assigned exactly one label (cf., for example, Pal and Bezdek, 1995 for more information regarding the evaluation of fuzzy partitionings). Mathematical definitions for selected validation techniques are provided in the Supplementary Material.

**3.1 External measures**

**3.1.1 Type 1: Unary measures**

Standard external evaluation measures take a single clustering result as the input, and compare it with a known set of class labels (the ‘ground truth’ or ‘gold standard’) to assess the degree of consensus between the two. Traditionally, the gold standard would be complete and unique, in the sense that exactly one class label is provided for every data item, and that the label is unequivocally defined. A partitioning can then be evaluated both with regard to the purity of individual clusters and the completeness of clusters. Here, purity denotes the fraction of the cluster taken up by its predominant class label, whereas completeness denotes the fraction of items in this predominant class that is grouped in the cluster at hand. Clearly, both these aspects provide a limited amount of information only, and trivial solutions for both of them exist such as a partitioning consisting of singleton clusters (scoring maximally under purity), and a one-cluster solution (scoring maximally under completeness). In order to obtain an objective assessment of a partition's accordance with the gold standard, it is therefore important to take both purity and completeness into account. Comprehensive measures like the *F*-measure (see Supplementary Material) (van Rijsbergen, 1979) provide a principled way to evaluate both of these and are therefore preferable over simplertechniques.

Note that techniques like the *F*-measure provide a means to assess the quality of a clustering result at the level of the entire partitioning, and not for individual clusters only. In principle, such measures can also be adapted for use with a ‘partial labelling’ (i.e. for use in a setting where only incomplete labelling information is available—such as functional annotation for a fraction of genes in a microarray experiment) by applying the measure to the labelled data and their respective cluster assignments only. This can provide a more comprehensive way of assessing clustering quality than does the computation of corrected significance levels of the ‘enrichment’ (Gat-Viks *et al*., 2003; Tavazoie *et al*., 1999; Toronen, 2004) of individual selected clusters.

**3.1.2 Type 2: Binary measures**

In addition to measures based on purity and completeness, the data-mining literature also provides a number of indices, which assess the consensus between a partitioning and the gold standard based on the contingency table of the pairwise assignment of data items. Most of these indices are symmetric, and are therefore equally well-suited for the use as binary measures, that is, for assessing the similarity of two different clustering results.

Probably the best known such index is the Rand Index (see Supplementary Material) (Rand, 1971), which determines the similarity between two partitions as a function of positive and negative agreements in pairwise cluster assignments. A number of variations of the Rand Index exist, in particular the adjusted Rand Index (Hubert, 1985), which introduces a statistically induced normalization in order to yield values close to zero for random partitions. Another related index is the Jaccard coefficient (Jaccard, 1908), which applies a somewhat stricter definition of correspondence in which only positive agreements are rewarded. Note that not all indices based on contingency tables are symmetric. The Minkowski Score (see Supplementary Material) (Jardine and Sibson, 1971), for example, is asymmetric [i.e. *M*(*U*,*V*) *M*(*V*,*U*) for two partitionings *U* and *V*] and therefore less suited for assessing the similarity between clustering results.

**3.2 Internal measures**

Internal measures take a clustering and the underlying dataset as the input, and use information intrinsic to the data to assess the quality of the clustering. Using the same categorization as for clustering methods (see Section 2.1), the first three types of internal measures can be grouped according to the particular notion of clustering quality that they employ.

**3.2.1 Type 1: Compactness**

A first group comprises validation measures assessing cluster compactness or homogeneity, with intra-cluster variance (see Supplementary Material) (also sum-of-squared-errors minimum variance criterion, the measure locally optimized by the *k*-means algorithm) as their most popular representative. Numerous variants of measuring intra-cluster homogeneity are possible such as the assessment of average or maximum pairwise intra-cluster distances, average or maximum centroid-based similarities or the use of graph-based approaches (Bezdek and Pal, 1998).

**3.2.2 Type 2: Connectedness**

The second type of internal validation technique attempts to assess how well a given partitioning agrees with the concept of connectedness, i.e. to what degree a partitioning observes local densities and groups data items together with their nearest neighbours in the data space. Representatives include *k*-nearest neighbour consistency (Ding and He, 2004) and connectivity (see Supplementary Material) (Handl and Knowles, 2005), which both count violations of nearest neighbour relationships.

**3.2.3 Type 3: Separation**

The third group includes all those measures that quantify the degree of separation between individual clusters. For example, an overall rating for a partitioning can be defined as the average weighted inter-cluster distance, where the distance between individual clusters can be computed as the distance between cluster centroids, or as the minimum distance between data items belonging to different clusters. Alternatively, cluster separation in a partitioning may, for example, be assessed as the minimum separation observed between individual clusters in the partitioning.

**3.2.4 Type 4: Combinations**

The literature provides a number of enhanced approaches that combine measures of the above different types. In this respect, combinations of type one and type three are particularly popular, as the two classes of measures exhibit opposing trends: while intra-cluster homogeneity improves with an increasing number of clusters, the distance between clusters tends to deteriorate. Several techniques therefore assess both intra-cluster homogeneity and inter-cluster separation, and compute a final score as the linear or non-linear combination of the two measures. An example of a linear combination is the SD-validity Index (see Supplementary Material) (Halkidi *et al*., 2001);^{2} well-known examples of non-linear combinations are the Dunn Index (see Supplementary Material) (Dunn, 1974), Dunn-like Indices (Bezdek and Pal, 1998), the Davies–Bouldin Index (Davies and Bouldin, 1979) or the Silhouette Width (see Supplementary Material) (Rousseeuw, 1987).

While the above methods are relatively popular, the linear or non-linear combination of the measures inevitably results in a certain information loss (Fig. 3), and can therefore lead to incorrect conclusions. An alternative and more principled way of evaluating *N* measures simultaneously is the evaluation of the resulting *N*-tuples with respect to Pareto optimality (Pareto, 1971): a clustering result is judged to dominate (be superior to) another partitioning, if it is equal or better under all measures, and is strictly better under at least one measure. A recent study using Pareto optimality for clustering and validation with regard to a type one and a type two measure can be found in Handl and Knowles, 2005.

**3.2.5 Type 5: predictive power/stability**Validation techniques assessing the predictive power or stability of a partitioning form a special class of internal validation measures. They are clearly not external since they do not make use of label information. However, they are quite different from traditional internal measures in that their use requires additional access to the clustering algorithm. Measures of this type repeatedly re-sample or perturb the original dataset, and re-cluster the resulting data. The consistency of the corresponding results provides an estimate of the significance of the clusters obtained from the original dataset.

The methods described in (Ben-Hur *et al*., 2002http://psb.stanford.edu/psb-online/; Bittner *et al*., 2000; Breckenridge, 1989; Fridlyand and Dudoit, 2001 http://www.stat.berkeley.edu/sandrine/tecrep/600.pdf; Kerr and Churchill, 2001; Lange *et al*., 2004; Levine and Domany, 2001; Li and Wong, 2001; McShane *et al*., 2002; Tibshirani *et al*., 2001ahttp://www-stat.stanford.edu/tibs/ftp/predstr.ps) employ the concept of self-consistency, that is, the idea that a clustering algorithm should produce consistent results when applied to data sampled from the same source. In order to assess the degree of stability of a partitioning, several papers (Ben-Hur *et al*., 2002; Levine and Domany, 2001) repeatedly draw overlapping subsamples of the same dataset (the individual subsamples are drawn without replacement). Each subsample is clustered individually, and the resulting partitions are compared by applying an external validation index to the partial partitions obtained for the overlapping shared set of points. A slightly different approach has been taken in (Breckenridge, 1989; Fridlyand and Dudoit, 2001; Lange *et al*., 2004; Tibshirani *et al*., 2001a). Here the data are split repeatedly into a training and a test set (typically of equal size and with no overlap), and both sets are clustered. The partitioning on the training set is then employed to derive a classifier to predict all class labels for the test set. The disagreement between the prediction and the partitioning on the test set can then be computed using an external binary validation index. Obviously, the classifier used for prediction has a significant impact on the performance of this method and should comply with the modelling assumptions made by the clustering algorithm. Lange *et al*. (2004) recommend the use of a nearest-neighbour classifier for single link, and of centroid-based classifiers for algorithms such as *k*-means that assume spherically shaped clusters. Finally, the stability of a clustering result can also be assessed by comparing the partitions obtained for perturbed data (Bittner *et al*., 2000; Kerr and Churchill, 2001; Li and Wong, 2001). For this purpose, a number of bootstrap datasets are generated from the original data: using a simple error model (Bittner *et al*., 2000) or more advanced methods such as ANOVA (Kerr and Churchill, 2001), a noise component is added to each data item. The resulting datasets (in which data items are slightly perturbed with respect to their original position) are subjected to a cluster analysis. The partitions obtained can then be directly compared using external binary indices (i.e. by comparing the cluster assignments for data vectors derived from the same original data point).

**3.2.6 Type 6: Compliance between a partitioning and distance information**

An alternative way of assessing clustering quality is to estimate directly the degree to which distance information in the original data is preserved in a partitioning. For this purpose, a partitioning is represented by means of its cophenetic matrix *C* (Romesburg, 1984), where *C* is a symmetric matrix of size *N* x *N* and *N* is the size of the dataset. In a crisp partitioning, the cophenetic matrix contains only zeros and ones, with each entry *C*(*i*,*j*) indicating whether the two elements *i* and *j* have been assigned to the same cluster or not. For the evaluation of a hierarchical clustering, the cophenetic matrix can also be constructed to reflect the structure of the dendrogram. Here, an entry *C*(*i*,*j*) represents the level within the dendrogram at which the two data items *i* and *j* are first assigned to the same cluster.

The cophenetic matrix can then be compared to the original dissimilarity matrix using Hubert's Statistic (essentially the dot-product between the two matrices), the Normalized Statistic, or a measure of correlation such as the Pearson correlation (Edwards, 1967) (in cases where the prime emphasis is on the preservation of absolute distance values) or the Spearman rank correlation (Lehmann and D'Abrera, 1998) (in cases where the prime emphasis is on the preservation of distance orderings). The correlation between the two matrices is commonly referred to as cophenetic correlation, matrix correlation or standardized Mantel Statistic (Halkidi *et al*., 2001). As an aside, cophenetic correlation can also be used as a binary index to assess the preservation of distances under different distance functions and within different feature spaces, or to compare the dendrograms obtained for different algorithms.

**3.2.7 Type 7: Specialized measures for highly correlated data**

This last category of internal validation measures includes a number of techniques that explicitly exploit redundancies and correlations such as those inherent to post-genomic data. The first of these, the figure of merit (Yeung *et al*., 2001a), is motivated by the jacknife (Efron and Tibshirani, 1993) approach. For a dataset with *D* features, the figure of merit of Yeung *et al*. (2001) requires the computation of *D* partitions, each of them based on only *D* – 1 out of the *D* features. For each partitioning, its figure of merit is then computed as the average intra-cluster variance within the unused feature, and the aggregation of these values provides an estimate of the overall performance of the algorithm. Datta and Datta (2003) extend this approach to the computation of a figure of merit by means of different internal validity indices—specifically, one measure of pairwise co-assignment, one of cluster separation and one of cluster compactness.

The second approach, overabundance analysis (Ben-Dor *et al*., 2002, http://www.cs.huji.ac.il/nirf/Abstracts/BFY2Full.html; Bittner *et al*., 2000) assesses the frequency of discriminatory variables for a given partitioning, that is, it identifies those variables that show significant differences between the identified clusters. The observed frequencies are compared with a null-model to assess the significance of the partitioning.

Clearly, both of the above approaches are only applicable to datasets with correlated (dependent) variables, but this is likely to be true for most types of post-genomic data (such as gene expression data and protein and metabolome profiles).

**3.3 Number of clusters**

Most of the internal measures discussed above can be used to estimate the number of clusters in a dataset, which usually involves the computation of clustering results for a range of different numbers of clusters, and the subsequent plot of the performance under the internal measure as a function of the number of clusters. If both the clustering algorithm employed and the internal measure are adequate for the dataset under consideration, the best number of clusters can often be identified as a ‘knee’ in the resulting performance curve. See, e.g., the Gap Statistic (Tibshirani *et al*., 2001b) for a formalized approach. This type of use in model selection has been the most common application of internal validation measures in bioinformatics (Bolshakova and Azuaje, 2003; Bolshakova *et al*., 2005; Fridlyand and Dudoit, 2001; Lange *et al*., 2004; Tibshirani *et al*., 2001b), yet, their more broad applicability in the validation of cluster quality has been frequently neglected.

**3.4 Statistical tests of clustering tendency**

The previous sections have been concerned with the validation of a clustering result obtained on a given dataset. However, when in doubt about the quality of a raw dataset, it may be useful to apply statistical tests to examine the clustering tendency of the data prior to conducting a cluster analysis (McShane *et al*., 2002). For this purpose, the distribution of nearest neighbour distances can be examined, and compared with that under a suitable null model. However, the sparseness of data in high dimensions may lead to instabilities, and it is therefore recommended to apply this type of test to low-dimensional data only. An example is provided by McShane *et al*. (2002) where the data are subjected to a principal components analysis and projected to the first three principal components. The principal components can further be used to generate an appropriate null model, in the above case a three-dimensional Gaussian distribution with means and standard deviations in each dimension estimated from the original data.