6. SAMPLE APPLICATION
In this last section, a brief example of a cluster analysis on gene expression data is given, in order to demonstrate the power of validation measures as a tool to provide insight into the structure of a dataset, and to assess the performance of individual clustering algorithms. The dataset employed is Golub et al.'s (1998) Leukemia dataset (http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi). The aim is to conduct an unsupervised analysis, and the genes used for the clustering are therefore selected in a completely unsupervised fashion.
The data are subjected to a series of standard pre-processing steps: lower and upper threshold values (raw expression values of 100 and 16 000, respectively) are applied, the 100 genes with the largest variation across samples are selected, and the remaining expression values are log-transformed. The resulting dataset of size 38 x 100 is subjected to a cluster analysis under Euclidean distance. The corresponding validation results are presented in Figures 10 and 11. Altogether, evidence accumulation over the set of employed validation techniques indicates a high quality of the three-cluster solution discovered by k-means, SOM, SOTA and average link. This three-cluster solution corresponds to an almost perfect separation of the samples of acute leukaemias into those arising from myeloid precursors (AML), and two sub-classes arising from lymphoid precursors (T-lineage ALL and B-lineage ALL).