As illustrated for yeast cell cycle data, differences among
clustering algorithms and individual dataset structures make
it difficult, and limiting, to simply select one clustering
result and expect it to produce a fully informative data model.
In the absence of ways to make objective comparisons or to mine
comparisons, it has until now been exceedingly difficult to
tell by inspection whether one clustering is significantly ‘better’
than another or to dissect differences between results in a
systematic manner. The mathematical, computational and visualization
tools that collectively comprise CompClust allow one to run
diverse unsupervised and supervised clustering algorithms, compare
the results using unbiased quantitative tools and then dissect
similarities and differences between specific clusters and between
entire clusterings. Specifically, we showed that LA and NMI
metrics, ROC analysis, PCA projections and interactive confusion
array analysis can be combined to provide a powerful comparative
By coupling the resulting comparative analyses with a flexiblevisualization system within CompClust and, especially, by usingconfusion arrays to organize comparisons, it became relativelystraightforward to identify both global and local trends inexpression patterns and to find out the features that are fragileto algorithm choice or to other variations. The tools were alsouseful for investigating substructure within individual geneclusters and for seeing how a cluster from one analysis relatesto a cluster from another analysis. CompClust, including allsource code, and associated tutorials are available at http://woldlab.caltech.edu/compClust/.The principal capabilities presented here can be used throughthe GUI of CompClustTK. The tools are introduced by tutorialsthat use the cell cycle examples presented here. Much richerand almost infinitely varied interactive interrogations canbe performed using the command line version of CompClust, whichis available for download. And while this demonstration is centeredon clustering of large-scale expression data from microarrays,it will be applicable to clusterings of protein interactionmeasurements, protein–DNA interactions and other large-scaledata types.
Comparative analysis showed that for the Affymetrix dataset(1), EM MoDG and the Cho heuristic found a basic data structuredominated by, and consistent with, the major phases of the cellcycle (Figure 5). This presented an apparent paradox, sincethe overall cell cycle phase structure was highlighted similarlyby both algorithms, while the assignment of specific gene vectorsto individual clusters was quite different, as shown by theLA and NMI scores (Figure 1). Further investigation of clusterrelationships in the context of confusion arrays, local ROCanalysis, and ROC curve structure helped to resolve the paradox.In specific cases, ambiguity was simply a data quality issuefor particular gene data vectors, and highlighting these affordsa biologist the opportunity to trim gene lists based on expertknowledge. In other cases, differences between algorithms portrayedcorrectly the fact that phases of the cell cycle are not crisplyseparated with respect to both mRNA synthesis and decay. Thisleads to a continuum of time-course profiles, especially aroundS phase. This knowledge of ‘fuzzy kinetic boundaries’is important for future uses of gene categories for subsequentgene network modeling. In yet other specific parts of the clustering,the differences between algorithms focused attention on differentand valid ways of parsing the data, as in the case of genesregulated strongly by Ace2/Swi5 in G1, which were separableby one algorithm but not by another.
Inference of transcription modules
A key capability of the CompClust computational framework isthat it provides the means to integrate many different kindsof data via linking properties (see Methods). This then allowsthe biologist to detect, organize and further mine relationships.The manner in which relationships, such as a direct connectionbetween a transcription factor and one of its target genes,can be defined and vetted (by other data) is flexible, so thatusers can specify significance thresholds and apply diversecomparative metrics of their choice. They may also export CompClustdata for further automated modeling, e.g. artificial neuralnetworks [(9); C. Hart, E. Mjolsness, B. Wold, manuscript inpreparation]. By using CompClust in this way, we were easilyable to capture all known regulatory modules governing yeastG1 transcription and to relate these regulatory connectionsto specific expression clustering patterns.
Funding of this work and its open access publication was fromthe NCI, the NIH, NASA, the Department of Energy, and the LKWhittier Foundation. The authors thank Prof. Joe Hacia, DrsJose Luis Riechmann and Brian Williams for helpful commentson the manuscript.
Conflict of interest statement. None declared.