Division of Biology, California Institute of Technology Pasadena, CA 91125, USA 1Jet Propulsion Laboratory, Machine Learning Systems Group Pasadena, CA 91109, USA 2Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA 92697, USA 3School of Information and Computer Science, University of California Irvine, Irvine, CA 92697, USA
Nucleic Acids Research 2005 33(8):2580-2594. Open Access Article.
Analysis of large-scale gene expression studies usually beginswith gene clustering. A ubiquitous problem is that differentalgorithms applied to the same data inevitably give differentresults, and the differences are often substantial, involvinga quarter or more of the genes analyzed. This raises a seriesof important but nettlesome questions: How are different clusteringresults related to each other and to the underlying data structure?Is one clustering objectively superior to another? Which differences,if any, are likely candidates to be biologically important?A systematic and quantitative way to address these questionsis needed, together with an effective way to integrate and leverageexpression results with other kinds of large-scale data andannotations. We developed a mathematical and computational frameworkto help quantify, compare, visualize and interactively mineclusterings. We show that by coupling confusion matrices withappropriate metrics (linear assignment and normalized mutualinformation scores), one can quantify and map differences betweenclusterings. A version of receiver operator characteristic analysisproved effective for quantifying and visualizing cluster qualityand overlap. These methods, plus a flexible library of clusteringalgorithms, can be called from a new expandable set of softwaretools called CompClust 1.0 (http://woldlab.caltech.edu/compClust/).CompClust also makes it possible to relate expression clusteringpatterns to DNA sequence motif occurrences, protein–DNAinteraction measurements and various kinds of functional annotations.Test analyses used yeast cell cycle data and revealed data structurenot obvious under all algorithms. These results were then integratedwith transcription motif and global protein–DNA interactiondata to identify G1 regulatory modules.
The sources of difference between clustering algorithm outputsare many and varied, and the biological implications are alsodiverse, as illustrated below for cell cycle data. The generalchallenge is to detect, measure, evaluate and mine the commonalitiesand differences. Specifically, the biologist usually wants tofirst know whether one clustering is objectively and significantly‘better’ than another, and just how big the differenceis. If two clusterings are of similar overall quality, yet differsubstantially from each other as is often observed, then whatspecific gene cluster or samples harbor the greatest differences,or are they evenly distributed across the data? At a finer levelstill, which genes are being assigned to different clustersand why? Importantly, do the distinctions between clusteringshighlight properties of biological importance?
To begin to answer such questions, we first needed a way tomake systematic quantitative comparisons and then we neededways to effectively mine the resulting comparisons. We use confusionmatrices as the common tool for these comparisons (see belowand Methods). A confusion matrix effectively summarizes pairwiseintersections between clusters derived from two clustering results.These similarities are quantified by applying scoring functionsto the confusion matrix. In this work, we use two differentscoring functions for this purpose: (i) normalized mutual information(NMI), which measures the amount of information shared betweenthe two clustering results (7) and; (ii) a linear assignment(LA) method, which quantifies the similarity of two clusteringsby finding the optimal pairing of clusters between two clusteringresults and measuring the degree of agreement across this pairing(8,9). Previous studies have used metrics for evaluating thetotal number of data point pairs grouped together between twodifferent clusterings to begin to address the need for quantifyingoverall differences (10–13). Ben-Hur et al. (13) usedthis to help determine an optimal number of clusters (K) andto assess the overall validity of a clustering. These priortechniques did not, however, offer the capacity to isolate andinspect the similarities and differences between two differentclusterings, nor did they provide an interactive interface forbiology users that would permit them to usefully capture thecomparative differences and similarities. We also introducea new application of receiver operator characteristic (ROC)analysis (14,15). As we use it here, ROC enables one to quantifythe distinctness of a given cluster relative to another clusteror relative to all non-cluster members. Implemented in thisfashion, ROC provides another measure of local cluster qualityand shape, and provides another tool for quantitatively dissectinga cluster. Though the methods and tools were worked out forclusterings of large-scale gene expression data, they are applicableto clusterings of other kinds of large-scale data as well.
We have integrated the algorithms and comparative tools intoan interactive analysis package collectively called CompClust1.0. CompClust enables a user to organize, interrogate and visualizethe comparisons. In addition to comparative cluster analysis,an important feature of this software is that it establishesand maintains links between the outputs of clustering analysesand the primary expression data, and, critically, with all otherdesired annotations. In the sense used here, ‘annotations’include other kinds of primary and metadata of diverse types.This gives a biologist crucial flexibility in data mining andpermits analyses that integrate results from other kinds ofexperiments, such as global protein–DNA interactions (ChIP/Array),protein–protein interactions, comparative genome analysisor information from gene ontologies.CompClust methods and tools are agnostic about the kinds ofmicroarray data (ratiometric, Affymetrix, etc.) and types ofclustering algorithms used. We demonstrate the tools by analyzingtwo different sets of yeast cell cycle expression data representingboth major data platforms, clustered by four different methods:a statistical clustering algorithm [Expectation Maximizationof a Mixture of Diagonal Gaussian distributions (EM MoDGs)](this work), a human-driven heuristic (1), a Fourier transformalgorithm designed to take advantage of a periodic time-coursepatterns (16) and an agglomerative version of the Xclust phylogeneticordering algorithm [Eisen et al. (2) modified in this work].We show that gene groups derived from these comparative analysescan be integrated with data on evolutionarily conserved transcriptionfactor binding sites to infer regulatory modules. These resultsbegin to illustrate how a more quantitative and nuanced understandingof both global and local features in the data can be achieved,and how these can be linked with diverse kinds of data typesto infer connectivity between regulators and their target genemodules.
CompClust is primarily accessible through an application programminginterface (API) and, with the use of Python's exposed interpreter,this provides a very rich command line interface (CLI). Themajor capabilities illustrated in this paper are accessiblethrough CompClustTK, a set of simple graphical user interfaces(GUIs) to offer a convenient starting point without learningPython commands. These GUIs will permit users to perform themajor classes of analyses shown, though we note that these compriseonly a fraction of CompClust capabilities. The flexibility anddiversity of analysis paths is too great to anticipate themall or commit them to GUI forms. This limitation can be overcomeby using the Python command line environment. Python commandscan be learned at the level needed in a relatively short time(a few weeks of part time effort) by users who do not have priorprogramming experience. The benefit is access to remarkableflexibility in interrogating datasets. This is a much bettermatch to the diversity of questions and comparisons that biologistsusually want to make than in any GUI-based system.
The choice to implement CompClust in Python over other languageswas made for several reasons which, considered in aggregate,argue it is the best available language to support the capabilitiesand analysis goals of CompClust: (i) using Python's exposedinterpreter, our API becomes immediately useful for analysiswithout the construction of a complex GUI. The exposed interpreteralso speeds the development time. (ii) Python's syntax is fairlystraightforward and easy to learn for even non-programmers.(iii) It is freely available and distributable under an open-sourcelicense. (iv) Python has an extensive and standard library andin addition third party extensions, including the Numeric packagewhich provides powerful numeric analysis routines. (v) Pythonis also platform neutral and runs on the majority of systems,including unix/linux, Microsoft Windows and the Mac OS.
Pairwise comparison of clusterings (partitions) using confusion arrays and matrices
Confusion arrays and matrices were used to make pairwise comparisonsbetween different clusterings (mathematical partitions). A setof metrics were then applied to the confusion matrix to measurethe nature and degree of similarity between two dataset partitions.Briefly, a confusion matrix is the matrix of cardinalities ofall pairwise intersections between two partitions, where a partitionof a dataset is defined as a set of disjoint subsets whose unioncontains all elements of the dataset. We define a confusionarray simply as an array of all pairwise intersections betweentwo partitions of a dataset. The cardinalities of these intersectionsets form the confusion matrix, whose elements are given byEquation 1:
The LA value for a confusion matrix is calculated between twopartitions (clusterings) and by generating an optimal pairingso that there is, at most, a one-to-one pairing between everyclass in partitions, and this pairing is calculated by optimizingthe objective function in Equation 2, using the constraintsgiven in Equation 3, thus defining a linear assignment problem.Next, the maximum-cardinality bipartite matching of maximumweights algorithm (Gabow, 1973) was implemented for the optimization.After finding the optimal pairing, the LA score is simply theproportion of vectors (e.g. gene expression trajectories orconditions) included in the optimally paired clusters (Equation 4).It is important to note that LA, unlike NMI, is a symmetricscore so that LA(A,B) = LA(B,A). In addition to quantifyingthe degree of similarity or difference between two partitions,the adjacency matrix (Equation 3) also provides a way to identifypairs of clusters that are globally most similar to each otherbetween two partitions of the data. As illustrated for clusteringsof yeast cell cycle regulated genes, this is especially usefulfor interactive examination of two clusterings.
Normalized mutual information
The NMI index (7) quantifies how much information is lost, onaverage, when one clustering is regenerated from a second classification(Equation 5). A noteworthy difference from LA is that NMI isasymmetric.
EM MoDG clustering
EM MoDG was implemented with a diagonal covariance matrix modelbecause the number of samples in the (1) cell cycle datasetwas too small to fit a statistically valid full covariance matrixto each cluster (17). In order to ensure a near optimal initialization,each EM MoDG result was a result of selection of the best of30 runs, each initialized by placing the initial cluster centroidson K randomly selected data points. The run with best fit tothe data (i.e. had the lowest log-likelihood score) was usedfor the final clustering. Multiple best-of-30 runs were performedto verify that the quantitative measures and gene lists resultsreported here did not vary significantly. The EM MoDG code usedhere was developed by the NASA/JPL Machine Learning SystemsGroup.
We agglomerate the hierarchical tree returned by Xclust (18)based on a maximal cluster size threshold. Starting from theroot, any subtree within the tree with less than the maximalcluster size threshold is agglomerated into a cluster. In orderto work with the familiar parameter K (number of clusters),we iteratively find the size threshold that will return as closeto K clusters as possible. In practice, this simple heuristicworks best when K is over specified by 2–4 times the expectednumber of clusters because it will generate several very small(often singleton) clusters that are outliers to core major clustersin the data.
Each microarray dataset was obtained from the cited authors.For the Cho et al. (1) data, we removed any gene that did notshow a sustained absolute expression level of at least 8 for30 consecutive minutes. For each gene vector, we then dividedeach time point measurement by the median expression value forthe gene. For the data of Spellman et al. (16), we linearlyinterpolated missing values using the available adjacent timepoints. For both datasets, we log2 transformed the resultinggene expression matrices. The datasets were then annotated withthe original clustering results as published.
Motif conserved enrichment score (MCS)
For each motif, we translated the IUPAC consensus (Swi5/Ace2:KGCTGR; MCB: ACGCGT; SCB: CACGAAA) into a position weight matrix(PWM) where the probabilities or frequencies in the PWM is determinedby the degeneracy of the IUPAC symbol. We calculate a log-oddsratio for the PWM occurring at every position in the 1 kb upstreamas described in Equation 10
Mathematical tools for organizing and quantifying microarray clusterings: confusion matrices and comparative metrics
A confusion matrix can effectively summarize all pairwise intersectionsbetween all clusters from any two clusterings of the same data.Confusion matrices, as used in this work, are defined as thematrix of cardinalities of all pairwise intersections betweentwo different expression clusterings (see Methods). Once theconfusion matrix is constructed, we can then apply differentscoring functions to the confusion matrix to quantify similarity:(i) NMI measures the amount of information shared between twoclusterings (7); and (ii) LA optimizes the number of data vectorsin clusters that correspond to each other, thereby identifyingthe optimal pairing of clusters. LA also reports the percentageof data vectors contained within those clusters, and this canbe used to assess similarity of results globally over the entiredataset and locally on a cluster pair by cluster pair basis(8,9) (for mathematical descriptions of confusion matrices,NMI and LA, see Methods). The combined use of LA and NMI metricscan provide a biologist with immediate insight into the magnitudeand nature of global differences between two microarray clusteringsby capitalizing on the fact that NMI is asymmetric and LA issymmetric (see Table 1 for details). Specifically, this discriminatesinstances in which two clusterings are very similar to eachother from a comparison in which one clustering is differentfrom the other, but is essentially a refinement of the first.Both of these can be discriminated from the third relationship,in which two clusterings deliver fundamentally different viewsof the data structure.
Confusion arrays organize and display comparative analyses.Given two different clusterings of a dataset and a global evaluationof their similarity via NMI and LA, we then needed a methodto systematically compare clusters in a manner that is moreeffective and intuitive than mere inspection of gene lists.To do this, we define the confusion array, which is a directextension of a formal confusion matrix. For two different clusterings,each cell in the confusion array contains the intersection setbetween the two parent clusters (as opposed to the cardinalityof this set, as in a confusion matrix; see Methods). In thecontext of the CompClust system, the confusion array cells canthen be interactively mined. Confusion arrays for two differentclusterings, one using an Affymetrix yeast cell cycle dataset(1) and the other using a deposition ratiometric dataset (16),are shown in Figures 1 and 2, and are analyzed further below.
Understanding cluster relatedness
ROC measures cluster overlap. Whatever algorithm has been usedto cluster data, it is useful to find out how distinct eachcluster is from all the others and how distinct any particularcluster is from another specific cluster. This is especiallypertinent when membership in a cluster will be translated intoa gene list that ultimately becomes a functional annotationor defines genes that will be input into higher-order analyses.To address this issue, we applied classical ROC analysis (seeMethods). In this context, cluster assignment is used as the‘diagnosis’ and the distance of each expressionvector from the cluster mean vector is the ‘decision criterion’.The corresponding ROC curve plots the proportion of clustermembers versus the proportion of non-cluster members as thedistance from the cluster centroid increases (Figure 3). Thiscan be interpreted geometrically as expansion of a hyperspherefrom the cluster centroid until all members of the cluster areenclosed. Thus, when one cluster is completely separate fromall other data, all of its members are closer to the clustercenter than all non-members and the area under the ROC curveis 1.0 (Figure 3B). When a cluster is not fully separable fromthe remainder of the data, the ROC curve rises more slowly andthe area under the ROC curve is <1.0. In the limit, whenthe two classes are perfectly mixed, the ROC curve closely followsX = Y, and the area under the curve drops to 0.5 (Figure 3D).The shape of the ROC curve also contains additional informationabout how cluster overlap is distributed, and this informationcan be used by the biologist to choose useful data mining cut-offsthat mark discontinuities and cluster substructure (see belowand Figure 4). It can also be used interactively within CompClustto explore and select data vectors (genes) that are closer ormore distant from the cluster center. Selection of vectors notassigned to the cluster, yet positioned at overlapping distancesfrom its center, is also possible and is often instructive (Figure 4and text below).
Comparing clusterings of yeast cell cycle microarray datasets
We performed comparative analyses on clustering results fromtwo different yeast microarray time-course datasets (one Affymetrixand one ratiometric), each composed of genes that are differentiallyexpressed over the cell cycle (1,16). These comparisons providea useful perspective because the gene classification resultsfrom the original gene clusterings have been introduced as geneannotations in widely used databases [Incytes' YPD (19), SGD(http://yeastgenome.org)] and have been mined or used as startingpoints in many subsequent works. We generated a new clusteringfor each dataset, in each instance selecting an algorithm thatdiffers substantially from the one used in the original publicationbut one that should also be entirely appropriate for the dataset.For the Cho et al. (1) dataset, we used EM MoDGs (17), whichis an unsupervised method that searches for the best statisticalfit to the data modeled as a mixture of Gaussian distributions.The heuristic used in the original report (1) is a supervisedmethod based on biologist's knowledge of cell cycle phases.The heuristic focused on the time of peak expression for eachgene trajectory to guide assignment of each gene to one of thefive time domains associated with Early G1, Late G1, S, G2 andM phases of the cell cycle. For the second dataset (16), weperformed an agglomerative phylogenic hierarchical clusteringof the tsCDC15-mutant synchronized data. This algorithm is basedon the widely used Xclust phylogenetic ordering algorithm (2),onto which we grafted an agglomeration step designed to establishobjective boundaries in the tree (see Methods). This effectivelyturns an ordering algorithm into a clustering algorithm, inwhich group boundaries are imposed computationally rather thanby a user's pattern recognition skills, as is performed withXclust by itself. This result was compared with the result reportedby Spellman et al. (16), in which they used a Fourier transform-basedalgorithm, which was designed to take maximal advantage of thetime-course pattern.Global similarity measures
Comparison of the two clusterings of Affymetrix data from Choet al. (1) gave a global LA score of 0.63 and NMI scores of0.52 and 0.50, immediately indicating that EM MoDG and the heuristicclassification have produced substantially different results.The LA value of 0.63 says that the optimal pairing of clustersstill classifies 37% of the genes differently between the twoalgorithms. ROC curves and ROC areas were generated for eachcluster (Figure 5). Viewed in aggregate, this ROC analysis showedthat clusters from EM MoDG are all better separated from eachother than are any clusters from the original Cho et al. (1)heuristic. Thus, the ROC indices for EM MoDG are all 0.96 orabove, and four of the five clusters are >0.98. In contrast,the heuristic classification groups had ROC values as low as0.82 for S phase and none was better than 0.97 (M phase). Bythis criterion, we can argue that EM clustering is an objectivelysuperior representation of the underlying data structure.
How are these differences between clustering results distributedover the dataset? We used PCA (Principle Component Analysis)to determine whether the two clusterings were globally similaror different in the way they partitioned the dataspace. PCAprojects high-dimensional gene expression vectors (each dimensionhere corresponding to a different RNA sample) into a differentand low-dimensional space (usually two or three) (20), in whichthe new PCA dimensions have each been selected to explain themaximal amount of variance in the data. A common feature ofmicroarray datasets is that the first few principle componentsoften capture most of the variations in the data (here 64%).Using CompClust to view the cluster means in PCA space allowedus to assess relationships between clusters from the two algorithms.Relative positions of cluster means in the PCA display the cellcycle progression in a counterclockwise pattern that is quitesimilar for the two algorithms. The absolute positions of thecluster centers in PCA space differ, though not extravagantly,for most clusters. This is interesting because the coherencein overall structure would seem to contradict the rather highdissimilarities in cluster composition measured by the criteriaLA and NMI, and shown graphically in the confusion array (Figure 1).Considered together, the results argue that the overalldata structure, reflecting phases of the cell cycle, is robustand has been treated rather similarly by the two algorithms,even though 37% of individual gene expression vectors were assigneddifferently. This raises the question of which gene vectorshave been differentially assigned and what biological meaning,if any, should be attached to the differences. These questionsare addressed in the Discussion by examining specific gene groupsin the confusion array. Using the ratiometric data of Spellman et al. (16), a comparisonof the original Fourier-based algorithm versus agglomeratedXclust produced NMI and LA scores of 0.39, 0.41 and 0.60, respectively.These scores indicate that the two clusterings are even moredifferent in membership assignment, with 40% of genes fallingoutside the optimal LA pairing. Since both NMI and LA scoresare low, gene memberships for some clusters must be truly scrambledrather than being simple combinations of cluster unions andsubsets (see Methods and Table 1). PCA projection (Figure 6)showed that some major cluster centers from the two algorithmsare positioned very differently, both absolutely and relatively(note the yellow cluster corresponding to the Fourier S-phasegroup). The confusion array shows that XclustAgglom clustersoften combine genes that are members of adjacent Fourier clusters,though in some cases it joins vectors from non-adjacent groups(the confusion around S phase is complex). ROC curves and scoresalso indicate that XclustAgglom has performed a slightly betterjob of segregating data into discrete groups that reflect underlyingdata structure, while the Fourier analysis groups are less coherentand often seem to mix members of kinetically adjacent groupsas detailed in the confusion array (Figures 2 and 6). This maybe due, in part, to the use of a small number of then knowngenes to center landmark phases by the Fourier algorithm. Thefact that this phase assignment was a ‘somewhat arbitrary’step in the original analysis was pointed out by Spellman etal. (16).
High-resolution cluster comparisons
Confusion arrays can be used to explore in more detail the issuesraised by global analyses and to mine relationships betweenindividual clusters. This can then be used to make refined genelists based on either the expert opinion or on the applicationof computationally objective criteria. We applied LA to theconfusion matrix of the Cho heuristic and the EM MoDG resultsand produced the corresponding adjacency matrix (see Methodsand Equation 3). This delivered an objectively optimized pairingof EM cluster 1 with Cho ‘Early G1’, EM cluster2 with Cho ‘Late G1’ and so on, as shown in thearray visualization (Figure 1). Each cell in the confusion arraycontains the corresponding gene vectors and displays the calculatedmean vector for each intersect cell in the array. The confusion array highlighted relationships that were notclear from Figures 5 or 6. For example, in the Affymetrix data(1), both algorithms identified two gene classes within G1 (redand yellow, respectively, in the PCA analysis of Figure 5).However, the EM1 cluster shares only 67% of its content withthe Cho ‘Early G1’, and most remaining genes fallinto the Cho ‘Late G1’ cluster (Figure 7A). A straightforwardhypothesis is that the statistical EM algorithm simply couldnot justify dividing G1 vectors into early and Late G1 kineticgroups as the heuristic had done. The confusion array, however,makes it clear at a glance that a different data feature isdriving the G1 sub-groupings. EM1 cluster members are upregulatedonly in the second cycle, while EM2 genes are upregulated inboth cycles. The array also shows that the Cho Early G1 groupcontains a set of 10 genes that appear much more consistentwith a coherent M-phase group that corresponds to EM5.
Because the focus of the heuristic classification was mainlyon the second oscillation, it suppressed the distinction betweensingle cycle and two cycle G1 patterns, while ‘payingmore attention’ to fine-structure kinetic differencesof the second cycle. On the other hand, EM MoDG treats all featureswith equal weight across the time course and so centers clusterswithout prior guidance about their relationship to cell cyclephase. The confusion array intersect cells then effectivelyparsed the fine kinetic differences within EM1 by separating47 vectors that more closely resemble the Early G1 cluster from18 that are more like Late G1 cluster. Thus, the intersect cellscapture and dissect the two distinct ways used by the two algorithmsto parse G1 expression. Both appear valid and reveal in differentways, one based entirely on the kinetics of the second cycleand one focusing on a major difference in the expression thatis seen only in the first cycle following release from arrest.
Turning to the S-phase clusters, the comparative analysis highlightsa different kind of disagreement between the two algorithm outputs.Members from Cho ‘S-phase’ cluster overlap almostevenly with either EM2 (41%) or EM3 (49%). A simple biologicalinterpretation is that the kinetic boundary between Late G1and S-phase is not very crisp, irrespective of the algorithmused to try to define them. An alternate explanation is thatone algorithm is frankly superior to the other in defining coherentexpression groups. Examination of the ROC curves (Figure 5)and values (0.98 for EM versus 0.82 for S phase) provided anobjective measure of quality that argues the EM clustering ofS phase is superior.
Dissecting individual clusters using ROC
Further ROC-based analysis of the S-phase cluster from the Choet al. (1) classification is shown in Figure 4. The ROC curveshows how far from the cluster mean one needs to expand a hypersphereto include a given fraction of vectors from the cluster, andthe shape of the curve can be used to understand cluster substructure.Inspection of the ROC curve and the corresponding histogram(Figure 4A) identified a natural discontinuity separating thefirst 66% of genes that are nearer the cluster center from theremainder. For additional data mining, we therefore set a boundaryat 66% on the ROC curve and then inspected all gene vectorsfrom the entire cell cycle dataset that fall within that boundary.Approximately 20% of gene vectors inside this dataspace thresholdhad been assigned to other clusters. Figure 4B and C allowsinspection of gene trajectories that were either interior orexterior to the boundary. This is useful for reviewing and ‘pruning’lists of putatively co-expressed genes in an objective manner.
Integration with transcription factor motifs to identify regulatory modules
CompClust is designed to integrate different kinds of data bylinking each gene with other data, annotations and results ofmeta-analyses. There are many ways of using other datasets toidentify relationships between, for example, observed patternsof RNA co-expression and other data that help to answer thequestion: are similarly expressed genes co-regulated? A groupof genes that are co-expressed may also be co-regulated, butthis is far from assured. Co-expressed genes can instead arriveat the same expression pattern by the action of two (or more)different regulators. Conversely, genes that are co-regulatedby the same factor(s) at the transcriptional level may not displayidentical RNA expression patterns for a variety of reasons,including differential turnover rates. For these reasons, additionalkinds of data are needed to help determine the co-expressedgenes that are, in fact, transcriptionally co-regulated andto provide evidence for the identity of factor(s) driving co-regulation.Here, we show how the occurrence of evolutionarily persistenttranscription factor binding sites can be mapped informativelyonto the gene expression clusters from a confusion array topredict the structure of transcription modules.
The observation of two distinct sets of genes, one that peaksduring both the first and second cell cycles after release fromarrest, and another restricted to only the second oscillation(Figure 7), suggests that they might be regulated differentlyat the level of transcription. Prior work has led to the viewthat MCB and SCB sequence motifs bind Mbp1/Swi6 (MCF) or Swi4/Swi6(SCF) factor complexes to drive G1-specific transcription (21–23).Thus, many genes are believed to be selectively and specificallyexpressed in G1 owing to their membership in either MCF or SCFregulatory modules. The two modules are also thought to be partlydistinct from each other, with some genes apparently being stronglygoverned by either Swi4 or Mbp1 [(24,25) and reviewed in (26)].
We, therefore, calculated an MCS (see Methods) to quantify theconserved enrichment of a consensus site within 1 kb of thestart ATG in sequence data from the seven available yeast genomes(27,28). We then asked whether different intersecting cellswithin the confusion array are differentially and significantlyenriched for these known candidate motifs. The EM2/Late G1 intersectcell was highly enriched, above chance, for MCB and SCB. A totalof 79 of 113 genes (70%) were enriched for MCB compared withthe expectation of 13 such genes for randomly selected samplesof 113 yeast genes. A total of 18% are enriched for SCB sitescompared with an expectation of only 4% by chance (Figure 8A).CompClust's data linking capabilities were then used to visualizecorrelations with in vivo protein–DNA binding data forSwi4 and Mbp1 (29). The vast majority of genes with the abovethreshold MCB or SCB MCS scores also showed significant in vivobinding activity for either MCF or SCF.
The picture for the EM1/Early G1 intersect cell, whose genespeak only once during the time course, was surprisingly different.This group showed no significant enrichment for either MCB orSCB (Figure 8A). What other factor(s) could be responsible forthe EM1/Early G1 intersect pattern? We searched and found thatthe Swi5/Ace2 motif is enriched so that 30% are above threshold,a value twice that expected by chance. The highest Swi5 MCSscores correlated strongly with very intense expression in thesecond cycle. This, in turn, correlated well with in vivo factorbinding by both Swi5 and Ace2 taken from the chromatin immunoprecipitationdata of Lee et al. (29). Thus, 60% of the EM1/Early G1 Swi5/Ace2group had P-values <0.05 for Swi5 or Ace2 in the global chromatinimmunoprecipitation study of Lee et al. (29), and others wererelatively strong binders as shown in Figure 8. That most orall of these connections between Swi5/Ace2 and EM1 genes, inferredfrom three kinds of genome scale data are real is supportedby the fact that most previously identified targets of Ace2and/or Swi5 (30,31) that were in the original Cho cycling datasetare in this group.
By coupling the resulting comparative analyses with a flexiblevisualization system within CompClust and, especially, by usingconfusion arrays to organize comparisons, it became relativelystraightforward to identify both global and local trends inexpression patterns and to find out the features that are fragileto algorithm choice or to other variations. The tools were alsouseful for investigating substructure within individual geneclusters and for seeing how a cluster from one analysis relatesto a cluster from another analysis. CompClust, including allsource code, and associated tutorials are available at http://woldlab.caltech.edu/compClust/.The principal capabilities presented here can be used throughthe GUI of CompClustTK. The tools are introduced by tutorialsthat use the cell cycle examples presented here. Much richerand almost infinitely varied interactive interrogations canbe performed using the command line version of CompClust, whichis available for download. And while this demonstration is centeredon clustering of large-scale expression data from microarrays,it will be applicable to clusterings of protein interactionmeasurements, protein–DNA interactions and other large-scaledata types.
Comparative analysis showed that for the Affymetrix dataset(1), EM MoDG and the Cho heuristic found a basic data structuredominated by, and consistent with, the major phases of the cellcycle (Figure 5). This presented an apparent paradox, sincethe overall cell cycle phase structure was highlighted similarlyby both algorithms, while the assignment of specific gene vectorsto individual clusters was quite different, as shown by theLA and NMI scores (Figure 1). Further investigation of clusterrelationships in the context of confusion arrays, local ROCanalysis, and ROC curve structure helped to resolve the paradox.In specific cases, ambiguity was simply a data quality issuefor particular gene data vectors, and highlighting these affordsa biologist the opportunity to trim gene lists based on expertknowledge. In other cases, differences between algorithms portrayedcorrectly the fact that phases of the cell cycle are not crisplyseparated with respect to both mRNA synthesis and decay. Thisleads to a continuum of time-course profiles, especially aroundS phase. This knowledge of ‘fuzzy kinetic boundaries’is important for future uses of gene categories for subsequentgene network modeling. In yet other specific parts of the clustering,the differences between algorithms focused attention on differentand valid ways of parsing the data, as in the case of genesregulated strongly by Ace2/Swi5 in G1, which were separableby one algorithm but not by another.
Inference of transcription modules
A key capability of the CompClust computational framework isthat it provides the means to integrate many different kindsof data via linking properties (see Methods). This then allowsthe biologist to detect, organize and further mine relationships.The manner in which relationships, such as a direct connectionbetween a transcription factor and one of its target genes,can be defined and vetted (by other data) is flexible, so thatusers can specify significance thresholds and apply diversecomparative metrics of their choice. They may also export CompClustdata for further automated modeling, e.g. artificial neuralnetworks [(9); C. Hart, E. Mjolsness, B. Wold, manuscript inpreparation]. By using CompClust in this way, we were easilyable to capture all known regulatory modules governing yeastG1 transcription and to relate these regulatory connectionsto specific expression clustering patterns.
Funding of this work and its open access publication was fromthe NCI, the NIH, NASA, the Department of Energy, and the LKWhittier Foundation. The authors thank Prof. Joe Hacia, DrsJose Luis Riechmann and Brian Williams for helpful commentson the manuscript. Conflict of interest statement. None declared.
Given two clustering results A and B, for which both NMI(A,B), NMI(B,A) and LA(A,B) values are high (nearing the maximum value of 1.0), the two clusterings are very similar, and when all three are significantly lower, they are very different. But when NMI(A,B) is high, NMI(B,A) is low and LA is low, then it is likely that A is a refinement of B. In this case, many clusters in B have been broken into two or more clusters in A (possible combinations summarized in here) (1). The magnitude of dissimilarity that is important is defined by the user and may vary considerably with the dataset, although values <0.7 for both LA and NMI are usually viewed as quite different. Additional interpretation of differences measured by LA and NMI depends on more detailed analysis of the dissimilarities and their distribution over the dataset, as outlined above.
Comparing two clustering results using a confusion array. Shown in this
comparison is a supervised clustering result published in the original
study by Cho et al. (1) and results from running an
unsupervised clustering (EM MoDG, see Methods) on the same Affymetrix
microarray dataset profiling yeast gene expression through two cell
cycles. The confusion array is composed of a grid of summary plots.
Each summary plot displays the mean (blue color or solid line)
expression level of a group of genes as well as the standard deviation
(red color or dashed line). Summary plots with a white background
represent clusters from either the Cho et al. (1) clustering
result (along the right most column) or the EM MoDG clustering result
(along the top row); cluster names are in the lower right corner; and
the number of genes in each cluster is displayed in the upper left
corner. Summary plots with a colored background represent cells within
the confusion array (see Methods), where each cell represents the
intersection set of genes that are in common between the Cho et al.
(1) cluster and the EM MoDG result cluster. Again, the upper left hand
corner displays the number of genes within a confusion matrix cell. The
background of each plot is colored according to a heat-map (scale
below) that registers the proportionate number of genes in the cell
compared with the corresponding cluster in the EM MoDG result.
Intersection cells with dark outlines indicate the optimal pairings
between the two data partitions, as determined from the LA calculation
(Equation 2). Quantitative measures of overall similarity between the
two clustering results using both LA and NMI are displayed in the graph
title (see Methods).
(Click image to enlarge)
Comparing two clustering results on a ratiometric microarray dataset
using a confusion array. Shown in this comparison is a Fourier
clustering result published in the original study by Spellman et al.
(16) and results from running an unsupervised clustering (Xclustagglom,
see Methods) on the same ratiometric microarray dataset as the Fourier
analysis was run on. Details of the figure layout are discussed in the
legend of Figure 1. Here, the 5 Fourier clusters are shown along the
rows, while the 10 Xclustagglom clusters are displayed across the
(Click image to enlarge)
Example ROC curves to assess cluster overlap. An ROC curve (B and D,
left side) is drawn as a function of moving outward from a cluster
center and counting the proportion of cluster members (blue points)
encountered along the y-axis versus the proportion of non-cluster members (red points) encountered along the x-axis.
The collection of distances from every point within a cluster and every
point outside a cluster is binned and used to create the distance
histograms (B and D, right side). Shown in red is the distance
histogram for cluster members and cluster non-members are shown in
blue. Two extreme cases are exemplified in this figure. (A) Example expression data falling into two completely discrete clusters highlighted in red and blue. (B)
The corresponding ROC curve (left) and distance histograms (right) for
the sample data shown in (A). Note that since all cluster members are
encountered before any non-cluster members the area under the ROC curve
is 1.0. The distance histograms also show this perfect separation. (C) Example expression data falling into two completely overlapping clusters highlighted in red and blue. (D)
The corresponding ROC curve (left) and distance histograms (right) for
sample data shown in (B). Note that since cluster members and
non-cluster members are encountered at an equal rate as a function of
distance from the cluster center, the ROC curve approximates the line x = y
and the area under the ROC curve is 0.5. This overlap is also
highlighted in the distance histograms because the distributions of
distances for cluster members completely overlap with that of the
distribution of distances for non-cluster members.
(Click image to enlarge)
ROC analysis of the S-phase cluster of Cho et al. (1). (A)
ROC curve (left) shows the overlap between this cluster of 74 genes and
genes from all other clusters in the time-course analysis [383 genes in
total, selected by inspection by Cho et al. (1) for cycling
behavior]. The area under the ROC curve is 0.82. The area under the
curve highlighted in green demonstrates selection of genes from S phase
that overlap with other clusters least. At the shown distance
threshold, 66 genes from the Cho determined S-phase cluster are
selected, and the overlap with only non-S-phase genes. (A)
Right: correlation distance histograms illustrating the distribution of
distances to the center of the S-phase cluster for non-cluster members
(bottom/blue) and for all S-phase cluster members (top/red). (B)
Expression trajectories for the 74 genes in the S-phase cluster,
highlighting in green cluster members represented by the green
highlight in (A). (C) Expression trajectories for all genes
outside the S-phase cluster of the Cho clustering highlighting in green
non-cluster members represented by the green highlight in (A).
(Click image to enlarge)
PCA, ROC plots and trajectory summary views of clusters from the Cho
classification and an unsupervised clustering (EM MoDG) of an
Affymetrix yeast cell cycle time course (1). The top panel for each
clustering results shows cluster means projected into the top two
dimensions of the principle component space defined by the expression
data (capturing 64% of the variance). The area of the marker size for
each cluster is proportional to the number of genes in each cluster.
Below are ROC curves (left) and trajectory summaries (right) for each
cluster. The trajectory summaries display every gene's expression
profile within a cluster as a blue line with time along the x-axis and expression along the y-axis.
The red line within each trajectory summary represents the mean
expression level for the cluster. ROC area values are displayed within
the ROC curve for each cluster. The background colors for the
trajectory summaries and the PCA projection have been matched within
each clustering result. In addition, LA was used to find the optimal
mapping of clusters between the Cho classification and the EM MoDG
result and the colors have been set accordingly.
(Click image to enlarge)
PCA, ROC plots and trajectory summary views of clusters from the
Fourier classification and an unsupervised clustering (Xclustagglom)
results from the ratiometric yeast cell cycle time course (16). Details
of the figure layout are the same as for Figure 5. Only the six largest
clusters are shown in the Xclustagglom. Clusters that do not have an
optimal pairing by LA with a Fourier cluster are colored black. Note
that PCA summary calls attention to the low quality of the S/XC3
pairing, places it between XC5 and XC1.
(Click image to enlarge)
Selected confusion array cells from Figure 1 highlighting cluster
membership differences for genes with peak expression during the G1 and S phases of the cell cycle. The trajectory summaries display an expression profile for every gene with time along the x-axis and expression along the y-axis.
Blue trajectory summaries show parent clustering results (EM MoDG along
the columns, the Cho classification along the rows). Intersection cells
from the confusion array are shown in red. Mean vectors for each gene
set are shown in black with error bars proportional to the standard
deviation. The total number of gene expression vectors in each cell is
shown in parentheses. (A) G1 genes are subdivided
differently by the two algorithms. EM MoDG separates genes upregulated
only during the second phase of the cell cycle from those upregulated
during both the first and second cycles. The Cho classification
separates G1 based primarily on peak time in the second
cycle. Figure 8 illustrates these observed kinetic distinctions being a
result of these genes belonging to distinct regulatory modules. (B)
Detailed comparison from the confusion array of Figure 1 showing the
S-phase cluster of the Cho classification is subdivided nearly equally
among EM2, EM3 (optimal match by LA) clusters.
(Click image to enlarge)
Integrating expression data, regulatory motif conservation and protein–DNA binding information. (A) Binding site enrichment in genes from the four confusion matrix cells of Figure 7 that dissect genes in the G1
cell cycle phase. Shown in red are the observed number of genes with a
MCS score above threshold for each motif. Shown in blue are the number
of genes expected by chance, as computed by bootstrap simulations. The
total number of genes each cell contains is in the upper left. (B–D)
Heat-map displays showing expression data on the left, followed by MCS
scores for a specified motif, followed by in vivo protein–DNA
binding data for transcription factors implicated in binding to the
specified consensus. Color scales for each panel are at the bottom of
the figure. For the MCS scores, the color map ranges from 0 to the 99th
percentile to minimize the influence of extreme outliers on
interpretation. (B) Shown are 14 genes that fall within the EM1/Early G1
intersection cell and have a conserved enrichment in the presence of
the SWI5 consensus as measured by MCS scores (see Methods; Equations
4–9) (C) Shown are 79 genes that fall within EM2/Late G1 intersection cell and have a high MCS score for MCB. (D) Shown are 20 genes that fall within EM2/Late G1
intersection cell and have a high MCS score for SCB. In each heat-map
genes are ordered by decreasing MCS score. Significant correlation can
be seen between a high MCS score, protein–DNA binding and the expected
(Click image to enlarge)