Mathematical tools for organizing and quantifying microarray clusterings: confusion matrices and comparative metrics
A confusion matrix can effectively summarize all pairwise intersectionsbetween all clusters from any two clusterings of the same data.Confusion matrices, as used in this work, are defined as thematrix of cardinalities of all pairwise intersections betweentwo different expression clusterings (see Methods). Once theconfusion matrix is constructed, we can then apply differentscoring functions to the confusion matrix to quantify similarity:(i) NMI measures the amount of information shared between twoclusterings (7); and (ii) LA optimizes the number of data vectorsin clusters that correspond to each other, thereby identifyingthe optimal pairing of clusters. LA also reports the percentageof data vectors contained within those clusters, and this canbe used to assess similarity of results globally over the entiredataset and locally on a cluster pair by cluster pair basis(8,9) (for mathematical descriptions of confusion matrices,NMI and LA, see Methods). The combined use of LA and NMI metricscan provide a biologist with immediate insight into the magnitudeand nature of global differences between two microarray clusteringsby capitalizing on the fact that NMI is asymmetric and LA issymmetric (see Table 1 for details). Specifically, this discriminatesinstances in which two clusterings are very similar to eachother from a comparison in which one clustering is differentfrom the other, but is essentially a refinement of the first.Both of these can be discriminated from the third relationship,in which two clusterings deliver fundamentally different viewsof the data structure.
Confusion arrays organize and display comparative analyses.Given two different clusterings of a dataset and a global evaluationof their similarity via NMI and LA, we then needed a methodto systematically compare clusters in a manner that is moreeffective and intuitive than mere inspection of gene lists.To do this, we define the confusion array, which is a directextension of a formal confusion matrix. For two different clusterings,each cell in the confusion array contains the intersection setbetween the two parent clusters (as opposed to the cardinalityof this set, as in a confusion matrix; see Methods). In thecontext of the CompClust system, the confusion array cells canthen be interactively mined. Confusion arrays for two differentclusterings, one using an Affymetrix yeast cell cycle dataset(1) and the other using a deposition ratiometric dataset (16),are shown in Figures 1 and 2, and are analyzed further below.
Understanding cluster relatedness
ROC measures cluster overlap. Whatever algorithm has been usedto cluster data, it is useful to find out how distinct eachcluster is from all the others and how distinct any particularcluster is from another specific cluster. This is especiallypertinent when membership in a cluster will be translated intoa gene list that ultimately becomes a functional annotationor defines genes that will be input into higher-order analyses.To address this issue, we applied classical ROC analysis (seeMethods). In this context, cluster assignment is used as the‘diagnosis’ and the distance of each expressionvector from the cluster mean vector is the ‘decision criterion’.The corresponding ROC curve plots the proportion of clustermembers versus the proportion of non-cluster members as thedistance from the cluster centroid increases (Figure 3). Thiscan be interpreted geometrically as expansion of a hyperspherefrom the cluster centroid until all members of the cluster areenclosed. Thus, when one cluster is completely separate fromall other data, all of its members are closer to the clustercenter than all non-members and the area under the ROC curveis 1.0 (Figure 3B). When a cluster is not fully separable fromthe remainder of the data, the ROC curve rises more slowly andthe area under the ROC curve is <1.0. In the limit, whenthe two classes are perfectly mixed, the ROC curve closely followsX = Y, and the area under the curve drops to 0.5 (Figure 3D).The shape of the ROC curve also contains additional informationabout how cluster overlap is distributed, and this informationcan be used by the biologist to choose useful data mining cut-offsthat mark discontinuities and cluster substructure (see belowand Figure 4). It can also be used interactively within CompClustto explore and select data vectors (genes) that are closer ormore distant from the cluster center. Selection of vectors notassigned to the cluster, yet positioned at overlapping distancesfrom its center, is also possible and is often instructive (Figure 4and text below).
Comparing clusterings of yeast cell cycle microarray datasets
We performed comparative analyses on clustering results fromtwo different yeast microarray time-course datasets (one Affymetrixand one ratiometric), each composed of genes that are differentiallyexpressed over the cell cycle (1,16). These comparisons providea useful perspective because the gene classification resultsfrom the original gene clusterings have been introduced as geneannotations in widely used databases [Incytes' YPD (19), SGD(http://yeastgenome.org)] and have been mined or used as startingpoints in many subsequent works. We generated a new clusteringfor each dataset, in each instance selecting an algorithm thatdiffers substantially from the one used in the original publicationbut one that should also be entirely appropriate for the dataset.For the Cho et al. (1) dataset, we used EM MoDGs (17), whichis an unsupervised method that searches for the best statisticalfit to the data modeled as a mixture of Gaussian distributions.The heuristic used in the original report (1) is a supervisedmethod based on biologist's knowledge of cell cycle phases.The heuristic focused on the time of peak expression for eachgene trajectory to guide assignment of each gene to one of thefive time domains associated with Early G1, Late G1, S, G2 andM phases of the cell cycle. For the second dataset (16), weperformed an agglomerative phylogenic hierarchical clusteringof the tsCDC15-mutant synchronized data. This algorithm is basedon the widely used Xclust phylogenetic ordering algorithm (2),onto which we grafted an agglomeration step designed to establishobjective boundaries in the tree (see Methods). This effectivelyturns an ordering algorithm into a clustering algorithm, inwhich group boundaries are imposed computationally rather thanby a user's pattern recognition skills, as is performed withXclust by itself. This result was compared with the result reportedby Spellman et al. (16), in which they used a Fourier transform-basedalgorithm, which was designed to take maximal advantage of thetime-course pattern.Global similarity measures
Comparison of the two clusterings of Affymetrix data from Choet al. (1) gave a global LA score of 0.63 and NMI scores of0.52 and 0.50, immediately indicating that EM MoDG and the heuristicclassification have produced substantially different results.The LA value of 0.63 says that the optimal pairing of clustersstill classifies 37% of the genes differently between the twoalgorithms. ROC curves and ROC areas were generated for eachcluster (Figure 5). Viewed in aggregate, this ROC analysis showedthat clusters from EM MoDG are all better separated from eachother than are any clusters from the original Cho et al. (1)heuristic. Thus, the ROC indices for EM MoDG are all 0.96 orabove, and four of the five clusters are >0.98. In contrast,the heuristic classification groups had ROC values as low as0.82 for S phase and none was better than 0.97 (M phase). Bythis criterion, we can argue that EM clustering is an objectivelysuperior representation of the underlying data structure.
How are these differences between clustering results distributedover the dataset? We used PCA (Principle Component Analysis)to determine whether the two clusterings were globally similaror different in the way they partitioned the dataspace. PCAprojects high-dimensional gene expression vectors (each dimensionhere corresponding to a different RNA sample) into a differentand low-dimensional space (usually two or three) (20), in whichthe new PCA dimensions have each been selected to explain themaximal amount of variance in the data. A common feature ofmicroarray datasets is that the first few principle componentsoften capture most of the variations in the data (here 64%).Using CompClust to view the cluster means in PCA space allowedus to assess relationships between clusters from the two algorithms.Relative positions of cluster means in the PCA display the cellcycle progression in a counterclockwise pattern that is quitesimilar for the two algorithms. The absolute positions of thecluster centers in PCA space differ, though not extravagantly,for most clusters. This is interesting because the coherencein overall structure would seem to contradict the rather highdissimilarities in cluster composition measured by the criteriaLA and NMI, and shown graphically in the confusion array (Figure 1).Considered together, the results argue that the overalldata structure, reflecting phases of the cell cycle, is robustand has been treated rather similarly by the two algorithms,even though 37% of individual gene expression vectors were assigneddifferently. This raises the question of which gene vectorshave been differentially assigned and what biological meaning,if any, should be attached to the differences. These questionsare addressed in the Discussion by examining specific gene groupsin the confusion array.
Using the ratiometric data of Spellman et al. (16), a comparisonof the original Fourier-based algorithm versus agglomeratedXclust produced NMI and LA scores of 0.39, 0.41 and 0.60, respectively.These scores indicate that the two clusterings are even moredifferent in membership assignment, with 40% of genes fallingoutside the optimal LA pairing. Since both NMI and LA scoresare low, gene memberships for some clusters must be truly scrambledrather than being simple combinations of cluster unions andsubsets (see Methods and Table 1). PCA projection (Figure 6)showed that some major cluster centers from the two algorithmsare positioned very differently, both absolutely and relatively(note the yellow cluster corresponding to the Fourier S-phasegroup). The confusion array shows that XclustAgglom clustersoften combine genes that are members of adjacent Fourier clusters,though in some cases it joins vectors from non-adjacent groups(the confusion around S phase is complex). ROC curves and scoresalso indicate that XclustAgglom has performed a slightly betterjob of segregating data into discrete groups that reflect underlyingdata structure, while the Fourier analysis groups are less coherentand often seem to mix members of kinetically adjacent groupsas detailed in the confusion array (Figures 2 and 6). This maybe due, in part, to the use of a small number of then knowngenes to center landmark phases by the Fourier algorithm. Thefact that this phase assignment was a ‘somewhat arbitrary’step in the original analysis was pointed out by Spellman etal. (16).
High-resolution cluster comparisons
Confusion arrays can be used to explore in more detail the issuesraised by global analyses and to mine relationships betweenindividual clusters. This can then be used to make refined genelists based on either the expert opinion or on the applicationof computationally objective criteria. We applied LA to theconfusion matrix of the Cho heuristic and the EM MoDG resultsand produced the corresponding adjacency matrix (see Methodsand Equation 3). This delivered an objectively optimized pairingof EM cluster 1 with Cho ‘Early G1’, EM cluster2 with Cho ‘Late G1’ and so on, as shown in thearray visualization (Figure 1). Each cell in the confusion arraycontains the corresponding gene vectors and displays the calculatedmean vector for each intersect cell in the array.
The confusion array highlighted relationships that were notclear from Figures 5 or 6. For example, in the Affymetrix data(1), both algorithms identified two gene classes within G1 (redand yellow, respectively, in the PCA analysis of Figure 5).However, the EM1 cluster shares only 67% of its content withthe Cho ‘Early G1’, and most remaining genes fallinto the Cho ‘Late G1’ cluster (Figure 7A). A straightforwardhypothesis is that the statistical EM algorithm simply couldnot justify dividing G1 vectors into early and Late G1 kineticgroups as the heuristic had done. The confusion array, however,makes it clear at a glance that a different data feature isdriving the G1 sub-groupings. EM1 cluster members are upregulatedonly in the second cycle, while EM2 genes are upregulated inboth cycles. The array also shows that the Cho Early G1 groupcontains a set of 10 genes that appear much more consistentwith a coherent M-phase group that corresponds to EM5.
Because the focus of the heuristic classification was mainlyon the second oscillation, it suppressed the distinction betweensingle cycle and two cycle G1 patterns, while ‘payingmore attention’ to fine-structure kinetic differencesof the second cycle. On the other hand, EM MoDG treats all featureswith equal weight across the time course and so centers clusterswithout prior guidance about their relationship to cell cyclephase. The confusion array intersect cells then effectivelyparsed the fine kinetic differences within EM1 by separating47 vectors that more closely resemble the Early G1 cluster from18 that are more like Late G1 cluster. Thus, the intersect cellscapture and dissect the two distinct ways used by the two algorithmsto parse G1 expression. Both appear valid and reveal in differentways, one based entirely on the kinetics of the second cycleand one focusing on a major difference in the expression thatis seen only in the first cycle following release from arrest.
Turning to the S-phase clusters, the comparative analysis highlightsa different kind of disagreement between the two algorithm outputs.Members from Cho ‘S-phase’ cluster overlap almostevenly with either EM2 (41%) or EM3 (49%). A simple biologicalinterpretation is that the kinetic boundary between Late G1and S-phase is not very crisp, irrespective of the algorithmused to try to define them. An alternate explanation is thatone algorithm is frankly superior to the other in defining coherentexpression groups. Examination of the ROC curves (Figure 5)and values (0.98 for EM versus 0.82 for S phase) provided anobjective measure of quality that argues the EM clustering ofS phase is superior.
Dissecting individual clusters using ROC
Further ROC-based analysis of the S-phase cluster from the Choet al. (1) classification is shown in Figure 4. The ROC curveshows how far from the cluster mean one needs to expand a hypersphereto include a given fraction of vectors from the cluster, andthe shape of the curve can be used to understand cluster substructure.Inspection of the ROC curve and the corresponding histogram(Figure 4A) identified a natural discontinuity separating thefirst 66% of genes that are nearer the cluster center from theremainder. For additional data mining, we therefore set a boundaryat 66% on the ROC curve and then inspected all gene vectorsfrom the entire cell cycle dataset that fall within that boundary.Approximately 20% of gene vectors inside this dataspace thresholdhad been assigned to other clusters. Figure 4B and C allowsinspection of gene trajectories that were either interior orexterior to the boundary. This is useful for reviewing and ‘pruning’lists of putatively co-expressed genes in an objective manner.
Integration with transcription factor motifs to identify regulatory modules
CompClust is designed to integrate different kinds of data bylinking each gene with other data, annotations and results ofmeta-analyses. There are many ways of using other datasets toidentify relationships between, for example, observed patternsof RNA co-expression and other data that help to answer thequestion: are similarly expressed genes co-regulated? A groupof genes that are co-expressed may also be co-regulated, butthis is far from assured. Co-expressed genes can instead arriveat the same expression pattern by the action of two (or more)different regulators. Conversely, genes that are co-regulatedby the same factor(s) at the transcriptional level may not displayidentical RNA expression patterns for a variety of reasons,including differential turnover rates. For these reasons, additionalkinds of data are needed to help determine the co-expressedgenes that are, in fact, transcriptionally co-regulated andto provide evidence for the identity of factor(s) driving co-regulation.Here, we show how the occurrence of evolutionarily persistenttranscription factor binding sites can be mapped informativelyonto the gene expression clusters from a confusion array topredict the structure of transcription modules.
The observation of two distinct sets of genes, one that peaksduring both the first and second cell cycles after release fromarrest, and another restricted to only the second oscillation(Figure 7), suggests that they might be regulated differentlyat the level of transcription. Prior work has led to the viewthat MCB and SCB sequence motifs bind Mbp1/Swi6 (MCF) or Swi4/Swi6(SCF) factor complexes to drive G1-specific transcription (21–23).Thus, many genes are believed to be selectively and specificallyexpressed in G1 owing to their membership in either MCF or SCFregulatory modules. The two modules are also thought to be partlydistinct from each other, with some genes apparently being stronglygoverned by either Swi4 or Mbp1 [(24,25) and reviewed in (26)].
We, therefore, calculated an MCS (see Methods) to quantify theconserved enrichment of a consensus site within 1 kb of thestart ATG in sequence data from the seven available yeast genomes(27,28). We then asked whether different intersecting cellswithin the confusion array are differentially and significantlyenriched for these known candidate motifs. The EM2/Late G1 intersectcell was highly enriched, above chance, for MCB and SCB. A totalof 79 of 113 genes (70%) were enriched for MCB compared withthe expectation of 13 such genes for randomly selected samplesof 113 yeast genes. A total of 18% are enriched for SCB sitescompared with an expectation of only 4% by chance (Figure 8A).CompClust's data linking capabilities were then used to visualizecorrelations with in vivo protein–DNA binding data forSwi4 and Mbp1 (29). The vast majority of genes with the abovethreshold MCB or SCB MCS scores also showed significant in vivobinding activity for either MCF or SCF.
The picture for the EM1/Early G1 intersect cell, whose genespeak only once during the time course, was surprisingly different.This group showed no significant enrichment for either MCB orSCB (Figure 8A). What other factor(s) could be responsible forthe EM1/Early G1 intersect pattern? We searched and found thatthe Swi5/Ace2 motif is enriched so that 30% are above threshold,a value twice that expected by chance. The highest Swi5 MCSscores correlated strongly with very intense expression in thesecond cycle. This, in turn, correlated well with in vivo factorbinding by both Swi5 and Ace2 taken from the chromatin immunoprecipitationdata of Lee et al. (29). Thus, 60% of the EM1/Early G1 Swi5/Ace2group had P-values <0.05 for Swi5 or Ace2 in the global chromatinimmunoprecipitation study of Lee et al. (29), and others wererelatively strong binders as shown in Figure 8. That most orall of these connections between Swi5/Ace2 and EM1 genes, inferredfrom three kinds of genome scale data are real is supportedby the fact that most previously identified targets of Ace2and/or Swi5 (30,31) that were in the original Cho cycling datasetare in this group.