**Mathematical tools for organizing and quantifying microarray clusterings: confusion matrices and comparative metrics**

A confusion matrix can effectively summarize all pairwise intersections^{}between all clusters from any two clusterings of the same data.^{}Confusion matrices, as used in this work, are defined as the^{}matrix of cardinalities of all pairwise intersections between^{}two different expression clusterings (see Methods). Once the^{}confusion matrix is constructed, we can then apply different^{}scoring functions to the confusion matrix to quantify similarity:^{}(i) NMI measures the amount of information shared between two^{}clusterings (7); and (ii) LA optimizes the number of data vectors^{}in clusters that correspond to each other, thereby identifying^{}the optimal pairing of clusters. LA also reports the percentage^{}of data vectors contained within those clusters, and this can^{}be used to assess similarity of results globally over the entire^{}dataset and locally on a cluster pair by cluster pair basis^{}(8,9) (for mathematical descriptions of confusion matrices,^{}NMI and LA, see Methods). The combined use of LA and NMI metrics^{}can provide a biologist with immediate insight into the magnitude^{}and nature of global differences between two microarray clusterings^{}by capitalizing on the fact that NMI is asymmetric and LA is^{}symmetric (see Table 1 for details). Specifically, this discriminates^{}instances in which two clusterings are very similar to each^{}other from a comparison in which one clustering is different^{}from the other, but is essentially a refinement of the first.^{}Both of these can be discriminated from the third relationship,^{}in which two clusterings deliver fundamentally different views^{}of the data structure.

Confusion arrays organize and display comparative analyses.^{}Given two different clusterings of a dataset and a global evaluation^{}of their similarity via NMI and LA, we then needed a method^{}to systematically compare clusters in a manner that is more^{}effective and intuitive than mere inspection of gene lists.^{}To do this, we define the confusion array, which is a direct^{}extension of a formal confusion matrix. For two different clusterings,^{}each cell in the confusion array contains the intersection set^{}between the two parent clusters (as opposed to the cardinality^{}of this set, as in a confusion matrix; see Methods). In the^{}context of the CompClust system, the confusion array cells can^{}then be interactively mined. Confusion arrays for two different^{}clusterings, one using an Affymetrix yeast cell cycle dataset^{}(1) and the other using a deposition ratiometric dataset (16),^{}are shown in Figures 1 and 2, and are analyzed further below.^{}

**Understanding cluster relatedness**

ROC measures cluster overlap. Whatever algorithm has been used^{}to cluster data, it is useful to find out how distinct each^{}cluster is from all the others and how distinct any particular^{}cluster is from another specific cluster. This is especially^{}pertinent when membership in a cluster will be translated into^{}a gene list that ultimately becomes a functional annotation^{}or defines genes that will be input into higher-order analyses.^{}To address this issue, we applied classical ROC analysis (see^{}Methods). In this context, cluster assignment is used as the^{}‘diagnosis’ and the distance of each expression^{}vector from the cluster mean vector is the ‘decision criterion’.^{}The corresponding ROC curve plots the proportion of cluster^{}members versus the proportion of non-cluster members as the^{}distance from the cluster centroid increases (Figure 3). This^{}can be interpreted geometrically as expansion of a hypersphere^{}from the cluster centroid until all members of the cluster are^{}enclosed. Thus, when one cluster is completely separate from^{}all other data, all of its members are closer to the cluster^{}center than all non-members and the area under the ROC curve^{}is 1.0 (Figure 3B). When a cluster is not fully separable from^{}the remainder of the data, the ROC curve rises more slowly and^{}the area under the ROC curve is <1.0. In the limit, when^{}the two classes are perfectly mixed, the ROC curve closely follows^{}*X* = *Y*, and the area under the curve drops to 0.5 (Figure 3D).^{}The shape of the ROC curve also contains additional information^{}about how cluster overlap is distributed, and this information^{}can be used by the biologist to choose useful data mining cut-offs^{}that mark discontinuities and cluster substructure (see below^{}and Figure 4). It can also be used interactively within CompClust^{}to explore and select data vectors (genes) that are closer or^{}more distant from the cluster center. Selection of vectors not^{}assigned to the cluster, yet positioned at overlapping distances^{}from its center, is also possible and is often instructive (Figure 4^{}and text below).

**Comparing clusterings of yeast cell cycle microarray datasets**

We performed comparative analyses on clustering results from^{}two different yeast microarray time-course datasets (one Affymetrix^{}and one ratiometric), each composed of genes that are differentially^{}expressed over the cell cycle (1,16). These comparisons provide^{}a useful perspective because the gene classification results^{}from the original gene clusterings have been introduced as gene^{}annotations in widely used databases [Incytes' YPD (19), SGD^{}(http://yeastgenome.org)] and have been mined or used as starting^{}points in many subsequent works. We generated a new clustering^{}for each dataset, in each instance selecting an algorithm that^{}differs substantially from the one used in the original publication^{}but one that should also be entirely appropriate for the dataset.^{}For the Cho *et al*. (1) dataset, we used EM MoDGs (17), which^{}is an unsupervised method that searches for the best statistical^{}fit to the data modeled as a mixture of Gaussian distributions.^{}The heuristic used in the original report (1) is a supervised^{}method based on biologist's knowledge of cell cycle phases.^{}The heuristic focused on the time of peak expression for each^{}gene trajectory to guide assignment of each gene to one of the^{}five time domains associated with Early G_{1}, Late G_{1}, S, G_{2} and^{}M phases of the cell cycle. For the second dataset (16), we^{}performed an agglomerative phylogenic hierarchical clustering^{}of the tsCDC15-mutant synchronized data. This algorithm is based^{}on the widely used Xclust phylogenetic ordering algorithm (2),^{}onto which we grafted an agglomeration step designed to establish^{}objective boundaries in the tree (see Methods). This effectively^{}turns an ordering algorithm into a clustering algorithm, in^{}which group boundaries are imposed computationally rather than^{}by a user's pattern recognition skills, as is performed with^{}Xclust by itself. This result was compared with the result reported^{}by Spellman *et al*. (16), in which they used a Fourier transform-based^{}algorithm, which was designed to take maximal advantage of the^{}time-course pattern.^{}**Global similarity measures**

Comparison of the two clusterings of Affymetrix data from Cho^{}*et al*. (1) gave a global LA score of 0.63 and NMI scores of^{}0.52 and 0.50, immediately indicating that EM MoDG and the heuristic^{}classification have produced substantially different results.^{}The LA value of 0.63 says that the optimal pairing of clusters^{}still classifies 37% of the genes differently between the two^{}algorithms. ROC curves and ROC areas were generated for each^{}cluster (Figure 5). Viewed in aggregate, this ROC analysis showed^{}that clusters from EM MoDG are all better separated from each^{}other than are any clusters from the original Cho *et al*. (1)^{}heuristic. Thus, the ROC indices for EM MoDG are all 0.96 or^{}above, and four of the five clusters are >0.98. In contrast,^{}the heuristic classification groups had ROC values as low as^{}0.82 for S phase and none was better than 0.97 (M phase). By^{}this criterion, we can argue that EM clustering is an objectively^{}superior representation of the underlying data structure.

How are these differences between clustering results distributed^{}over the dataset? We used PCA (Principle Component Analysis)^{}to determine whether the two clusterings were globally similar^{}or different in the way they partitioned the dataspace. PCA^{}projects high-dimensional gene expression vectors (each dimension^{}here corresponding to a different RNA sample) into a different^{}and low-dimensional space (usually two or three) (20), in which^{}the new PCA dimensions have each been selected to explain the^{}maximal amount of variance in the data. A common feature of^{}microarray datasets is that the first few principle components^{}often capture most of the variations in the data (here 64%).^{}Using CompClust to view the cluster means in PCA space allowed^{}us to assess relationships between clusters from the two algorithms.^{}Relative positions of cluster means in the PCA display the cell^{}cycle progression in a counterclockwise pattern that is quite^{}similar for the two algorithms. The absolute positions of the^{}cluster centers in PCA space differ, though not extravagantly,^{}for most clusters. This is interesting because the coherence^{}in overall structure would seem to contradict the rather high^{}dissimilarities in cluster composition measured by the criteria^{}LA and NMI, and shown graphically in the confusion array (Figure 1).^{}Considered together, the results argue that the overall^{}data structure, reflecting phases of the cell cycle, is robust^{}and has been treated rather similarly by the two algorithms,^{}even though 37% of individual gene expression vectors were assigned^{}differently. This raises the question of which gene vectors^{}have been differentially assigned and what biological meaning,^{}if any, should be attached to the differences. These questions^{}are addressed in the Discussion by examining specific gene groups^{}in the confusion array.^{}
Using the ratiometric data of Spellman *et al*. (16), a comparison^{}of the original Fourier-based algorithm versus agglomerated^{}Xclust produced NMI and LA scores of 0.39, 0.41 and 0.60, respectively.^{}These scores indicate that the two clusterings are even more^{}different in membership assignment, with 40% of genes falling^{}outside the optimal LA pairing. Since both NMI and LA scores^{}are low, gene memberships for some clusters must be truly scrambled^{}rather than being simple combinations of cluster unions and^{}subsets (see Methods and Table 1). PCA projection (Figure 6)^{}showed that some major cluster centers from the two algorithms^{}are positioned very differently, both absolutely and relatively^{}(note the yellow cluster corresponding to the Fourier S-phase^{}group). The confusion array shows that XclustAgglom clusters^{}often combine genes that are members of adjacent Fourier clusters,^{}though in some cases it joins vectors from non-adjacent groups^{}(the confusion around S phase is complex). ROC curves and scores^{}also indicate that XclustAgglom has performed a slightly better^{}job of segregating data into discrete groups that reflect underlying^{}data structure, while the Fourier analysis groups are less coherent^{}and often seem to mix members of kinetically adjacent groups^{}as detailed in the confusion array (Figures 2 and 6). This may^{}be due, in part, to the use of a small number of then known^{}genes to center landmark phases by the Fourier algorithm. The^{}fact that this phase assignment was a ‘somewhat arbitrary’^{}step in the original analysis was pointed out by Spellman *et*^{}al. (16).

**High-resolution cluster comparisons**

Confusion arrays can be used to explore in more detail the issues^{}raised by global analyses and to mine relationships between^{}individual clusters. This can then be used to make refined gene^{}lists based on either the expert opinion or on the application^{}of computationally objective criteria. We applied LA to the^{}confusion matrix of the Cho heuristic and the EM MoDG results^{}and produced the corresponding adjacency matrix (see Methods^{}and Equation 3). This delivered an objectively optimized pairing^{}of EM cluster 1 with Cho ‘Early G_{1}’, EM cluster^{}2 with Cho ‘Late G_{1}’ and so on, as shown in the^{}array visualization (Figure 1). Each cell in the confusion array^{}contains the corresponding gene vectors and displays the calculated^{}mean vector for each intersect cell in the array.^{}
The confusion array highlighted relationships that were not^{}clear from Figures 5 or 6. For example, in the Affymetrix data^{}(1), both algorithms identified two gene classes within G_{1} (red^{}and yellow, respectively, in the PCA analysis of Figure 5).^{}However, the EM1 cluster shares only 67% of its content with^{}the Cho ‘Early G_{1}’, and most remaining genes fall^{}into the Cho ‘Late G_{1}’ cluster (Figure 7A). A straightforward^{}hypothesis is that the statistical EM algorithm simply could^{}not justify dividing G_{1} vectors into early and Late G_{1} kinetic^{}groups as the heuristic had done. The confusion array, however,^{}makes it clear at a glance that a different data feature is^{}driving the G_{1} sub-groupings. EM1 cluster members are upregulated^{}only in the second cycle, while EM2 genes are upregulated in^{}both cycles. The array also shows that the Cho Early G_{1} group^{}contains a set of 10 genes that appear much more consistent^{}with a coherent M-phase group that corresponds to EM5.

Because the focus of the heuristic classification was mainly^{}on the second oscillation, it suppressed the distinction between^{}single cycle and two cycle G_{1} patterns, while ‘paying^{}more attention’ to fine-structure kinetic differences^{}of the second cycle. On the other hand, EM MoDG treats all features^{}with equal weight across the time course and so centers clusters^{}without prior guidance about their relationship to cell cycle^{}phase. The confusion array intersect cells then effectively^{}parsed the fine kinetic differences within EM1 by separating^{}47 vectors that more closely resemble the Early G_{1} cluster from^{}18 that are more like Late G_{1} cluster. Thus, the intersect cells^{}capture and dissect the two distinct ways used by the two algorithms^{}to parse G_{1} expression. Both appear valid and reveal in different^{}ways, one based entirely on the kinetics of the second cycle^{}and one focusing on a major difference in the expression that^{}is seen only in the first cycle following release from arrest.^{}

Turning to the S-phase clusters, the comparative analysis highlights^{}a different kind of disagreement between the two algorithm outputs.^{}Members from Cho ‘S-phase’ cluster overlap almost^{}evenly with either EM2 (41%) or EM3 (49%). A simple biological^{}interpretation is that the kinetic boundary between Late G_{1}^{}and S-phase is not very crisp, irrespective of the algorithm^{}used to try to define them. An alternate explanation is that^{}one algorithm is frankly superior to the other in defining coherent^{}expression groups. Examination of the ROC curves (Figure 5)^{}and values (0.98 for EM versus 0.82 for S phase) provided an^{}objective measure of quality that argues the EM clustering of^{}S phase is superior.^{}

**Dissecting individual clusters using ROC**

Further ROC-based analysis of the S-phase cluster from the Cho^{}*et al*. (1) classification is shown in Figure 4. The ROC curve^{}shows how far from the cluster mean one needs to expand a hypersphere^{}to include a given fraction of vectors from the cluster, and^{}the shape of the curve can be used to understand cluster substructure.^{}Inspection of the ROC curve and the corresponding histogram^{}(Figure 4A) identified a natural discontinuity separating the^{}first 66% of genes that are nearer the cluster center from the^{}remainder. For additional data mining, we therefore set a boundary^{}at 66% on the ROC curve and then inspected all gene vectors^{}from the entire cell cycle dataset that fall within that boundary.^{}Approximately 20% of gene vectors inside this dataspace threshold^{}had been assigned to other clusters. Figure 4B and C allows^{}inspection of gene trajectories that were either interior or^{}exterior to the boundary. This is useful for reviewing and ‘pruning’^{}lists of putatively co-expressed genes in an objective manner.^{}

**Integration with transcription factor motifs to identify regulatory modules**

CompClust is designed to integrate different kinds of data by^{}linking each gene with other data, annotations and results of^{}meta-analyses. There are many ways of using other datasets to^{}identify relationships between, for example, observed patterns^{}of RNA co-expression and other data that help to answer the^{}question: are similarly expressed genes co-regulated? A group^{}of genes that are co-expressed may also be co-regulated, but^{}this is far from assured. Co-expressed genes can instead arrive^{}at the same expression pattern by the action of two (or more)^{}different regulators. Conversely, genes that are co-regulated^{}by the same factor(s) at the transcriptional level may not display^{}identical RNA expression patterns for a variety of reasons,^{}including differential turnover rates. For these reasons, additional^{}kinds of data are needed to help determine the co-expressed^{}genes that are, in fact, transcriptionally co-regulated and^{}to provide evidence for the identity of factor(s) driving co-regulation.^{}Here, we show how the occurrence of evolutionarily persistent^{}transcription factor binding sites can be mapped informatively^{}onto the gene expression clusters from a confusion array to^{}predict the structure of transcription modules.^{}

The observation of two distinct sets of genes, one that peaks^{}during both the first and second cell cycles after release from^{}arrest, and another restricted to only the second oscillation^{}(Figure 7), suggests that they might be regulated differently^{}at the level of transcription. Prior work has led to the view^{}that MCB and SCB sequence motifs bind Mbp1/Swi6 (MCF) or Swi4/Swi6^{}(SCF) factor complexes to drive G_{1}-specific transcription (21–23).^{}Thus, many genes are believed to be selectively and specifically^{}expressed in G_{1} owing to their membership in either MCF or SCF^{}regulatory modules. The two modules are also thought to be partly^{}distinct from each other, with some genes apparently being strongly^{}governed by either Swi4 or Mbp1 [(24,25) and reviewed in (26)].^{}

We, therefore, calculated an MCS (see Methods) to quantify the^{}conserved enrichment of a consensus site within 1 kb of the^{}start ATG in sequence data from the seven available yeast genomes^{}(27,28). We then asked whether different intersecting cells^{}within the confusion array are differentially and significantly^{}enriched for these known candidate motifs. The EM2/Late G_{1} intersect^{}cell was highly enriched, above chance, for MCB and SCB. A total^{}of 79 of 113 genes (70%) were enriched for MCB compared with^{}the expectation of 13 such genes for randomly selected samples^{}of 113 yeast genes. A total of 18% are enriched for SCB sites^{}compared with an expectation of only 4% by chance (Figure 8A).^{}CompClust's data linking capabilities were then used to visualize^{}correlations with *in vivo* protein–DNA binding data for^{}Swi4 and Mbp1 (29). The vast majority of genes with the above^{}threshold MCB or SCB MCS scores also showed significant *in vivo*^{}binding activity for either MCF or SCF.

The picture for the EM1/Early G_{1} intersect cell, whose genes^{}peak only once during the time course, was surprisingly different.^{}This group showed no significant enrichment for either MCB or^{}SCB (Figure 8A). What other factor(s) could be responsible for^{}the EM1/Early G_{1} intersect pattern? We searched and found that^{}the Swi5/Ace2 motif is enriched so that 30% are above threshold,^{}a value twice that expected by chance. The highest Swi5 MCS^{}scores correlated strongly with very intense expression in the^{}second cycle. This, in turn, correlated well with *in vivo* factor^{}binding by both Swi5 and Ace2 taken from the chromatin immunoprecipitation^{}data of Lee *et al*. (29). Thus, 60% of the EM1/Early G_{1} Swi5/Ace2^{}group had *P*-values <0.05 for Swi5 or Ace2 in the global chromatin^{}immunoprecipitation study of Lee *et al*. (29), and others were^{}relatively strong binders as shown in Figure 8. That most or^{}all of these connections between Swi5/Ace2 and EM1 genes, inferred^{}from three kinds of genome scale data are real is supported^{}by the fact that most previously identified targets of Ace2^{}and/or Swi5 (30,31) that were in the original Cho cycling dataset^{}are in this group.