Division of Biology, California Institute of Technology Pasadena, CA 91125, USA ^{1}Jet Propulsion Laboratory, Machine Learning Systems Group Pasadena, CA 91109, USA ^{2}Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA 92697, USA ^{3}School of Information and Computer Science, University of California Irvine, Irvine, CA 92697, USA
Nucleic Acids Research 2005 33(8):25802594. Open Access Article.
Analysis of largescale gene expression studies usually begins^{}with gene clustering. A ubiquitous problem is that different^{}algorithms applied to the same data inevitably give different^{}results, and the differences are often substantial, involving^{}a quarter or more of the genes analyzed. This raises a series^{}of important but nettlesome questions: How are different clustering^{}results related to each other and to the underlying data structure?^{}Is one clustering objectively superior to another? Which differences,^{}if any, are likely candidates to be biologically important?^{}A systematic and quantitative way to address these questions^{}is needed, together with an effective way to integrate and leverage^{}expression results with other kinds of largescale data and^{}annotations. We developed a mathematical and computational framework^{}to help quantify, compare, visualize and interactively mine^{}clusterings. We show that by coupling confusion matrices with^{}appropriate metrics (linear assignment and normalized mutual^{}information scores), one can quantify and map differences between^{}clusterings. A version of receiver operator characteristic analysis^{}proved effective for quantifying and visualizing cluster quality^{}and overlap. These methods, plus a flexible library of clustering^{}algorithms, can be called from a new expandable set of software^{}tools called CompClust 1.0 (http://woldlab.caltech.edu/compClust/).^{}CompClust also makes it possible to relate expression clustering^{}patterns to DNA sequence motif occurrences, protein–DNA^{}interaction measurements and various kinds of functional annotations.^{}Test analyses used yeast cell cycle data and revealed data structure^{}not obvious under all algorithms. These results were then integrated^{}with transcription motif and global protein–DNA interaction^{}data to identify G_{1} regulatory modules.
The sources of difference between clustering algorithm outputs^{}are many and varied, and the biological implications are also^{}diverse, as illustrated below for cell cycle data. The general^{}challenge is to detect, measure, evaluate and mine the commonalities^{}and differences. Specifically, the biologist usually wants to^{}first know whether one clustering is objectively and significantly^{}‘better’ than another, and just how big the difference^{}is. If two clusterings are of similar overall quality, yet differ^{}substantially from each other as is often observed, then what^{}specific gene cluster or samples harbor the greatest differences,^{}or are they evenly distributed across the data? At a finer level^{}still, which genes are being assigned to different clusters^{}and why? Importantly, do the distinctions between clusterings^{}highlight properties of biological importance?^{}
To begin to answer such questions, we first needed a way to^{}make systematic quantitative comparisons and then we needed^{}ways to effectively mine the resulting comparisons. We use confusion^{}matrices as the common tool for these comparisons (see below^{}and Methods). A confusion matrix effectively summarizes pairwise^{}intersections between clusters derived from two clustering results.^{}These similarities are quantified by applying scoring functions^{}to the confusion matrix. In this work, we use two different^{}scoring functions for this purpose: (i) normalized mutual information^{}(NMI), which measures the amount of information shared between^{}the two clustering results (7) and; (ii) a linear assignment^{}(LA) method, which quantifies the similarity of two clusterings^{}by finding the optimal pairing of clusters between two clustering^{}results and measuring the degree of agreement across this pairing^{}(8,9). Previous studies have used metrics for evaluating the^{}total number of data point pairs grouped together between two^{}different clusterings to begin to address the need for quantifying^{}overall differences (10–13). BenHur et al. (13) used^{}this to help determine an optimal number of clusters (K) and^{}to assess the overall validity of a clustering. These prior^{}techniques did not, however, offer the capacity to isolate and^{}inspect the similarities and differences between two different^{}clusterings, nor did they provide an interactive interface for^{}biology users that would permit them to usefully capture the^{}comparative differences and similarities. We also introduce^{}a new application of receiver operator characteristic (ROC)^{}analysis (14,15). As we use it here, ROC enables one to quantify^{}the distinctness of a given cluster relative to another cluster^{}or relative to all noncluster members. Implemented in this^{}fashion, ROC provides another measure of local cluster quality^{}and shape, and provides another tool for quantitatively dissecting^{}a cluster. Though the methods and tools were worked out for^{}clusterings of largescale gene expression data, they are applicable^{}to clusterings of other kinds of largescale data as well.^{}
We have integrated the algorithms and comparative tools into^{}an interactive analysis package collectively called CompClust^{}1.0. CompClust enables a user to organize, interrogate and visualize^{}the comparisons. In addition to comparative cluster analysis,^{}an important feature of this software is that it establishes^{}and maintains links between the outputs of clustering analyses^{}and the primary expression data, and, critically, with all other^{}desired annotations. In the sense used here, ‘annotations’^{}include other kinds of primary and metadata of diverse types.^{}This gives a biologist crucial flexibility in data mining and^{}permits analyses that integrate results from other kinds of^{}experiments, such as global protein–DNA interactions (ChIP/Array),^{}protein–protein interactions, comparative genome analysis^{}or information from gene ontologies.^{}
CompClust methods and tools are agnostic about the kinds of^{}microarray data (ratiometric, Affymetrix, etc.) and types of^{}clustering algorithms used. We demonstrate the tools by analyzing^{}two different sets of yeast cell cycle expression data representing^{}both major data platforms, clustered by four different methods:^{}a statistical clustering algorithm [Expectation Maximization^{}of a Mixture of Diagonal Gaussian distributions (EM MoDGs)]^{}(this work), a humandriven heuristic (1), a Fourier transform^{}algorithm designed to take advantage of a periodic timecourse^{}patterns (16) and an agglomerative version of the Xclust phylogenetic^{}ordering algorithm [Eisen et al. (2) modified in this work].^{}We show that gene groups derived from these comparative analyses^{}can be integrated with data on evolutionarily conserved transcription^{}factor binding sites to infer regulatory modules. These results^{}begin to illustrate how a more quantitative and nuanced understanding^{}of both global and local features in the data can be achieved,^{}and how these can be linked with diverse kinds of data types^{}to infer connectivity between regulators and their target gene^{}modules.CompClust is primarily accessible through an application programming^{}interface (API) and, with the use of Python's exposed interpreter,^{}this provides a very rich command line interface (CLI). The^{}major capabilities illustrated in this paper are accessible^{}through CompClustTK, a set of simple graphical user interfaces^{}(GUIs) to offer a convenient starting point without learning^{}Python commands. These GUIs will permit users to perform the^{}major classes of analyses shown, though we note that these comprise^{}only a fraction of CompClust capabilities. The flexibility and^{}diversity of analysis paths is too great to anticipate them^{}all or commit them to GUI forms. This limitation can be overcome^{}by using the Python command line environment. Python commands^{}can be learned at the level needed in a relatively short time^{}(a few weeks of part time effort) by users who do not have prior^{}programming experience. The benefit is access to remarkable^{}flexibility in interrogating datasets. This is a much better^{}match to the diversity of questions and comparisons that biologists^{}usually want to make than in any GUIbased system.^{}
The choice to implement CompClust in Python over other languages^{}was made for several reasons which, considered in aggregate,^{}argue it is the best available language to support the capabilities^{}and analysis goals of CompClust: (i) using Python's exposed^{}interpreter, our API becomes immediately useful for analysis^{}without the construction of a complex GUI. The exposed interpreter^{}also speeds the development time. (ii) Python's syntax is fairly^{}straightforward and easy to learn for even nonprogrammers.^{}(iii) It is freely available and distributable under an opensource^{}license. (iv) Python has an extensive and standard library and^{}in addition third party extensions, including the Numeric package^{}which provides powerful numeric analysis routines. (v) Python^{}is also platform neutral and runs on the majority of systems,^{}including unix/linux, Microsoft Windows and the Mac OS.^{}
Pairwise comparison of clusterings (partitions) using confusion arrays and matrices
Confusion arrays and matrices were used to make pairwise comparisons^{}between different clusterings (mathematical partitions). A set^{}of metrics were then applied to the confusion matrix to measure^{}the nature and degree of similarity between two dataset partitions.^{}Briefly, a confusion matrix is the matrix of cardinalities of^{}all pairwise intersections between two partitions, where a partition^{}of a dataset is defined as a set of disjoint subsets whose union^{}contains all elements of the dataset. We define a confusion^{}array simply as an array of all pairwise intersections between^{}two partitions of a dataset. The cardinalities of these intersection^{}sets form the confusion matrix, whose elements are given by^{}Equation 1:
^{}
(1) 
Linear assignment
The LA value for a confusion matrix is calculated between two^{}partitions (clusterings) and by generating an optimal pairing^{}so that there is, at most, a onetoone pairing between every^{}class in partitions, and this pairing is calculated by optimizing^{}the objective function in Equation 2, using the constraints^{}given in Equation 3, thus defining a linear assignment problem.^{}Next, the maximumcardinality bipartite matching of maximum^{}weights algorithm (Gabow, 1973) was implemented for the optimization.^{}After finding the optimal pairing, the LA score is simply the^{}proportion of vectors (e.g. gene expression trajectories or^{}conditions) included in the optimally paired clusters (Equation 4).^{}It is important to note that LA, unlike NMI, is a symmetric^{}score so that LA(A,B) = LA(B,A). In addition to quantifying^{}the degree of similarity or difference between two partitions,^{}the adjacency matrix (Equation 3) also provides a way to identify^{}pairs of clusters that are globally most similar to each other^{}between two partitions of the data. As illustrated for clusterings^{}of yeast cell cycle regulated genes, this is especially useful^{}for interactive examination of two clusterings.
^{}
(2) 
(3) 
(4) 
Normalized mutual information
The NMI index (7) quantifies how much information is lost, on^{}average, when one clustering is regenerated from a second classification^{}(Equation 5). A noteworthy difference from LA is that NMI is^{}asymmetric.
^{}
(5) 
(6) 
(7) 
(8) 
(9) 
EM MoDG clustering
EM MoDG was implemented with a diagonal covariance matrix model^{}because the number of samples in the (1) cell cycle dataset^{}was too small to fit a statistically valid full covariance matrix^{}to each cluster (17). In order to ensure a near optimal initialization,^{}each EM MoDG result was a result of selection of the best of^{}30 runs, each initialized by placing the initial cluster centroids^{}on K randomly selected data points. The run with best fit to^{}the data (i.e. had the lowest loglikelihood score) was used^{}for the final clustering. Multiple bestof30 runs were performed^{}to verify that the quantitative measures and gene lists results^{}reported here did not vary significantly. The EM MoDG code used^{}here was developed by the NASA/JPL Machine Learning Systems^{}Group.^{}
XclustAgglom
We agglomerate the hierarchical tree returned by Xclust (18)^{}based on a maximal cluster size threshold. Starting from the^{}root, any subtree within the tree with less than the maximal^{}cluster size threshold is agglomerated into a cluster. In order^{}to work with the familiar parameter K (number of clusters),^{}we iteratively find the size threshold that will return as close^{}to K clusters as possible. In practice, this simple heuristic^{}works best when K is over specified by 2–4 times the expected^{}number of clusters because it will generate several very small^{}(often singleton) clusters that are outliers to core major clusters^{}in the data.^{}
Data preprocessing
Each microarray dataset was obtained from the cited authors.^{}For the Cho et al. (1) data, we removed any gene that did not^{}show a sustained absolute expression level of at least 8 for^{}30 consecutive minutes. For each gene vector, we then divided^{}each time point measurement by the median expression value for^{}the gene. For the data of Spellman et al. (16), we linearly^{}interpolated missing values using the available adjacent time^{}points. For both datasets, we log_{2} transformed the resulting^{}gene expression matrices. The datasets were then annotated with^{}the original clustering results as published.^{}
Motif conserved enrichment score (MCS)
For each motif, we translated the IUPAC consensus (Swi5/Ace2:^{}KGCTGR; MCB: ACGCGT; SCB: CACGAAA) into a position weight matrix^{}(PWM) where the probabilities or frequencies in the PWM is determined^{}by the degeneracy of the IUPAC symbol. We calculate a logodds^{}ratio for the PWM occurring at every position in the 1 kb upstream^{}as described in Equation 10
^{}
(10) 
Mathematical tools for organizing and quantifying microarray clusterings: confusion matrices and comparative metrics
A confusion matrix can effectively summarize all pairwise intersections^{}between all clusters from any two clusterings of the same data.^{}Confusion matrices, as used in this work, are defined as the^{}matrix of cardinalities of all pairwise intersections between^{}two different expression clusterings (see Methods). Once the^{}confusion matrix is constructed, we can then apply different^{}scoring functions to the confusion matrix to quantify similarity:^{}(i) NMI measures the amount of information shared between two^{}clusterings (7); and (ii) LA optimizes the number of data vectors^{}in clusters that correspond to each other, thereby identifying^{}the optimal pairing of clusters. LA also reports the percentage^{}of data vectors contained within those clusters, and this can^{}be used to assess similarity of results globally over the entire^{}dataset and locally on a cluster pair by cluster pair basis^{}(8,9) (for mathematical descriptions of confusion matrices,^{}NMI and LA, see Methods). The combined use of LA and NMI metrics^{}can provide a biologist with immediate insight into the magnitude^{}and nature of global differences between two microarray clusterings^{}by capitalizing on the fact that NMI is asymmetric and LA is^{}symmetric (see Table 1 for details). Specifically, this discriminates^{}instances in which two clusterings are very similar to each^{}other from a comparison in which one clustering is different^{}from the other, but is essentially a refinement of the first.^{}Both of these can be discriminated from the third relationship,^{}in which two clusterings deliver fundamentally different views^{}of the data structure.
Confusion arrays organize and display comparative analyses.^{}Given two different clusterings of a dataset and a global evaluation^{}of their similarity via NMI and LA, we then needed a method^{}to systematically compare clusters in a manner that is more^{}effective and intuitive than mere inspection of gene lists.^{}To do this, we define the confusion array, which is a direct^{}extension of a formal confusion matrix. For two different clusterings,^{}each cell in the confusion array contains the intersection set^{}between the two parent clusters (as opposed to the cardinality^{}of this set, as in a confusion matrix; see Methods). In the^{}context of the CompClust system, the confusion array cells can^{}then be interactively mined. Confusion arrays for two different^{}clusterings, one using an Affymetrix yeast cell cycle dataset^{}(1) and the other using a deposition ratiometric dataset (16),^{}are shown in Figures 1 and 2, and are analyzed further below.^{}
Understanding cluster relatedness
ROC measures cluster overlap. Whatever algorithm has been used^{}to cluster data, it is useful to find out how distinct each^{}cluster is from all the others and how distinct any particular^{}cluster is from another specific cluster. This is especially^{}pertinent when membership in a cluster will be translated into^{}a gene list that ultimately becomes a functional annotation^{}or defines genes that will be input into higherorder analyses.^{}To address this issue, we applied classical ROC analysis (see^{}Methods). In this context, cluster assignment is used as the^{}‘diagnosis’ and the distance of each expression^{}vector from the cluster mean vector is the ‘decision criterion’.^{}The corresponding ROC curve plots the proportion of cluster^{}members versus the proportion of noncluster members as the^{}distance from the cluster centroid increases (Figure 3). This^{}can be interpreted geometrically as expansion of a hypersphere^{}from the cluster centroid until all members of the cluster are^{}enclosed. Thus, when one cluster is completely separate from^{}all other data, all of its members are closer to the cluster^{}center than all nonmembers and the area under the ROC curve^{}is 1.0 (Figure 3B). When a cluster is not fully separable from^{}the remainder of the data, the ROC curve rises more slowly and^{}the area under the ROC curve is <1.0. In the limit, when^{}the two classes are perfectly mixed, the ROC curve closely follows^{}X = Y, and the area under the curve drops to 0.5 (Figure 3D).^{}The shape of the ROC curve also contains additional information^{}about how cluster overlap is distributed, and this information^{}can be used by the biologist to choose useful data mining cutoffs^{}that mark discontinuities and cluster substructure (see below^{}and Figure 4). It can also be used interactively within CompClust^{}to explore and select data vectors (genes) that are closer or^{}more distant from the cluster center. Selection of vectors not^{}assigned to the cluster, yet positioned at overlapping distances^{}from its center, is also possible and is often instructive (Figure 4^{}and text below).
Comparing clusterings of yeast cell cycle microarray datasets
We performed comparative analyses on clustering results from^{}two different yeast microarray timecourse datasets (one Affymetrix^{}and one ratiometric), each composed of genes that are differentially^{}expressed over the cell cycle (1,16). These comparisons provide^{}a useful perspective because the gene classification results^{}from the original gene clusterings have been introduced as gene^{}annotations in widely used databases [Incytes' YPD (19), SGD^{}(http://yeastgenome.org)] and have been mined or used as starting^{}points in many subsequent works. We generated a new clustering^{}for each dataset, in each instance selecting an algorithm that^{}differs substantially from the one used in the original publication^{}but one that should also be entirely appropriate for the dataset.^{}For the Cho et al. (1) dataset, we used EM MoDGs (17), which^{}is an unsupervised method that searches for the best statistical^{}fit to the data modeled as a mixture of Gaussian distributions.^{}The heuristic used in the original report (1) is a supervised^{}method based on biologist's knowledge of cell cycle phases.^{}The heuristic focused on the time of peak expression for each^{}gene trajectory to guide assignment of each gene to one of the^{}five time domains associated with Early G_{1}, Late G_{1}, S, G_{2} and^{}M phases of the cell cycle. For the second dataset (16), we^{}performed an agglomerative phylogenic hierarchical clustering^{}of the tsCDC15mutant synchronized data. This algorithm is based^{}on the widely used Xclust phylogenetic ordering algorithm (2),^{}onto which we grafted an agglomeration step designed to establish^{}objective boundaries in the tree (see Methods). This effectively^{}turns an ordering algorithm into a clustering algorithm, in^{}which group boundaries are imposed computationally rather than^{}by a user's pattern recognition skills, as is performed with^{}Xclust by itself. This result was compared with the result reported^{}by Spellman et al. (16), in which they used a Fourier transformbased^{}algorithm, which was designed to take maximal advantage of the^{}timecourse pattern.^{}Global similarity measures
Comparison of the two clusterings of Affymetrix data from Cho^{}et al. (1) gave a global LA score of 0.63 and NMI scores of^{}0.52 and 0.50, immediately indicating that EM MoDG and the heuristic^{}classification have produced substantially different results.^{}The LA value of 0.63 says that the optimal pairing of clusters^{}still classifies 37% of the genes differently between the two^{}algorithms. ROC curves and ROC areas were generated for each^{}cluster (Figure 5). Viewed in aggregate, this ROC analysis showed^{}that clusters from EM MoDG are all better separated from each^{}other than are any clusters from the original Cho et al. (1)^{}heuristic. Thus, the ROC indices for EM MoDG are all 0.96 or^{}above, and four of the five clusters are >0.98. In contrast,^{}the heuristic classification groups had ROC values as low as^{}0.82 for S phase and none was better than 0.97 (M phase). By^{}this criterion, we can argue that EM clustering is an objectively^{}superior representation of the underlying data structure.
How are these differences between clustering results distributed^{}over the dataset? We used PCA (Principle Component Analysis)^{}to determine whether the two clusterings were globally similar^{}or different in the way they partitioned the dataspace. PCA^{}projects highdimensional gene expression vectors (each dimension^{}here corresponding to a different RNA sample) into a different^{}and lowdimensional space (usually two or three) (20), in which^{}the new PCA dimensions have each been selected to explain the^{}maximal amount of variance in the data. A common feature of^{}microarray datasets is that the first few principle components^{}often capture most of the variations in the data (here 64%).^{}Using CompClust to view the cluster means in PCA space allowed^{}us to assess relationships between clusters from the two algorithms.^{}Relative positions of cluster means in the PCA display the cell^{}cycle progression in a counterclockwise pattern that is quite^{}similar for the two algorithms. The absolute positions of the^{}cluster centers in PCA space differ, though not extravagantly,^{}for most clusters. This is interesting because the coherence^{}in overall structure would seem to contradict the rather high^{}dissimilarities in cluster composition measured by the criteria^{}LA and NMI, and shown graphically in the confusion array (Figure 1).^{}Considered together, the results argue that the overall^{}data structure, reflecting phases of the cell cycle, is robust^{}and has been treated rather similarly by the two algorithms,^{}even though 37% of individual gene expression vectors were assigned^{}differently. This raises the question of which gene vectors^{}have been differentially assigned and what biological meaning,^{}if any, should be attached to the differences. These questions^{}are addressed in the Discussion by examining specific gene groups^{}in the confusion array.^{} Using the ratiometric data of Spellman et al. (16), a comparison^{}of the original Fourierbased algorithm versus agglomerated^{}Xclust produced NMI and LA scores of 0.39, 0.41 and 0.60, respectively.^{}These scores indicate that the two clusterings are even more^{}different in membership assignment, with 40% of genes falling^{}outside the optimal LA pairing. Since both NMI and LA scores^{}are low, gene memberships for some clusters must be truly scrambled^{}rather than being simple combinations of cluster unions and^{}subsets (see Methods and Table 1). PCA projection (Figure 6)^{}showed that some major cluster centers from the two algorithms^{}are positioned very differently, both absolutely and relatively^{}(note the yellow cluster corresponding to the Fourier Sphase^{}group). The confusion array shows that XclustAgglom clusters^{}often combine genes that are members of adjacent Fourier clusters,^{}though in some cases it joins vectors from nonadjacent groups^{}(the confusion around S phase is complex). ROC curves and scores^{}also indicate that XclustAgglom has performed a slightly better^{}job of segregating data into discrete groups that reflect underlying^{}data structure, while the Fourier analysis groups are less coherent^{}and often seem to mix members of kinetically adjacent groups^{}as detailed in the confusion array (Figures 2 and 6). This may^{}be due, in part, to the use of a small number of then known^{}genes to center landmark phases by the Fourier algorithm. The^{}fact that this phase assignment was a ‘somewhat arbitrary’^{}step in the original analysis was pointed out by Spellman et^{}al. (16).
Highresolution cluster comparisons
Confusion arrays can be used to explore in more detail the issues^{}raised by global analyses and to mine relationships between^{}individual clusters. This can then be used to make refined gene^{}lists based on either the expert opinion or on the application^{}of computationally objective criteria. We applied LA to the^{}confusion matrix of the Cho heuristic and the EM MoDG results^{}and produced the corresponding adjacency matrix (see Methods^{}and Equation 3). This delivered an objectively optimized pairing^{}of EM cluster 1 with Cho ‘Early G_{1}’, EM cluster^{}2 with Cho ‘Late G_{1}’ and so on, as shown in the^{}array visualization (Figure 1). Each cell in the confusion array^{}contains the corresponding gene vectors and displays the calculated^{}mean vector for each intersect cell in the array.^{}
The confusion array highlighted relationships that were not^{}clear from Figures 5 or 6. For example, in the Affymetrix data^{}(1), both algorithms identified two gene classes within G_{1} (red^{}and yellow, respectively, in the PCA analysis of Figure 5).^{}However, the EM1 cluster shares only 67% of its content with^{}the Cho ‘Early G_{1}’, and most remaining genes fall^{}into the Cho ‘Late G_{1}’ cluster (Figure 7A). A straightforward^{}hypothesis is that the statistical EM algorithm simply could^{}not justify dividing G_{1} vectors into early and Late G_{1} kinetic^{}groups as the heuristic had done. The confusion array, however,^{}makes it clear at a glance that a different data feature is^{}driving the G_{1} subgroupings. EM1 cluster members are upregulated^{}only in the second cycle, while EM2 genes are upregulated in^{}both cycles. The array also shows that the Cho Early G_{1} group^{}contains a set of 10 genes that appear much more consistent^{}with a coherent Mphase group that corresponds to EM5.
Because the focus of the heuristic classification was mainly^{}on the second oscillation, it suppressed the distinction between^{}single cycle and two cycle G_{1} patterns, while ‘paying^{}more attention’ to finestructure kinetic differences^{}of the second cycle. On the other hand, EM MoDG treats all features^{}with equal weight across the time course and so centers clusters^{}without prior guidance about their relationship to cell cycle^{}phase. The confusion array intersect cells then effectively^{}parsed the fine kinetic differences within EM1 by separating^{}47 vectors that more closely resemble the Early G_{1} cluster from^{}18 that are more like Late G_{1} cluster. Thus, the intersect cells^{}capture and dissect the two distinct ways used by the two algorithms^{}to parse G_{1} expression. Both appear valid and reveal in different^{}ways, one based entirely on the kinetics of the second cycle^{}and one focusing on a major difference in the expression that^{}is seen only in the first cycle following release from arrest.^{}
Turning to the Sphase clusters, the comparative analysis highlights^{}a different kind of disagreement between the two algorithm outputs.^{}Members from Cho ‘Sphase’ cluster overlap almost^{}evenly with either EM2 (41%) or EM3 (49%). A simple biological^{}interpretation is that the kinetic boundary between Late G_{1}^{}and Sphase is not very crisp, irrespective of the algorithm^{}used to try to define them. An alternate explanation is that^{}one algorithm is frankly superior to the other in defining coherent^{}expression groups. Examination of the ROC curves (Figure 5)^{}and values (0.98 for EM versus 0.82 for S phase) provided an^{}objective measure of quality that argues the EM clustering of^{}S phase is superior.^{}
Dissecting individual clusters using ROC
Further ROCbased analysis of the Sphase cluster from the Cho^{}et al. (1) classification is shown in Figure 4. The ROC curve^{}shows how far from the cluster mean one needs to expand a hypersphere^{}to include a given fraction of vectors from the cluster, and^{}the shape of the curve can be used to understand cluster substructure.^{}Inspection of the ROC curve and the corresponding histogram^{}(Figure 4A) identified a natural discontinuity separating the^{}first 66% of genes that are nearer the cluster center from the^{}remainder. For additional data mining, we therefore set a boundary^{}at 66% on the ROC curve and then inspected all gene vectors^{}from the entire cell cycle dataset that fall within that boundary.^{}Approximately 20% of gene vectors inside this dataspace threshold^{}had been assigned to other clusters. Figure 4B and C allows^{}inspection of gene trajectories that were either interior or^{}exterior to the boundary. This is useful for reviewing and ‘pruning’^{}lists of putatively coexpressed genes in an objective manner.^{}
Integration with transcription factor motifs to identify regulatory modules
CompClust is designed to integrate different kinds of data by^{}linking each gene with other data, annotations and results of^{}metaanalyses. There are many ways of using other datasets to^{}identify relationships between, for example, observed patterns^{}of RNA coexpression and other data that help to answer the^{}question: are similarly expressed genes coregulated? A group^{}of genes that are coexpressed may also be coregulated, but^{}this is far from assured. Coexpressed genes can instead arrive^{}at the same expression pattern by the action of two (or more)^{}different regulators. Conversely, genes that are coregulated^{}by the same factor(s) at the transcriptional level may not display^{}identical RNA expression patterns for a variety of reasons,^{}including differential turnover rates. For these reasons, additional^{}kinds of data are needed to help determine the coexpressed^{}genes that are, in fact, transcriptionally coregulated and^{}to provide evidence for the identity of factor(s) driving coregulation.^{}Here, we show how the occurrence of evolutionarily persistent^{}transcription factor binding sites can be mapped informatively^{}onto the gene expression clusters from a confusion array to^{}predict the structure of transcription modules.^{}
The observation of two distinct sets of genes, one that peaks^{}during both the first and second cell cycles after release from^{}arrest, and another restricted to only the second oscillation^{}(Figure 7), suggests that they might be regulated differently^{}at the level of transcription. Prior work has led to the view^{}that MCB and SCB sequence motifs bind Mbp1/Swi6 (MCF) or Swi4/Swi6^{}(SCF) factor complexes to drive G_{1}specific transcription (21–23).^{}Thus, many genes are believed to be selectively and specifically^{}expressed in G_{1} owing to their membership in either MCF or SCF^{}regulatory modules. The two modules are also thought to be partly^{}distinct from each other, with some genes apparently being strongly^{}governed by either Swi4 or Mbp1 [(24,25) and reviewed in (26)].^{}
We, therefore, calculated an MCS (see Methods) to quantify the^{}conserved enrichment of a consensus site within 1 kb of the^{}start ATG in sequence data from the seven available yeast genomes^{}(27,28). We then asked whether different intersecting cells^{}within the confusion array are differentially and significantly^{}enriched for these known candidate motifs. The EM2/Late G_{1} intersect^{}cell was highly enriched, above chance, for MCB and SCB. A total^{}of 79 of 113 genes (70%) were enriched for MCB compared with^{}the expectation of 13 such genes for randomly selected samples^{}of 113 yeast genes. A total of 18% are enriched for SCB sites^{}compared with an expectation of only 4% by chance (Figure 8A).^{}CompClust's data linking capabilities were then used to visualize^{}correlations with in vivo protein–DNA binding data for^{}Swi4 and Mbp1 (29). The vast majority of genes with the above^{}threshold MCB or SCB MCS scores also showed significant in vivo^{}binding activity for either MCF or SCF.
The picture for the EM1/Early G_{1} intersect cell, whose genes^{}peak only once during the time course, was surprisingly different.^{}This group showed no significant enrichment for either MCB or^{}SCB (Figure 8A). What other factor(s) could be responsible for^{}the EM1/Early G_{1} intersect pattern? We searched and found that^{}the Swi5/Ace2 motif is enriched so that 30% are above threshold,^{}a value twice that expected by chance. The highest Swi5 MCS^{}scores correlated strongly with very intense expression in the^{}second cycle. This, in turn, correlated well with in vivo factor^{}binding by both Swi5 and Ace2 taken from the chromatin immunoprecipitation^{}data of Lee et al. (29). Thus, 60% of the EM1/Early G_{1} Swi5/Ace2^{}group had Pvalues <0.05 for Swi5 or Ace2 in the global chromatin^{}immunoprecipitation study of Lee et al. (29), and others were^{}relatively strong binders as shown in Figure 8. That most or^{}all of these connections between Swi5/Ace2 and EM1 genes, inferred^{}from three kinds of genome scale data are real is supported^{}by the fact that most previously identified targets of Ace2^{}and/or Swi5 (30,31) that were in the original Cho cycling dataset^{}are in this group.
By coupling the resulting comparative analyses with a flexible^{}visualization system within CompClust and, especially, by using^{}confusion arrays to organize comparisons, it became relatively^{}straightforward to identify both global and local trends in^{}expression patterns and to find out the features that are fragile^{}to algorithm choice or to other variations. The tools were also^{}useful for investigating substructure within individual gene^{}clusters and for seeing how a cluster from one analysis relates^{}to a cluster from another analysis. CompClust, including all^{}source code, and associated tutorials are available at http://woldlab.caltech.edu/compClust/.^{}The principal capabilities presented here can be used through^{}the GUI of CompClustTK. The tools are introduced by tutorials^{}that use the cell cycle examples presented here. Much richer^{}and almost infinitely varied interactive interrogations can^{}be performed using the command line version of CompClust, which^{}is available for download. And while this demonstration is centered^{}on clustering of largescale expression data from microarrays,^{}it will be applicable to clusterings of protein interaction^{}measurements, protein–DNA interactions and other largescale^{}data types.^{}
Comparative analysis showed that for the Affymetrix dataset^{}(1), EM MoDG and the Cho heuristic found a basic data structure^{}dominated by, and consistent with, the major phases of the cell^{}cycle (Figure 5). This presented an apparent paradox, since^{}the overall cell cycle phase structure was highlighted similarly^{}by both algorithms, while the assignment of specific gene vectors^{}to individual clusters was quite different, as shown by the^{}LA and NMI scores (Figure 1). Further investigation of cluster^{}relationships in the context of confusion arrays, local ROC^{}analysis, and ROC curve structure helped to resolve the paradox.^{}In specific cases, ambiguity was simply a data quality issue^{}for particular gene data vectors, and highlighting these affords^{}a biologist the opportunity to trim gene lists based on expert^{}knowledge. In other cases, differences between algorithms portrayed^{}correctly the fact that phases of the cell cycle are not crisply^{}separated with respect to both mRNA synthesis and decay. This^{}leads to a continuum of timecourse profiles, especially around^{}S phase. This knowledge of ‘fuzzy kinetic boundaries’^{}is important for future uses of gene categories for subsequent^{}gene network modeling. In yet other specific parts of the clustering,^{}the differences between algorithms focused attention on different^{}and valid ways of parsing the data, as in the case of genes^{}regulated strongly by Ace2/Swi5 in G_{1}, which were separable^{}by one algorithm but not by another.^{}
Inference of transcription modules
A key capability of the CompClust computational framework is^{}that it provides the means to integrate many different kinds^{}of data via linking properties (see Methods). This then allows^{}the biologist to detect, organize and further mine relationships.^{}The manner in which relationships, such as a direct connection^{}between a transcription factor and one of its target genes,^{}can be defined and vetted (by other data) is flexible, so that^{}users can specify significance thresholds and apply diverse^{}comparative metrics of their choice. They may also export CompClust^{}data for further automated modeling, e.g. artificial neural^{}networks [(9); C. Hart, E. Mjolsness, B. Wold, manuscript in^{}preparation]. By using CompClust in this way, we were easily^{}able to capture all known regulatory modules governing yeast^{}G_{1} transcription and to relate these regulatory connections^{}to specific expression clustering patterns.
Acknowledgements
Funding of this work and its open access publication was from^{}the NCI, the NIH, NASA, the Department of Energy, and the LK^{}Whittier Foundation. The authors thank Prof. Joe Hacia, Drs^{}Jose Luis Riechmann and Brian Williams for helpful comments^{}on the manuscript.^{} Conflict of interest statement. None declared.

Given two clustering results A and B, for which both NMI(A,B), NMI(B,A) and LA(A,B) values are high (nearing the maximum value of 1.0), the two clusterings are very similar, and when all three are significantly lower, they are very different. But when NMI(A,B) is high, NMI(B,A) is low and LA is low, then it is likely that A is a refinement of B. In this case, many clusters in B have been broken into two or more clusters in A (possible combinations summarized in here) (1). The magnitude of dissimilarity that is important is defined by the user and may vary considerably with the dataset, although values <0.7 for both LA and NMI are usually viewed as quite different. Additional interpretation of differences measured by LA and NMI depends on more detailed analysis of the dissimilarities and their distribution over the dataset, as outlined above.