CompClust
The maturation of additional largescale data types (global
^{}chromatin immunoprecipitation assays, more complete and highly
^{}articulated protein–protein interaction maps, Gene Ontology
^{}categories, evolutionarily conserved sequence features and other
^{}covariates) shifts the emphasis from analyzing and mining expression
^{}data alone to integrating disparate data types. A key feature
^{}of any system designed for integration is the ability to provide
^{}a manytomany mapping of labels to data features and data features
^{}to other data features in a global way. CompClust provides these
^{}capabilities by maintaining and tracking linkages of multiple
^{}arbitrary annotations and covariates with data features through
^{}almost any data transformation, merger, selection or aggregation.
^{}In addition, many supervised and unsupervised machine learning
^{}algorithms are easily accessible within CompClust.
^{}
CompClust is primarily accessible through an application programming^{}interface (API) and, with the use of Python's exposed interpreter,^{}this provides a very rich command line interface (CLI). The^{}major capabilities illustrated in this paper are accessible^{}through CompClustTK, a set of simple graphical user interfaces^{}(GUIs) to offer a convenient starting point without learning^{}Python commands. These GUIs will permit users to perform the^{}major classes of analyses shown, though we note that these comprise^{}only a fraction of CompClust capabilities. The flexibility and^{}diversity of analysis paths is too great to anticipate them^{}all or commit them to GUI forms. This limitation can be overcome^{}by using the Python command line environment. Python commands^{}can be learned at the level needed in a relatively short time^{}(a few weeks of part time effort) by users who do not have prior^{}programming experience. The benefit is access to remarkable^{}flexibility in interrogating datasets. This is a much better^{}match to the diversity of questions and comparisons that biologists^{}usually want to make than in any GUIbased system.^{}
The choice to implement CompClust in Python over other languages^{}was made for several reasons which, considered in aggregate,^{}argue it is the best available language to support the capabilities^{}and analysis goals of CompClust: (i) using Python's exposed^{}interpreter, our API becomes immediately useful for analysis^{}without the construction of a complex GUI. The exposed interpreter^{}also speeds the development time. (ii) Python's syntax is fairly^{}straightforward and easy to learn for even nonprogrammers.^{}(iii) It is freely available and distributable under an opensource^{}license. (iv) Python has an extensive and standard library and^{}in addition third party extensions, including the Numeric package^{}which provides powerful numeric analysis routines. (v) Python^{}is also platform neutral and runs on the majority of systems,^{}including unix/linux, Microsoft Windows and the Mac OS.^{}
Pairwise comparison of clusterings (partitions) using confusion arrays and matrices
Confusion arrays and matrices were used to make pairwise comparisons^{}between different clusterings (mathematical partitions). A set^{}of metrics were then applied to the confusion matrix to measure^{}the nature and degree of similarity between two dataset partitions.^{}Briefly, a confusion matrix is the matrix of cardinalities of^{}all pairwise intersections between two partitions, where a partition^{}of a dataset is defined as a set of disjoint subsets whose union^{}contains all elements of the dataset. We define a confusion^{}array simply as an array of all pairwise intersections between^{}two partitions of a dataset. The cardinalities of these intersection^{}sets form the confusion matrix, whose elements are given by^{}Equation 1:
^{}

(1) 
where
A_{i}: the data members
^{}of class
i in
A, and
B_{i}: the data members of class
j in
B.
^{}
Linear assignment
The LA value for a confusion matrix is calculated between two^{}partitions (clusterings) and by generating an optimal pairing^{}so that there is, at most, a onetoone pairing between every^{}class in partitions, and this pairing is calculated by optimizing^{}the objective function in Equation 2, using the constraints^{}given in Equation 3, thus defining a linear assignment problem.^{}Next, the maximumcardinality bipartite matching of maximum^{}weights algorithm (Gabow, 1973) was implemented for the optimization.^{}After finding the optimal pairing, the LA score is simply the^{}proportion of vectors (e.g. gene expression trajectories or^{}conditions) included in the optimally paired clusters (Equation 4).^{}It is important to note that LA, unlike NMI, is a symmetric^{}score so that LA(A,B) = LA(B,A). In addition to quantifying^{}the degree of similarity or difference between two partitions,^{}the adjacency matrix (Equation 3) also provides a way to identify^{}pairs of clusters that are globally most similar to each other^{}between two partitions of the data. As illustrated for clusterings^{}of yeast cell cycle regulated genes, this is especially useful^{}for interactive examination of two clusterings.
^{}

(2) 
where,
^{}

(3) 
Now,
^{}

(4) 
where
M is the adjacency matrix describing the
^{}pairing between
A and
B, and
C is the confusion matrix (
Equation 1).
^{}^{}
Normalized mutual information
The NMI index (7) quantifies how much information is lost, on^{}average, when one clustering is regenerated from a second classification^{}(Equation 5). A noteworthy difference from LA is that NMI is^{}asymmetric.
^{}

(5) 
where
I(
A,
B) is the
^{}shared information between the two partitions and it is normalized
^{}by the entropy of partition
A;
H(
A) is defined as:
^{}

(6) 
and
^{}

(7) 
and the jointinformation
^{}is:
^{}

(8) 
and
^{}

(9) 
^{}
EM MoDG clustering
EM MoDG was implemented with a diagonal covariance matrix model^{}because the number of samples in the (1) cell cycle dataset^{}was too small to fit a statistically valid full covariance matrix^{}to each cluster (17). In order to ensure a near optimal initialization,^{}each EM MoDG result was a result of selection of the best of^{}30 runs, each initialized by placing the initial cluster centroids^{}on K randomly selected data points. The run with best fit to^{}the data (i.e. had the lowest loglikelihood score) was used^{}for the final clustering. Multiple bestof30 runs were performed^{}to verify that the quantitative measures and gene lists results^{}reported here did not vary significantly. The EM MoDG code used^{}here was developed by the NASA/JPL Machine Learning Systems^{}Group.^{}
XclustAgglom
We agglomerate the hierarchical tree returned by Xclust (18)^{}based on a maximal cluster size threshold. Starting from the^{}root, any subtree within the tree with less than the maximal^{}cluster size threshold is agglomerated into a cluster. In order^{}to work with the familiar parameter K (number of clusters),^{}we iteratively find the size threshold that will return as close^{}to K clusters as possible. In practice, this simple heuristic^{}works best when K is over specified by 2–4 times the expected^{}number of clusters because it will generate several very small^{}(often singleton) clusters that are outliers to core major clusters^{}in the data.^{}
Data preprocessing
Each microarray dataset was obtained from the cited authors.^{}For the Cho et al. (1) data, we removed any gene that did not^{}show a sustained absolute expression level of at least 8 for^{}30 consecutive minutes. For each gene vector, we then divided^{}each time point measurement by the median expression value for^{}the gene. For the data of Spellman et al. (16), we linearly^{}interpolated missing values using the available adjacent time^{}points. For both datasets, we log_{2} transformed the resulting^{}gene expression matrices. The datasets were then annotated with^{}the original clustering results as published.^{}
Motif conserved enrichment score (MCS)
For each motif, we translated the IUPAC consensus (Swi5/Ace2:^{}KGCTGR; MCB: ACGCGT; SCB: CACGAAA) into a position weight matrix^{}(PWM) where the probabilities or frequencies in the PWM is determined^{}by the degeneracy of the IUPAC symbol. We calculate a logodds^{}ratio for the PWM occurring at every position in the 1 kb upstream^{}as described in Equation 10
^{}

(10) 
of
^{}each open reading frame (ORF) for each species available. We
^{}then sum the logodds ratio over all possible positions, where
^{}the logodds ratio is >7. The summed logodds ratios for
^{}each species is then averaged together to generate an ORFspecific
^{}motif enrichment score. In
Equation 10,
N is the total number
^{}of species compared,
W is the length of the motif,
p is the
^{}probability from the PWM of position
i being the nucleotide
^{}n, and
bg represents the probability of the window being generated
^{}from a background sequence model based on a secondorder hidden
^{}Markov model.