The maturation of additional large-scale data types (global
chromatin immunoprecipitation assays, more complete and highly
articulated protein–protein interaction maps, Gene Ontology
categories, evolutionarily conserved sequence features and other
covariates) shifts the emphasis from analyzing and mining expression
data alone to integrating disparate data types. A key feature
of any system designed for integration is the ability to provide
a many-to-many mapping of labels to data features and data features
to other data features in a global way. CompClust provides these
capabilities by maintaining and tracking linkages of multiple
arbitrary annotations and covariates with data features through
almost any data transformation, merger, selection or aggregation.
In addition, many supervised and unsupervised machine learning
algorithms are easily accessible within CompClust.
CompClust is primarily accessible through an application programminginterface (API) and, with the use of Python's exposed interpreter,this provides a very rich command line interface (CLI). Themajor capabilities illustrated in this paper are accessiblethrough CompClustTK, a set of simple graphical user interfaces(GUIs) to offer a convenient starting point without learningPython commands. These GUIs will permit users to perform themajor classes of analyses shown, though we note that these compriseonly a fraction of CompClust capabilities. The flexibility anddiversity of analysis paths is too great to anticipate themall or commit them to GUI forms. This limitation can be overcomeby using the Python command line environment. Python commandscan be learned at the level needed in a relatively short time(a few weeks of part time effort) by users who do not have priorprogramming experience. The benefit is access to remarkableflexibility in interrogating datasets. This is a much bettermatch to the diversity of questions and comparisons that biologistsusually want to make than in any GUI-based system.
The choice to implement CompClust in Python over other languageswas made for several reasons which, considered in aggregate,argue it is the best available language to support the capabilitiesand analysis goals of CompClust: (i) using Python's exposedinterpreter, our API becomes immediately useful for analysiswithout the construction of a complex GUI. The exposed interpreteralso speeds the development time. (ii) Python's syntax is fairlystraightforward and easy to learn for even non-programmers.(iii) It is freely available and distributable under an open-sourcelicense. (iv) Python has an extensive and standard library andin addition third party extensions, including the Numeric packagewhich provides powerful numeric analysis routines. (v) Pythonis also platform neutral and runs on the majority of systems,including unix/linux, Microsoft Windows and the Mac OS.
Pairwise comparison of clusterings (partitions) using confusion arrays and matrices
Confusion arrays and matrices were used to make pairwise comparisonsbetween different clusterings (mathematical partitions). A setof metrics were then applied to the confusion matrix to measurethe nature and degree of similarity between two dataset partitions.Briefly, a confusion matrix is the matrix of cardinalities ofall pairwise intersections between two partitions, where a partitionof a dataset is defined as a set of disjoint subsets whose unioncontains all elements of the dataset. We define a confusionarray simply as an array of all pairwise intersections betweentwo partitions of a dataset. The cardinalities of these intersectionsets form the confusion matrix, whose elements are given byEquation 1:
: the data members
of class i
, and Bi
: the data members of class j
The LA value for a confusion matrix is calculated between twopartitions (clusterings) and by generating an optimal pairingso that there is, at most, a one-to-one pairing between everyclass in partitions, and this pairing is calculated by optimizingthe objective function in Equation 2, using the constraintsgiven in Equation 3, thus defining a linear assignment problem.Next, the maximum-cardinality bipartite matching of maximumweights algorithm (Gabow, 1973) was implemented for the optimization.After finding the optimal pairing, the LA score is simply theproportion of vectors (e.g. gene expression trajectories orconditions) included in the optimally paired clusters (Equation 4).It is important to note that LA, unlike NMI, is a symmetricscore so that LA(A,B) = LA(B,A). In addition to quantifyingthe degree of similarity or difference between two partitions,the adjacency matrix (Equation 3) also provides a way to identifypairs of clusters that are globally most similar to each otherbetween two partitions of the data. As illustrated for clusteringsof yeast cell cycle regulated genes, this is especially usefulfor interactive examination of two clusterings.
is the adjacency matrix describing the
pairing between A
, and C
is the confusion matrix (Equation 1
Normalized mutual information
The NMI index (7) quantifies how much information is lost, onaverage, when one clustering is regenerated from a second classification(Equation 5). A noteworthy difference from LA is that NMI isasymmetric.
) is the
shared information between the two partitions and it is normalized
by the entropy of partition A
) is defined as:
and the joint-information
EM MoDG clustering
EM MoDG was implemented with a diagonal covariance matrix modelbecause the number of samples in the (1) cell cycle datasetwas too small to fit a statistically valid full covariance matrixto each cluster (17). In order to ensure a near optimal initialization,each EM MoDG result was a result of selection of the best of30 runs, each initialized by placing the initial cluster centroidson K randomly selected data points. The run with best fit tothe data (i.e. had the lowest log-likelihood score) was usedfor the final clustering. Multiple best-of-30 runs were performedto verify that the quantitative measures and gene lists resultsreported here did not vary significantly. The EM MoDG code usedhere was developed by the NASA/JPL Machine Learning SystemsGroup.
We agglomerate the hierarchical tree returned by Xclust (18)based on a maximal cluster size threshold. Starting from theroot, any subtree within the tree with less than the maximalcluster size threshold is agglomerated into a cluster. In orderto work with the familiar parameter K (number of clusters),we iteratively find the size threshold that will return as closeto K clusters as possible. In practice, this simple heuristicworks best when K is over specified by 2–4 times the expectednumber of clusters because it will generate several very small(often singleton) clusters that are outliers to core major clustersin the data.
Each microarray dataset was obtained from the cited authors.For the Cho et al. (1) data, we removed any gene that did notshow a sustained absolute expression level of at least 8 for30 consecutive minutes. For each gene vector, we then dividedeach time point measurement by the median expression value forthe gene. For the data of Spellman et al. (16), we linearlyinterpolated missing values using the available adjacent timepoints. For both datasets, we log2 transformed the resultinggene expression matrices. The datasets were then annotated withthe original clustering results as published.
Motif conserved enrichment score (MCS)
For each motif, we translated the IUPAC consensus (Swi5/Ace2:KGCTGR; MCB: ACGCGT; SCB: CACGAAA) into a position weight matrix(PWM) where the probabilities or frequencies in the PWM is determinedby the degeneracy of the IUPAC symbol. We calculate a log-oddsratio for the PWM occurring at every position in the 1 kb upstreamas described in Equation 10
each open reading frame (ORF) for each species available. We
then sum the log-odds ratio over all possible positions, where
the log-odds ratio is >7. The summed log-odds ratios for
each species is then averaged together to generate an ORF-specific
motif enrichment score. In Equation 10
is the total number
of species compared, W
is the length of the motif, p
probability from the PWM of position i
being the nucleotiden
, and bg
represents the probability of the window being generated
from a background sequence model based on a second-order hidden