
table of contents ![]() A mathematical and computational framework to help quantify, compare, visualize and interactively … '); |
Biology Articles » Biomathematics » A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data » Methods
Methods
|
![]() |
(1) |
Linear assignment
The LA value for a confusion matrix is calculated between twopartitions (clusterings) and by generating an optimal pairingso that there is, at most, a one-to-one pairing between everyclass in partitions, and this pairing is calculated by optimizingthe objective function in Equation 2, using the constraintsgiven in Equation 3, thus defining a linear assignment problem.Next, the maximum-cardinality bipartite matching of maximumweights algorithm (Gabow, 1973) was implemented for the optimization.After finding the optimal pairing, the LA score is simply theproportion of vectors (e.g. gene expression trajectories orconditions) included in the optimally paired clusters (Equation 4).It is important to note that LA, unlike NMI, is a symmetricscore so that LA(A,B) = LA(B,A). In addition to quantifyingthe degree of similarity or difference between two partitions,the adjacency matrix (Equation 3) also provides a way to identifypairs of clusters that are globally most similar to each otherbetween two partitions of the data. As illustrated for clusteringsof yeast cell cycle regulated genes, this is especially usefulfor interactive examination of two clusterings.
![]() |
(2) |
![]() |
(3) |
![]() |
(4) |
Normalized mutual information
The NMI index (7) quantifies how much information is lost, onaverage, when one clustering is regenerated from a second classification(Equation 5). A noteworthy difference from LA is that NMI isasymmetric.
![]() |
(5) |
![]() |
(6) |
![]() |
(7) |
![]() |
(8) |
![]() |
(9) |
EM MoDG clustering
EM MoDG was implemented with a diagonal covariance matrix modelbecause the number of samples in the (1) cell cycle datasetwas too small to fit a statistically valid full covariance matrixto each cluster (17). In order to ensure a near optimal initialization,each EM MoDG result was a result of selection of the best of30 runs, each initialized by placing the initial cluster centroidson K randomly selected data points. The run with best fit tothe data (i.e. had the lowest log-likelihood score) was usedfor the final clustering. Multiple best-of-30 runs were performedto verify that the quantitative measures and gene lists resultsreported here did not vary significantly. The EM MoDG code usedhere was developed by the NASA/JPL Machine Learning SystemsGroup.
XclustAgglom
We agglomerate the hierarchical tree returned by Xclust (18)based on a maximal cluster size threshold. Starting from theroot, any subtree within the tree with less than the maximalcluster size threshold is agglomerated into a cluster. In orderto work with the familiar parameter K (number of clusters),we iteratively find the size threshold that will return as closeto K clusters as possible. In practice, this simple heuristicworks best when K is over specified by
2–4 times the expectednumber of clusters because it will generate several very small(often singleton) clusters that are outliers to core major clustersin the data.
Data preprocessing
Each microarray dataset was obtained from the cited authors.For the Cho et al. (1) data, we removed any gene that did notshow a sustained absolute expression level of at least 8 for30 consecutive minutes. For each gene vector, we then dividedeach time point measurement by the median expression value forthe gene. For the data of Spellman et al. (16), we linearlyinterpolated missing values using the available adjacent timepoints. For both datasets, we log2 transformed the resultinggene expression matrices. The datasets were then annotated withthe original clustering results as published.
Motif conserved enrichment score (MCS)
For each motif, we translated the IUPAC consensus (Swi5/Ace2:KGCTGR; MCB: ACGCGT; SCB: CACGAAA) into a position weight matrix(PWM) where the probabilities or frequencies in the PWM is determinedby the degeneracy of the IUPAC symbol. We calculate a log-oddsratio for the PWM occurring at every position in the 1 kb upstreamas described in Equation 10
![]() |
(10) |
rating: 3.75 from 4 votes | updated on: 3 Nov 2008 | views: 6770 |
© Biology-Online.org. All Rights Reserved. Register | Login | About Us | Contact Us | Link to Us | Disclaimer & Privacy