A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data
Christopher E. Hart,
Benjamin J. Bornstein1,
Eric Mjolsness2,3 and
Barbara J. Wold*
Division of Biology, California Institute of Technology Pasadena, CA 91125, USA
1Jet Propulsion Laboratory, Machine Learning Systems Group Pasadena, CA 91109, USA
2Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA 92697, USA
3School of Information and Computer Science, University of California Irvine, Irvine, CA 92697, USA
Nucleic Acids Research 2005 33(8):2580-2594. Open Access Article.
Analysis of large-scale gene expression studies usually beginswith gene clustering. A ubiquitous problem is that differentalgorithms applied to the same data inevitably give differentresults, and the differences are often substantial, involvinga quarter or more of the genes analyzed. This raises a seriesof important but nettlesome questions: How are different clusteringresults related to each other and to the underlying data structure?Is one clustering objectively superior to another? Which differences,if any, are likely candidates to be biologically important?A systematic and quantitative way to address these questionsis needed, together with an effective way to integrate and leverageexpression results with other kinds of large-scale data andannotations. We developed a mathematical and computational frameworkto help quantify, compare, visualize and interactively mineclusterings. We show that by coupling confusion matrices withappropriate metrics (linear assignment and normalized mutualinformation scores), one can quantify and map differences betweenclusterings. A version of receiver operator characteristic analysisproved effective for quantifying and visualizing cluster qualityand overlap. These methods, plus a flexible library of clusteringalgorithms, can be called from a new expandable set of softwaretools called CompClust 1.0 (http://woldlab.caltech.edu/compClust/).CompClust also makes it possible to relate expression clusteringpatterns to DNA sequence motif occurrences, protein–DNAinteraction measurements and various kinds of functional annotations.Test analyses used yeast cell cycle data and revealed data structurenot obvious under all algorithms. These results were then integratedwith transcription motif and global protein–DNA interactiondata to identify G1 regulatory modules.