Login

Join for Free!
112476 members
table of contents table of contents

A mathematical and computational framework to help quantify, compare, visualize and interactively …


Biology Articles » Biomathematics » A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data

Abstract
- A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data

A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data

Christopher E. Hart, Lucas Sharenbroich1, Benjamin J. Bornstein1, Diane Trout, Brandon King, Eric Mjolsness2,3 and Barbara J. Wold*

Division of Biology, California Institute of Technology Pasadena, CA 91125, USA 1Jet Propulsion Laboratory, Machine Learning Systems Group Pasadena, CA 91109, USA 2Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA 92697, USA 3School of Information and Computer Science, University of California Irvine, Irvine, CA 92697, USA

Nucleic Acids Research 2005 33(8):2580-2594. Open Access Article.

Abstract

Analysis of large-scale gene expression studies usually beginswith gene clustering. A ubiquitous problem is that differentalgorithms applied to the same data inevitably give differentresults, and the differences are often substantial, involvinga quarter or more of the genes analyzed. This raises a seriesof important but nettlesome questions: How are different clusteringresults related to each other and to the underlying data structure?Is one clustering objectively superior to another? Which differences,if any, are likely candidates to be biologically important?A systematic and quantitative way to address these questionsis needed, together with an effective way to integrate and leverageexpression results with other kinds of large-scale data andannotations. We developed a mathematical and computational frameworkto help quantify, compare, visualize and interactively mineclusterings. We show that by coupling confusion matrices withappropriate metrics (linear assignment and normalized mutualinformation scores), one can quantify and map differences betweenclusterings. A version of receiver operator characteristic analysisproved effective for quantifying and visualizing cluster qualityand overlap. These methods, plus a flexible library of clusteringalgorithms, can be called from a new expandable set of softwaretools called CompClust 1.0 (http://woldlab.caltech.edu/compClust/).CompClust also makes it possible to relate expression clusteringpatterns to DNA sequence motif occurrences, protein–DNAinteraction measurements and various kinds of functional annotations.Test analyses used yeast cell cycle data and revealed data structurenot obvious under all algorithms. These results were then integratedwith transcription motif and global protein–DNA interactiondata to identify G1 regulatory modules.


rating: 3.75 from 4 votes | updated on: 3 Nov 2008 | views: 8121 |

Rate article:







excellent!bad…