table of contents table of contents

A mathematical and computational framework to help quantify, compare, visualize and interactively …

Home » Biology Articles » Biomathematics » A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data » Introduction

- A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data

A key step in analyzing most large-scale gene expression studiesis clustering or otherwise grouping gene expression data vectorsand conditions (individual RNA samples or replicates) into setsthat contain members more similar to each other than to theremainder of the data. To do this, biologists now have at theirdisposal a wide range of computational techniques includingsupervised and unsupervised machine learning algorithms andvarious heuristics, such as k-means, phylogenic-like hierarchicalordering and clustering, Expectation Maximization of Mixturemodels, self organizing maps, support vector machines, Fourieranalysis and more (16). Their purpose in all cases isto detect underlying relationships in the data, but differentalgorithms applied to a given dataset typically deliver onlypartly concordant results. As we show below, it is common tofind 20–40% of genes from a high-quality dataset classifieddifferently by two algorithms. These differences can be quitemeaningful for a first-pass analysis, in which candidate geneswill be selected based on their expression pattern for furtherdetailed study. But clustering classifications are also increasinglyimportant, not as results on their own, but as a key preprocessedinput to higher level integrative modeling, such as gene networkinference. Clustering results are also becoming important asgene annotations for interpreting entirely different kinds ofdata. For example, classification of genes as being ‘cellcycle regulated in G1 phase’ has become part of majordatabases based on a specific clustering. If such annotationsare uncertain or simply incorrect, the uncertainty or errorsthen ramify through future uses of the data.

The sources of difference between clustering algorithm outputsare many and varied, and the biological implications are alsodiverse, as illustrated below for cell cycle data. The generalchallenge is to detect, measure, evaluate and mine the commonalitiesand differences. Specifically, the biologist usually wants tofirst know whether one clustering is objectively and significantly‘better’ than another, and just how big the differenceis. If two clusterings are of similar overall quality, yet differsubstantially from each other as is often observed, then whatspecific gene cluster or samples harbor the greatest differences,or are they evenly distributed across the data? At a finer levelstill, which genes are being assigned to different clustersand why? Importantly, do the distinctions between clusteringshighlight properties of biological importance?

To begin to answer such questions, we first needed a way tomake systematic quantitative comparisons and then we neededways to effectively mine the resulting comparisons. We use confusionmatrices as the common tool for these comparisons (see belowand Methods). A confusion matrix effectively summarizes pairwiseintersections between clusters derived from two clustering results.These similarities are quantified by applying scoring functionsto the confusion matrix. In this work, we use two differentscoring functions for this purpose: (i) normalized mutual information(NMI), which measures the amount of information shared betweenthe two clustering results (7) and; (ii) a linear assignment(LA) method, which quantifies the similarity of two clusteringsby finding the optimal pairing of clusters between two clusteringresults and measuring the degree of agreement across this pairing(8,9). Previous studies have used metrics for evaluating thetotal number of data point pairs grouped together between twodifferent clusterings to begin to address the need for quantifyingoverall differences (1013). Ben-Hur et al. (13) usedthis to help determine an optimal number of clusters (K) andto assess the overall validity of a clustering. These priortechniques did not, however, offer the capacity to isolate andinspect the similarities and differences between two differentclusterings, nor did they provide an interactive interface forbiology users that would permit them to usefully capture thecomparative differences and similarities. We also introducea new application of receiver operator characteristic (ROC)analysis (14,15). As we use it here, ROC enables one to quantifythe distinctness of a given cluster relative to another clusteror relative to all non-cluster members. Implemented in thisfashion, ROC provides another measure of local cluster qualityand shape, and provides another tool for quantitatively dissectinga cluster. Though the methods and tools were worked out forclusterings of large-scale gene expression data, they are applicableto clusterings of other kinds of large-scale data as well.

We have integrated the algorithms and comparative tools intoan interactive analysis package collectively called CompClust1.0. CompClust enables a user to organize, interrogate and visualizethe comparisons. In addition to comparative cluster analysis,an important feature of this software is that it establishesand maintains links between the outputs of clustering analysesand the primary expression data, and, critically, with all otherdesired annotations. In the sense used here, ‘annotations’include other kinds of primary and metadata of diverse types.This gives a biologist crucial flexibility in data mining andpermits analyses that integrate results from other kinds ofexperiments, such as global protein–DNA interactions (ChIP/Array),protein–protein interactions, comparative genome analysisor information from gene ontologies.

CompClust methods and tools are agnostic about the kinds ofmicroarray data (ratiometric, Affymetrix, etc.) and types ofclustering algorithms used. We demonstrate the tools by analyzingtwo different sets of yeast cell cycle expression data representingboth major data platforms, clustered by four different methods:a statistical clustering algorithm [Expectation Maximizationof a Mixture of Diagonal Gaussian distributions (EM MoDGs)](this work), a human-driven heuristic (1), a Fourier transformalgorithm designed to take advantage of a periodic time-coursepatterns (16) and an agglomerative version of the Xclust phylogeneticordering algorithm [Eisen et al. (2) modified in this work].We show that gene groups derived from these comparative analysescan be integrated with data on evolutionarily conserved transcriptionfactor binding sites to infer regulatory modules. These resultsbegin to illustrate how a more quantitative and nuanced understandingof both global and local features in the data can be achieved,and how these can be linked with diverse kinds of data typesto infer connectivity between regulators and their target genemodules.

rating: 3.75 from 4 votes | updated on: 3 Nov 2008 | views: 12328 |

Rate article: