Login

Join for Free!
118239 members
table of contents table of contents

A mathematical and computational framework to help quantify, compare, visualize and interactively …


Biology Articles » Biomathematics » A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data » Figures

Figures
- A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data

mcith_gki536f1.JPG Figure 1 Comparing two clustering results using a confusion array. Shown in this comparison is a supervised clustering result published in the original study by Cho et al. (1) and results from running an unsupervised clustering (EM MoDG, see Methods) on the same Affymetrix microarray dataset profiling yeast gene expression through two cell cycles. The confusion array is composed of a grid of summary plots. Each summary plot displays the mean (blue color or solid line) expression level of a group of genes as well as the standard deviation (red color or dashed line). Summary plots with a white background represent clusters from either the Cho et al. (1) clustering result (along the right most column) or the EM MoDG clustering result (along the top row); cluster names are in the lower right corner; and the number of genes in each cluster is displayed in the upper left corner. Summary plots with a colored background represent cells within the confusion array (see Methods), where each cell represents the intersection set of genes that are in common between the Cho et al. (1) cluster and the EM MoDG result cluster. Again, the upper left hand corner displays the number of genes within a confusion matrix cell. The background of each plot is colored according to a heat-map (scale below) that registers the proportionate number of genes in the cell compared with the corresponding cluster in the EM MoDG result. Intersection cells with dark outlines indicate the optimal pairings between the two data partitions, as determined from the LA calculation (Equation 2). Quantitative measures of overall similarity between the two clustering results using both LA and NMI are displayed in the graph title (see Methods).

(Click image to enlarge)

mcith_gki536f2.JPG Figure 2 Comparing two clustering results on a ratiometric microarray dataset using a confusion array. Shown in this comparison is a Fourier clustering result published in the original study by Spellman et al. (16) and results from running an unsupervised clustering (Xclustagglom, see Methods) on the same ratiometric microarray dataset as the Fourier analysis was run on. Details of the figure layout are discussed in the legend of Figure 1. Here, the 5 Fourier clusters are shown along the rows, while the 10 Xclustagglom clusters are displayed across the columns.

(Click image to enlarge)

mcith_gki536f3.JPG Figure 3 Example ROC curves to assess cluster overlap. An ROC curve (B and D, left side) is drawn as a function of moving outward from a cluster center and counting the proportion of cluster members (blue points) encountered along the y-axis versus the proportion of non-cluster members (red points) encountered along the x-axis. The collection of distances from every point within a cluster and every point outside a cluster is binned and used to create the distance histograms (B and D, right side). Shown in red is the distance histogram for cluster members and cluster non-members are shown in blue. Two extreme cases are exemplified in this figure. (A) Example expression data falling into two completely discrete clusters highlighted in red and blue. (B) The corresponding ROC curve (left) and distance histograms (right) for the sample data shown in (A). Note that since all cluster members are encountered before any non-cluster members the area under the ROC curve is 1.0. The distance histograms also show this perfect separation. (C) Example expression data falling into two completely overlapping clusters highlighted in red and blue. (D) The corresponding ROC curve (left) and distance histograms (right) for sample data shown in (B). Note that since cluster members and non-cluster members are encountered at an equal rate as a function of distance from the cluster center, the ROC curve approximates the line x = y and the area under the ROC curve is 0.5. This overlap is also highlighted in the distance histograms because the distributions of distances for cluster members completely overlap with that of the distribution of distances for non-cluster members.

(Click image to enlarge)

mcith_gki536f4.JPG Figure 4 ROC analysis of the S-phase cluster of Cho et al. (1). (A) ROC curve (left) shows the overlap between this cluster of 74 genes and genes from all other clusters in the time-course analysis [383 genes in total, selected by inspection by Cho et al. (1) for cycling behavior]. The area under the ROC curve is 0.82. The area under the curve highlighted in green demonstrates selection of genes from S phase that overlap with other clusters least. At the shown distance threshold, 66 genes from the Cho determined S-phase cluster are selected, and the overlap with only non-S-phase genes. (A) Right: correlation distance histograms illustrating the distribution of distances to the center of the S-phase cluster for non-cluster members (bottom/blue) and for all S-phase cluster members (top/red). (B) Expression trajectories for the 74 genes in the S-phase cluster, highlighting in green cluster members represented by the green highlight in (A). (C) Expression trajectories for all genes outside the S-phase cluster of the Cho clustering highlighting in green non-cluster members represented by the green highlight in (A).

(Click image to enlarge)

mcith_gki536f5.JPG Figure 5 PCA, ROC plots and trajectory summary views of clusters from the Cho classification and an unsupervised clustering (EM MoDG) of an Affymetrix yeast cell cycle time course (1). The top panel for each clustering results shows cluster means projected into the top two dimensions of the principle component space defined by the expression data (capturing 64% of the variance). The area of the marker size for each cluster is proportional to the number of genes in each cluster. Below are ROC curves (left) and trajectory summaries (right) for each cluster. The trajectory summaries display every gene's expression profile within a cluster as a blue line with time along the x-axis and expression along the y-axis. The red line within each trajectory summary represents the mean expression level for the cluster. ROC area values are displayed within the ROC curve for each cluster. The background colors for the trajectory summaries and the PCA projection have been matched within each clustering result. In addition, LA was used to find the optimal mapping of clusters between the Cho classification and the EM MoDG result and the colors have been set accordingly.

(Click image to enlarge)

mcith_gki536f6.JPG Figure 6 PCA, ROC plots and trajectory summary views of clusters from the Fourier classification and an unsupervised clustering (Xclustagglom) results from the ratiometric yeast cell cycle time course (16). Details of the figure layout are the same as for Figure 5. Only the six largest clusters are shown in the Xclustagglom. Clusters that do not have an optimal pairing by LA with a Fourier cluster are colored black. Note that PCA summary calls attention to the low quality of the S/XC3 pairing, places it between XC5 and XC1.

(Click image to enlarge)

mcith_gki536f7.JPG Figure 7 Selected confusion array cells from Figure 1 highlighting cluster membership differences for genes with peak expression during the G1 and S phases of the cell cycle. The trajectory summaries display an expression profile for every gene with time along the x-axis and expression along the y-axis. Blue trajectory summaries show parent clustering results (EM MoDG along the columns, the Cho classification along the rows). Intersection cells from the confusion array are shown in red. Mean vectors for each gene set are shown in black with error bars proportional to the standard deviation. The total number of gene expression vectors in each cell is shown in parentheses. (A) G1 genes are subdivided differently by the two algorithms. EM MoDG separates genes upregulated only during the second phase of the cell cycle from those upregulated during both the first and second cycles. The Cho classification separates G1 based primarily on peak time in the second cycle. Figure 8 illustrates these observed kinetic distinctions being a result of these genes belonging to distinct regulatory modules. (B) Detailed comparison from the confusion array of Figure 1 showing the S-phase cluster of the Cho classification is subdivided nearly equally among EM2, EM3 (optimal match by LA) clusters.

(Click image to enlarge)

mcith_gki536f8.JPG Figure 8 Integrating expression data, regulatory motif conservation and protein–DNA binding information. (A) Binding site enrichment in genes from the four confusion matrix cells of Figure 7 that dissect genes in the G1 cell cycle phase. Shown in red are the observed number of genes with a MCS score above threshold for each motif. Shown in blue are the number of genes expected by chance, as computed by bootstrap simulations. The total number of genes each cell contains is in the upper left. (B–D) Heat-map displays showing expression data on the left, followed by MCS scores for a specified motif, followed by in vivo protein–DNA binding data for transcription factors implicated in binding to the specified consensus. Color scales for each panel are at the bottom of the figure. For the MCS scores, the color map ranges from 0 to the 99th percentile to minimize the influence of extreme outliers on interpretation. (B) Shown are 14 genes that fall within the EM1/Early G1 intersection cell and have a conserved enrichment in the presence of the SWI5 consensus as measured by MCS scores (see Methods; Equations 4–9) (C) Shown are 79 genes that fall within EM2/Late G1 intersection cell and have a high MCS score for MCB. (D) Shown are 20 genes that fall within EM2/Late G1 intersection cell and have a high MCS score for SCB. In each heat-map genes are ordered by decreasing MCS score. Significant correlation can be seen between a high MCS score, protein–DNA binding and the expected expression pattern.

(Click image to enlarge)

 


rating: 3.75 from 4 votes | updated on: 3 Nov 2008 | views: 8531 |

Rate article:







excellent!bad…