|
Figure 1
Comparing two clustering results using a confusion array. Shown in this
comparison is a supervised clustering result published in the original
study by Cho et al. (1) and results from running an
unsupervised clustering (EM MoDG, see Methods) on the same Affymetrix
microarray dataset profiling yeast gene expression through two cell
cycles. The confusion array is composed of a grid of summary plots.
Each summary plot displays the mean (blue color or solid line)
expression level of a group of genes as well as the standard deviation
(red color or dashed line). Summary plots with a white background
represent clusters from either the Cho et al. (1) clustering
result (along the right most column) or the EM MoDG clustering result
(along the top row); cluster names are in the lower right corner; and
the number of genes in each cluster is displayed in the upper left
corner. Summary plots with a colored background represent cells within
the confusion array (see Methods), where each cell represents the
intersection set of genes that are in common between the Cho et al.
(1) cluster and the EM MoDG result cluster. Again, the upper left hand
corner displays the number of genes within a confusion matrix cell. The
background of each plot is colored according to a heat-map (scale
below) that registers the proportionate number of genes in the cell
compared with the corresponding cluster in the EM MoDG result.
Intersection cells with dark outlines indicate the optimal pairings
between the two data partitions, as determined from the LA calculation
(Equation 2). Quantitative measures of overall similarity between the
two clustering results using both LA and NMI are displayed in the graph
title (see Methods).
(Click image to enlarge)
|
|
Figure 2
Comparing two clustering results on a ratiometric microarray dataset
using a confusion array. Shown in this comparison is a Fourier
clustering result published in the original study by Spellman et al.
(16) and results from running an unsupervised clustering (Xclustagglom,
see Methods) on the same ratiometric microarray dataset as the Fourier
analysis was run on. Details of the figure layout are discussed in the
legend of Figure 1. Here, the 5 Fourier clusters are shown along the
rows, while the 10 Xclustagglom clusters are displayed across the
columns.
(Click image to enlarge)
|
|
Figure 3
Example ROC curves to assess cluster overlap. An ROC curve (B and D,
left side) is drawn as a function of moving outward from a cluster
center and counting the proportion of cluster members (blue points)
encountered along the y-axis versus the proportion of non-cluster members (red points) encountered along the x-axis.
The collection of distances from every point within a cluster and every
point outside a cluster is binned and used to create the distance
histograms (B and D, right side). Shown in red is the distance
histogram for cluster members and cluster non-members are shown in
blue. Two extreme cases are exemplified in this figure. (A) Example expression data falling into two completely discrete clusters highlighted in red and blue. (B)
The corresponding ROC curve (left) and distance histograms (right) for
the sample data shown in (A). Note that since all cluster members are
encountered before any non-cluster members the area under the ROC curve
is 1.0. The distance histograms also show this perfect separation. (C) Example expression data falling into two completely overlapping clusters highlighted in red and blue. (D)
The corresponding ROC curve (left) and distance histograms (right) for
sample data shown in (B). Note that since cluster members and
non-cluster members are encountered at an equal rate as a function of
distance from the cluster center, the ROC curve approximates the line x = y
and the area under the ROC curve is 0.5. This overlap is also
highlighted in the distance histograms because the distributions of
distances for cluster members completely overlap with that of the
distribution of distances for non-cluster members.
(Click image to enlarge)
|
|
Figure 4
ROC analysis of the S-phase cluster of Cho et al. (1). (A)
ROC curve (left) shows the overlap between this cluster of 74 genes and
genes from all other clusters in the time-course analysis [383 genes in
total, selected by inspection by Cho et al. (1) for cycling
behavior]. The area under the ROC curve is 0.82. The area under the
curve highlighted in green demonstrates selection of genes from S phase
that overlap with other clusters least. At the shown distance
threshold, 66 genes from the Cho determined S-phase cluster are
selected, and the overlap with only non-S-phase genes. (A)
Right: correlation distance histograms illustrating the distribution of
distances to the center of the S-phase cluster for non-cluster members
(bottom/blue) and for all S-phase cluster members (top/red). (B)
Expression trajectories for the 74 genes in the S-phase cluster,
highlighting in green cluster members represented by the green
highlight in (A). (C) Expression trajectories for all genes
outside the S-phase cluster of the Cho clustering highlighting in green
non-cluster members represented by the green highlight in (A).
(Click image to enlarge)
|
|
Figure 5
PCA, ROC plots and trajectory summary views of clusters from the Cho
classification and an unsupervised clustering (EM MoDG) of an
Affymetrix yeast cell cycle time course (1). The top panel for each
clustering results shows cluster means projected into the top two
dimensions of the principle component space defined by the expression
data (capturing 64% of the variance). The area of the marker size for
each cluster is proportional to the number of genes in each cluster.
Below are ROC curves (left) and trajectory summaries (right) for each
cluster. The trajectory summaries display every gene's expression
profile within a cluster as a blue line with time along the x-axis and expression along the y-axis.
The red line within each trajectory summary represents the mean
expression level for the cluster. ROC area values are displayed within
the ROC curve for each cluster. The background colors for the
trajectory summaries and the PCA projection have been matched within
each clustering result. In addition, LA was used to find the optimal
mapping of clusters between the Cho classification and the EM MoDG
result and the colors have been set accordingly.
(Click image to enlarge)
|
|
Figure 6
PCA, ROC plots and trajectory summary views of clusters from the
Fourier classification and an unsupervised clustering (Xclustagglom)
results from the ratiometric yeast cell cycle time course (16). Details
of the figure layout are the same as for Figure 5. Only the six largest
clusters are shown in the Xclustagglom. Clusters that do not have an
optimal pairing by LA with a Fourier cluster are colored black. Note
that PCA summary calls attention to the low quality of the S/XC3
pairing, places it between XC5 and XC1.
(Click image to enlarge)
|
|
Figure 7
Selected confusion array cells from Figure 1 highlighting cluster
membership differences for genes with peak expression during the G1 and S phases of the cell cycle. The trajectory summaries display an expression profile for every gene with time along the x-axis and expression along the y-axis.
Blue trajectory summaries show parent clustering results (EM MoDG along
the columns, the Cho classification along the rows). Intersection cells
from the confusion array are shown in red. Mean vectors for each gene
set are shown in black with error bars proportional to the standard
deviation. The total number of gene expression vectors in each cell is
shown in parentheses. (A) G1 genes are subdivided
differently by the two algorithms. EM MoDG separates genes upregulated
only during the second phase of the cell cycle from those upregulated
during both the first and second cycles. The Cho classification
separates G1 based primarily on peak time in the second
cycle. Figure 8 illustrates these observed kinetic distinctions being a
result of these genes belonging to distinct regulatory modules. (B)
Detailed comparison from the confusion array of Figure 1 showing the
S-phase cluster of the Cho classification is subdivided nearly equally
among EM2, EM3 (optimal match by LA) clusters.
(Click image to enlarge)
|
|
Figure 8
Integrating expression data, regulatory motif conservation and protein–DNA binding information. (A) Binding site enrichment in genes from the four confusion matrix cells of Figure 7 that dissect genes in the G1
cell cycle phase. Shown in red are the observed number of genes with a
MCS score above threshold for each motif. Shown in blue are the number
of genes expected by chance, as computed by bootstrap simulations. The
total number of genes each cell contains is in the upper left. (B–D)
Heat-map displays showing expression data on the left, followed by MCS
scores for a specified motif, followed by in vivo protein–DNA
binding data for transcription factors implicated in binding to the
specified consensus. Color scales for each panel are at the bottom of
the figure. For the MCS scores, the color map ranges from 0 to the 99th
percentile to minimize the influence of extreme outliers on
interpretation. (B) Shown are 14 genes that fall within the EM1/Early G1
intersection cell and have a conserved enrichment in the presence of
the SWI5 consensus as measured by MCS scores (see Methods; Equations
4–9) (C) Shown are 79 genes that fall within EM2/Late G1 intersection cell and have a high MCS score for MCB. (D) Shown are 20 genes that fall within EM2/Late G1
intersection cell and have a high MCS score for SCB. In each heat-map
genes are ordered by decreasing MCS score. Significant correlation can
be seen between a high MCS score, protein–DNA binding and the expected
expression pattern.
(Click image to enlarge)
|