A common approach used to evaluate a normalization procedureis to compare correlation coefficient between replicate samples.We compared four normalization procedures, MAS5, RMA, GCRMAand Li–Wong, on gene expression measurements of 10 replicatesamples as well as on their permuted data files. The randomizeddata set plays the role of a negative control (null-hypothesis)such that any significant correlation measured on the permuteddataset could be deemed an artifact of the specific normalizationprocedure. Figure 2a shows the comparison of between-sampleSpearman rank correlation among the four normalization procedures.While all four procedures achieve correlation >0.9, GCRMAseems to produce higher overall correlation measures than theother methods, while MAS5 appears to produce the lowest overallcorrelation measures. This may be incorrectly interpreted toimply that GCRMA normalization outperforms the other methods.However, Figure 2b provides a completely different interpretationfor these observations. It shows that both RMA and especiallyGCRMA produce highly significant correlation measurements evenwhen applied to the randomized set. The conclusion is that thehigher overall correlation after normalization with these twomethods is an artifact and will likely skew the results of reverse-engineeringmethods.
In several methods for the reverse engineering of cellular networks,physical interactions—including protein–proteinand protein–DNA interactions—are inferred from thestatistical dependencies between gene expression profiles (Bassoet al., 2005; Ge et al., 2001). A correct estimate of the correlationstructure in the gene expression profile data is thus one ofthe most crucial ingredients of a successful reverse-engineeringalgorithm. To estimate the impact of normalization procedureson the correlation structure, we preprocessed a set of 254 Affymetrixarrays using the four normalization procedures and then computedcorrelation between all pairs of probe sets. Figure 3 providesa global view of the correlation structure in these data sets.In particular, out of 77 millions possible probe set pairs,there are 5.2 millions (6.7%) expression pairs with correlationcoefficient, ||, >0.75 in GCRMA-normalized data set. In contrast,MAS5-normalized data set contains only 0.04% or 33 000 pairsabove this cutoff value. This is an extraordinary differencewhich warrants further investigation, especially in light ofthe previous results from the analysis of a randomized set.Since most biological networks are known to be scale-free, orpossess a degree distribution that can be approximated by apower law (Barabasi and Oltvai, 2004), we varied threshold ofa relevance network (Butte and Kohane, 2000) and fit the globalnetwork connectivity to a power-law distribution. Pairwise mutualinformation (MI) of the network was estimated using a Gaussiankernel method (Margolin et al., 2006). Figure 4 compares R2-valueof the fitting in each of the four normalized data sets. Withthe exception of GCRMA, all other networks show a good fit ofscale-free distribution and the R2 reach a plateau above thresholdP-value of 1 x 10–10. While this is just a sanity checkrather than a fully quantitative result, it should be clearthat a normalization method resulting in a large deviation fromthe accepted model of topological connectivity in the cell shouldbe approached with caution.
We finally proceeded to assess whether higher correlation, aftera specific normalization procedure, would reflect a higher chanceof either functional or physical interaction between two genes.Two non-parametric gene-pair correlation measures were tested,including Spearman rank correlation and MI. Here we only reportresults for MI analysis, as both measurements produce consistentresults in all the tests we performed. MI was chosen as it arguablyprovides the best estimate of pairwise statistical dependencyin a non-linear setting. We first evaluated functional relationshipbetween genes by examining whether they share a common GO biologicalprocess annotation. Figure 5 compares the fraction of gene pairswith the same GO annotation in a cumulative equal-frequencyhistogram. MAS5 demonstrated the best result with 48% of thetop 10 000 pairs having a common GO biological process term,followed by RMA and Li–Wong, while GCRMA produces a fractionthat is comparable to the background level. We then computedlikelihood ratios of PPIs for the top correlated gene pairsusing the gold-standard PPI interaction sets described in theMethods section. Although this is a rather naïve methodthat directly correlates physical interaction with gene co-expression,the results should still be able to provide a fair comparisonamong the four normalization procedures given that the sameassumption is made for all of them. Figure 6 shows the plotsof the PPI likelihood ratio as a function of various rangesof gene–gene correlations. Note that the discrepancy ofperformance between the curves implies that the MI ranks ofthe gene pairs are not consistent among the four normalizationprocedures. Gene pairs found to be highly correlated in onedata set may not be significant in another data set. Our resultsshow that MAS5-normalized data provide by far the best platformfor inferring PPIs. To our surprise, yet consistently with allother tests in this article, data normalized by GCRMA dramaticallyscrambled the ranks of correlation among gene pairs, i.e. highlycorrelated gene pairs are equally likely to be a positive PPIor a negative PPI. This is likely the consequence of correlationartifacts resulting in the introduction of a large number ofgene pairs that are not truly correlated among the top mostcorrelated ones.
The results of this analysis are both surprising and concerning.GCRMA has been a popular procedure used to convert raw microarraydata into gene expression profiles and it was shown to outperformother normalization procedures in detecting differentially expressedgenes (Wu et al., 2004). However, we observed the opposite resultwhen correlation artifacts are considered. This does not justaffect reverse-engineering methods, but any other method thatrelies on an accurate measure of gene-pair expression profilecorrelation, such as hierarchical clustering among many others.
It is rather obvious that the dramatic under-performance ofGCRMA is due to its background adjustment step, since (a) itis the only step where GCRMA and RMA differ and (b) RMA hasbeen performing much better than GCRMA in our study, albeitnot as well as MAS5. The background adjustment in GCRMA consistsof three sequential steps: (1) optical background correction,(2) probe intensity adjustment through non-specific binding(NSB) utilizing affinity information and optical noise-adjustedMM intensities and (3) probe intensity adjustment through gene-specificbinding (GSB), where NSB-adjusted PM intensities are furthercorrected for the effect of PM probe affinities. In step (2),the default GCRMA procedure implemented in the R statisticalpackage truncates PM intensities to a minimum value, m, if theNSB-adjusted intensity values are less than m, and in step (3),these truncated PM probes are further adjusted for GSB.
We suggest that the GSB adjustment of truncated values is asignificant flaw in the design of the GCRMA normalization procedureand that it is directly responsible for the difference in performancebetween RMA and GCRMA. From a theoretical perspective, oncea probe intensity is truncated at m, it should be deemed uninformativeand further adjustments should be avoided. More specifically,if any two probes with similar affinities are truncated in thesame subset of samples, GSB adjustment could introduce correlationartifacts between the two probes. Figure 7 demonstrates a simplifiedscenario, where two probes with similar affinity, and intensitiesboth truncated to m, could gain a high correlation after GSBadjustment. The GCRMA procedure applies quantile normalizationafter the background adjustment steps, where all probe intensitiesare essentially transformed to ranks within each sample. WithoutGSB-adjustment, these two probes will both rank the lowest inall samples as their intensities are truncated. However, ifGSB adjustment is applied, they may switch ranks with otheruntruncated probes in some samples after the adjustment, andsuch rank switches can be highly correlated between the twoprobes owing to their similar probe affinity. One possible solutionhere is to decrease the value of m, in order to reduce the truncatedregions as well as the number of the affected probes. However,in practice, we found that the most effective way to reducethis problem is to avoid further GSB adjustment altogether onthe probes with truncated intensities. Note that any algorithmused to compute non-parametric statistical dependencies betweenprobe-pairs should randomize the rank of equal-value entriessuch that any probe pair with a constant expression level acrosssamples should not contribute to the correlation.
To test our speculations, we reimplemented the GCRMA procedurewithout adjusting GSB for uninformative probes—i.e. probesthat are truncated to m after NSB adjustment. To ensure thelowest intensity rank of these probes, any other probes withGSB-adjusted value less than m were also truncated at m. Finally,an infinitesimal amount of uniformly distributed noise was addedto truncated probes to avoid rank-order correlation issues.The new implementation successfully removed the artificial correlationinduced in the default version and, as shown in Figure 8, performedmuch better than the original GCRMA and almost at par with MAS5.