A common approach used to evaluate a normalization procedure^{}is to compare correlation coefficient between replicate samples.^{}We compared four normalization procedures, MAS5, RMA, GCRMA^{}and Li–Wong, on gene expression measurements of 10 replicate^{}samples as well as on their permuted data files. The randomized^{}data set plays the role of a negative control (null-hypothesis)^{}such that any significant correlation measured on the permuted^{}dataset could be deemed an artifact of the specific normalization^{}procedure. Figure 2a shows the comparison of between-sample^{}Spearman rank correlation among the four normalization procedures.^{}While all four procedures achieve correlation >0.9, GCRMA^{}seems to produce higher overall correlation measures than the^{}other methods, while MAS5 appears to produce the lowest overall^{}correlation measures. This may be incorrectly interpreted to^{}imply that GCRMA normalization outperforms the other methods.^{}However, Figure 2b provides a completely different interpretation^{}for these observations. It shows that both RMA and especially^{}GCRMA produce highly significant correlation measurements even^{}when applied to the randomized set. The conclusion is that the^{}higher overall correlation after normalization with these two^{}methods is an artifact and will likely skew the results of reverse-engineering^{}methods.

In several methods for the reverse engineering of cellular networks,^{}physical interactions—including protein–protein^{}and protein–DNA interactions—are inferred from the^{}statistical dependencies between gene expression profiles (Basso^{}*et al*., 2005; Ge *et al*., 2001). A correct estimate of the correlation^{}structure in the gene expression profile data is thus one of^{}the most crucial ingredients of a successful reverse-engineering^{}algorithm. To estimate the impact of normalization procedures^{}on the correlation structure, we preprocessed a set of 254 Affymetrix^{}arrays using the four normalization procedures and then computed^{}correlation between all pairs of probe sets. Figure 3 provides^{}a global view of the correlation structure in these data sets.^{}In particular, out of 77 millions possible probe set pairs,^{}there are 5.2 millions (6.7%) expression pairs with correlation^{}coefficient, ||, >0.75 in GCRMA-normalized data set. In contrast,^{}MAS5-normalized data set contains only 0.04% or 33 000 pairs^{}above this cutoff value. This is an extraordinary difference^{}which warrants further investigation, especially in light of^{}the previous results from the analysis of a randomized set.^{}Since most biological networks are known to be scale-free, or^{}possess a degree distribution that can be approximated by a^{}power law (Barabasi and Oltvai, 2004), we varied threshold of^{}a relevance network (Butte and Kohane, 2000) and fit the global^{}network connectivity to a power-law distribution. Pairwise mutual^{}information (MI) of the network was estimated using a Gaussian^{}kernel method (Margolin *et al*., 2006). Figure 4 compares *R*^{2}-value^{}of the fitting in each of the four normalized data sets. With^{}the exception of GCRMA, all other networks show a good fit of^{}scale-free distribution and the *R*^{2} reach a plateau above threshold^{}*P*-value of 1 x 10^{–10}. While this is just a sanity check^{}rather than a fully quantitative result, it should be clear^{}that a normalization method resulting in a large deviation from^{}the accepted model of topological connectivity in the cell should^{}be approached with caution.

We finally proceeded to assess whether higher correlation, after^{}a specific normalization procedure, would reflect a higher chance^{}of either functional or physical interaction between two genes.^{}Two non-parametric gene-pair correlation measures were tested,^{}including Spearman rank correlation and MI. Here we only report^{}results for MI analysis, as both measurements produce consistent^{}results in all the tests we performed. MI was chosen as it arguably^{}provides the best estimate of pairwise statistical dependency^{}in a non-linear setting. We first evaluated functional relationship^{}between genes by examining whether they share a common GO biological^{}process annotation. Figure 5 compares the fraction of gene pairs^{}with the same GO annotation in a cumulative equal-frequency^{}histogram. MAS5 demonstrated the best result with 48% of the^{}top 10 000 pairs having a common GO biological process term,^{}followed by RMA and Li–Wong, while GCRMA produces a fraction^{}that is comparable to the background level. We then computed^{}likelihood ratios of PPIs for the top correlated gene pairs^{}using the gold-standard PPI interaction sets described in the^{}Methods section. Although this is a rather naïve method^{}that directly correlates physical interaction with gene co-expression,^{}the results should still be able to provide a fair comparison^{}among the four normalization procedures given that the same^{}assumption is made for all of them. Figure 6 shows the plots^{}of the PPI likelihood ratio as a function of various ranges^{}of gene–gene correlations. Note that the discrepancy of^{}performance between the curves implies that the MI ranks of^{}the gene pairs are not consistent among the four normalization^{}procedures. Gene pairs found to be highly correlated in one^{}data set may not be significant in another data set. Our results^{}show that MAS5-normalized data provide by far the best platform^{}for inferring PPIs. To our surprise, yet consistently with all^{}other tests in this article, data normalized by GCRMA dramatically^{}scrambled the ranks of correlation among gene pairs, i.e. highly^{}correlated gene pairs are equally likely to be a positive PPI^{}or a negative PPI. This is likely the consequence of correlation^{}artifacts resulting in the introduction of a large number of^{}gene pairs that are not truly correlated among the top most^{}correlated ones.

The results of this analysis are both surprising and concerning.^{}GCRMA has been a popular procedure used to convert raw microarray^{}data into gene expression profiles and it was shown to outperform^{}other normalization procedures in detecting differentially expressed^{}genes (Wu *et al*., 2004). However, we observed the opposite result^{}when correlation artifacts are considered. This does not just^{}affect reverse-engineering methods, but any other method that^{}relies on an accurate measure of gene-pair expression profile^{}correlation, such as hierarchical clustering among many others.^{}

It is rather obvious that the dramatic under-performance of^{}GCRMA is due to its background adjustment step, since (a) it^{}is the only step where GCRMA and RMA differ and (b) RMA has^{}been performing much better than GCRMA in our study, albeit^{}not as well as MAS5. The background adjustment in GCRMA consists^{}of three sequential steps: (1) optical background correction,^{}(2) probe intensity adjustment through non-specific binding^{}(NSB) utilizing affinity information and optical noise-adjusted^{}MM intensities and (3) probe intensity adjustment through gene-specific^{}binding (GSB), where NSB-adjusted PM intensities are further^{}corrected for the effect of PM probe affinities. In step (2),^{}the default GCRMA procedure implemented in the R statistical^{}package truncates PM intensities to a minimum value, *m*, if the^{}NSB-adjusted intensity values are less than *m*, and in step (3),^{}these truncated PM probes are further adjusted for GSB.^{}

We suggest that the GSB adjustment of truncated values is a^{}significant flaw in the design of the GCRMA normalization procedure^{}and that it is directly responsible for the difference in performance^{}between RMA and GCRMA. From a theoretical perspective, once^{}a probe intensity is truncated at *m*, it should be deemed uninformative^{}and further adjustments should be avoided. More specifically,^{}if any two probes with similar affinities are truncated in the^{}same subset of samples, GSB adjustment could introduce correlation^{}artifacts between the two probes. Figure 7 demonstrates a simplified^{}scenario, where two probes with similar affinity, and intensities^{}both truncated to *m*, could gain a high correlation after GSB^{}adjustment. The GCRMA procedure applies quantile normalization^{}after the background adjustment steps, where all probe intensities^{}are essentially transformed to ranks within each sample. Without^{}GSB-adjustment, these two probes will both rank the lowest in^{}all samples as their intensities are truncated. However, if^{}GSB adjustment is applied, they may switch ranks with other^{}untruncated probes in some samples after the adjustment, and^{}such rank switches can be highly correlated between the two^{}probes owing to their similar probe affinity. One possible solution^{}here is to decrease the value of *m*, in order to reduce the truncated^{}regions as well as the number of the affected probes. However,^{}in practice, we found that the most effective way to reduce^{}this problem is to avoid further GSB adjustment altogether on^{}the probes with truncated intensities. Note that any algorithm^{}used to compute non-parametric statistical dependencies between^{}probe-pairs should randomize the rank of equal-value entries^{}such that any probe pair with a constant expression level across^{}samples should not contribute to the correlation.

To test our speculations, we reimplemented the GCRMA procedure^{}without adjusting GSB for uninformative probes—i.e. probes^{}that are truncated to *m* after NSB adjustment. To ensure the^{}lowest intensity rank of these probes, any other probes with^{}GSB-adjusted value less than *m* were also truncated at *m*. Finally,^{}an infinitesimal amount of uniformly distributed noise was added^{}to truncated probes to avoid rank-order correlation issues.^{}The new implementation successfully removed the artificial correlation^{}induced in the default version and, as shown in Figure 8, performed^{}much better than the original GCRMA and almost at par with MAS5.