The use of GCRMA and RMA normalization procedures for Affymetrix
GeneChip® technology has received a remarkably broad adoption
in the community due to previous benchmarks demonstrating their
superiority with respect to other methods. However, while these
methods perform well in the assessment of differential expression
analysis, we found that they also introduce correlation artifacts
in the data. This seriously undermines their utilization, at
least in their standard form, upstream of reverse engineering
algorithms or any other method relying on the estimate of expression
profile correlation. Thus, our results raise issues on the validity
of many studies obtained on the basis of correlation measures
after these normalization procedures were applied. Specifically
we suggest that the implementation of a specific step in GCRMA—the
GSB adjustment of truncated values—introduces artificial
correlation among the probesets. Unfortunately, according to
our analysis, these artifacts are not dataset specific and can
survive even after the use of additional probe sets postprocessing
filters such as those based on mean, SD and coefficient of variation.
Results were completely consistent across four classes of tests,including (a) a direct assessment of correlation artifacts fromreplicate and randomized samples, (b) an evaluation of the globaltopological properties of reverse engineered networks, (c) astudy of the functional clustering of correlated genes and (d)a study of the relationship between gene-pair expression profilecorrelation and membership in stable protein complexes. Theunequivocal result is that normalization with GCRMA substantiallyreduces the ability to distinguish between actual and incorrectfunctional and physical interactions. In particular, GCRMA islikely to introduce an extraordinary number of false positives,while MAS5 appears to perform optimally with respect to thesetests.
We conclude that the choice of normalization procedure stronglyaffects the correlation structure in the data. Thus, choosingthe right normalization procedure is a key step towards theinference of accurate cellular networks. Our comparative analysisfavors MAS5 in this context even though (or probably because)it infers fewer interactions but with the highest functionaland physical interaction enrichment.
Finally, we suggest that a specific correction to the default
implementation of GCRMA in the R package appears to substantially
improve its performance, making it competitive with that of
MAS5. With this correction, we believe that GCRMA can be properly
utilized in the context of reverse engineering gene networks.