Microarray technologies have enabled researchers to monitor the expression levels of tens of thousands of genes simultaneously. However, the process of producing microarray data, from sample preparation to the final step of harvesting the data, involves multiple steps, some of which can be error-prone. Possible quality problems include poor RNA extraction, problems arising from the hybridization process, physical defects of the chips, and artifacts such as batching effects (see Zhang et al., 2004 and Brettschneider et al., 2007 for a more detailed discussion). As poor quality arrays may seriously distort the preprocessing as well as the data analysis procedures, examining the quality of the arrays is a critical step before any subsequent analysis can be performed. In this article, the first question we consider is how to assess the quality of a large microarray dataset. That is, after we receive n microarray chips from a facility that produces microarray data, we need to assess their quality, and if necessary, to identify those m chips that need to be rerun. The second question we consider is why each of the m chips is of unacceptable quality. Because of the resources involved (e.g. biological material, human time and production cost), this is an important step to reduce the effort required for the rerun.
Despite the extensive research on microarray data, the development of microarray quality assessment methods is still in its early stages. The standard practice of inspecting each image files to detect quality problems of each array is time consuming and difficult to apply in large studies. As a result, alternative automated quality assessment methods have been proposed. We briefly discuss some of these studies and refer the reader to Brettschneider et al. (2007) for a comprehensive review of the literature. For spotted arrays, several studies provide useful spot quality measures to examine features of each spot on the slides (Bylesjö et al., 2005; Hautaniemi et al., 2003; Sauer et al., 2005; Wang et al., 2001). In addition, Model et al. (2002) propose using multivariate statistical process control techniques based on the measurement values of cDNA arrays to detect problematic slides. For oligonucleotide arrays, quality control (QC) reports can be used to assess the quality of the arrays (Affymetrix 2004; Wilson and Miller 2005). Instead of relying on QC reports, Brettschneider et al. (2007) introduce several new quality measures based on probe level and probeset level information to assess the quality of Affymetrix GeneChips (Bolstad et al., 2005).
In this article, we develop a tool to identify quality problems based on the quality measures provided in any QC report. Although there is an open debate on whether the measures contained in the QC reports can identify quality problems, QC reports are commonly used to assess the quality of microarrays (Finkelstein, 2005; Landea et al., 2005). One common practice is to compare the values of each measure against ad hoc thresholds. Another practice is to account for the similarity of the measures across arrays and flag those arrays with one or more measures that substantially differ from those of the majority of the arrays. These methods can be implemented using softwares such as simpleaffy (Wilson and Miller, 2005) or GeneData Expressionist Refiner 5.0 (GeneData, Basel, Switzerland). The main drawback of these methods is that they are univariate, i.e. they ignore the correlation structure of the QC measures. As a result, these methods can only detect univariate outliers, i.e. observations that clearly depart from the bulk of the data in at least one dimension. However, they cannot detect structural outliers, i.e. observations that are not outliers in any single dimension, but are nonetheless outliers when multiple dimensions are considered (see Rousseeuw and Leroy 1987 or Model et al., 2002 for a discussion of this issue). In our context, we are interested in flagging arrays that violate any of the univariate checks, but also those that are of poor quality only when multiple QC parameters are simultaneously taken into account.
We propose a multivariate quality assessment method for microarrays that is based on the similarity of quality measures across arrays, i.e. on the idea of outlier detection. Intuitively, the ‘distance’ of an array's quality attributes measures the similarity of the quality of that array against the quality of the other arrays. Then, arrays with unusually high distances can be flagged as potentially low quality. Thus, our method computes a single distance measure, the Mahalanobis distance (MD), to summarize the quality of each array. The use of this distance allows us to perform a multivariate analysis of the information in QC reports taking the correlation structure of the quality measures into account. In addition, by using robust estimators to identify the typical quality measures of good-quality arrays, the evaluation is not affected by the measures of outlying arrays. This method can be based on all the quality measures simultaneously, or on subsets of them, which gives one distance value for each subset of parameters in the QC report. We show that the latter approach can be exploited to provide possible explanations of the source of the quality problems. In sum, we bring outlier detection methods widely used in statistics into the quality assessment of microarrays based on QC measures.
The method is specifically designed to identify a small fraction of potentially flawed arrays within a large set of arrays. Thus, it is useful to deal with the common problem that arises in microarray experiments when a small number of the arrays in the batch may have low quality and need to be identified. The method is not appropriate when a large fraction of the arrays or even the entire batch may be flawed due to incorrect laboratory procedures, contaminated samples or other reasons. However, such events can be easily detected using the univariate methods discussed above.
In addition to having a clear statistical foundation, our method has several salient features. First, it takes into account the correlation structure of the quality parameters in the QC report. We show that a multivariate analysis gives substantially richer information than the analysis of each parameter in isolation. Second, it is flexible and useful for any platform as it can be based on any QC report. We illustrate our method using two datasets of Affymetrix GeneChips and the QC reports generated by simpleaffy (Wilson and Miller, 2005 and GeneChip Operating Software (GCOS) (Affymetrix, 2005) respectively. However, all the ideas can be applied to other QC reports. Moreover, the user can choose how to group the different quality measures as well as the cutoff lines. Third, since our method is scale-invariant, the analysis does not change if different scales are used for the quality parameters. Last, once the QC reports are produced, our method is computationally light-weight and it summarizes the large number of quality parameters in a way that can be easily visualized and interpreted, which is especially valuable in large microarrays studies.