
table of contents ![]() The process of producing microarray data involves multiple steps, some of which … '); |
Biology Articles » Bioinformatics » MDQC: a new quality assessment method for microarrays based on quality control reports » Methods
Methods
|
|
|
(1) |
Since we want to accurately compute the MD of one array's quality measures to those of other arrays, it is extremely important that outlying arrays do not contaminate our estimates of the center and correlation structure of all the arrays. If they did, then such distances would diverge from their true values simply because the reference point is imprecisely estimated and would not be useful in flagging problematic arrays. Thus, our method relies on robust M (location) and S (scatter) estimators to compute the MD defined in Equation (1). In Supplementary Material, we illustrate the relevance of using robust estimators to compute the MDs with a real data example. In this article, we use the S-estimator (Lopuhaä, 1989), however, any other robust multivariate location and scatter estimator can be used (e.g. minimum volume ellipsoid or minimum covariance determinant estimators). To increase the finite sample efficiency of these estimators and thus improve the approximation of the MDs distribution, we suggest using an estimator with 25% breakdown point. This is particularly important in studies containing a small number of arrays where the robust estimators are more unstable and the distributional approximations are less accurate.
The resulting MDs can be used to flag poor-quality arrays as their MDs will be large relative to those of undamaged arrays, i.e. they will be far from the center of the normal arrays. Assuming that X1, ..., Xp are multivariate normal random variables, the squared MDs have an approximate chi-squared distribution with p degrees of freedom. Thus, using the chi-squared distribution we can set a cutoff point to decide if the array is likely to be defective. For example, let X be a 20 times 14 matrix containing the 14 numeric quality measures of the GCOS QC report for 20 arrays in a study. Then, M is a 14-dimensional row vector that estimates the center of X and S is a 14 times 14 matrix that estimates its covariance matrix. Finally, the MDs would be distributed as a chi-squared with 14 degrees of freedom.
2.2 MDQC: different approaches
The most intuitive approach towards MDQC is to compute a single MD for each array based on all the quality measures in the QC report. However, this approach suffers from two drawbacks, one statistical and one conceptual. First, the low quality of an array may be reflected by extreme values in only a few of the measures in the report, while other measures may not significantly differ from those corresponding to the bulk of the arrays. Thus, it is possible that the combination of all the quality measures into a single MD ‘masks’ these outlying observations. Second, even when a single MD can be accurate in identifying poor quality arrays, it provides no information about the potential source of the quality problem. Thus, we recommend alternative approaches to address these issues.
Computing multiple MDs based on different groups with a reduced number of quality measures instead of a single MD based on all of them can help to ‘unmask’ outlying observations. As a result, the quality attributes of an array would be summarized by as many MDs as groups formed. In addition, QC reports usually contain more than one measure related to the same quality aspect of the array. Thus, grouping complementary measures according to the quality attribute they represent helps to identify possible reasons of the quality problems. We recommend to form these groups using the a priori grouping method, in which the groups are formed on the basis of an a priori interpretation of the quality measures in the report and according to the quality aspect they represent. To illustrate the use of this method, we now use the GCOS QC report for Affymetrix GeneChip arrays (Affymetrix, 2005). The QC measures in this report can be classified into four groups, according to whether they provide information on the quality of the chip and/or the sample, the chip, the sample and the RNA, respectively:
These groups contain valuable information about the possible sources of corruption. For example, if only the MD for Group 4 were abnormally high, then this would suggest that the array is defective due to poor RNA quality. However, the groups may sometimes provide less conclusive evidence about the source of the problem as a high MD in one group may manifest itself together with an abnormal MD in other groups. For example, a defective chip that should give large MDs in Group 2 may distort the expression of the housekeeping genes and thus also give large MDs in Group 3. Nevertheless, even in these cases the a priori approach usually allows the researcher to rule out at least some possible sources of corruption.
The MDQC method based on groups is versatile. The a priori approach described above can also be used on QC reports other than that provided by GCOS. This would result in different a priori groups based on the description of each measure in those reports. In addition, in the Supplementary Material we provide two data-driven methods to form the groups that serve as an alternative to the a priori approach. These are the clustering grouping method, which groups the quality measures using clustering analysis, and the loading PCA grouping method, which uses the loadings of a PCA to identify the quality measures that contain similar information. It is important to note that the groups formed using these approaches will vary from one dataset to another, and one may lose the interpretability of the groups provided by the a priori method.
We also propose an alternative approach to unmask low-quality arrays, which we refer to as the global PCA method. It uses PCA to create linear combinations of the original QC parameters, referred to as principal components (PCs), where the PCs retain most of the original variability in the data (Johnson and Wichern, 1999). Thus, the MD can be computed on a single group based on the reduced space of the first k PCs (k p), which can help to ‘unmask’ outlying observations. It is important to note that, in this approach, the formed group does not contain a subset of the original quality measures sharing a common purpose as in the a priori groups. Thus, while this method can also flag low-quality arrays, it gives no indication of the source of the quality problem.
In a PCA, it is usually recommended to standardize the data by the mean and SD of each variable so that variables with a large variance will not dominate the first PCs (Johnson and Wichern, 1999). In addition, as the QC report may contain outlying measures associated with low-quality arrays, it is important to use a robust multivariate location and scatter estimator to standardize the variables. Thus, let X be a n times p matrix containing the quality parameters of each array in each row, and let (M,S) be the robust location and scatter estimator of X. Then, the standardized variables are given by
, for i = 1 ,... ,n, where V is a p times p diagonal matrix containing the robust variance estimates. If n>p, a robust PCA can be performed deriving the PCs from a robust location and covariance matrix estimators of Z, where Z is the n times p matrix with Zi in its rows (Croux and Heasbroeck, 2000)2. That is, the j th PC is given by Yj = Zej, where ej is the eigenvector corresponding to the jth largest eigenvalue
j of the covariance matrix of Z, for j = i,...,p.
If we use the same robust multivariate estimator to standardize the data, to derive the PCs and to estimate the PC's location and covariance matrix, then the estimated PC's location and covariance matrix become a zero vector, 0p, and a diagonal matrix with the eigenvalues
j in its diagonal, Dp = diag{
1,...,
p}, respectively. As a result, the MDs defined in Equation (1) reduces to an Euclidean distance weighted by the eigenvalues of the covariance matrix of Z (Johnson and Wichern, 1999). i.e.
|
|
(2) |
is the set of the first k PCs for the ith array.
In this article, we use the S-estimator with 25% breakdown point in all the steps of the analysis, however, other robust estimators can be used. The scree plot (i.e., the plot of
j in decreasing order versus j, for j = 1, ... ,p) is used to determine the number k of principal components preserved in the analysis, looking for the ‘elbow’ or first important bend in the line (Johnson and Wichern, 1999). As before, we flag the array as potentially low quality if its distance defined in Equation (2) is unusually high.
rating: 1.00 from 1 votes | updated on: 1 Dec 2007 | views: 5761 |
© Biology-Online.org. All Rights Reserved. Register | Login | About Us | Contact Us | Link to Us | Disclaimer & Privacy