- Genetical genomics: use all data


Chesler et al.'s [12] experiment consists of a set of 35 BxD mouse recombinant inbred lines. Brain tissue from 100 pools of individuals were arrayed with Affymetrix U74Av2 arrays chips and a panel of 779 markers was genotyped. Each array experiment was made up with a pool of brain tissue (excluding olfactory bulb, retina or neurohypohysis) from three individuals of the same sex. The data set was downloaded from the GeneNetwork site [17].

Maximum likelihood techniques

Two main models were used to reanalyze the data. In model (1), the j-th expression level for i-th individual (i = 1, n) is modelled as


where μ is the general mean or any other fixed effects that may be included in the model, λ is a 0/1 indicator variable that identifies the genotype of the individual for the marker considered (i.e., λit = 1 if the i-th individual has genotype t, 0 otherwise), and ε is the residual. Note that we assume that marker density is very high so that we test each individual marker in turn instead of carrying out a QTL scan. This is done here for simplicity and computational speed as is straight forward to generalize (1) to other situations. Carlborg et al. [18] found no large differences between single marker and interval mapping in this context. In model 2 we allow, in addition, that cDNAs other than the one analysed can be included in the model as covariates, i.e.,


where δk is an indicator variable with value 1 if the cDNA level k is included in the model with covariate coefficient βk, 0 otherwise. Note that a most difficult problem can be choosing the adequate set of δ's. For this work, we scanned all cDNAs and we included in (2) only the most significant cDNA. Models (2) vs. (1) were compared with a likelihood ratio test, and P-values were computed assuming the usual Chi-squared approximation. Likelihood was maximized using an EM algorithm implemented in package Qxpak [19].

Support vector machines techniques

SVM techniques are a well known tool for classification and prediction [20]. The rationale for using SVM in this context was to use an alternative to maximum likelihood to identify the variables that best predict the trait (expression level) of interest. As in the previous section, suppose we have an n-dimensional real vector containing the expression level to be studied (y), and a collection of d-dimensional real vectors xi that contains the descriptive variables (i.e, all markers and the rest of cDNAs). The goal of SVM is to produce a predictive function for the expressions levels of each individual. Therefore, the input of a SVM can be collected in a set of pairs S = {(x1', y1),...,(xn', yn)}, while the output is a vector w* and a scalar b* such that the function h defined by

h(x) = w*' x + b*     (3)

is a prediction of the expression level y of an individual described by x. An important issue is to fix the criterion for measuring the quality of the prediction. In our case, the aim is to produce a function h (Eq. 3) such that the relative ordering of (h(x1),...,h(xm)) is as close as possible with the observed ordering of (y1,...,yn). For this purpose we used the loss function [21] that returns the number of pairs (i, j) whose predicted relative ordering (h(xi), h(xj)) is swapped with respect to its observed ordering of (yi, yj). Formally, the loss of h in the set is defined as the probability


where I{p(x)} is the function that returns 1 when the predicate p(x) is true, 0 otherwise. This loss function can be seen as a generalization of the complement of the Area Under a Receiver Operating Characteristic (ROC) curve, AUC for short. Hanley and McNeil [22] showed that the AUC is the probability of a correct ranking and thus AUC coincides with the value of the Wilcoxon-Mann-Whitney non parametric statistic. Here, we report

AUC = 100(1 - ΔSP(h, T))     (5)

as a measure of the goodness of the SVM prediction models. Technically, the SVM seeks w* and b* as the solution to the following convex quadratic optimization problem,


subject to

(w'xi + b) - yi ≤ ∈ + 07010903m05.gif and yi - (w'xi + b) ≤ ∈ + 07010903m06.gif,

where C is the regularization parameter and ξ are slack variables (ξ+, ξ- ≥ 0). Notice that the regression approach is similar to fitting the rank, although the function optimised by regression is not exactly a measure of the coherence between observed and predicted rankings. In fact, we could have chosen a SVM solution where the goal was to optimise the loss function AUC (4) directly, see e.g. Joachims [23,24]. However, the results achieved with regression were good enough and they are faster to obtain.

We used SVMlight software [9] to produce regressors that were evaluated with a 10 fold cross validation using the AUC loss function. This means that the whole dataset was randomly split into 10 partitions, each resulting in a training subset and a test subset. Equation (6) is used by SVM to produce a function h (Eq. 3) using the training subset, whereas the function h is evaluated (Eqns. 4 and 5) using the test subset. The performance estimation returned by the cross-validation method is the mean over all 10 partitions. The kernel used was linear, C was set to 1 (usually the default value in most SVM environments), and the parameter ∈ was set to 0.01, the default value in the implementation used.

The markers were dealt as discrete variables with each of the three values (the three genotypes) transformed into three Boolean attributes. Thus, when a marker was found to be among the 50 most relevant for a given trait, the three associated Boolean variables were included in the corresponding model. The algorithm used to discover relevancies was the so-called Recursive Feature Elimination (RFE) [10]; a simple yet efficient method when the kernel is linear. We run SVM for several cDNA levels considering as predictors either all variables (cDNAs and markers), only markers or only cDNA levels. We set a maximum of 50 variables to be included in the decission rule h.

rating: 0.00 from 0 votes | updated on: 30 Jun 2009 | views: 6555 |

Rate article:

excellent! bad…