**5**6 7 8 9 10

Rate article:

- Genetical genomics: use all data

Chesler et al.'s [12] experiment consists of a set of 35 BxD mouse recombinant inbred lines. Brain tissue from 100 pools of individuals were arrayed with Affymetrix U74Av2 arrays chips and a panel of 779 markers was genotyped. Each array experiment was made up with a pool of brain tissue (excluding olfactory bulb, retina or neurohypohysis) from three individuals of the same sex. The data set was downloaded from the GeneNetwork site [17].

Two main models were used to reanalyze the data. In model (1), the j-th expression level for i-th individual (i = 1, n) is modelled as

where μ is the general mean or any other fixed effects that may be included in the model, λ is a 0/1 indicator variable that identifies the genotype of the individual for the marker considered (i.e., λ_{it }=
1 if the i-th individual has genotype t, 0 otherwise), and ε is the
residual. Note that we assume that marker density is very high so that
we test each individual marker in turn instead of carrying out a QTL
scan. This is done here for simplicity and computational speed as is
straight forward to generalize (1) to other situations. Carlborg et al.
[18]
found no large differences between single marker and interval mapping
in this context. In model 2 we allow, in addition, that cDNAs other
than the one analysed can be included in the model as covariates, i.e.,

where δ_{k }is an indicator variable with value 1 if the cDNA level k is included in the model with covariate coefficient β_{k}, 0 otherwise. Note that a most difficult problem can be choosing the adequate set of δ's.
For this work, we scanned all cDNAs and we included in (2) only the
most significant cDNA. Models (2) vs. (1) were compared with a
likelihood ratio test, and P-values were computed assuming the usual
Chi-squared approximation. Likelihood was maximized using an EM
algorithm implemented in package Qxpak [19].

SVM techniques are a well known tool for classification and prediction [20].
The rationale for using SVM in this context was to use an alternative
to maximum likelihood to identify the variables that best predict the
trait (expression level) of interest. As in the previous section,
suppose we have an n-dimensional real vector containing the expression
level to be studied (**y**), and a collection of d-dimensional real vectors **x**_{i }that
contains the descriptive variables (i.e, all markers and the rest of
cDNAs). The goal of SVM is to produce a predictive function for the
expressions levels of each individual. Therefore, the input of a SVM
can be collected in a set of pairs **S **= {(**x**_{1}', y_{1}),...,(**x**_{n}', y_{n})}, while the output is a vector **w* **and a scalar b* such that the function h defined by

h(**x**) = **w***' **x **+ b* (3)

is a prediction of the expression level y of an individual described by **x**.
An important issue is to fix the criterion for measuring the quality of
the prediction. In our case, the aim is to produce a function h (Eq. 3)
such that the relative ordering of (h(**x**_{1}),...,h(**x**_{m})) is as close as possible with the observed ordering of (y_{1},...,y_{n}). For this purpose we used the loss function [21] that returns the number of pairs (i, j) whose predicted relative ordering (h(**x**_{i}), h(**x**_{j})) is swapped with respect to its observed ordering of (y_{i}, y_{j}). Formally, the loss of h in the set is defined as the probability

where I{p(x)} is the function that returns 1 when the predicate p(x) is true, 0 otherwise. This loss function can be seen as a generalization of the complement of the Area Under a Receiver Operating Characteristic (ROC) curve, AUC for short. Hanley and McNeil [22] showed that the AUC is the probability of a correct ranking and thus AUC coincides with the value of the Wilcoxon-Mann-Whitney non parametric statistic. Here, we report

AUC = 100(1 - Δ_{SP}(h, T)) (5)

as a measure of the goodness of the SVM prediction models. Technically, the SVM seeks **w* **and b* as the solution to the following convex quadratic optimization problem,

subject to

(**w'x**_{i }+ *b*) - *y*_{i }≤ ∈ + and *y*_{i }- (**w'x**_{i }+ *b*) ≤ ∈ + ,

where *C *is the regularization parameter and ξ are slack variables (ξ^{+}, ξ^{- }≥
0). Notice that the regression approach is similar to fitting the rank,
although the function optimised by regression is not exactly a measure
of the coherence between observed and predicted rankings. In fact, we
could have chosen a SVM solution where the goal was to optimise the
loss function AUC (4) directly, see e.g. Joachims [23,24]. However, the results achieved with regression were good enough and they are faster to obtain.

We used SVM^{light }software [9]
to produce regressors that were evaluated with a 10 fold cross
validation using the AUC loss function. This means that the whole
dataset was randomly split into 10 partitions, each resulting in a
training subset and a test subset. Equation (6) is used by SVM to
produce a function h (Eq. 3) using the training subset, whereas the
function h is evaluated (Eqns. 4 and 5) using the test subset. The
performance estimation returned by the cross-validation method is the
mean over all 10 partitions. The kernel used was linear, *C *was
set to 1 (usually the default value in most SVM environments), and the
parameter ∈ was set to 0.01, the default value in the implementation
used.

The markers were dealt as discrete variables with each of the three values (the three genotypes) transformed into three Boolean attributes. Thus, when a marker was found to be among the 50 most relevant for a given trait, the three associated Boolean variables were included in the corresponding model. The algorithm used to discover relevancies was the so-called Recursive Feature Elimination (RFE) [10]; a simple yet efficient method when the kernel is linear. We run SVM for several cDNA levels considering as predictors either all variables (cDNAs and markers), only markers or only cDNA levels. We set a maximum of 50 variables to be included in the decission rule h.

rating: **0.00** from **0** votes | updated on: **30 Jun 2009** | views: **6555** |