Variable relevance
Throughout this work, we compared the results for eQTL scan with two
models, a first model (model 1) where the only effects are a general
mean and the marker, and a second model (model 2) where external cDNA
levels were also included as covariate (see methods). Chesler et al. [12]
listed a series of ~70 highly significant eQTL (genome wise P-values
< 0.04) using a model 1 type strategy. One of the most significant
QTL affected the expression of the gene peroxiredoxin 2 (Prdx2), which plays a major role in protecting against oxidative stress. Figure 2
(top) shows the results with model (1), i.e., the classical approach.
As expected, we found a highly significant QTL (nominal P-value ~10-15),
in agreement with published results. However, when we scanned for
association not only the 779 markers genotyped but also the rest of
cDNA levels, the most associated variable was actually the transcript
level of gene Psma7 (P < 10-20). Interestingly,
this gene is involved in regulating the hypoxia-inducible
factor-1alpha, a transcription factor important for cellular responses
to oxygen tension. Next, we included Psma7 level as covariate in the eQTL model for Prdx2 and
we reanalysed the data with model (2), testing as before for the
associated significance of each marker in turn. The results are also in
Figure 2 (top). Although the profile is similar, the QTL is now far more significant (P-value = 10-23 now vs. 10-15 in the previous model). This observation strongly suggests that Psma7 plays an important role in modifying the expression of Prdx2, but also that its influence is purely environmental and probably under different genetic control; as a result, including Psma7 in the model removes a large part of the residual variation.
The opposite phenomenon was observed with the transcript level of Lin7c, one of the most highly connected gene in the brain [12]. In this case, the original eQTL had a nominal P-value ~10-5 with the usual model (model 1). This significance decreased when Sacm1l expression level was included using model (2) (P = 2 × 10-3). Sacm1l transcript level was the most significant factor associated with Lin7c expression (P < 10-53),
i.e., much more significant than any marker. It should also be noted
that the position of the maximum statistics was shifted, from marker D12Mit234 in chromosome 12 to D6Mit116 in chromosome 6. Interestingly, the expression level of Sacm1l had an eQTL also in the neighborhood of marker D12Mit234 with model 1. This likely occurs because both traits are highly correlated (ρ = 0.99). Schadt et al. [2] proposed to compare likelihoods P(x1|m) and P(x1|x2), where x1 and x2 are cDNA measures and m, marker genotypes, in order to disentangle whether m or x2 are causal to x1. Here, we compared P(xLin7c | xSacm1l) vs. P(xSacm1l | xLin7c)
but they were almost identical and thus we cannot resolve whether one
gene is causal to the other by using only statistical evidence, whereas
P(xLin7c | D12Mit234) was slightly more significant (P = 10-5) than P(xSacm1l | D12Mit234), P = 10-3. All this hints that Lin7c and Sacm1l are
mediated through the same causal effects but that it seems that their
association with D12Mit234 could be actually a false positive because
it is far less significant than other eQTL found in this same study.
Table 1
shows the associated P-levels and the goodness of fit statistics
provided by SVM (AUC, see methods) of marker and expression levels for
the ten most significant QTL reported in Chesler et al. [12]. The results corresponding to the most highly connected gene (Lin7c)
are also presented. This Table illustrates well the relevance of
external transcript levels as compared to markers. Note that the most
associated variable was a transcript level rather than a marker in six
out of ten genes, and this occurred for the most significant QTL, i.e.,
where the P-values for associated markers are most significant. Figure 3
displays the AUC obtained with the best 50 markers, the best 50
transcript levels, or the 50 best variables (including both markers and
gene levels) in the 67 genes that exhibited the most significant QTL.
Again, it can be clearly seen that transcript levels have a
significantly and consistently better predictive ability than markers.
For a random cDNA, we will expect that external transcript levels will
be even more relevant than for those with highly significant QTL. In
the case of the most highly connected gene, Lin7c, this is evident (Table 1).
Thus, both classical statistics and SVM methods suggest that, on
average, external transcript levels are better predictors of a given
cDNA level than markers.
The above considerations should not imply that the most significant variable is always an
external cDNA level for all genes, and thus that model (2) is to be
preferred over model (1). We observed that usual model (1) was to be
preferred in about 25% of the most significant QTL listed by Chesler et
al[12]. In Table 1, the most significant variable was a marker rather than a transcript level in four out of ten genes (Mela, Myoc, Cd59a, and Krtl-12).
We investigated whether modeling can nevertheless be improved in these
cases. To do that, we searched the next most significant effect among
all external cDNAs and the rest of markers, computing its P-value after
fitting the QTL. We observed that the next most significant variable
was a transcript which in turn was highly significant (P-values ranged
10-8 – 10-30). Note that this is surely an
underestimation of the influence because we were considering those
expression levels for which a marker (QTL) is the most significant
effect. The conclusion that we can draw, again, is that external cDNAs
are likely to be very important factors to be considered in genetical
genomics studies.
Reanalyzing hotspots
The observation of eQTL hotspots, i.e., genome regions that seem to
harbour a much higher number of QTL than expected by chance has been
largely debated in the literature [1,13-15].
This is a remarkable observation and is tempting to look for a
functional significance to these regions. It suggests the presence of
key regulatory, polymorphic motifs in the genome that can have a
profound influence on the genome transcription activity. In a previous
simulation study we observed that QTL hotspots appeared even when
genotypes and microarrays were shuffled, simply as a consequence of the
high correlation that exists between many expression levels [16].
It is important to notice, though, that we can use the correlation
between cDNA levels to our advantage in order to improve eQTL modelling
dramatically.
To investigate the nature of eQTL hotspots and the influence of
model choice, we chose nine genes that were reported by Chesler et al. [12] as influenced by the largest trans regulatory QTL hotspot, i.e., that close to marker D6Mit150 on murine chromosome 6. The nine genes chosen were Reln, Chrng, Slc6a1, Calm4, Mapk6, Adra2b, Mapk1, Gad1, and Htr4. As before, we fitted models (1) and (2). In all analyses with model (1), we found that the most significant marker was D6Mit254, the closest marker to D6Mit150 and also in chromosome 6, thus in agreement with published results. The P-values of the QTL with model (1) are in Table 2, they are in the order of 10-3 – 10-4.
Next, we scanned all transcript levels for each of the nine genes, the
most significant ones for each gene are also listed in the Table. Note
that the P-values are much smaller than the QTL (P-value < 10-30 in
all cases), which clearly shows that the nine expression levels studied
are much more influenced by external transcript levels than by any of
the polymorphisms genotyped. We also performed a QTL scan for these
selected external transcripts and we found that they mapped to regions
distinct from chromosome 6, and thus that they were not members of the
QTL hotspot (results not presented). Finally, we included the relevant
transcript as covariate for each of the nine genes and we repeated the
QTL analysis (model 2). Results are also in Table 2 (last two columns).
Two aspects are important. First, the QTL were more
significant now than with model (1); sometimes significance increased
dramatically, e.g., P-value changed from 10-4 to 10-21 for gene Reln.
This is a clear evidence that expression levels included in the model
remove noise and thus may increase power for QTL detection, as we
observed previously (Figure 2
top). We did not always observe this, in other instances we found that
adjusting for transcript levels decreased QTL significance (Figure 2
bottom). In these latter cases we can conclude that the QTL found with
the simple model is an artefact and that the QTL was truly affecting
other gene expression level. The second noticeable aspect in Table 2 is that the QTL location was shifted for some transcripts (Reln, Calm4, Adra2b, and Htr4).
Although we did not systematically search for all cDNA levels affected
by this QTL, it is clear that the number of the genes in the hotspot
has been reduced by ~50%. This is a challenging result and calls for
revisiting the significance of eQTL hotspots, as they are highly
dependent on the model used. From a purely statistical – rational –
point of view, a model that includes a highly correlated transcript
level is to be preferred over one that includes only the marker:
P-values in the order of 10-30 to 10-45 vs. ~10-4.