Overview
We have annotated 156 genes that have the potential to code for
proteins with the R&R Consensus. RR-2 genes were 65% of the total.
Genes are named in the order in which they were annotated. (See Methods
for further details.) Their locations on chromosomes are given in
Figure 2 and Additional Files 1, 2, 3, 4.
We recognized eight tandem arrays (highlighted in gray) in which genes
are typically spaced a few kb and never more than 20 kb apart. We also
recognized eight sequence clusters, highlighted in different colors,
and named by their order on chromosomes based on Ensembl genomic
coordinates, e.g. 2LA, 2LB, and 2LC. Detailed information supporting
each sequence cluster is provided in Cornman and Willis (MS in
preparation). The same highlighting scheme is used for Figure 2 and on all relevant Tables.
Additional file 1. Supplementary Table 1. Position of CPR genes (coding region only) on contigs arranged in their order on chromosomes.
Format: PDF
Size: 67KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 2. Supplementary Table 2. General information about CPR genes.
Format: PDF
Size: 54KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 2 Summary of An. gambiae genes/proteins with the R&R Consensus. aTandem arrays are shown in alternating shades of gray. The sequences from CPR17 to CPR62 are boxed because they are included in the 2La inversion [29]. bSequence clusters are highlighted in color. cClass for most sequences was determined using the tool at cuticleDB [17]. A ? indicates that manual assignment was necessary. dPeptides
indicates whether a unique peptide or a shared peptide was found for
the corresponding protein via a proteomics analysis of cuticle
preparations [12, He unpublished observations]. Proteins with data in
this column are authentic not putative cuticular proteins. Additional
details about the information in this Figure are in Additional Files 1, 2, 3, 4.
Validation of annotation
The 156 putative genes coding for proteins with the R&R
Consensus is the largest group validated to date for any species. EST
support from public sequence databases obtained at [18]
had to be evaluated carefully, requiring matches in unique
non-translated regions, because of the similarities within groups of
genes. When this was done, a combination of available EST sequences and
RACE and RT-PCR products sequenced by our laboratory confirmed 55% of
the sequences (Additional File 3).
A further 42% were confirmed as unique genes in the G3 strain by
obtaining single-copy kinetics with qPCR on genomic DNA and confirming
their expression with qRT-PCR (Additional File 3, [13]).
Most primers for this analysis were designed around the stop codon to
take advantage of the higher level of variation in the 3' UTR. Together
these data uniquely confirm the large majority of annotated genes.
Using the SignalP web tool [19],
we also confirmed that each predicted protein had a valid signal
peptide, an essential feature of secreted proteins. The presence of a
TATA box and INR (initiator element) in 93% of the genes (see below)
provided additional confirmation of the reality of these genes. Indeed,
these features were important in several cases where Ensembl had
predicted a single gene with multiple R&R Consensus regions, yet
manual annotation revealed that multiple genes were present. A
proteomics analysis based on head capsules and pupal cuticles left
behind after a molt and some batch prepared cuticles [[12],
He unpublished observations] revealed unique peptide(s) for 75 of these
genes and shared peptides for an additional 70, having excluded those
with only DGDVVK, a very common peptide. This information confirms 48%
of genes we annotated as authentic components of the cuticle, and if we
include proteins with shared peptides the proportion rises to 93%
(Figure 2, Additional File 3).
Finally, a combination of all of the supporting data has provided at
least some support for all 156 CPR proteins (Additional File 3).
Given that our experimental work was carried out on the G3 strain, not
the PEST strain that was the source of the annotated genome, we were
gratified by the support we found.
Did we identify all of the genes with the R&R Consensus? We only
included three genes on unplaced contigs (i.e., on the ''UNKN''
chromosome). We had information about these three genes that supported
their inclusion in our analysis. Fifteen of the 22 genes we ignored
were on contigs less than 1 kb, and the longest was on a contig of only
2966 nucleotides. We assumed that these represent alternative
haplotypes present in the PEST strain, which is reasonable given the
level of haplotype variation that has been observed [11].
In one case, we omitted a tract of five candidate genes on chromosome
2R that are nearly identical to, but a subset of, another set of genes (CPR115 – CPR123 and CPR154)
approximately 1 Mb away. The intergenic sequence in this tract was in
some regions completely unresolved whereas in other regions it was
virtually identical, preventing any direct verification of the tract by
PCR. To ascertain whether all of these candidate genes were likely to
be present in the PEST strain, we performed qPCR on genomic DNA with a
primer set that included intronic sequences and was complementary to a
subset of the candidate genes in question. We compared the data to
results for a known single copy gene (chitin synthase, AGAP001748),
which indicated that the actual number of targets was between five and
six and not eight as expected if all of the candidate genes in the
assembled genome were actually present (data not shown). This result
was consistent with only the annotated genes being present in PEST; an
identical result was also obtained with the G3 strain. The coordinates
of the omitted sequences (labelled DUPL A-E) are provided in Additional
File 1
for reference, but we believe there is insufficient evidence of their
validity to warrant their inclusion in the present analysis. Thus,
given the diverse data we used to annotate and confirm the CPR genes
and our exhaustive BLAST searches, it is unlikely that many genes were
missed or are artifacts of assembly or incorrect annotation. We also
identified two pseudogenes as shown in Additional File 1,
but we did not include them as named members of the gene family and
they were excluded from analyses of molecular evolution. Seventy-six
percent of the gene models are modifications or new genes relative to
the Ensembl data available at the time the annotations were done. See
Methods for details on retrieving all the sequences.
Orthologs
Further verification of the annotation comes from orthologs in D. melanogaster. The data on orthologs are summarized in Additional File 5
where obvious differences between pairs are highlighted. Such
comparisons reveal an important complication in the discovery of
distant orthologs, namely that high-scoring reciprocal BLAST hits may
be restricted to the R&R Consensus region with little conservation
of the flanking sequence. This creates ambiguity as to whether a gene
is a genuine ortholog or a product of parallel evolution or domain
shuffling. We present data in Additional File 5
on sequence identity in both the Consensus and the total protein and
have retained putative orthologs only in cases where the evidence was
compelling or interesting.
Additional file 5. Supplementary Table 5. Comparison of An. gambiae and D. melanogaster orthologs.
Format: PDF
Size: 62KB Download file
This file can be viewed with: Adobe Acrobat Reader
We found good orthologs for seven of the 15 proteins
that had over 300 amino acids in their mature form. This indicates that
there is something important about their structure that has been
preserved for about 250 myr [9].
There are instances where the regions flanking the Consensus are
quite different in an ortholog pair, such as AgamCPR132 and DmelCry, a D. melanogaster lens protein [20]. The An. gambiae protein has a stretch of 20 glutamines in a row, while the D. melanogaster protein
with almost the same percentage of glutamines has no cluster longer
than 7. Given the propensity of glutamines to form amyloid-like
structures, this suggests that the structure of the cuticle will be
somewhat different in the two species. The length of the longest
glutamine repeat was variable among the more than 50 cDNAs and ESTs
that have been deposited in public databases from different strains
(median = 23). Interestingly, while the long glutamine tracts in
Huntington's and other human diseases are based on the expansion of
just CAG repeats [21]; the long An. gambiae cluster
uses both CAG and CAA codons. Cry along with AgamCPR132 had numerous
repeats of RREE that may make a contribution to protein structure
comparable to a stretch of glutamines.
Another example is illustrated by AgamCPR152 and the D. melanogaster protein resilin (CG15920). Resilin is a cuticular protein that confers elastic properties to specific regions of cuticle [22].
While the extended Consensus region of the two has 76% identity, the
other regions of the protein are totally different. AgamCPR152 has 20%
histidine residues, resilin had only 1 histidine. Dmelresilin is 35%
glycine, AgamCPR152 only 12%. This is a clear case where the regions
flanking the Consensus are not related in the two species. So is there
a resilin gene in Anopheles? Lyons et al. [23] have made a recombinant protein with 16 units of a repeat (AQTPSSQYGAP) from an An. gambiae EST (BX61961) identified in a BLAST search using the Drosophila gene. When properly cross-linked, this protein has the same resilient elastic properties as a comparable multimer made with the D. melanogaster resilin repeat (GGRPSDSYGAPGGGN). The An. gambiae sequence
corresponds to the incompletely annotated gene AGAP002367. This gene
has 10 perfect copies of the repeat and three additional ones that
differ in one amino acid. It lacks, however, an R&R Consensus even
after probable errors in the original annotation were corrected and no
Consensus region lurks in the surrounding 10 kb. This may well be a
situation where domain shuffling has occurred.
Several other orthologs merit comment. While all other An. gambiae CPR
proteins have only a single R&R Consensus region, there are three
versions of it in AgamCPR144 and its orthologs DmelCpr73D and Tribolium GLEAN_16311. All three share other features. AgamCPR138 is a clear ortholog of the D. melanogaster protein
l(3)mbn. The latter is longer, but they share two interesting features,
the C-terminal position of the R&R Consensus and the presence of
two cysteine residues that are rarely found in mature CPRs. We
identified orthologs for two of the six An. gambiae genes
that lacked the aromatic triad; one had a diad, the other only a single
aromatic residue signifying the start of the R&R Consensus. Each
had an ortholog with the same atypical feature.
Of the six CPR genes on the X chromosome, we identified orthologs for five, but only AgamCPR129 has an ortholog that also resides on the X chromosome in D. melanogaster. The other four orthologs are on 3R and three unrelated genes with the R&R Consensus are on the D. melanogaster X chromosome. It was apparent from this limited number of orthologs, that genes that are on either 2L or 3L in An. gambiae reside on 3L in D. melanogaster.
Core promoter regions and polyA addition sites
For each gene, we summarized the presence of a TATA box, the
position and sequence of its INR if one could be identified, and other
features (Additional File 3).
Analysis of the core promoter region was possible for 153 genes; the
other three had their 5' region determined by RACE, since the genomic
sequence was populated by a string of "Ns", an intra-contig gap, in the
relevant regions. The vast majority, 126/153 (82%), of the CPR genes
have a conventional TATA box (TATAAA). An additional 19 have a variant
TATA box, including those in sequence cluster 2LC where 13/16 genes
appeared to use TATTTAA. In all 145 cases where a TATA box was used, we
were able to identify a putative INR. Cherbas and Cherbas [24] reported four common INRs in D. melanogaster, TCAGT, ACAGT, GCAGT, and TCATT. These sequences comprise 71% of the INRs we identified. Variants are shown in Additional File 3,
some of which (italicized) were confirmed by 5' RACE, others were
deduced by their distance downstream from the TATA box and the presence
of an adenine in the third position. For seven of the eight genes with
no known TATA box, we were able to identify INRs and putative DPEs
(downstream promoter elements) [25].
The conventional polyA addition site (AATAAA) was found in 84% of the
CPR genes within the first 500 nucleotides after the stop codon.
Alternative sites (AATACA, AATATA, AATTAA) have been reported from some
Bombyx cuticular protein genes [26,27].
We found the first two types in 14 additional genes. Hence all but 7%
of the 153 genes that had sufficient C-terminal sequence data available
had known polyA addition sites. The presence of appropriate
untranslated flanking regions is further evidence for the authenticity
of the genes we have annotated.
Patterns of gene architecture
As is evident from Additional Files 1 and 2, An. gambiae CPR genes show considerable variation in exon position and length. Complete coding sequences from genomic DNA
are obtainable for 154 of these proteins; two others, with "Ns" in the
genomic data, were obtained by sequencing RACE products. Eighteen
sequences had only a single exon, 11 of which were in sequence cluster
2LC; one sequence had five exons. The majority (55%) had two exons,
with three and four exons present in 25%, and 8%, respectively. A large
majority (86%) of genes with introns had an interruption in the region
coding for the signal peptide, resulting in short first exons of from
3–36 nucleotides long, excluding the 5'UTR. Both the median and mode
were 12. Short first introns are not unique to CPR genes but are an
obstacle to gene annotation. This is because several plausible first
exons may fit a well-supported gene model and because EST support for
first exons is sometimes lacking. We cloned and sequenced RACE products
from the G3 strain to verify some problematic first exons (see
Additional File 3), but we did not do so routinely for gene models that were well supported by similarity to related CPR genes.
Intron number is variable throughout the CPR phylogeny, suggesting
that intron loss or gain has occurred repeatedly. Overall, however,
there are general patterns that distinguish the exon structure of RR-1
and RR-2 genes. There are fewer RR-2 genes with more than two introns
than RR-1 genes (23% vs. 52%), and exon-intron boundaries are more
strongly biased toward phase 0 (i.e. between codons) in RR-2 genes (96%
of introns) than in RR-1 genes (63%). Furthermore, the interruption of
the Consensus region by an intron is less common in RR-2 genes (15%)
than in RR-1 genes (49%). Indels are very rare in the aligned RR-2
Consensus region and span at most two codons, whereas the aligned RR-1
Consensus region has extensive indels, particularly near the center of
the alignment. Thus, the two classes differ substantially in the
structural diversity of the R&R Consensus. These architectural
differences also suggest that RR-2 proteins may be more amenable to
novel gene formation by transposition of an intact R&R Consensus
domain, although such an occurrence has not been demonstrated for a CPR
gene.
Strain variation in gene number
While there has been no systematic survey of variation in CPR gene
number within a species, serendipitous findings of gene copy number
variation suggest that it may be common, particularly in large tandem
arrays. For example, investigators have identified strain variation in
CPR gene number within D. melanogaster [10,28].
In the present study, we identified a large tandem array of RR-2 genes
on 3R that appears to be particularly dynamic with respect to gene copy
number. Previously, Dotson et al. [14] had sequenced a 17.4 kb genomic clone from the Sua strain of An. gambiae and identified 3 CPR genes (Agcp2a,b,c) in this genomic region. We were able to identify orthologs for all three genes in the PEST strain, but Agcp2b (CPR97) is separated from 2c (CPR100) by 24.3 kb rather than 3.5 kb (Figure 3, Additional File 1).
That larger region has six additional CPR genes. Two are in the same
sequence cluster, one belongs to another sequence cluster, and three
are single-copy genes. We have not investigated if the additional
region or its component genes are present somewhere else in the Sua
strain. Whether copy number variation is prevalent or adaptive in
natural populations remains to be determined.
Figure 3 Example of strain variation in gene number and/or order within An. gambiae.
Orthologous genomic regions from the Sua and PEST strains that contain
different numbers of CPR genes are depicted. The Sua clone sequenced by
Dotson et al. [14] contains three CPR genes, two of which, Agcp2b and Agcp2c, are orthologs of the genes CPR97 and CPR100 that we identified in the PEST strain. Due to a duplication event, the other Sua gene, Agcp2a, is orthologous to both the first half of the pseudogene PseudoA and the second half of CPR96.
The intergenic regions between the three Sua genes are also present in
the PEST genome sequence, but there is also an additional 24.3 kb
segment containing six additional CPR genes. It is not known whether
this segment is absent in the Sua strain or has simply been rearranged
relative to PEST. Diagram is not drawn to scale; single-copy genes
(those not in sequence clusters) are underlined.
An inversion on chromosome 2L in some strains of An. gambiae is
of particular interest because one form (2La) correlates with
desiccation resistance and vector competence discussed in Sharakhov et
al. [29].
Their careful mapping of the inversion boundaries enabled us to place
it in the context of the genes we have annotated with the surprising
result that there are 73 CPR genes within the inverted region (Figure 2, Additional File 1). We used inversion specific primers [30] to verify that the G3 strain, which we used for analysis, is heterozygous for the inversion (data not shown).
General protein properties
A summary of protein properties is in Table 1 and Additional File 4.
The CPR proteins averaged 169 residues (minus the signal peptide), with
a considerable range, 87–837. The major feature of the CPR proteins is,
of course, the presence of the R&R Consensus (see Figure 1).
In all but six sequences, it began with the aromatic triad described
above; in four cases only a diad was found, and in two cases a single
aromatic residue defined the start of the Consensus region. The
position of the start of the R&R Consensus ranges from the fourth
to the 90th percentile of the length of the mature protein, and the average position was just at the start of the second quartile (27th percentile).
Glycine and alanine are major amino acids; 26 proteins have over one
fourth of their residues as just these two amino acids. Three RR-1
proteins (CPR79, CPR133, and CPR153) had over 40% of their residues as
a combination of just these two amino acids. Proline is another major
amino acid in many of the proteins, reaching 15% in CPR79; the average
was 7%. Both proline and glycine-glycine pairs tend to introduce kinks
in protein chains and thereby influence protein chain folding [2].
Both histidine and lysine have been shown to be involved in
sclerotization, the cross-linking of quinone derivatives to the
proteins [31]. On average, RR-2 genes have more histidine residues, 11% vs. 2% (Table 1, Additional File 4).
Table 1. Properties of CPRs by class and by individual RR-2 sequence clusters
We used principal components analysis (PCA) of the
amino-acid composition matrix to further investigate patterns of
variation in amino-acid content of RR-1 and RR-2 proteins. For this
analysis, we excluded cysteine and methionine because these two amino
acids are virtually absent from mature CPR proteins (Additional File 4), hence their removal reduces the dimensionality and noise of the composition matrix. Figure 4
shows a scatterplot of all CPR proteins along the two major principal
component axes, which together explain 57% of the total variation. The
labelled vectors represent the relative contribution of each amino acid
to the variation explained by these two axes. Not surprising given the
composition values mentioned above, alanine, histidine, and glycine are
prominent in discriminating among proteins. Two features of CPR
diversity are particularly apparent from Figure 4.
First, RR-1 proteins show much less variation in amino-acid composition
than RR-2 proteins. Secondly, RR-1 and RR-2 proteins are strongly
discriminated by the second principal-component axis, for which the
proportion of histidine is the most strongly loaded variable.
Figure 4 Scatterplot representing variation in amino-acid content of all putative An. gambiae CPR proteins.
The horizontal and vertical axes are the first and second principal
components, respectively, of the amino-acid composition matrix after
removal of the predicted signal peptide. The labeled vectors indicate
the relative loadings of each amino acid in the first and second
principal components. The amino acids cysteine and methionine, which
are scarce in mature CPR proteins (0.03 and 0.4% respectively), were
excluded from the composition matrix. The percentage of the total
variation explained by each axis is also given. RR-1 and RR-2 proteins
are indicated separately according to the legend. PC axis 1 and 2 =
33.4 + 23.5 = 57% of total variation. The major loadings for the first
five PC axes (82% of the variation) are: A, H, G-Q, V, Q-Y. RR-1 and
RR-2 are strongly separated on the second axis and RR-2s have more
variation overall.
Commonly occurring sequence motifs
Within the two major classes of the R&R Consensus sequence, some
common variants are recognizable that extend the region of alignable
sequence in the N- or C-terminal directions for a subset of genes. One
common variant found in RR-1 genes has a proline-rich region adjacent
to the defined Consensus (Figure 1),
which can be approximated by the expression GFQPQGxHxPxPPP. A second
common variant is found in RR-2 genes at the C-terminus of the R&R
Consensus, approximated by the expression GFNAVV(HR)RE(GP). Also
present in 38 RR-2 genes was RDGDVVKG. All three of these variants
occur in tandem gene arrays in An. gambiae as well as other
Dipterans, indicating that they arose to high frequency by tandem gene
duplication (Cornman and Willis, MS in preparation). The first two,
however, are also present in Apis and Tribolium (Cornman,
unpublished data) indicating that these motifs are old and, given their
high level of conservation, probably have functional importance,
possibly in addition to their participation in chitin-binding.
CPR genes frequently contain low complexity sequence in the regions
flanking the R&R Consensus, and several common repeats have been
recognized (reviewed in [2,3]). Andersen et al. [2]
described a common repeat in cuticular proteins, AAP(A/V). We
identified this repeat or a variant form, AAPL, in one quarter of An. gambiae CPR proteins (Additional File 4).
To systematically search for more complex sequence patterns outside of
the R&R Consensus, we used the MEME server (see Methods). For this
analysis, we removed the signal peptide and R&R Consensus so that
they would not influence the motif search. We submitted all annotated An. gambiae and Aedes aegypti CPR
genes to achieve the best representation of sequence variation, and we
ensured that no putative motif spanned the artificial gap created by
removing the Consensus. To eliminate the effect of duplication per se,
we ignored any motifs present only within a single sequence cluster.
Given the high proportion of low-complexity sequence, we also required
that the motif be more than five amino-acids long. Only one motif was
found that met these criteria, a 16 amino-acid motif that is present at
or very near the C-terminus of all sequences that contain it. All An. gambiae genes
with this motif are RR-2 genes that occur on chromosome 2L. A sequence
logo representation of observed matches is shown in Figure 5A.
The position-specific scoring matrix for this motif, which can be used
to search custom datasets via the MAST option of the MEME server, is
presented as Additional File 6.
When we compared the motif matches to the full alignment of protein
sequences, it was apparent that a more concise motif could be obtained
(Figure 5B)
by removing the first position and allowing single-position indels. It
is interesting to note that this motif is rich in histidine, which is
along with alanine and glycine a major contributor to variation in
amino-acid composition among CPR proteins as discussed above. We show
in a later section that this modified motif is under purifying
selection.
Additional file 6. Supplementary Table 6.
Position-specific scoring matrix of the motif identified at the
C-terminus of most RR-2 genes on chromosome 2L.
Format: PDF
Size: 9KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 5 C-terminal sequence motif identified by MEME analysis. A. Sequence logo representing actual matches to the motif position-specific scoring matrix in An. gambiae and Ae. aegypti. The motif is present in 29 An. gambiae RR-2 genes on chromosome 2L and in 65 annotated genes in Ae. aegypti.
B. Sequence logo representing a modified version of the motif that is
reduced by one position and permits indels. Arrows indicate sites under
negative selection at P < 0.05, as determined by
single-likelihood-ancestor counting [32]. Overall Ka/Ks for the
modified motif was estimated to be 0.42 [95% CI: 0.39, 0.65].
Phylogeny and genomic organization of CPR genes
In An. gambiae, CPR proteins cannot generally be aligned
outside of the R&R Consensus and signal peptide, consistent with
the pattern of insect CPRs generally. We therefore based our
phylogenetic analysis on the region that spans the R&R Consensus
from the fifth position N-terminal to the tenth position C-terminal of
the pfam00379 sequence (Figure 1). We included these flanking regions because, while they are not alignable across all An. gambiae CPRs,
they incorporate common sequence variants that are alignable within
large subsets of genes and thus are phylogenetically informative.
However, we double-weighted all positions from the aromatic triad to
two positions past the final invariant glycine (Figure 1)
as these positions can be aligned across all CPRs. We used amino-acid
sequence rather than codon-aligned nucleotide sequence because
mutational saturation of the latter is evident across the gene family
as a whole.
The neighbor-joining phylogeny we recovered identifies two main groups that constitute the core RR-1 and RR-2 proteins. Figure 6 shows the overall topology of the tree; detailed phylogenies of RR-1 and RR-2 genes are presented as Figures 7 and 8.
The RR-2 clade has much shorter branch lengths on average than the RR-1
clade because the former group is dominated by sequence clusters.
Several long-branch genes lie between the main RR-1 and RR-2 clades.
This group has low bootstrap support and is probably an artificial
group caused by long-branch attraction. Both RR-1 and RR-2 genes are
present in this group. The group includes five of the six X-chromosome
genes as well as all three genes that match the "RR-3" designation of
Andersen et al. [16] which are indicated in Figure 6.
Removal of this group of long-branch genes raises bootstrap support for
the core RR-1 and RR-2 clades to 100% (results not shown).
Figure 6 Topology of the neighbor-joining tree of all An. gambiae CPR genes.
Gene names are omitted here for clarity. Detailed phylogenies of the
RR-1 and RR-2 clades are presented as Figures 7 and 8. Distances are
based on the amino-acid sequence of the aligned and weighted R&R
Consensus as described in the text. We used the JTT exchange matrix
[49] as implemented in MEGA3 [48]. The core group of RR-1 and RR-2
proteins and an intermediate group of RR-1, RR-2, and RR-3 proteins are
marked. The sequence clusters defined in the text are also marked.
Symbols represent the chromosome arm on which each gene is located. The
three without symbols have not been placed on a chromosome.
Figure 7 Phylogenetic relationships of the core RR-1 and intermediate CPR proteins. A detailed view of the neighbor-joining tree of An. gambiae CPR
genes with the branch leading to the main group of RR-2 proteins shown
in Figure 6 collapsed. Numbers at nodes indicate the percentage of 1000
bootstrap replicates that support the node.
Figure 8 Phylogenetic relationships of the core RR-2 proteins. A detailed view of the neighbor-joining tree of An. gambiae CPR
genes with the branch leading to the main group of RR-1 proteins shown
in Figure 6 collapsed as are the individual sequence clusters. Numbers
at nodes indicate the percentage of 1000 bootstrap replicates that
support the node.
CPR genes are found on all of the chromosomal arms
except for the Y chromosome but have a biased distribution with 51%
occurring on 2L and 28% occurring on 3R. Phylogenetically, genes show a
very strong tendency to cluster by chromosome (Figure 2, Figure 6), indicating that inter-chromosome duplications are rare. RR-1 and RR-2 tandem arrays show no overlap on chromosomes (Figure 2, Additional File 1) and contain few non-CPR genes.
Evolutionary patterns within the R&R Consensus and flanking sequence
The pattern of natural selection acting during the evolutionary
history of a gene family can be illuminated by comparing the rate of
nucleotide substitutions that change the protein sequence (Ka) to the
rate of nucleotide substitutions that do not (Ks). Coding sequences
that have a ratio of Ka to Ks equal to one are evolving in a manner
consistent with neutrality, that is, nonsynonymous mutations are as
likely to be fixed as synonymous ones. Ka/Ks greater than one indicates
a higher rate of amino-acid substitution than expected under neutrality
and implies that positive selection has driven the divergence from the
ancestral state. Ka/Ks lower than one implies that changes in protein
sequence are selected against on average.
To investigate the possibility of adaptive evolution of the R&R
Consensus during the diversification of CPR genes, we calculated Ka/Ks
within this region for all An. gambiae paralog pairs and An. gambiae – Ae. aegypti ortholog
pairs that we identified. For this analysis, we used a shortened
version of pfam00379, in which we excluded the first seven positions,
to define a more strict R&R Consensus for An. gambiae.
This was done because the alignment of the first seven sites across all
RR-1 or RR-2 proteins was ambiguous and therefore not useful for
analyses of substitution patterns.
One difficulty with the interpretation of Ka/Ks ratios between
functional paralogs is that a finding of increased Ka/Ks after
duplication may be explained by relaxed selection on initially
redundant genes or by adaptive evolution at some fraction of sites
within a functionally constrained sequence, or both successively. Only
if Ka/Ks is substantially greater than one is adaptive evolution
strongly implicated, although it remains a possible explanation for
even small increases in Ka/Ks. Codon-based models of nucleotide
substitution are potentially more useful because they can identify
individual sites under selection within a local region of low Ka/Ks,
but their power is strongly dependent on sample size [32]
and the actual pattern of selection. We used both approaches to
investigate the molecular evolution of the chitin-binding R&R
Consensus, which has a well-characterized secondary structure and both
labile and highly conserved positions [8,33].
Figure 9
shows mean Ka/Ks for all pairwise comparisons among single-copy RR-1
and RR-2 paralogs, respectively, for which the Jukes-Cantor corrected
Ks is less than 2 (to reduce the effect of mutational saturation). The
Ka/Ks between ortholog pairs in An. gambiae and Ae. aegypti with
corrected Ks < 2 are also shown for comparison. Ka/Ks is higher on
average among RR-1 paralogs than RR-2. A significant fraction of RR-1
gene pairs have Ka/Ks greater than 1, whereas RR-2 genes pairs have
Ka/Ks almost exclusively below 1. Interestingly, the six CPR genes on
the X chromosome (CPR125 – CPR130) were all among the genes with the highest average pairwise Ka/Ks (Table 2). This finding accords with previous work [34] that found a significant increase in the Ka/Ks of X-linked versus autosomal duplicates in D. melanogaster.
The higher rate of evolution of X-linked duplicates is a theoretical
prediction if one assumes that most advantageous mutations are
recessive and hence more visible to selection when X-linked [35].
Figure 9 Ka/Ks of RR-1 and RR-2 single-copy genes. Ka/Ks is plotted as a function of Ks for all An. gambiae paralog pairs and all An. gambiae – Ae. aegypti ortholog
pairs for which the estimated Ks < 2. Genes in sequence clusters are
excluded because they are likely to be products of recent duplication
or gene conversion, which can bias Ka/Ks.
Table 2. CPR genes with the highest mean pairwise Ka/Ks. Genes in bold are on the X chromosome
We next used codon- and branch-based tests of selection to test for positive selection on the R&R Consensus of An. gambiae CPRs. We used the single-likely-ancestor counting method, SLAC [32], and GA-Branch method [36] implemented by the DataMonkey web server [37].
The SLAC method identifies individual codons that deviate from neutral
expectation with respect to nonsynonymous versus synonymous
substitutions, whereas the GA-Branch method identifies lineages of a
phylogeny that differ in overall Ka/Ks. Subsets of CPR genes were
investigated by submitting well supported interior clades that are
likely to have experienced less mutational saturation and by removing
highly similar gene duplicates. In no case did we find any positively
selected sites within the R&R Consensus at P < 0.05, whereas a
large majority of sites were found to be under negative selection.
Conceptually similar but methodologically distinct methods for
identifying sites under positive selection were also implemented with
the codeml program of the PAML package [38]
and gave similar results. Specifically, evolutionary models that
included a category of Ka/Ks > 1 in addition to either a single
category of Ka/Ks < 1 or a beta distribution of Ka/Ks < 1 were
not significantly more likely than models that did not include positive
selection (codeml model options 2 versus 1 and options 8 versus 7,
respectively, 2ΔlnL ≈ 0 for both
comparisons). Thus, diversification of the CPR gene family does not
appear to have been driven by strong positive selection on specific
sites within the R&R Consensus. We estimated the global branch
Ka/Ks within the R&R Consensus under the beta-distribution model
(model 7) of codeml, which was significantly more likely than a single
Ka/Ks < 1 category (model 1) for both the RR-1 and RR-2 clades (P
<< 0.01), and after removing all but one haphazardly chosen gene
of each sequence cluster. The evolutionary rate of amino-acid
substitutions was more than twofold higher within the RR-1 clade (Ka/Ks
= 0.271) than within the RR-2 clade (Ka/Ks = 0.118).
Although no sites or sites within lineages showed significant
evidence of positive selection, the relative rates of amino-acid
evolution are nonetheless variable among lineages, as indicated by
Tajima's test [39]. The most notable example is the branch leading to the 3RB sequence cluster (Figures 6, 8)
compared with the linked sequence clusters 3RA and 3RC (average P <
0.01 across all possible gene comparisons for the three sequence
clusters). Since we found no support for elevated Ka/Ks occurring along
this or any other lineage as described above, the high rate of
evolution in the 3RB sequence cluster appears to be due to
mutation-rate variation alone. Consistent with this interpretation, the
3RB sequence cluster has a substantially lower GC content (36.9%) than
the other 3R sequence clusters (46.7%), suggesting divergent mutational
histories. This localized difference in GC content is surprising given
the very tight linkage of the genes in these three sequence clusters
(Additional File 1).
We also performed a SLAC analysis on all examples of the motif identified by MEME (aligned with indels) from An. gambiae and Ae. aegypti. Ten of fifteen sites were found to be under purifying selection at P < 0.05, as shown in Figure 5B,
and the Ka/Ks for the region was estimated to be 0.42 (95% C.I. 0.39,
0.65). While the motif is evidence of sequence conservation among
paralogs outside the R&R Consensus, the level of conservation is
somewhat less than what is seen among single-copy orthologs in the two
mosquito genomes (Ka/Ks of 0.08 – 0.33 with the R&R Consensus
removed).
Our analysis of the CPR gene family in An. gambiae reveals
that a high number of paralogs can be retained in insect genomes with
remarkably low rates of pseudogene formation. The retention of these
paralogous single-copy genes or sequence clusters over periods of 100
million years or more, as evidenced by orthologs in Ae. aegypti (Cornman
and Willis, MS in preparation), implies that there has been extensive
functional diversification of this group. We do not yet know the nature
of this functional diversification, although there is certainly broad
variation in the complement of CPR proteins expressed at different
developmental stages and in different tissues as evidenced by gene
expression and proteomic studies ([12,13], He unpublished data). There are also a few examples from D. melanogaster of
CPR proteins that have been shown to be essential components of
specialized cuticular structures, such as crystallin in the eyes [22] and resilin in wing tendons, ligaments, etc. [22,23]. Nonetheless, codon-based analyses of the R&R Consensus of An. gambiae CPR
proteins failed to identify any sites under positive selection,
although such tests can have limited power to detect ancient positive
selection and Ka/Ks>1 within gene families is rarely found [40].
Our comparison of the RR-1 and RR-2 classes presents an interesting
dichotomy regarding the diversification of the R&R Consensus and
flanking sequences. The R&R Consensus of the RR-1 class has higher
rates of amino-acid evolution and is structurally much more variable
than the RR-2 family, and a number of RR-1 proteins have average
pairwise Ka/Ks approaching one, many times higher than the mean Ka/Ks
for orthologous pairs in An. gambiae and Ae. aegypti.
To the extent that there has been adaptive evolution of the
chitin-binding Consensus, it has probably been more prevalent in the
RR-1 clade. In contrast, the RR-2 clade appears to have diversified in
terms of amino-acid composition to a much greater extent than the RR-1
class and in general there is a higher frequency of histidine residues,
which are involved in cross-linking, in this group. It seems likely
that the functional diversification of the RR-2 group derives to a
greater extent from the properties of the sequence flanking the
Consensus than from any particular changes in the chitin-binding
sequence itself.
One of the most remarkable features of this gene family in An. gambiae,
the presence of sequence clusters of highly similar genes, begs
investigation as to the possible selective advantage of increased copy
number for these particular genes. A separate analysis of these genes
was too extensive to include here, but our unpublished data indicate
that these proteins have complex repeats and unusual amino-acid
compositions relative even to other RR-2 proteins. Furthermore,
different sequence clusters appear to have independently acquired
compact gene architectures, perhaps as a response to selection for
increased gene expression, and concerted evolution in addition to
purifying selection appears to be an important mode by which protein
similarity is maintained in these groups.