3.1. Specific trends in amino acid substitution patterns between mouse and human orthologs
In order to investigate whether the mouse and human proteins of high-, medium-, and low-GC composition followed the same or different evolutionary trajectories since the divergence of the two species, trends in amino acid substitution between the human–mouse orthologous pairs were studied individually in three groups of genes, using the program SPAST developed in-house. Tables 1, 2, and 3 represents the AARMs for aligned regions of orthologous pairs (gap-free regions of length >100 residues at a stretch) in high-, medium-, and low-GC groups, respectively. As already mentioned, the mouse to human replacements was taken by convention as the forward direction and human to mouse as the reverse direction.
For each group of orthologs, some specific amino acid pairs exhibit significant bias in the replacement patterns. For instance, in high-GC group, the value of RIG, i.e. the ratio of [Ile]Mouse [Gly]Human to [Gly]Mouse [Ile]Human, is 2.51 (p implying that the frequency of replacement of Ile in mouse sequence with Gly in human is >2.5-fold higher than that in reverse direction, i.e. the frequency of replacement of Gly of mouse sequence with Ile in human. On the contrary, in low-GC group, RIG = 0.48, indicating that in low-GC orthologs, the frequency of substitution of Ile of mouse sequence by Gly in human is more than two-fold lower than the frequency of the reverse substitution. For the medium-GC group, the value of RIG is not statistically significant, suggesting that the frequencies of substitution of Ile by Gly and of Gly by Ile are comparable in cases of medium-GC orthologs of mouse and human.
As Rij = 1/Rji, out of the 380 off-diagonal elements of an AARM (Tables 1, 2, and 3), only 190 are independent. Out of these 190 AARM elements, 53 are significantly biased in a specific direction for high-GC group (46 at p p to be significantly biased (56 and 11 at p respectively). In the medium-GC group, only 15 AARM elements are statistically significant (ten at p at p and some with the low-GC group.
3.2. Significant trends in amino acid substitution in high-GC and low-GC groups are, in general, opposite to one another
A careful examination of Tables 1, 2, and 3 reveals that when i represents a residue encoded by A/U-rich codons and j represents a residue encoded by relatively G/C-rich codons, the AARM element, Rij, in most cases (but not in all), is >1 in the high-GC group, 1 in the medium-GC group. Reverse situation occurs, in general, when i represents a residue encoded by G/C-rich codons and j by relatively A/U-rich codons. For instance, for i = Ala (A) (encoded by GCN), RAj is significantly >1 in low-GC group for j = Ile, Asn, Lys, Ser, Thr, Val, Glu etc (encoded, respectively, by AUH, AAY, AAR, UCN/AGY, ACN, GUN, and GAR). On the contrary, for i = Asn (N) (encoded by AAY), RNj > 1 for high-GC group and when j = Gly or Ala or Arg (encoded by GGN, GCN, and CGN/AGR, respectively). There are altogether 33 AARM elements, which are polarized to the opposite directions (>1 and high- and low-GC groups and are found to be statistically significant in both groups.
Table 4 provides the lists of the 15 amino acid pairs having the largest differences in total number of forward (mouse to human) and backward (human to mouse) substitutions between them for three different groups of orthologs under study. There are eight pairs of residues (marked with ±) that appear among the top 15 trends in both high- and low-GC groups, but with opposite directionality (Table 4). There are four other pairs of residues among the top 15 of the high-GC group (marked with +), which exhibit significant, but opposite, bias in the low-GC group (Table 4), but did not come among the top 15 in the later group. Similarly, there are also four pairs (marked with –) among the top 15 of the low-GC group, which are opposite and significant, but not among the top 15 in the high-GC group. Thus, the trends in amino acid substitutions between the mouse and human orthologs follow reverse directionality, in general, in the high- and low-GC groups. Among the top 15 trends in the medium-GC groups, some are common in directionality with high-GC group and some with the low-GC group.
In the high-GC group, although amino acids of mouse proteins encoded by A/U-rich codons tend to be replaced by the amino acids encoded by G/C-rich codons in their human orthologs, not all amino acid residues encoded by A/U-rich codons exhibit equal bias in replacement patterns. There are six residues, viz. Phe, Tyr, Met, Ile, Asn, and Lys, which are encoded by A/U-rich codons and four residues, viz. Gly, Ala, Arg, and Pro, encoded by G/C-rich codons. As can be seen from Tables 1, 2, and 3, among the amino acid residues encoded by A/U-rich codons, Ile, Asn, and Lys have a more number of significantly biased replacement ratios (AARM elements) both in the high-GC group and in the low-GC group. Replacement ratios of Phe and Tyr, though follow the general trend, are not statistically significant in most cases. Rather, some other residues like Ser, Thr, Val, Leu etc., which are not necessarily encoded by A/U codons, exhibit significant bias in the replacement ratios (Tables 1 and 3). Similarly, among Gly, Ala, Arg, and Pro, the former two have more number of significant Rij values. Previous analysis of many prokaryotic genomes28 and high-GC rice genes with their Arabidopsis homologs29 showed that proteins encoded by GC-rich sequences are characterized by increased levels of Gly, Ala, Arg, and Pro residues and a corresponding decrease in Phe, Tyr, Met, Ile, Asn, and Lys residues. It is, therefore, intriguing to examine to what extent the overall usage of the residues Gly/Ala/Arg/Pro and that of Phe/Tyr/Met/Ile/Asn/Lys vary within the mouse and human orthologs of high-, medium- and low-GC groups. Our analysis indicates that the mouse and human orthologs of three groups are indeed characterized by distinct usage profile of these residues (Supplementary Fig. S1). In the high-GC group, the human orthologs have higher usage of Gly, Ala, Arg, and Pro and lower usages of Phe, Tyr, Met, Ile, Asn, and Lys compared with their mouse orthologs, but the differences are not as pronounced as shown previously for homologous gene pairs from the rice and Arabidopsis, having large difference in GC content.29 In the low-GC group, the reverse is true, whereas in the medium-GC group, there is no significant difference between mouse and human orthologous pairs in the usage of these two groups of residues.
3.3. (Asp)Mouse (Glu)Human trend in all groups of orthologs irrespective of their GC content
There is only one replacement ratio RDE, which exhibits same directionality and almost the same value in all three AARMs (Tables 1, 2, and 3), indicating that in all groups of orthologs, the frequency of (Asp)Mouse (Glu)Human replacements is slightly higher than the replacement in the opposite direction. As the Asp Glu replacement is among the top 15 trends in substitution in all three groups (Table 4), it is one of the most common trends in amino acid replacement in mouse–human orthologs. These observations suggest that irrespective of the GC content of the encoding genes, there has been a consistent increase in glutamic acid in human proteins at the cost of aspartic acid compared with their mouse orthologs. The structural and/or functional implications of this unique evolutionary trend is, however, not clear. There are two other substitution trends, Ser Thr and Phe Tyr, which also exhibit same directionality in all three groups under study, but the replacement values are not statistically significant for the high-GC group.
3.4. High-GC orthologs are biased towards (A/T)Mouse (G/C)Human replacements, whereas in low-GC orthologs, (G/C)Mouse (A/T)Human replacements prevail
As already emphasized, the major trends in amino acid replacements between mouse and human orthologs (Tables 1–4) indicate that in the high-GC group, the amino acid residues encoded by relatively GC-rich codons tend to increase in human proteins compared with mouse orthologs, and in the low-GC group, the reverse trends prevail. In the medium-GC-group, however, there is no such specific directionality in codon substitution patterns. These observations have prompted us to examine the trends in nucleotide substitution patterns individually in three codon positions in three groups of orthologs. As can be seen from the NRMs shown in Table 5, in the high-GC dataset, rij, is significantly greater than 1, when i = A or T and j = G or C. On the contrary, rij is significantly less than 1, when i = G or C and j = A or T. These trends are valid in all three codon positions, although the deviation of rij from 1 (for a particular set of m and n) is highest in third codon positions, followed by the first and second codon positions. Therefore, in the high-GC group, there has been an excess of (A/T)Mouse (G/C)Human replacements over (G/C)Mouse (A/T)Human at each codon position individually. For the low-GC group, the reverse situation has been encountered (Table 5), i.e. there is a tendency for G and C in mouse genes to be replaced by A or T in their human orthologs, the bias being maximal at the third codon positions. For the medium-GC group, however, no significant difference between (A/T)Mouse (G/C)Human and (G/C)Mouse (A/T)Human replacements could be observed at the first and second codon sites, whereas for the third codon sites, the (A)Mouse (G)Human and (T)Mouse (C)Human replacements dominate over the reverse replacements. These observations imply that for the high-GC group, either the GC content tends to increase in human genes relative to mouse or tends to decrease in mouse genes relative to human, whereas for low-GC group, either there is a trend in relative decrease in GC content in human compared with mouse or there is a trend in relative increase in GC content in mouse compared with human. This suggests that with time, there is a relative increase in compositional heterogeneity within human genes compared with that within mouse genes or decrease in compositional heterogeneity within mouse genes compared with that within human genes.
Our next task was to see to what extent the observed trends in nucleotide substitution patterns have affected the relative GC divergence between mouse and human orthologs. To this end, the number of genes was plotted against their GC12 and GC3 values both for mouse and for human in all three groups of orthologs (Fig. 2) using STATISTICA (version 6.0). In all cases, normal distributions were obtained (Fig. 2). In high-GC group, both GC12 and GC3 distributions in human are skewed towards right (increasing GC contents) compared with mouse (Fig. 2A and B), but for low-GC group, the reverse is true (Fig. 2E and F). The extent of inter-species divergence in GC distribution is much more apparent in case of third codon positions (Fig. 2B, D, and F) compared with the first and second positions (Fig. 2A, C, and E). For medium-GC orthologs, medians of both GC12 and GC3 distributions are almost same in both species under study (Fig. 2C and D). These observations imply that the intra-species divergence in base composition is higher in case of human genes than that in their mouse orthologs such that among the GC-rich pairs of orthologs, human coding sequences are usually higher in GC content than their mouse counterparts, but among the GC-poor orthologous pairs, human coding sequences are, in general, lower in GC content than the respective mouse sequences.
3.5. Multivariate analysis of synonymous codon usage confirms opposite trends in high-GC and low-GC groups of orthologs
The skewness of GC3 in human genes towards increasing GC in high-GC group and decreasing GC in low-GC group compared with mouse orthologs (Fig. 2B, D, and F) has also been reflected in the COA of RSCU. Fig. 3A–C represents axis-1 versus axis-2 plot of the COA on RSCU of genes in three different groups. In all cases, axis-1 exhibits strong negative correlation with GC content at synonymous substitution sites (GC3S). The distribution of human and mouse genes along axis-1 confirms that in high-GC group (Fig. 3A), human genes exhibit higher usage of G or C ending synonymous codons compared with their mouse orthologs, whereas for low-GC group, the reverse trend dominates (Fig. 3C). For medium-GC group, as expected, usage of G/C-ending codons is comparable in mouse and human (Fig. 3B).
It is worth mentioning that the GC contents of the synonymous substitution sites in the mouse and human orthologous pairs exhibit negative correlations in all three groups (supplement-I). These observations are in accordance with the previous report by Takahashi and Nakashima.30
3.6. Rate of synonymous and nonsynonymous substitutions are same in all three groups of orthologs
In order to examine whether the rate of nucleotide substitution between mouse and human orthologs varies in three different groups, the number of synonymous substitutions per synonymous site, dS, and the number of nonsynonymous substitutions per nonsynonymous site, dN, were estimated for randomly selected 500 pairs of the orthologs from each group. The value of dS remains almost same in all three groups (data not shown). The value of dN apparently seems to be lower in the medium-GC group, but its difference with its values for the other two groups is not statistically significant. These observations indicate that although the trends in nucleotide substitutions are polarized to the opposite directions in the high- and low-GC groups of orthologs, the rates of synonymous or nonsynonymous substitutions did not vary with the GC bias of the genes.