2.1. Sequence retrieval
Nucleotide sequences of 12 405 pairs of orthologous coding regions of human and mouse were extracted from the Searchable Prototype Experimental Evolutionary Database (SPEED)21 (http://www.bioinfobase.umkc.edu/speed/), using an in-house program developed in Perl. To minimize the sampling errors, a total of 174 sequences, which were shorter than 100 codons in either organism, were excluded from the analysis. The remaining sequences were subjected to a codon integrity check using a freely available program, CodonW22 (http://www.molbiol.ox.ac.uk/cu/), and the dataset was further screened for removing redundant sequences. The final dataset of human mouse orthologs contains 12 024 nonredundant sequences. We generated corresponding nonredundant protein sequence using C program developed in-house.
2.2. Classification of orthologs in three compositional groups
The pairs of orthologous sequences under study exhibited significant correlations, not only between the overall GC contents, but also between the GC contents at third codon sites (GC3) as shown in Fig. 1A and B. These orthologs were then classified into three groups according to the GC3 contents of the human genes: the low-GC group with (GC3)Humangroup with 50% (GC3)Human 70%, and the high-GC group with (GC3)Human > 70%. The numbers of pairs of orthologous genes in these three groups were comparable with one another (3896, 3960, and 4168 numbers in high-, medium-, and low-GC groups, respectively). The sequences in these three groups were used to examine the trends in amino acid and nucleotide substitution patterns.
The total dataset was also classified into another three groups on the basis of GC3 contents of the mouse genes: the new low-GC group with (GC3)Mouse50% (GC3)Mouse 70%, and the new high-GC group with (GC3)Mouse > 70%. The entire study carried out with the high-, medium-, and low-GC groups were re-checked with these new high-, new medium-, and new low-GC groups of orthologs.
We classified the datasets on the basis of the GC3 of coding sequences rather than the overall GC, because the GC3 contents of mammalian genes are known to exhibit strong correlation with the GC content of the genomic region, where the genes are located.23,24
2.3. Analysis of amino acid substitution patterns: evaluation of amino acid replacement matrix (AARM) for different groups of orthologs
The alignments of orthologous sequences of three groups were created separately using the pairwise alignment program ClustalW25 and only the gap-free aligned regions of length >100 residues were considered to avoid any spurious short-alignment regions. The numbers of pairs of aligned orthologous genes in three groups of sequences with gap-free regions of length >100 residues were less than the previous set (3659, 2669, and 3291 numbers in high-, medium-, and low-GC groups, respectively). Amino acid replacements were calculated for all gap-free alignment regions of >100 residues and also for fully aligned sequences including gaps. The replacement data are represented in a 20 x 20 matrix, designated as amino acid replacement matrix (AARM), as shown in Tables 1, 2, and 3 for gap-free alignment regions of >100 residues. (To avoid confusion with the standard amino acid substitution matrices like PAM or BLOSUM, we have used the term ‘Replacement’ matrix.) The elements of AARM represent the ratio between the number of forward replacements and the number of backward replacements for any specific pairs of residues, i.e. the value of any element Rij of the AARM represents the ratio of the total number of [i]Mouse [j]Human replacements to the number of [j]Mouse [i]Human replacements. If Rij > 1, then there will be an overall gain in the amino acid residue j at the cost of the amino acid residue i in human compared with that in mouse. If Rij will be true. The actual number of forward and backward replacements for all possible pairs of amino acid residues for high-, medium-, and low-GC groups are given in Supplementary Table S1a–c. Other than diagonal positions of the matrices (representing the identical substitutions), all other elements represent the nonidentical substitutions. The replacement values for the gap-free alignment regions were not changed significantly from the result obtained by alignment of full sequences including gaps. In order to test whether there were any significant intra-group variations in the replacement values, subsets of 500 pairs of sequences were taken sequentially from start to end and also randomly from the entire dataset of a specific group of orthologs (i.e. high-/medium-/low-GC group) and 20 x 20 AARM was determined for each subset of sequence pairs. Comparison of the replacement values obtained from different subsets of any particular group was then carried out, and no significant variations in substitution values were found for individual residue pairs within a group. All these computations were done using Substitution Pattern Analysis Software Tool (SPAST), a program in C, developed in-house.
2.4. Analysis of nucleotide substitution patterns: evaluation of nucleotide replacement matrix (NRM) for three codon sites of different groups of orthologs
We created the nucleotide sequence alignments on the basis of amino acid alignments and calculated the nucleotide replacements in the form of 4 x 4 NRM, individually for three codon positions for three different groups of orthologs under study. The elements rij of NRM represent the ratio of the number of forward replacements to that of backward replacements for any specific pairs of nucleotides. Comparison of the nucleotide replacement values obtained from different subsets of 500 orthologous sequences (taken sequentially from start to end and also randomly) of any particular group was then carried out and no significant variations in replacement values were found for individual nucleotide pairs within a group.
2.5. Tests for statistical significance of different elements (Rij/rij) of AARM and NRM
For a given pair of amino acids or nucleotides, the mouse to human replacement was taken as the forward direction and human to mouse as the reverse direction, and each Rij in AARM or rij in NRM represents the ratio of number of replacements of the residue i by the residue j in the forward direction (mouse to human) to that in the reverse direction (human to mouse). This means that if Rij (or rij) > 1, the number of (i)Mouse (j)Human replacements is higher than the number of (j)Mouse (i)Human replacements, and if Rij (or rij)
For each pair of replacements, the ratio of forward to reverse replacements was expected to be 1:1 under unbiased conditions. To test this hypothesis, the observed and expected numbers (based on a 1:1 ratio) were recorded for each pair of a particular group. In all cases, the Chi-square test was used to assess the significance of the directional bias, if any, at p = 0.001 and 0.0001. For each pair of replacements, the first and second rows of the 2 x 2 contingency table represented the number of replacements from one particular residue (say, i) to another (say, j) of the pair and the total count of the remaining replacements (say, k) from the residue i (where k j), respectively. The procedure was repeated also for orthologous replacements of 500 sequences taken sequentially from start to end and randomly. The significant (at p dataset are also consistent with sequences taken sequentially from start to end and randomly.
2.6. Correspondence analysis (COA) on relative synonymous codon usage (RSCU) and estimation of synonymous and nonsynonymous substitution
Correspondence analysis on RSCU26 was performed using the CodonW 1.4.222 program to identify the major factors influencing the variation in synonymous codon usage in three groups of orthologous sets. These analyses generate a series of orthogonal axes to identify trends that explain the variation within a dataset, with each subsequent axis explaining a decreasing amount of the variation.
To examine the nucleotide substitution patterns, we estimated the number of synonymous substitutions per synonymous site, dS, and the number of nonsynonymous substitutions per nonsynonymous site, dN, of randomly chosen 500 pairs of ten sets of orthologs in each three groups using the MEGA program (version 2.1), as described by Nei and Gojobori.27 The values of dS, dN, and dN/dS of three orthologous groups were compared by t-test.