Inference of the evolution of intron phase distribution
Koonin's group  compiled intron patterns in the conserved regions of 684 gene orthologs from eight eukaryotes, D. melanogaster, A. gambiae, H. sapiens, C. elegans, Saccharomyces cerevisiae, S. pombe, A. thaliana, and P. falciparum. We used this database for our analysis, but excluded S. cerevisiae due to its sparse intron distribution. Following our previous analysis , we assumed the ecdysozoa tree for these species and applied our maximum likelihood method to infer rates of intron gains and losses as well as the distribution of introns in the last common ancestor of these species. These parameters were then used to infer the most reliable events (>90% confidence) for intron gain and loss along each branch and for intron presence at each ancestor. Phase distributions were then calculated for these events using the phase information for each intron pattern. Note that our method  assumes the same model of intron evolution with the method of Csűrös  but the implementation details are different [29,30].
Compilation of a genome-wide dataset
We downloaded data about the genomes of six eukaryotes (H. sapiens, D. melanogaster, C. elegans, S. pombe, A. thaliana, and P. falciparum) from NCBI , three eukaryotes (N. crassa, F. graminearum, and C. neoformans) from BROAD Institute's Fungal Genome Initiative website , and X. tropicalis from the JGI Eukaryotic Genomics website . For all genomes except X. tropicalis, gene structures were built using annotation. For X. tropicalis, there was no annotation for gene structures, so we first used the cDNA sequences as input to the BLAST program  to query against the DNA sequences. Then the DNA region covering the query result of each cDNA sequence was extracted and the SIM4 program  was used to reconstruct the exon/intron structure. If SIM4 failed to reconstruct the exon/intron structure of a gene (i.e., either match ratio or cover ratio ad hoc program was written in the C programming language to automate the construction of gene structures.
The genes of each genome were then subjected to a purging process to remove redundancy by using a criterion of ad hoc program, which makes use of the program ALIGN  for calculating the identity of a pair of amino acid sequences, was written in C to automate the purging process.
Prediction of intron phase distribution for the all-pattern model
In the all-pattern intron insertion model, we used patterns of 5-bp length, with 3 bp upstream and 2 bp downstream of the splice sites. The number of patterns, N, is therefore 1,024 (= 45). Let Oi (i = 1..N) be the count of pattern i among all observed splice sites; Ci (i = 1..N) be the count of pattern i in all coding regions; and Dij (i = 1..N, j = 0..2) be the count of pattern i appearing at phase j in all coding regions. The preference of intron insertion in pattern i is proportional to Ei = Oi/Ci and the frequency of intron insertion in pattern i, Fi, is calculated by:
Then the expected number of phase-j introns, Pj, is calculated by:
Finally, the expected percentage of phase-j introns, Wj, is calculated by:
Inference of the GC content and intron density in the RP gene dataset
We compiled 79 orthologs of RP genes from four eukaryotes: H. sapiens, A. thaliana, O. sativa, and C. reinhardtii. The RP genes of H. sapiens and A. thaliana were taken from the manually curated Ribosomal Protein Gene database . The RP genes of O. sativa and C. reinhardtii were first collected from the TIGR Rice Genome Annotation website  and the JGI Eukaryotic Genomics website , respectively, by performing BLAST searches using RP genes of A. thaliana as queries. Their gene structures were then manually constructed by using both annotation and the gene structures of H. sapiens and A. thaliana as references. When a gene of a species existed in multiple copies, the copy with the most introns was used.
Multiple sequence alignments for each of these gene orthologs were built using CLUSTAL W , and an ad hoc program was written in C to extract an intron presence/absence matrix and the conserved DNA regions of these alignments. The conserved regions were then concatenated together and the DNAML program of the PHYLIP package  was used to infer the phylogenetic tree and GC contents of the internal nodes of these four species. The GC contents were based only on inferred sites of >95% confidence. Finally, the intron presence/absence matrix and the phylogenetic tree (with H. sapiens as the outgroup) were used as input to our maximum likelihood method  to infer the intron evolution for these species.
Prediction of intron phase distribution with mutation correction
We applied a simple model for mutation correction in which only mutations [see Additional file 2] that change the GC content of a codon but do not affect the translated amino acid are allowed. All positions that allow these mutations are assumed to have the same mutation rate, with positive/negative values meaning that these mutations will happen in the direction that increases/decreases the GC content of a codon. For each mutation rate, the original sequences were randomly mutated using this rate and then the intron phase distribution was predicted using the same protocol for the all-pattern model, but with Ci and Dij taken from the mutated sequences instead of the original sequences. The simulated mutation correction was repeated 20 times and the average of the 20 χ2 values between the predicted and observed intron phase distributions was used as the prediction error for the mutation rate. We then searched for the best mutation rate (i.e., the rate at which the prediction error is smallest) in the range (-1, +1) using the Brent search algorithm . Intron phase distribution predicted using the best mutation rate was taken as the output. A program was written in C to automatically perform the prediction of intron phase distribution for the all-pattern intron insertion model, both with and without mutation correction.
RP: ribosomal protein
HDN, MY, and NK conceived and designed the research. HDN performed the experiments. HDN, MY, and NK analyzed the data. HDN contributed analysis tools. HDN and NK wrote the paper. All authors read and approved the final manuscripts.
We thank Akihiro Nakao, Tamayo Uechi, and Sayomi Higa for useful discussions, and Dr. Ikuo Yoshihara for his encouragement. This study was supported by Grants-in-Aid for Scientific Research (14035103, 15310135, 188093, and 17770207) from the Ministry of Education, Culture, Sports, Science and Technology (MEXT), and the Japan Society for the Promotion of Science (JSPS). HDN is a research fellow of the JSPS (17-05174). The sequence data of N. crassa, F. graminearum and C. neoformans were produced by the Fungal Genome Initiative http://www.broad.mit.edu/annotation/fgi/. The sequence data of X. tropicalis and C. reinhardtii were produced by the US Department of Energy Joint Genome Institute http://www.jgi.doe.gov/ and are provided for use in this publication only.