Details of analysis parameters and settings are available in SI Text.
Data Composition. Orthology extraction and multiple sequence alignment were performed by using Online Codon-Preserved Alignment Tool (OCPAT), an in-house-developed tool that combines the BLAST (34) and Clustal (35) algorithms and preserves the correct protein coding reading frame in all taxa aligned. The tool is available at http://homopan.wayne.edu/Pise/ocpat.html. Details of the analysis pipeline are provided at http://homopan.wayne.edu/ocpat/index.html (54). Taxa included in the study were Homo sapiens (human) (36), Pan troglodytes (chimpanzee) (37), Macaca mulatta (Rhesus monkey) (38), Mus musculus (mouse) (39), Rattus norvegicus (rat) (40), Oryctolagus cuniculus (rabbit), Canis familiaris (dog) (41), Bos taurus (cow), Dasypus novemcinctus (armadillo), Loxodonta africana (African elephant), Echinops telfairi (tenrec), Monodelphis domestica (gray short-tailed opossum) (42), Gallus gallus (chicken) (43), and Xenopus tropicalis (Western clawed frog). For the analysis, all data we used were updated at the end of August 2006. Ornithorhynchus anatinus (platypus) and other recently completed draft genomes were not included. Nucleotide composition was calculated by using MEGA 4.0 (44).
Phylogenetic Analyses. ML analysis. ML analyses were conducted in PAUP* Ver. 4.0b10 (45) by using Model settings determined by the program ModelTest Ver. 3.7 (46), as chosen by Akaike Information Criterion (AIC). These settings correspond to the general time reversible (GTR) + + I model with four rate categories. Assumed nucleotide frequencies were: A = 0.27630, C = 0.24200, G = 0.25440, and T = 0.22730. The assumed proportion of invariable sites = 0.3082 and the distribution of rates at variable sites = gamma (discrete approximation) with a shape parameter () = 0.6954. Our heuristic analysis included 10 random sequence additions and tree-bisection-reconnection (TBR) tree searching. Complete settings are available in SI Text. Additional analyses were conducted in a new beta version (3.0) of PhyML (47), which includes fast subtree pruning and regrafting (SPR) tree search (48). We first ran PhyML with an SPR search from 10 random starting trees and with GTR + + I model with four rate categories but without bootstrap. The proportion of invariant sites, shape parameter of the gamma distribution and GTR parameters were estimated from the data. Then, we performed a bootstrap analysis with 100 replicates by using BioNJ trees as starting points, SPR tree search, and the model parameter values estimated in the first run. All inferred trees (i.e., 10 with random starting trees and 100 with resampled data) were identical.
Bayesian analysis. The parallel MrBayes [Ver. 3.1.2 (49–51)] analysis was carried out by using the resources of the Computational Biology Service Unit from Cornell University (Ithaca, NY).
With 14 taxa and a concatenated sequence of 1,443,825 bases, the parallel processes ran for 1 week for 1 million generations of Markov-Chain Monte Carlo. We chose to assume partition homogeneity rather than using mixed models, which potentially would suffer from overparameterization from the 1,600+ loci included in the analysis. For the 1.4-Mb data set, we also ran the analysis using BayesPhylogenies (52). In this case, we assumed a single GTR pattern.
MP analysis. We conducted a heuristic search in PAUP* Ver. 4.0b10 consisting of 100 random addition sequence replicates using the tree-bisection-reconnection (TBR) branch-swapping algorithm. We excluded third codon positions in parsimony analysis because of potential saturation at those sites. Thus, 481,275 characters were excluded, leaving a data set that included the remaining 962,550 characters. In addition to searching for the optimal tree, we conducted a bootstrap analysis that was based on 1,000 pseudoreplicates with 10 random addition sequence replicates per pseudoreplicate. The TBR algorithm was also used in this analysis. We also conducted an MP analysis on the translated amino acid sequences. In addition to the full complement of taxa, we ran parsimony analyses to reflect the taxon sampling in refs. 4 and 6.
NJ analysis. We conducted an NJ search in PAUP* Ver. 4.0b10 with 1,000 full heuristic bootstrap replicates using the ML distances as selected by ModelTest and in MEGA 4.0 using the ML composite distance. We also conducted an NJ bootstrap analysis (1,000 replicates) in MEGA on the translated amino acid sequences using the Jones–Taylor–Thornton distance. In addition to the full complement of taxa, we ran NJ analyses to reflect the taxon sampling in refs. 4 and 6.