Calculations were done under the RedHat Linux 6.3 operating system on an Intel-Pentium III instrument using Blackdown's Java-SDK 1.1.8. PAML calculations were done on an IBM PC using the Unix operating system. Sequence analyses were aided by the DARWIN bioinformatics package . The DARWIN package can be obtained by emailing a request to [email protected].
Initially, pairwise alignments were constructed for the aromatase protein sequences available in the database. An evolutionary distance in PAM units was calculated for each pair by applying the PamEstimator-package from DARWIN using an empirical log-odds matrix. From this, a preliminary evolutionary tree was built for the mammalian sequences, with branch lengths along internal nodes calculated to minimize a least-squares distance. The sequences of the ancestral genes and proteins at branch points in the tree were then reconstructed. From there, mutations (including fractional mutations) at both the DNA level and protein level were assigned to individual branches in the tree using the method of Fitch .
The evolutionary history of the aromatase family was then analyzed using the transition redundant exchange (TREx) metric based on an analysis of two-fold redundant codon systems [24,78]. These were obtained for each pairwise comparison of aligned aromatase genes. The number (n) of two-fold redundant amino acids (Cys, Asp, Glu, Phe, His, Lys, Asn, Gln, and Tyr) that are conserved in the aligned pairs was determined. The number of those amino acids that are encoded by the same codon (c) was determined, and the fraction (f2 = c/n) of the codons that are the same were then tabulated (Supplementary Table [see Additional File 2]). The TREx distances were calculated from f2 values using the expression kt = -ln((f2-Equil)/(1-Equil)), where Equil is the f2 value expected after a large number of nucleotide substitutions have occurred at the synonymous sites .
The DNA sequences for aromatase were phylogenetically analyzed using a maximum likelihood framework in PAUP 4.0* (beta 10) , with the following parameters: alpha value representing the gamma distribution (2.1), the transition-transversion ratio (1.6), proportion of invariable sites (0.24), and empirical base frequencies. The resulting topology of the tree mirrors those based on other molecular studies .
For inter-taxon analyses, families in the MasterCatalog (EraGen Biosciences, Madison WI) were identified that contained at least one representative protein from both of the taxa of interest. For these families, all inter-taxa pairs of genes were extracted, together with the pairwise protein sequence alignment. A pairwise alignment of the DNA sequences was then generated to follow the protein sequence alignment. If a family contained more than one sequence of a species belonging to one of the taxa analyzed, then those sequences were checked to determine whether they were duplicate entries into the database. If this was the case, only one of the duplicate sequences was retained in the analysis. A histogram of inter-taxa pairs was constructed, and the f2 value characteristic of orthologs determined . This was used to calibrate the TREx clock using the divergence of pigs and oxen, and pigs and humans.
Codon biases were obtained from the CUTG (Codon Usage Tabulated from GenBank) made available by the Kazusa DNA Research Institute Foundation, Japan .
Pairwise TREx distances were used to generate lengths for the branches connecting the swine paralogs using the minimum evolution criterion in PAUP. This preliminary analysis was followed by a maximum likelihood analysis for the complete dataset using the PAML program . This includes the assignment of KA/KS values to individual branches. Tests of parallel evolution were conducted using Converge , implementing the JTT model.
Secondary structural data based on homology modeling for aromatases were generated using the DARWIN bioinformatics package, and in agreement with previous studies [83,84]. Renderings of the three dimensional structure of the proteins were obtained using a beta version of the HyperProtein package (HyperCube, Gainesville FL, USA 32601).
EAG performed the evolutionary, statistical and structural analyses, and prepared the manuscript. LGG cloned genes as part of his Masters work, and called the evolutionary problem to the attention of SAB. TL provided computational infrastructure. RCMS and FAS initiated the work with suid reproductive endocrinology, and supervised LGG. DRS and DAL did the initial bioinformatics analysis. CMJ provided paleontological expertise, constructed the cladogram, and helped prepare the manuscript. SAB has developed planetary biological analysis as a paradigm for generating hypotheses about the biological function of proteins, and prepared the manuscript.
We thank three anonymous reviewers for their invaluable comments. We also thank Alaric Falcon, Andres A. Kowalski and Ge Zhao for their assistance. This work was supported in part by N.I.H. grants GM 54075 and GM 067439-01 (S.A.B.), N.I.H. grant HD 21961 (R.C.M.S., F.A.S.) and USDA-NRICGP grant 98-35205-6739 (F.A.S., R.C.M.S.).