Genome-scale mining for phylogenetic markers
Whole genomic sequences of Danio rerio and Takifugu rubripes were retrieved from the ENSEMBL database [54].
Exon sequences with length > 800 bp were then extracted from the
genome databases. The exons extracted were compared in two steps: (1)
within-genome sequence comparisons and (2) between genome comparisons.
The first step is designed to generate a set of single-copy nuclear
gene exons (length > 800 bp) within each genome, whereas the second
step should identify single-copy, putatively orthologous exons between D. rerio and T. rubripes (Figure 2).
The BLAST algorithm was used for sequence similarity comparison. In
addition to the parameters available in the BLAST program, we applied
another parameter, coverage (C), to identify global sequence similarity
between exons. The coverage was defined as the ratio of total length of
locally aligned sequences over the length of query sequence. The
similarity (S) was set to S < 50% for within-genome comparison,
which means that only genes that have no counterpart more than 50%
similar to themselves were kept. The similarity was set to S × > 70%
and the coverage was set to C > 30% in cross-genome comparison,
which selected genes that are 70% similar and 30% aligned between D. rerio and T. rubripes. Subsequent comparisons were performed on the newly available genome of stickleback (Gasterosteus aculeatus) and Japanese rice fish (Oryzias latipes),
as described above. We programmed this procedure using PERL programming
language to automate the processes and made the source code publicly
available on our website [43]. We are in progress to make it available for other genomic sequences and parameter values.
Experimental testing for candidate markers
PCR and sequencing primers were designed on aligned sequences of D. rerio and T. rubripes for 15 random selected genes. Primer3 was used to design the primers [55].
Degenerate primers and a nested-PCR design were used to assure the
amplification for each gene in most of the taxa. Ten of the 15 genes
tested were amplified with single fragment in most of the 36 taxa
examined. PCR primers for 10 gene markers are listed in Table 1.
The amplified fragments were directly sequenced, without cloning, using
the BigDye system (Applied Biosystems). Sequences of the frequently
used RAG1 gene were retrieved for the same taxa from GenBank for
comparison to the newly developed markers [GenBank: AY430199, NM_131389, U15663, AB120889, DQ492511, AY308767, AF108420, EF033039 – EF033043]. When RAG1 sequences for the same taxa were not available, a taxon of the same family was used, i.e. Nimbochromis was used instead of Oreochromis and Neobythites was used instead of Brotula.
Phylogenetic analysis
Sequences of the 10 new markers in the 14 taxa were used in
phylogenetic analysis to assess their performance. Sequences were
aligned using ClustalX [56] on the translated protein sequences. Uncorrected genetic distances were calculated using PAUP [57]. Relative substitution rate for each markers were estimated using a Bayesian approach [58]. Relative composition variability (RCV) and treeness were calculated following Phillips and Penny [44]. Prottest [45]
was used to chose the best model for protein sequence data and the AIC
criteria to determine the scheme of data partitioning. Bayesian
analysis implemented in MrBayes v3.1.1 and maximum likelihood analysis
implemented in TreeFinder [59]
were performed on the protein sequences. One million generation with 4
chains were run for Bayesian analysis and the trees sampled prior to
reaching convergence were discarded (as burnin) before computing the
consensus tree and posterior probabilities. Two independent runs were
used to provide additional confirmation of convergence of posterior
probability distribution. Given the biased base composition in the
nucleotide data indicated by the RCV value (Table 2),
we analyzed the nucleotide data under the RY-coding scheme (C and T =
Y, A and G = R), partitioned by gene in TreeFinder, since RY-coded data
are less sensitive to base compositional bias [44]. Alternative hypotheses were tested by one-tailed Shimodaira and Hasegawa (SH) test [53] with 1000 RELL bootstrap replicates implemented in TreeFinder.