Unigene set and ORF prediction
The dataset used consisted of 24,845 unigenes based on ~130,000 public P. patens expressed sequence tags (EST), the production parameters of which have been described before [22,23]. An evaluation of several tools for open reading frame (ORF) prediction was carried out, including ESTScan [66], FrameD [67] and Estwise [68]. As it turned out, using A. thaliana or plant-trained models yielded a high rate of false positive predictions for P. patens genes [23]. Homology-based ORF prediction was hindered by the fact that the closest homologs often shared only 30–40% identity on amino acid level. By dividing all publicly available P. patens mRNAs into a training set (226 sequences) and a test set (100 sequences), FrameD turned out to be the most accurate individual tool. After testing, a P. patens specific hidden Markov model (HMM) for ESTScan and interpolated HMMs for FrameD were built by using all 326 sequences. In order to improve the quality of the predicted ORFs, FrameD was given the results of a BLASTX-search of the unigenes against Genpept using an E-value cutoff of 1E-10. ORFs were determined by combining the prediction results from ESTScan and FrameD, preferring the latter. In total, 19,313 ORF were detected by FrameD and 21,344 by ESTScan, the combination yielding 22,237 ORF which were used for further analyses.
Calculating the KS distribution
Two Perl scripts were written to identify clusters of paralogous genes and subsequently calculate KS distributions. The software ("KeyS") is available upon request. The method used to calculate Figure 1A is described below.
Identification of pairs of paralogous genes
To identify similar sequences on peptide level, an all-against-all BLAST-search was performed using BLASTP with an E-value cutoff of 1E-10. Two sequences were defined as paralogs if the sequences could be aligned over a length of at least 150 amino acids and showed at least 30% identity [25]. Gene pairs with a BLAST identity of 98% or higher were further tested for identity because near identical sequences occasionally are present in clustered EST data due to sequencing errors and the fragmentary nature of EST. To do this, the nucleic acid sequences were aligned globally using the EMBOSS [69] implementation of the Needleman-Wunsch algorithm, needle. Afterwards all leading and trailing gaps were removed from the alignment. Two sequences were then defined as identical if the aligned sequences had an identity of at least 98.0%. From all identical gene pairs the longer sequence was kept and all gene pairs containing the shorter sequence were discarded.
Clustering of paralogous genes
In order to reduce the computational complexity, genes were clustered prior to KS calculation. From the list of gene pairs, the genes of the pair with the highest BLAST-derived bit score were chosen as the first two genes of a new cluster. If several pairs shared the same bit score, the pair with the shortest alignment length was selected first. New members were subsequently added to the cluster using agglomerative linkage clustering until no more suitable candidate genes were left. After completion of each cluster, all gene pairs having at least one clustered member gene were deleted from the gene pair list.
Estimation of KS values for gene pairs
In a first step the peptide sequences were aligned globally using needle. Afterwards, all positions containing a gap were removed from the alignment and the amino acids were replaced by their corresponding codons. The nucleotide alignment was used to calculate the KS-value with the maximum likelihood method implemented in codeml of the PAML package [70]. Codon frequencies were calculated from the average nucleotide frequencies at the three codon positions (codon frequency model F3 × 4). Because codeml can get stuck in suboptimal likelihood maxima, the calculation was repeated five times and the KS-value with the highest likelihood was then assigned to the gene pair.
Calculating the KS values used in the distribution
To remove node-connecting KS values > 5.0, subtree clustering based on the KS-values as distance measure was performed using average linkage clustering. Assuming that all genes in a resulting cluster with n members originate from the same ancestor gene, n-1 duplication events have taken place. However, the number of possible gene pairs or KS-values of a cluster with n members is n × (n-1)/2, which exceeds the number of duplication events for n > 2. Using all pairwise KS-values of a cluster directly in the age distribution would thus falsify it. Instead, we used approximate KS-values for the n-1 duplication events that were derived from the pairwise KS-values during the clustering. The merging steps taken during the KS-based clustering were represented in a bifurcating dendrogram. The terminal nodes represent the genes of the original cluster and each inner node represents the joining of two clusters, which also can be regarded as the duplication event giving rise to the two clusters. To each inner node, and each duplication event respectively, the average inter cluster KS-value of the merged gene clusters can be assigned. The inter cluster KS-values ≤ 5.0 were used to represent the duplication events of the cluster in the age distribution.
Construction of linearized trees
We have inferred the age of P. patens duplicated genes by constructing linearized trees and comparing the time of gene duplication with the A. thaliana-poplar split or the monocot-eudicot split, following the method used by Vandepoele et al. [30]. To this end, the 22,237 protein sequences of P. patens were grouped into 1,967 gene families containing two to ten P. patens proteins based on sequence similarity [25]. All protein sequences of each gene family were used as queries to do BLASTP searches against proteins from Oryza sativa (TIGR release 4), Populus trichocarpa (JGI version 1, released June 7, 2006), Arabidopsis thaliana (TAIR release 6), Chlamydomonas reinhardtii (JGI release 3), and Ostreococcus tauri (released August 8, 2006). Gene families were built and neighbor-joining trees were constructed using LINTREE [29] based on the alignments of each gene family [30]. Only those gene families that included at least one outgroup sequence (C. reinhardtii or O. tauri) and sequences that could be used as reference or calibration points (see below) to estimate the date of the P. patens duplication, i.e. sequences from at least two different organisms out of the three angiosperm species (rice, poplar and A. thaliana), and which all had to have higher BLASTP scores than that of the outgroup sequence, were considered for further analyses. Linearized trees, which assume equal rates of evolution in different lineages of the tree [29], were constructed for each gene family after sequences evolving at highly deviated rates were removed [30]. The split of A. thaliana and poplar, or monocots (rice) and eudicots (Arabidopsis and poplar), set at 100 and 150 MYA, respectively, were used as reference points to estimate the age of the P. patens duplicates. Each node that was used for dating had to have bootstrap support ≥ 70% [71]. In cases where the tree had more than one reference point that could be used for dating, the duplication date was first calculated separately using each reference point. The tree was then discarded if the minimum and maximum date differed by >20 MYA. If the difference was ≤20 MY, the average of the date estimates from all possible reference points was taken as the date of the duplication event.
Gene Ontology and pathway mapping
GO terms were assigned to the sequences using Blast2GO [44] with an E-value cutoff of 1E-25 and a minimal hit length of 80 amino acids. The GO Slim annotation (which avoids the redundancy of GO term association) was created using the generic GO Slim file, GO terms, definitions and ontologies [72]. It was determined (using five-fold leave-one-out cross validation) how many genes are necessary to do GO bias comparisons in order not to be affected by sampling bias. As it turned out, a sample size of at least 500 genes is sufficient to detect significantly biased categories. The fractions of genes assigned/devoid of individual GO terms were tested for deviation within the 765 duplicated genes in comparison with a randomly chosen reference set (excluding genes from the KS peak) of equal size (765 out of 2,202 possible genes) using Fisher's exact test. Resulting p values were adjusted to control for multiple testing by calculating the false discovery rate [73]. Statistics were performed with R 2.1.0 [74]. The KEGG pathways were assigned to the sequences representing the peaks using KAAS (KEGG Automatic Annotation Server 1.10; [75]). Searches were performed against the whole dataset with a bit score threshold of 60 and the bi-directional best hit method (BBH).