Sequences were downloaded from each of the five sequenced angiosperm species including 31 921 gene models from A. thaliana (TAIR, version 7.0), 25 536 from C. papaya (version 1.0, complete), 40 567 from M. trunculata (IMGA, version 1.0, 60% complete), 45 555 from P. trichocarpa (JGI, version 1.0) and 66 710 from O. sativa (TIGR, version 5.0). The Carica and Medicago genome-sequencing projects are underway; the data for these species were included with the protein scaffold and results for these species will go ‘live’ for public access following the publication of these genomes. As summarized in Figure 1, we compared the predicted proteins for all five species in an all-against-all BLASTP (e = 1e – 10, b = 10 000) using the NCBI BLAST package (7). MCL clustering was then performed at low, medium and high stringencies (Inflation, I = 1.2, 3.0, 5.0, respectively) to produce the sets of objectively defined gene families (tribes).
A second iteration of MCL was conducted in order to connect distant, but potentially related gene clusters which we define as SuperTribes. In order to construct SuperTribes, we computed both the average and minimum e-value between all pairs of tribes and used these as the input matrix for MCL. In addition, we ran MCL with low, medium and high inflation values to generate SuperTribe clusters at the three different stringencies. In total, there are 18 SuperTribe classifications for users to access and compare (i.e. 3 original tribe stringencies x 3 super tribe stringencies x 2 metrics average/minimum e-value).
In order to annotate each tribe, we used additional information connected to all member genes according to the following criteria: gene ontology (GO), presence of domains, manually curated gene families and common word patterns associated with the gene descriptions within a tribe. We downloaded the gene_ontology.v1.2.obo, goslim_plant.obo and gene_association.TAIR flat files (8) and used the map2slim.pl script to create a GO slim database for the Arabidopsis genes in each tribe. To annotate our tribes by domain information, we downloaded NCBI's Conserved Domain Database (CDD) (9) and used the formatrpsdb (default parameters, with f = 9.82, S = 100.0) utility to index the domains. We then searched all protein sequences from the five genomes using rpsblast (default parameters, with e-value = 1e – 5). To annotate tribes according to manually curated gene families, we downloaded gene_family_tab_121906.txt from TAIR, which includes 996 gene families that include 8331 genes. Finally, a Perl script was used to extract all gene descriptions within a tribe, and determined the most common words within the tribe, keeping track of the relative position of each word, using only the top five words. Therefore, each tribe has a composite annotation defined by each of the four criteria.
The resulting constellation of gene family tribes was used as a scaffold for plant gene space onto which roughly 4 million unigene sequences were sorted. These unigenes, derived from over 11 million ESTs, were downloaded from TIGR PTA (http://plantta.tigr.org). In addition, we sorted the predicted proteomes of Chlamydomonas reinhardtii (green alga; JGI, version 3) and Physcomitrella patens (moss; JGI, version 1). We searched the five sequenced proteomes using a blastx search (e-value = 1e – 5) for the unigene sequences and a blastp (e-value = 1e – 1) search for the distantly related Physcomitrella and Chlamydomonas proteomes.
Phylogenetic analysis pipeline
A sequence alignment and phylogenetic analysis pipeline included the following steps. We generated fasta files of both amino acid and DNA sequences (CDS) for each tribe. Each amino acid file was aligned using the MUSCLE alignment program (10). We then forced the DNA sequences onto the amino acid alignments using custom Perl/Bioperl scripts.
Phylogenetic trees were built using a fast maximum-likelihood ratchet approach (Morrison, D.A. (in press) Increasing the efficiency of searches for the maximum likelihood tree in a phylogenetic analysis of up to 150 nucleotide sequences. Syst. Biol., in press) as newly implemented in PRAP (11) v.2.0 for this study. PRAP generated command files that were handed over to PAUP (12). The heuristics involves (i) rapidly getting a starting tree not too far from the optimal score; (ii) move rapidly to a (near-) optimal tree island, (iii) getting the best tree within the island. Step (i) was achieved by calculating a BioNJ tree using LogDet distances, followed by one round of NNI and then one round of SPR branch swapping, optimizing the substitution model parameters between these steps. Similar to the parsimony ratchet (13), step (ii) included alternating between branch swapping on the original matrix and branch swapping on a matrix with 25% of characters upweighted. Unlike in Nixon's strategy for parsimony, SPR branch swapping was used, only 10 iterations were performed, and during the weighted analyses, only one tree was saved. In particular for datasets with low levels of phylogenetic signal, this strategy was found to be more successful (Morrison, D.A. (in press, as above)) than the strategies implemented in GARLI (14) or RAxML (15). To assess confidence in clades, bootstrapping was performed by executing PRAP-generated command files in PAUP. Using optimized parameters from the likelihood ratchet search, SPR branch swapping was performed on the maximum-likelihood topology for each bootstrapped data matrix, and the proportion of iterations in which a given clade was recovered was mapped onto the maximum-likelihood tree using a strongly modified version of TreeGraph (16) (Müller et al., manuscript in preparation). The latter program was also used to generate SVG trees that can be viewed via the web interface.
Understanding how gene expression patterns vary among gene family members will inform our understanding of evolutionary processes shaping plant gene function and genome structure. Characterization of changing gene expression following gene duplication and speciation events e.g. Ref. (17), will improve as additional plant genomes are sequenced and genome-wide gene expression studies are preformed on a wide range of plant species (18). We aim to advance this research by placing gene expression data within a gene family context. To incorporate expression data into PlantTribes, we downloaded all AFFY expression data and associated descriptions of experiments, tissue, etc. from NASCArrays (19). This has allowed us to link tribes with Arabidopsis genes to a curated expression dataset including 327 experiments conducted on more than 200 tissues and organs, developmental stages and growth conditions. Gene expression data for additional species will be added to future versions of PlantTribes, as an ontology is developed to relate organs and developmental stages across plant species (20–22).