PlantTribes: a gene and gene family resource for comparative genomics in plants
P. Kerr Wall1, Jim Leebens-Mack1,2, Kai F. Müller1,3, Dawn Field4, Naomi S. Altman5 and Claude W. dePamphilis1,*
1Department of Biology, Institute of Molecular Evolutionary Genetics, and The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA, 2Department of Plant Biology, University of Georgia, Athens, GA 30602, USA, 3Nees Institute for the Biodiversity of Plants, University of Bonn, Meckenheimer Allee 170, 53115 Bonn, Germany, 4Molecular Evolution and Bioinformatics Group, NERC Centre for Ecology and Hydrology, Mansfield Road, Oxford, OX1 3SR, UK and 5Department of Statistics and The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA
*To whom correspondence should be addressed. Tel: +1 814 863 6412; Fax: +1 814 863 1357; Email: [email protected]
Received August 15, 2007. Revised October 17, 2007. Accepted October 18, 2007.
The PlantTribes database (http://fgp.huck.psu.edu/tribe.html) is a plant gene family database based on the inferred proteomes of five sequenced plant species: Arabidopsis thaliana, Carica papaya, Medicago truncatula, Oryza sativa and Populus trichocarpa. We used the graph-based clustering algorithm MCL [Van Dongen (Technical Report INS-R0010 2000) and Enright et al. (Nucleic Acids Res. 2002; 30: 1575–1584)] to classify all of these species’ protein-coding genes into putative gene families, called tribes, using three clustering stringencies (low, medium and high). For all tribes, we have generated protein and DNA alignments and maximum-likelihood phylogenetic trees. A parallel database of microarray experimental results is linked to the genes, which lets researchers identify groups of related genes and their expression patterns. Unified nomenclatures were developed, and tribes can be related to traditional gene families and conserved domain identifiers. SuperTribes, constructed through a second iteration of MCL clustering, connect distant, but potentially related gene clusters. The global classification of nearly 200 000 plant proteins was used as a scaffold for sorting 4 million additional cDNA sequences from over 200 plant species. All data and analyses are accessible through a flexible interface allowing users to explore the classification, to place query sequences within the classification, and to download results for further study.
Nucleic Acids Research, doi:10.1093/nar/gkm972. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/)