Use of DNA barcodes to identify flowering plants
W. John Kress *, , Kenneth J. Wurdack *, , Elizabeth A. Zimmer *, Lee A. Weigt , and Daniel H. Janzen
*Department of Botany and Laboratories of Analytical Biology, National Museum of Natural History, Smithsonian Institution, P.O. Box 37012, Washington, DC 20013-7012; and Department of Biology, University of Pennsylvania, Philadelphia, PA 19104
Contributed by Daniel H. Janzen, April 15, 2005
Methods for identifying species by using short orthologous DNA sequences, known as "DNA barcodes," have been proposed and initiated to facilitate biodiversity studies, identify juveniles, associate sexes, and enhance forensic analyses. The cytochrome c oxidase 1 sequence, which has been found to be widely applicable in animal barcoding, is not appropriate for most species of plants because of a much slower rate of cytochrome c oxidase 1 gene evolution in higher plants than in animals. We therefore propose the nuclear internal transcribed spacer region and the plastid trnH-psbA intergenic spacer as potentially usable DNA regions for applying barcoding to flowering plants. The internal transcribed spacer is the most commonly sequenced locus used in plant phylogenetic investigations at the species level and shows high levels of interspecific divergence. The trnH-psbA spacer, although short (450-bp), is the most variable plastid region in angiosperms and is easily amplified across a broad range of land plants. Comparison of the total plastid genomes of tobacco and deadly nightshade enhanced with trials on widely divergent angiosperm taxa, including closely related species in seven plant families and a group of species sampled from a local flora encompassing 50 plant families (for a total of 99 species, 80 genera, and 53 families), suggest that the sequences in this pair of loci have the potential to discriminate among the largest number of plant species for barcoding purposes.
angiosperm | internal transcribed spacer | Plummers Island | species identification | trnH-psbA
PNAS | June 7, 2005 | vol. 102 | no. 23 | 8369-8374. OPEN ACCESS ARTICLE.
The identification of animal biological diversity by using molecular markers has recently been proposed and demonstrated on a large scale through the use of a short DNA sequence in the cytochrome c oxidase 1 (CO1) gene (1-5). These "DNA barcodes" show promise in providing a practical, standardized, species-level identification tool that can be used for biodiversity assessment, life history and ecological studies, and forensic analysis. Engineered DNA sequences also have been suggested as exact identifiers and intellectual property tags for transgenic organisms (6). A Consortium for the Barcode of Life (www.barcoding.si.edu) has been established to stimulate the creation of a database of documented and vouchered reference sequences to serve as a universal library to which comparisons of unidentified taxa can be made. Here, we propose two DNA regions for barcoding plants and provide an initial test of their utility.
DNA barcoding follows the same principle as does the basic taxonomic practice of associating a name with a specific reference collection in conjunction with a functional understanding of species concepts (i.e., interpreting discontinuities in interspecific variation). Presently, some controversy exists over the value of DNA barcoding (7), largely because of the perception that this new identification method would diminish rather than enhance traditional morphology-based taxonomy, that species determinations based solely on the amount of genetic divergence could result in incorrect species recognition, and that DNA barcoding is a means to reconstruct phylogenies when it is actually a tool to be used largely for identification purposes (8-10). In support of barcoding as a species identification process, Besansky et al. (11), Janzen (12, 13), Hebert et al. (1-4), and Kress (14) have offered arguments for the utility of DNA barcoding as a powerful framework for identifying specimens. Our objective in this paper is not to debate the validity of using barcodes for plant identification, but rather to determine appropriate DNA regions for use in flowering plants.
A portion of the mitochondrial CO1 gene was deliberately chosen for use in animal identification when DNA barcoding was proposed (1), and its broad utility in animal systems has been demonstrated in subsequent pilot studies (1-5). The taxonomic limits of CO1 barcoding in animals are not fully known, but it has proven useful to discriminate among species in most groups tested (2). The choice of a DNA region usable for barcoding has been little investigated in other eukaryotes, whereas in prokaryotes, rRNA genes are favored for identifications (e.g., ref. 15). Among plants, especially angiosperms, DNA-based identifications, although not strictly through the use of DNA barcodes, have been creatively used to reconstruct extinct herbivore diets (16, 17), to identify species of wood (18), to correlate roots growing in Texas caves with the surface flora (19), and to determine species used in herbal supplements (20). However, some of these identifications have not been entirely successful at the species level, and DNA barcoding per se has not yet been applied to plants. The primary reason that barcoding has not been applied to plants by the emerging initiative is that plant mitochondrial genes, because of their low rate of sequence change, are poor candidates for species-level discrimination. The divergence of CO1 coding regions among families of flowering plants has been documented to be only a few base pairs across 1.4 kb of sequence (21, 22). Furthermore, plants rapidly change their mitochondrial genome structure (23), thereby precluding the existence of universal intergenic spacers that otherwise would be appropriately variable unique identifiers at the species level (e.g., ref. 24).
For plant molecular systematic investigations at the species level, the internal transcribed spacer (ITS) region of the nuclear ribosomal cistron (18S-5.8S-26S) is the most commonly sequenced locus (25). This region has shown broad utility across photosynthetic eukaryotes (with the exception of ferns) and fungi and has been suggested as a possible plant barcode locus (26). Species-level discrimination and technical ease have been validated in most phylogenetic studies that employ ITS, and a large body of sequence data already exists for this region (>36,000 angiosperm sequences were available in GenBank in December 2004, although these sequences have not been filtered for taxa, so it is not certain how many species are represented). However, the limitations of this nuclear region in some taxa are well established. ITS has reduced species-level variability in certain groups (especially recently diverged taxa on islands), divergent paralogues that require cloning of multiple copies, and secondary structure problems resulting in poor-quality sequence data (25, 27). In some cases, the preferential amplification of endophytic or contaminating fungi may occur, although this can be eliminated with plant-specific primer design (28, 29).
An advantage of the ITS region is that it can be amplified in two smaller fragments (ITS1 and ITS2) adjoining the 5.8S locus, which has proven especially useful for degraded samples. The quite conserved 5.8S region in fact contains enough phylogenetic signal for discrimination at the level of orders and phyla (29), although identification at this taxonomic level is not the concern of barcoding. Alignments are trivial to optimize for 5.8S due to the few indels found in plants and fungi (30). In contrast for phylogenetic reconstruction, ITS or any rapidly evolving noncoding region can require complex sequence alignment for homology assessments. Thus, the 5.8S locus can serve as a critical alignment-free anchor point for search algorithms that make sequence comparisons for both phylogenetic and barcoding purposes. The utility of conserved regions such as 5.8S to generate a pool of nearest neighbors for refined comparisons will be critical for effective database searches, especially when comparing a sequence that has no identical match in a sequence library. GenBank BLAST searches with our ITS data (see below) returned correct matches for the sequences in GenBank. This success suggests that despite alignment concerns, current search algorithms will be fast and effective at using ITS for species-level identifications, given an adequate database for comparison. For all of these reasons, ITS, even with its recognized limitations, is a prime candidate as an effective locus for DNA barcoding in plants.
However, the recognition that ITS has certain functional limitations for DNA barcoding of plants is a compelling argument that a search for additional loci is warranted. For phylogenetic investigations, the plastid genome has been more readily exploited than the nuclear genome and may offer for plant barcoding what the mitochondrial genome does for animals. It is a uniparentally inherited, nonrecombining, and, in general, structurally stable genome. Universal primers are available for a number of loci and intergenic spacers that are evolving at a variety of rates. The plastid locus most commonly sequenced by plant systematists for phylogenetic purposes is rbcL, followed by the trnL-F intergenic spacer, matK, ndhF, and atpB (e.g., refs. 31-33). rbcL has been suggested as a candidate for plant barcoding (34), even though it has generally been used to determine evolutionary relationships at the generic level and above. Besides rbcL and atpB, all of the latter plastid loci have been used at the species level with various degrees of success. Most of them (except the trnL-F spacer) require full-length sequences of >1 kb to yield enough sequence length to discriminate species. Most relevant to plant barcoding, no region of the plastid genome has been found to have the high level of variation seen in most animal CO1 barcodes, although a few intergenic spacers have shown more promise than any plastid locus now in general use (33).
When evaluating other genetic loci appropriate for plant DNA barcoding, three criteria must be satisfied: (i) significant species-level genetic variability and divergence, (ii) an appropriately short sequence length so as to facilitate DNA extraction and amplification, and (iii) the presence of conserved flanking sites for developing universal primers. With regard to sequence length, we note that in CO1 barcoding systems, the 600- to 700-bp length fortuitously matches high-quality sequence data from average capillary sequencer reads, although it is expected that routine read length will improve with new technology. An important rationale for using short sequences also resides in the need to obtain useful data from potentially degraded samples found in museum specimens. Amplicon size and gene copy number have been shown to account for much of the variability of amplification success: smaller sizes and increased copy number promote greater success with PCR, presumably by increasing the likelihood that a desired sequence has been preserved (18).