Department of Plant Biology, Carnegie Institution, Stanford, California 94305 (K.I., L.R., N.T.W., S.Y.R.); Department of Biology, University of Missouri, St. Louis, Missouri 63121 (E.A.K.); Department of Plant Breeding, Cornell University, Ithaca, New York 14853 (P.J., A.P., S.R.M.); Missouri Botanical Garden, St. Louis, Missouri 63121 (F.Z., P.F.S.); Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724 (S.A., D.H.W., L.D.S.); University of Missouri, Columbia, Missouri 65211 (L.P.V., M.L.S.); Maize Genetics Cooperation, Stock Center, and Department of Crop Sciences, University of Illinois, Urbana, Illinois 61801 (M.M.S.); and United States Department of Agriculture-Agricultural Research Service, Washington, DC 20250 (M.M.S., M.L.S., D.H.W.)
An Open Access article from Plant Physiology 143:587-599 (2007).
Formal description of plant phenotypes and standardized annotationof gene expression and protein localization data require uniformterminology that accurately describes plant anatomy and morphology.This facilitates cross species comparative studies and quantitativecomparison of phenotypes and expression patterns. A major drawbackis variable terminology that is used to describe plant anatomyand morphology in publications and genomic databases for differentspecies. The same terms are sometimes applied to different plantstructures in different taxonomic groups. Conversely, similarstructures are named by their species-specific terms. To addressthis problem, we created the Plant Structure Ontology (PSO),the first generic ontological representation of anatomy andmorphology of a flowering plant. The PSO is intended for a broadplant research community, including bench scientists, curatorsin genomic databases, and bioinformaticians. The initial releasesof the PSO integrated existing ontologies for Arabidopsis (Arabidopsisthaliana), maize (Zea mays), and rice (Oryza sativa); more recentversions of the ontology encompass terms relevant to Fabaceae,Solanaceae, additional cereal crops, and poplar (Populus spp.).Databases such as The Arabidopsis Information Resource, NottinghamArabidopsis Stock Centre, Gramene, MaizeGDB, and SOL GenomicsNetwork are using the PSO to describe expression patterns ofgenes and phenotypes of mutants and natural variants and areregularly contributing new annotations to the Plant Ontologydatabase. The PSO is also used in specialized public databases,such as BRENDA, GENEVESTIGATOR, NASCArrays, and others. Over10,000 gene annotations and phenotype descriptions from participatingdatabases can be queried and retrieved using the Plant Ontologybrowser. The PSO, as well as contributed gene associations,can be obtained at www.plantontology.org.
Angiosperms are one of the most diverse groups of plants thatvary greatly in morphology, size, habitat, and longevity. Agricultureis almost entirely dependent on angiosperms. Besides providingfood and fiber, angiosperms are important sources for pharmaceuticals,lumber, paper, and biofuel. Understanding the origins, mechanisms,and functions of morphological diversity in flowering plantsis one of the fundamental questions in plant biology. Modernapproaches to studying plant development integrate classicalknowledge in plant anatomy and development with molecular geneticsand genomics tools. Among powerful tools, analyses of mutantsthat affect developmental processes have shed new light on ourunderstanding of the complexity of plant development. More recently,high-throughput, genome-wide phenomic screens in Arabidopsis(Arabidopsis thaliana; for review, see Alonso and Ecker, 2006),and large-scale gene expression-profiling technologies (forreview, see Rensink and Buell, 2005) generated a huge amountof data in plant science. These tools and resources have thepotential to contribute to efforts to link genes with developmentalmorphology (i.e. genotype with phenotype) and make an impacton our understanding of functions of genes involved in plantdevelopment. However, an accurate interpretation of the functionof genes that control various aspects of plant development mustbe embedded in detailed knowledge of the anatomy and morphologyof a plant. Explicitly, the structural features of plant cells,tissues, and organs need to be correctly understood and uniformlydescribed. Accurate and standardized nomenclature for plantanatomy and morphology is also required for comparative purposes(i.e. for comparisons of genes involved in plant developmentamong related or evolutionarily distant taxa). Semantic perplexitypresents a major obstacle for conducting such comparative studiesin plants; similar plant structures are described by their species-specificterms. For example, in scientific publications, fruit is oftenreferred to as silique in Arabidopsis, grain or caryopsis inrice (Oryza sativa), and kernel in maize (Zea mays). Conversely,the inherent ambiguity of some plant anatomical terms led tothe same or similar terms being applied to different structures(e.g. cork cell in the epidermis of grasses and cork cell inthe periderm in all other angiosperms).
Standard vocabulary for describing anatomy and developmentalstages was developed for several plant species at major plantgenomic databases, such as Arabidopsis at The Arabidopsis InformationResource (TAIR; Berardini et al., 2004), rice and other cerealsat Gramene (Yamazaki and Jaiswal, 2005), and maize at MaizeGDB(Vincent et al., 2003). These vocabularies have been used todescribe gene expression data and mutant or natural variantphenotypes in several plant databases. However, they were developedindependently of each other and were based on different principlesand rules. In addition, variation in nomenclature used for differenttaxonomic groups in angiosperms presented obstacles for conductingqueries in more than one plant database and retrieving meaningfulresults. For the purpose of comparative genomics, diverse terminologyneeded to be organized into a standardized language that couldbe shared among individual databases and used for accurate descriptionof phenotypes and gene expression data.To address these problems, the Plant Ontology Consortium (POC;Jaiswal et al., 2005) has developed a simple and extensiblecontrolled vocabulary that describes anatomy, morphology, andgrowth and developmental stages of a generic flowering plant.In addition, the POC has established a database through whichthe data curated using this vocabulary can be accessed in aone-stop manner. Here, we describe the first representationof anatomy and morphology of a generic flowering plant, thePlant Structure Ontology (PSO). This ontology represents themorphological-anatomical aspect of the Plant Ontology (PO);the temporal aspect and the Plant Growth and Developmental StagesOntology have been described elsewhere (Pujar et al., 2006).We also discuss the guiding principles and rationale for thedevelopment and maintenance of the PSO and its importance fordescribing phenotypes and large-scale gene expression data inreference plants and crop species.
The best known bioontology, the Gene Ontology (GO), was thefirst to offer a practical solution for describing gene productsin a human- and computer-comprehensible manner spanning diversetaxonomic groups (Gene Ontology Consortium, 2006; http://www.geneontology.org).The GO consists of three mutually independent ontologies; eachdescribes cellular components, biological processes, or molecularfunctions that occur in organisms. Over the years, the GO hasbecome a standard for describing functional aspects of geneproducts in a consistent way in various genomic databases. Followingthe GO paradigm and embracing the idea of generic, standardizedterminology that can be used across diverse taxonomic groups,the POC has largely adopted the ontology design model and rulesestablished by the GO consortium. However, the PSO is conceptuallydifferent and is governed independently from the GO. Some importantdifferences between the PO and GO are discussed in more detailbelow.
The PSO is the first multispecies ontology of plant anatomyand morphology. Its main purpose is to provide a standardizedset of terms describing plant structures—a tool for annotationof gene expression patterns and phenotypes of germplasms acrossangiosperms. Hence, this vocabulary is intended for a broadplant research community, including curators in genomic databases,bioinformaticians, and bench scientists. The PSO initially integratedexisting species-specific ontologies for Arabidopsis, maize,and rice; however, it is not intended only for a few model plantorganisms. Rather, we envision it as a continuously expandingontology that will gradually encompass crop species and woodyspecies. Recently, the ontology has been expanded to includeterms for Fabaceae, Solanaceae, additional cereal crops (wheat[Triticum aestivum], oat [Avena sativa], barley [Hordeum vulgare]),and poplar (Populus spp.), a model plant organism for woodyspecies.
A common set of criteria was established to ensure that thePSO would be biologically accurate and adequately meet practicalrequirements for annotation. Analysis of the three originalspecies-specific plant ontologies—predecessors of thePSO—greatly influenced our decisions on the rationaleand design for the PSO. Foremost, we defined the scope of thisontology to be limited to anatomical and morphological structurespertinent to flowering plants during their normal course ofdevelopment. Botanical terms, from the cellular to the wholeorganism level, are entities (i.e. terms [in italics in thisarticle]) in the PSO. Besides this main criterion for creatinga term, in some cases (following annotation requirements), wehave considered derivation (i.e. origin of plant parts and celllineages, as well as spatial/positional organization of tissues,organs, and organ systems of a flowering plant (e.g. leaf abaxialepidermis and leaf adaxial epidermis).
We established general rules for deciding when not to add termsto the ontology. To a great extent, qualifiers (or attributes)of the terms are avoided, and the ontology makes only very limiteduse of attributes. Thus, the term corolla is included, but theterms "sympetalous corolla" and "apopetalous corolla" are not.Attributes that are specific for describing mutant plants (e.g.wrinkled seed) are also excluded. Because it does not includeattributes, the PSO is insufficient as, nor is it intended tobe, a taxonomic vocabulary on its own and does not address phylogenyof angiosperms. Moreover, the most granular terminology in thePSO is at the cell-type level. Therefore, terms for subcellularcompartments are not included in the PSO. These terms are handledby the GO Cellular Component ontology. In addition, temporallandmarks (i.e. morphological and anatomical changes that occurvia developmental progression of organs and organ systems) areexcluded from the PSO; this aspect is a part of the Plant Growthand Developmental Stages Ontology (Pujar et al., 2006). Nonetheless,some temporal aspects are indirectly present in the PSO. Unlikein animal systems, most plant organs are developed in the postembryonicphase of the life cycle. Many plant structures develop continually,whereas others exist only temporarily; that is, at a particulartime during the life cycle. Structures that exist even in avery short period of time, such as a leaf primordium, are includedas terms in the PSO. For example, terms such as apical hook(defined as a hook-like structure that develops at the apicalpart of the hypocotyl in dark-grown seedlings in dicots) andleaf primordium (defined as an organized group of cells thatwill differentiate into a leaf that emerges as an outgrowthin the shoot apex) exist in the PSO. A leaf primordium is merelythe first visible appearance of a leaf and, therefore, bothterms, leaf and leaf primordium, describe the same entity (leaf)at different time points in development. There are genes thatare expressed in organ primordia, such as JAGGED and FILAMENTOUSFLOWER genes in Arabidopsis (both expressed in leaf, sepal,petal, stamen, and carpel primordia) with expression levelsdeclining in the developing or adult organs (Dinneny et al.,2004). To accurately annotate expression patterns of such genes,we created separate terms for each primordium structure. Currently,the PSO has 11 such terms.To integrate terms from different species, we extensively usedsynonymy wherever feasible. This allows users to search existingplant databases using either a generic term or its taxon-specificsynonyms. For example, silique, caryopsis, and kernel are listedas synonyms of the term fruit. Therefore, a search for fruitin the PO database would retrieve all genes expressed in thesilique of Arabidopsis, caryopsis of rice, and kernel of maize.In reality, silique, caryopsis, and kernel are types (classes)of fruit, rather than strict synonyms. However, for the purposeof this ontology, specific types of a few high-level terms (e.g.fruit, inflorescence, and stem) are included only as synonyms.Thus, we intentionally overlooked an enormous morphologicaldiversity of flowering plants in favor of cross species comparisons,generic searches, and intuitive ontology browsing. Therefore,synonyms in the PSO can be taxon-specific morphological formsof a generic structure. Also, an entity in the PSO can eitherbe a term or a synonym, but not both. In a few cases where synonymywas not a suitable option, we created new terms as specificclasses. Typical examples are the terms tassel and ear, staminateand pistilate inflorescences specific to the genus Zea, respectively.In addition to the synonyms described above, the PSO containsa number of terms that have authentic (exact) synonyms. Examplesinclude the terms male gametophyte (synonym: pollen grain),female gametophyte (synonym: embryo sac), or perisperm (synonym:seed nucellus). Extensive use of synonymy in the PSO resultedin reduced granularity (i.e. the degree of detail in the ontology)and emphasized generic aspects of the ontology. As a rule, ahigh level of granularity was limited in the PSO because westrove to keep the ontology relatively simple, yet sufficientlybroad and generic to encompass a number of flowering plants.
A term (also called a node) in the PSO is an entity that representsa component of plant structure, such as cell, tissue, organ,and organ system. Each plant structure in the PSO has a termname, a unique numerical identifier (accession no.), a definition,and a specified relationship to at least one other term. Anaccession number always starts with the PO prefix followed byseven digits (e.g. PO:0009011). Once assigned to the PO term,the accession number never changes or gets reassigned to anotherterm. Users should always cite an ontology term by its exactname and a complete accession number, including the prefix.Similar to the GO, the PSO is organized into a hierarchicalnetwork called the Directed Acyclic Graph (for definition, seehttp://www.nist.gov/dads/HTML/directAcycGraph.html). Three typesof parent-child relationships are used in the PSO to specifythe type of association between two terms: is_a, part_of, anddevelops_from (described in more detail in Jaiswal et al., 2005). The term plant structure (PO:0009011) is the highest level ofthe PSO. Each term immediately below plant structure representshigh-level structures (broadly defined entities) that containspecific classes or types, positioned in the hierarchy as theirdirect descendants, called children terms. There are five directchildren of plant structure: plant cell, tissue, organ, gametophyte,and sporophyte (Fig. 1 ). The remaining two nodes, in vitrocultured cell, tissue, and organ and whole plant, were originallyincluded in all three plant species-specific anatomical ontologiesthat preceded the PSO. Because these terms were used in annotationsby all three databases, we included them as top-level nodes.The latter node, whole plant, is conceptually inconsistent (nota botanical term) from the rest of the terms in the PSO andis intentionally left without children terms. We recommend thatthis term be used as a last option—only when precise annotationto any other term in the PSO is not possible. Sporophyte andgametophyte exist as separate terms because they represent diploidand haploid generations of the plant life cycle, respectively.The largest node, sporophyte, includes seed, root, shoot, andinfructescence as direct children nodes. The term shoot is broadlydefined as part of the sporophyte composed of the stems andleaves and includes shoot apical meristems. It has phylome,stem, and inflorescence as part_of children terms, and rhizome,shoot borne shoot, root borne shoot, stolon, and tuber as specifictypes of a shoot. The term embryo (part_of seed) consists ofa number of terms that are applicable for both eudicots andmonocots (particularly members of Poaceae). Compared to eudicots,embryo development in grasses is more advanced; a fully developedembryo has body parts, such as coleoptile, coleorhizae, andscutellum, which are nonhomologous or absent in eudicots. Becauseno plant embryo has all body parts that are designated as part_ofembryo in the PSO, we adopted a nonrestrictive part_of relationshiptype; the child must be a part_of the parent to exist in theontology. However, a parent structure does not have to be composedof all of its part_of children. For example, scutellum is necessarilypart_of embryo; that is, wherever scutellum exists, it is alwaysa part of an embryo. However, not all embryos have scutellum(only embryos in Poaceae do). The high-level term infructescencewas created to accommodate terms that describe both simple fruits,formed from a single ovary (e.g. grape [Vitis vinifera]), andcompound fruits, formed from multiple ovaries (e.g. pineapple[Ananas comosus] or mulberry [Morus]). Currently, this nodehas only one direct descendant, fruit, which refers to a simplefruit. Terms specifically describing compound fruit will beincluded at a later time. Similar to the embryo node, the fruitnode contains several part_of children and not every fruit typenecessarily has all part_of descendants. Overlapping subsetsof part_of terms can be created, each applicable to siliquesof Arabidopsis and other Brassicaceae, caryopsis in cereals,and berry, a fleshy type of fruit, in tomato (Solanum lycopersicum)and other Solanaceae.
Recently, terms relevant for the Solanaceae and Fabaceae families,perennials, and woody species were added to the PSO. For example,terms such as tuber (and its children terms subterranean tuberand aerial tuber) and root nodule (with children terms adventitiousroot nodule, determinate nodule, and indeterminate nodule) wereadded to accommodate annotations to genes and germplasms inthe Solanaceae and Fabaceae families. In addition, the firstattempt to add terms relevant for perennials and woody specieswas made (such as epicomic shoot, defined as a shoot developingfrom a trunk), with more terms still to be incorporated. A numberof terms for secondary growth were also added, grouped undersecondary xylem (such as heartwood, sapwood, growth ring, growthring boundary, and others), secondary phloem (such as bark,libriform fiber, septate fiber, and phloem fiber), and vascularcambium (such as ray initial and fusiform initial), includingseveral cell-type terms under parenchyma cell, such as woodparenchyma cell, with direct descendants, axial wood parenchymacell, and ray wood parenchyma cell (with additional childrenterms underneath). At the very top level of the hierarchy is the node obsolete.As in the GO, a term that has been removed from the ontologyis never permanently deleted. Instead, the term and its assignedidentifier are kept in the ontology file for the record. Thedefinition is appended with OBSOLETE and an explanation is providedas to why a term was removed. The note in the definition orcomment field might also contain suggested terms for searchingand annotating. Obsoleted terms are not intended for use. Consequently,obsoleted terms do not have any annotations associated withthem. In many cases, terms in the obsolete node are valid botanicalterms (such as tunica and corpus); they are simply no longerin use in the PSO, mainly to avoid having duplicated terms thatdescribe a similar plant structure. Instead of using the outdatedconcept of tunica and corpus, shoot apex organization is describedby the following terms: central zone, peripheral zone, and ribzone. Other examples include terms depicting plant-specificsubcellular structures (e.g. filiform apparatus), all of whichwere made obsolete in the PSO to avoid overlap with the GO.Users are advised to use cellular component terms in the GOinstead.
Compared to the GO and other anatomical ontologies, the PSOis a rather small ontology. The top-level term (also calledroot node), plant structure (PO:0009011), has 726 children terms(release PO_0906; Table I ), of which 384 (or 53%) are leafterms, also called terminal nodes (the most specific terms withno children terms below), and 342 (47%) interior nodes (termswith children). In addition, the PSO currently has 304 synonymsassigned to 149 terms. The relatively small size of the PSOreflects the generic nature of the ontology; often, the mostgranular terms are specific to taxonomic groups and are includedonly when necessary (i.e. to retain biological accuracy andto comply with annotation requirements). Having reached a balancebetween broadness and granularity, the PSO is a stable and inclusivevocabulary. All of the top nodes, with the exception of theinfructescence, are populated with necessary terms to describethe phenotypes and gene expression data in angiosperms thatare currently being annotated.
We analyzed the structure of the PSO and the distribution ofannotations to the PSO terms to assess the breadth, depth, andcurrent usage of the ontology. The depth of a term was definedas the number of nodes in the longest path from the root tothat term. Distribution of the depths of the terms in the PSOis shown in Figure 2A . The mean and mode of the depth in theontology was 6.5 and 5, respectively, indicating that the majorityof the terms were fairly granular. The longest depth was 15,with the majority of the leaf terms (86%) having the depth betweenthree and 10 (Fig. 2A). To some extent, this variability isdue to the nature of the domain that the PSO describes (i.e.anatomy and morphology of an angiosperm). Certain morphologicalstructures of an angiosperm are more complex, resulting in deeperdepths (such as flower or leaf), whereas others are much simpler(such as male gametophyte and female gametophyte). The patternof distribution for terminal terms was similar to that for interiorterms.
The number and distribution of the annotations at differentdepths of the ontology are a measure of the usage of the ontology,indicating how adequate the depth of the ontology is for theannotations of gene expression data and phenotypic descriptions.Because annotation to the most granular terms is the ultimatecuration goal, we analyzed the current distribution of directannotations across the PSO and distribution of annotation toleaf terms (Fig. 2B). The majority of direct annotations (83%)are made to nodes with a depth between two and five nodes, indicatingthat terms with more granularity (with a path depth of sevenor more nodes) are less frequently used for direct annotations.Direct annotations to leaf terms are distributed between termsof depth between four and 11, with the exception of 405 annotationsto the top-level term whole plant (Fig. 2B). Because this termdoes not have any children, it appears as a terminal term inthe PSO at the first node. However, it is not a granular termand is excluded from further analysis. Only 155 leaf nodes,or 41% of total leaf nodes (excluding whole plant node), havedirect annotations (1,075 annotations), counting for 11% oftotal annotations to the PSO terms (Table I). Close to 90% ofthe annotations are made to nonleaf terms and the majority ofthe leaf terms are not currently used in annotations. This suggeststhat the granularity of the ontology seems to be sufficientfor the majority of the branches in the ontology. These datamay also be indicative of the extent of knowledge of gene expressionand phenotype characterization and could be further analyzedto determine which aspects of the ontology are less well studiedthan others. It is also possible that the distribution of theannotation reflects the extent of curation efforts in contributingdatabases and could be used to strategize directions in curationefforts. Finally, it may also reflect the current state of thetechnology used for gene expression data. Commonly availabletechnology for measuring gene expression data (e.g. microarraytechnology, northern blots, reverse transcription-PCR) are mostfrequently applied to organs and organ systems, which are high-levelterms in the ontology. This is not necessarily true for in-depthanalyses of mutant phenotypes, even though a large number ofphenotypic descriptions are generated in greenhouses or in thefield, where observations are made using limited tools. As newtechnologies become more available for plant researchers, suchas laser-capture microdissection, which allows for the procurementof specific cells of nearly any plant tissue, more granularterms in the PSO will likely be used for annotations.
The POC database is set up as a portal through which the datacurated using PO for different plant organisms, such as Arabidopsis,rice, and maize, can be easily accessed at one site. Informationfrom one hierarchical level in the ontology is propagated upto the next level (i.e. annotation to any given term with is_aor part_of relationship type implies automatic annotation toall ancestors of that term). Therefore, users can make inferencesand perform queries at different levels in the PSO. For example,all Arabidopsis, rice, and maize genes expressed in the flowerand phenotypes with altered floral development can be retrievednot only by a search using the term flower, but also by a searchusing the term inflorescence, of which the flower is a part.Also, a search with the term flower should retrieve all genesexpressed in stamens, pistils, petals, or sepals. To elucidatethe primary application of the PO, the annotation process incontributing databases is described below, followed by specificexamples of how scientists can efficiently use the PSO in theirresearch.
Annotations to the PSO in Participating Databases
A user interested in genes involved in leaf vascular developmentcan query the PSO by entering an appropriate term, for example,leaf vein, in the PO browser and retrieve all annotations tothis term and its children terms (midvein and secondary vein)in Arabidopsis, rice, and maize. The list includes genes thatare expressed in leaf veins as well as phenotypes with alteredleaf vein development. Annotations to this term are contributedby TAIR, NASC, and Gramene (Fig. 3A ). The user can obtainmore information about each gene or germplasm by clicking onthe name of the contributing database (Source), as shown forthe Arabidopsis YELLOW STRIPE LIKE 1 (YSL1) gene, annotatedby TAIR (Fig. 3B).
Functional annotation of a gene, which is an association betweena gene and a term in an ontology, summarizes information aboutits function at the molecular level, its biological roles, proteinlocalization patterns, and spatial/temporal expression patterns(Berardini et al., 2004). Generally, annotation tasks are carriedout at genomic databases, by manual or computational methods.All annotations contributed to the POC are composed manuallyby curators (biologists with an advanced degree) who eitherextract the information from published literature and generateconcise statements by creating gene-to-term associations (Berardiniet al., 2004; Clark et al., 2005) or record phenotype descriptionsdirectly by observing plants (natural variants and mutants)in greenhouses or in the field. Literature curation is usuallyconducted at species-specific genomic databases (TAIR, Gramene,and MaizeGDB). Curators at plant stock centers, such as NASC,Arabidopsis Biological Resource Center, and Maize Genetics CooperationStock Center, often combine their in-house description of germplasms,based on greenhouse observations and/or stock donor information,with information available from the literature. Each gene-to-termassociation is a separate annotation entry and a gene can beannotated with several ontology terms. For instance, the YSL1gene in Arabidopsis is annotated to multiple PO terms in TAIR(Fig. 3, B and C). YLS1 is expressed in male gametophyte, fruit,shoot, filament, sepal, petal, and leaf vein, with evidencecodes inferred from expression pattern (IEP) and inferred fromdirect assay (IDA), extracted from the publication by Jean etal. (2005). Evidence codes are defined types of evidence, whichare used to support the annotation. Most commonly used evidencecodes for annotating gene expression data and phenotypes areIDA, IEP, and inferred by mutant phenotype (IMP). In additionto the evidence code, TAIR provides evidence description, whichdepicts more specific assay types for supporting the annotations.For instance, YSL1 is expressed in the shoot, with evidencecode IEP and evidence description transcript levels (e.g. northerns;Fig. 3C). Details on evidence codes and evidence descriptionscan also be found online (http://www.plantontology.org/docs/otherdocs/evidence_codes.html).More details on literature curation using controlled vocabularyand components of annotations can be found elsewhere (Berardiniet al., 2004; Clark et al., 2005). Each contributing databasehas developed its own annotation interface and has taken differentapproaches to displaying gene and phenotype annotations. However,association files contributed to the POC Concurrent VersionsSystem repository are uniformly formatted and are compliantto POC standards.
Use of the PSO in Gene Expression and Protein Localization Experiments
Besides gene annotations, another common application of thePSO is in categorizing experiments and describing biologicalsamples. For example, databases containing large-scale geneexpression profiling data, such as GENEVESTIGATOR (Zimmermannet al., 2004) and NASCArrays (Craigon et al., 2004), are usingthe PSO to show genes that are expressed in certain plant structuresand to describe microarray experiments, respectively. The PlantExpression Database (Shen et al., 2005) is currently incorporatingPSO terms in their microarray experiment sample descriptionand also in their data submission forms (R. Wise, personal communication).Similarly, ArrayExpress plans to implement PSO terms in thenear future (H. Parkinson, personal communication). NASCarraysuses PSO terms to describe tissue sample sources used in microarrayexperiments (as BioSource Information; Supplemental Fig. S1).
Researchers can, and are encouraged to, use the PSO for describingtissue samples for various transcript analyses (e.g. northernblot/reverse transcription-PCR, -glucuronidase/green fluorescentprotein, in situ mRNA hybridization), protein localization experiments(e.g. immunolabeling, proteomic data), and gene expression assaysfrom microarray experiments or laser-capture microdissectionexperiments in their publications and Web sites. Descriptionsof other expression data, such as expressed sequence tags (ESTs)and cDNA libraries, can be enhanced by using proper botanicalterms and accession numbers from the PSO. These datasets aresubmitted to dbEST at the National Center for BiotechnologyInformation (NCBI) and consistent use of standardized anatomicalterms can greatly improve cross species comparison. For instance,a user interested in finding all ESTs from EST libraries generatedfrom pollen grains across plant taxa could query the NCBI GenBankusing the unique ID for the PSO term male gametophyte (synonym:pollen grain), PO:0020091, and retrieve all ESTs generated frompollen tissue samples. Currently, such a query is not feasibleat the NCBI; instead, a search for the words pollen AND plantretrieves all EST entries in which both words, pollen and plant,appear anywhere in the text. The Gramene database has alreadystarted using the PSO for tissue-type description of 201 ESTand cDNA libraries for cereals obtained from dbEST. The listof libraries and the links to the PSO terms can be viewed athttp://www.gramene.org/db/ontology/association_report?id=PO:0009011&object_type=Marker%20library.
In summary, the consistent use of PSO terms across differentplant species and use of available annotations of gene expressiondata and phenotype descriptions are valuable aids to bench scientistsand can facilitate new discoveries. Researchers involved inlarge-scale expression profiling projects or those who generatedmutant collections and are creating their own databases to storephenotypic data are encouraged to use the PSO. The POC has alreadybeen contacted by a number of such laboratories with questionson how to use the ontologies for describing tissue samples inEST collections, laser-capture microdissection experiments,microarray experiments, and mutant phenotype collections. Weare continuously making an effort to reach out to our prospectiveusers and to meet the particular annotation needs of the collaboratingdatabases, as well as the needs of the broader plant researchcommunity. Users are encouraged to contact the POC to get help,contribute their feedback, and suggest new ontology terms bywriting to firstname.lastname@example.org.
Comparison of Gene Expression and Phenotype Data in Arabidopsis, Rice, and Maize
The data curated using the PSO, contributed by participatingplant databases, can be easily accessed by performing one-stopqueries in the POC database. As of August 31, 2006, the databasehas over 4,400 unique genes and nearly 1,900 germplasms annotatedwith PSO terms, with a total of over 10,000 associations, contributedby TAIR, Gramene, MaizeGDB, and NASC. Annotations are displayedand can be queried using the PO browser tool (http://www.plantontology.org/amigo/go.cgi),a modified AmiGO tool (see "Materials and Methods"). A userinterested in genes involved in inflorescence development andtheir comparison between grasses (rice and maize) and Arabidopsiscan search for the term inflorescence (PO:0009049) and retrieveall gene annotations and phenotypic descriptions associatedwith this term. Direct annotations to the PSO term and annotationsto all its children terms are displayed on the term detail page.Hyperlinks to the original publications from which annotationswere extracted provide quick access to the original experimentaldata and methodology, which, combined with a direct link tothe gene and locus detail pages at contributing databases, leadsto quick access to deposited DNA and protein sequences. Also,on the gene detail pages at Gramene and TAIR, functional annotationswith GO terms are displayed and hyperlinked to the GO, providingaccess from the PO to the GO through these links.
The gene expression data available at the POC Web site combinedwith sequence similarity and phylogenic analysis can facilitatecomparative structural and functional studies of related plantgenes. Although it is yet to be experimentally verified thatthe evolutionary conservation among plant genomes is manifestedby functional similarity, such as distinct overlapping expressionpatterns of orthologous genes, available annotations of geneexpressions can be used as a starting point in such studies.This approach can be particularly useful for orthologs in maizeand rice, considering their evolutionary relatedness (i.e. theirmonophyletic origin) and, to some degree, also for comparisonto their putative orthologs in Arabidopsis. A known exampleis the study of functional complementation and overlapping expressionpatterns of the vp1 gene in maize and its Arabidopsis orthologABI3, both genes involved in seed maturation and germination(Suzuki et al., 2001). ABI3 is expressed in the Arabidopsisembryo and seed coat (TAIR), whereas germplasm of the maizevp1 mutant is annotated to the PSO term fruit (MaizeGDB). Thus,the query for the term fruit, of which the seed is a part, usingspecies-specific filters for Arabidopsis and maize (availableon the PO browser) would retrieve all genes/germplasms annotatedin these two species, including vp1 and ABI3. Although the POdatabase does not yet have tools to address orthology or evensequence similarity in rice, maize, and Arabidopsis, annotationdata available at the POC Web site can be used as a startingpoint for detailed studies of the function and expression ofputative orthologous genes in rice and maize and their correspondinghomologs in Arabidopsis. Web sites such as InParanoid provideorthology information for sequenced eukaryote genomes (O'Brienet al., 2005) and could be used in combination with the POCto address these questions.
Extended Annotation of Mutant Phenotypes Using Controlled Vocabularies
Describing a phenotype is a complex task; to capture relevantbiological information about an entire set of characteristicsof an organism, one needs to consider all observable (measurable)traits, qualitative and quantitative, the type of assays, andspecific experimental conditions in which interaction of genotypeand environment occurs. Traditionally, curators at plant genomicdatabases have relied on the free-text description (usuallyas a short summary), often combined with images of mutant phenotypesand natural variants. This approach largely limits data manipulationand searches and prevents easy comparison across species.
PSO is an essential ontology to use to move toward more systematicannotation of phenotypes. However, it depicts the plant structuresonly during normal development of a plant. It does not includeterms that describe morphological variations of cells, tissues,and organs in mutated plants (e.g. fasciated ear) or qualitativeand quantitative descriptors (e.g. type of branching, trichomeshape, spikelet density). Thus, additional ontologies are requiredfor capturing relevant biological information about phenotypesfully. If used exclusively, the PSO would be insufficient tocapture all of the details of a phenotype in a controlled vocabularyformat.
Recently, the NASC, Gramene, and MaizeGDB moved toward combiningPO terms with other ontologies to annotate mutant phenotypesand natural variants to allow computation and more efficientcross species comparison. At Gramene, PO terms are used in conjunctionwith Trait Ontology terms (Yamazaki and Jaiswal, 2005) to describephenotypes. As an example, the phenotype description of theallele cg.1, cigar shape panicle (cg) gene in rice (Seetharamanand Srivastava, 1969; Prasad and Seetharaman, 1991) is shownin Supplemental Figure S2A. This mutation affects the morphologyof a panicle, rachis, and grain (see the text description inSupplemental Fig. S2); thus, the annotations were made to PSOterms inflorescence (PO:0009049), stem (PO:0009047), and seed(PO:0009010). In addition to PO terms, curators from the Gramenedatabase chose terms from another ontology, Trait Ontology (Yamazakiand Jaiswal, 2005), to annotate the cg.1 allele in rice: panicletype (TO:0000089), seed length (TO:0000146), seed size (TO:0000391),and stem length (TO:0000576).
A different approach has been taken by the NASC database fordescribing mutant phenotypes and natural variants in Arabidopsis.In addition to a free-text description, short statements, referredto as an entity, attribute, value (EAV) description, are composedby combining terms from orthogonal (i.e. nonoverlapping) ontologies.This model has been tested in pilot projects at a few modelorganism databases, namely, ZFIN (Sprague et al., 2003) andFlyBase (FlyBase Consortium, 2002). The EAV model relies onthe Phenotype and Trait Ontology (PATO)—a species-independentcontrolled vocabulary created as a schema in which the qualitativephenotypic data are represented as nouns and phrases (Gkoutoset al., 2005). The core of the PATO is composed of a set ofattribute and value terms (such as color, shape, and size; green,serrate, and dwarf), which are recently converted to a singlehierarchy of qualities (G. Gkoutos, personal communication).At the NASC database, the allele ckh1-1 (in Landsberg erectabackground), a mutation of the CYTOKININ-HYPERSENSITIVE 1 genein Arabidopsis, is annotated to the PO terms inflorescence (PO:0009049)and to the PATO term ShortHeight-Value (PATO:0000569), creatingthe following syntax: inflorescence:short:height. An additionalannotation to primary root (PO:0020127) is followed by ShortLength-Value(PATO:0000574), creating the syntax primary root:short:length.Thus, multiple controlled vocabulary statements can be createdfor any germplasm/seed stock.Presently, the POC database and ontology browser are not setup to display annotations to multiple ontologies. Therefore,controlled vocabulary annotations to ontologies other than thePO can be viewed on gene/germplasm/stock detail pages at contributingdatabases, which can be accessed by clicking on the appropriatedatabase link (Supplemental Fig. S2B). More details on usingthe Trait Ontology and the PATO and EAV model can be found atthe Gramene and NASC Web sites, respectively. Whereas TraitOntology is plant specific and was created for the purpose ofannotating mutants in rice and other cereal crops, PATO ontologyis species independent and intended for description of mutantphenotypes across kingdoms. PATO terms can be used in combinationwith a wide range of other ontologies that describe entities,such as GO, Cell Ontology, and anatomical and developmentalstage ontologies, among others.
A major concern for the PSO is the proliferation of terms. Thenumber of terms needs to be large enough for precise annotationof genes and phenotypes, but small enough for curators and endusers to navigate the ontology easily. The terminology for describingplants is rich and complex and is often species or family specific.Available visualization and editing software portrays the ontologiesas strictly hierarchical, whereas plant structure is not. Rather,it is modular in nature, with a relatively small number of tissueand cell types recurring, often with slight modification, indifferent organ systems at different times during development.Converting a modular structure to a formal hierarchy requiresextensive redundancy in the ontology. For example, a flowermight be a part of a cyme, a raceme, or any other inflorescencetypes. To maintain the appropriate upward flow of informationthrough the hierarchy, we would need to create a term specifyinga distinct type of flower within each inflorescence type. Thus,we faced the possibility of creating the terms flower of cyme,flower of raceme, flower of panicle, etc., followed by stamenof flower of cyme, gynoecium of flower of cyme, etc. With inflorescenceand fruit types, we solved the problem by placing all of thedifferent inflorescence and fruit types as synonyms of inflorescenceand fruit, respectively. This effectively removed one hierarchicallevel from the ontology at these positions (Supplemental Fig.S3).
Synonymy was not appropriate to account for staminate and pistillateinflorescences of Zea, which are physically separate and morphologicallydistinct (monoecious) from each other. The two types of inflorescenceoften have different phenotypes in single-gene mutants and identicalgenes are often deployed differently in each. Maize geneticiststhus often want to be able to distinguish these two. Therefore,the maize ear and tassel are the only two inflorescence typesthat are treated as a type of inflorescence.
The solution by synonymy does not fully eliminate the problemswith proliferation of terms. Users of the ontology will findextensive residual redundancy in some areas. Ultimately, newvisualization and ontology editing software and a differentapproach to creating ontologies will be needed to reflect themodularity of biological reality more precisely and intuitively.
Homology Assessment and Taxon-Specific Forms
The PSO is designed to be a practical tool for annotating genesand germplasms and to be, as far as possible, neutral on questionsof homology. Thus, for example, the terms cotyledon and scutellumare not treated as synonyms, even though there is a body ofthought that suggests that they might be derived from the samesort of ancestral structure. As our knowledge of plant structurecontinues to develop, however, some of these terms may be merged.More problematic, but also perhaps more interesting, are structuresthat are unique to particular clades of plants. These are currentlyaccommodated by the sensu designation, but as major groups areadded, the number of such terms is likely to increase. For example,stipules are considered to have arisen independently in multiplelineages and they may prove to be developmentally and geneticallydistinct. If true, in addition to a common term stipule (PO:0020041),the PO could be faced with multiple terms such as stipule sensuRubiaceae, stipule sensu Fabaceae, stipule sensu Brassicaceae,etc. Handling a phylogenetic relationship is beyond the scopeof the PSO currently, but it is an important topic to addressin the long run. As more genes are annotated from more species,the PSO may help to discover whether similar structures thathave evolved independently are produced by very distinct underlyinggenetic mechanisms.
The PSO was based on three species-specific ontologies, TAIRAnatomy Ontology (Berardini et al., 2004), the Cereal Ontologyfrom Gramene, and the maize (Zea mays) Ontology (Vincent etal., 2003). However, most PSO terms and definitions were adoptedfrom a few well-known textbooks and glossaries. Most definitionscome from Plant Anatomy (Esau, 1977) and from the AngiospermPhylogeny Web site created and maintained by Peter Stevens andthe Missouri Botanical Garden (http://www.mobot.org/MOBOT/research/APweb).Definitions were sometimes taken verbatim from references ormodified for clarity. Original publications are often consulted.In addition, opinions of plant researchers in respective areasof expertise are periodically sought by the POC. The ontologieswere created and edited using the GO ontology editing tool,Directed Acyclic Graph Editor, which is freely available fromSourceForge (http://sourceforge.net/project/showfiles.php?group_id=36855).
As part of an ongoing effort to actively maintain the plantontologies, the POC meets on a regular basis to discuss newterms and ontology structure suggestions. Users are encouragedto use the feedback navigation bar menu option on the POC Website to suggest new ontology terms or send feedback or contactthe POC at email@example.com. Ontology and annotationupdates are released on the POC Web site the last week of everymonth. Each release is indicated by the month and year of therelease date (i.e. PO_0906), displayed at the left side of theontology browser header. POC ontology and association filesin the Concurrent Versions System are tagged accordingly toconnect the respective flat files with the database release.The same files that are used for each database release are alsoposted at the SourceForge OBO Web site (http://obo.sourceforge.net/cgi-bin/detail.cgi?po_anatomy).Synchronization between POC ontology releases and participatingdatabase releases of PO is handled individually by each database.The individual databases regularly update their PO versionseither on a monthly (TAIR) or quarterly (Gramene) basis.
Ontology and Annotation Analysis
To generate statistics for the path depth of PSO terms and annotations,we downloaded, installed, and queried the PO MySQL databaseversion 09/06 (http://www.plantontology.org/download/database).Term depths were determined by querying the number of nodesin the longest path length from the root node. This measureof depth was used so that a parent-child relation would neverdecrease the depth of a term.
Database and Ontology Browser
We used the GO database schema and ontology browsing tool, AmiGO,for storing and displaying the PSO and its annotations, respectively.The AmiGO browser, a Web-based tool for searching ontologiesand their associations developed by the GO consortium, is freelyavailable open-source software. We made minor modificationsto make it more suitable to the specific requirements of PO.Modifications of AmiGO pertinent to the general ontology communitywere contributed to GO. The PO browser accesses the MySQL POCdatabase at Cold Spring Harbor Laboratory, Cold Spring Harbor,NY. The structure of the POC database and main features of theWeb site have been previously described (Jaiswal et al., 2005).
The following materials are available in the online versionof this article.
We acknowledge the Gene Ontology Consortium for software infrastructureand technical support. We thank our two industry collaborators,Monsanto and their Genome Knowledge Enhancement Program, andPioneer Hi-Bred International, Inc., for contributing theirspecies-specific ontologies. We also thank Quentin Cronk, RexNelson, Naama Menda, Victoria Carollo, and William Friedmanfor their participation in ontology development, as well asnumerous researchers and curators who have reviewed the plantontologies and are listed individually online (http://www.plantontology.org/docs/otherdocs/acknowledgment_list.html).
Received November 8, 2006; accepted November 26, 2006; published December 1, 2006.
Berardini TZ, Mundodi S, Reiser L, Huala E, Garcia-Hernandez M, Zhang P, Mueller LA, Yoon J, Doyle A, Lander G, et al (2004) Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol 135: 1–11
Jaiswal P, Avraham S, Ilic K, Kellogg EA, McCouch S, Pujar A, Reiser L, Rhee SY, Sachs MM, Schaeffer M, et al (2005) Plant ontology: A controlled vocabulary of plant structures and growth stages. Comp Funct Genom 6: 388–397
Pujar A, Jaiswal P, Kellogg EA, Ilic K, Vincent L, Avraham S, Stevens P, Zapata Z, Reiser L, Rhee SY, et al (2006) Whole plant growth stage ontology for angiosperms and its application in plant biology. Plant Physiol 142: 414–428
Sprague J, Clements D, Conlin T, Edwards P, Frazer K, Schaper K, Segerdell E, Song P, Sprunger B, Westerfield M (2003) The Zebrafish Information Network (ZFIN): the zebrafish model organism database. Nucleic Acids Res 31: 241–243Zimmermann P, Hirsch-Hoffmann M, Hennig L, Gruissem W (2004) GENEVESTIGATOR: Arabidopsis microarray database and analysis toolbox. Plant Physiol 136: 2621–2632