1Walloon Agricultural Research Centre (CRA-W), Gembloux, Liroux 9, B-5030, Belgium, 2Agroscope Changins-Wädenswil (ACW), Phytopathologie, P.O. Box 185, Schloss CH-8820 Wädenswil, Switzerland, 3Genetics and Horticulture – GenHort, National Institute for Agricultural Research, 15 INRA, BP 60057, F-49071 Beaucouzé Cedex, France and 4Plant Research International (PRI), P.O. Box 16, 6700 AA Wageningen, The Netherlands
Open Access article from Bioinformatics 2007 23(7):882-891.
Objective: AppleBreed DataBase (DB) aims to store genotypicand phenotypic data from multiple pedigree verified plant populations(crosses, breeding selections and commercial cultivars) so thatthey are easily accessible for geneticists and breeders. Itwill help in elucidating the genetics of economically importanttraits, in identifying molecular markers associated with agronomictraits, in allele mining and in choosing the best parental cultivarsfor breeding. It also provides high traceability of data overgenerations, years and localities. AppleBreed DB could serveas a generic database design for other perennial crops withlong economic lifespans, long juvenile periods and clonal propagation.
Results: AppleBreed DB is organized as a relational database.The core element is the GENOTYPE entity, which has two sub-classesat the physical level: TREE and DNA-SAMPLE. This approach facilitatesall links between plant material, phenotypic and molecular data.The entities TREE, DNA-SAMPLE, PHENOTYPE and MOLECULAR DATAallow multi-annual observations to be stored as individual samplesof individual trees, even if the nature of these observationsdiffers greatly (e.g. molecular data on parts of the apple genome,physico-chemical measurements of fruit quality traits, and evaluationof disease resistance). AppleBreed DB also includes synonymsfor cultivars and pedigrees. Finally, it can be loaded and exploredthrough the web, and comes with tools to present basic statisticaloverviews and with validation procedures for phenotypic andmarker data to certify data quality.
AppleBreed DB was developed initially as a tool for scientistsinvolved in apple genetics within the framework of the Europeanproject, ‘High-quality Disease Resistance in Apples forSustainable Agriculture’ (HiDRAS), but it is also applicableto many other perennial crops.
The demands for the storage of genotyping data is also increasingtremendously due to the pace at which high numbers of PCR-basedmolecular markers are being developed. Initially, studies onmarker-trait associations were limited in size, usually involvingjust a single cross. The use of a single cross suffices as longas the genetic basis of a trait is extremely simple (only onelocus with one + allele). In all other cases, multiple crossesare needed if sound conclusions are to be reached on the numberof loci, alleles and mode of action of genes (intra- and/orinter-locus interactions). Studies on multiple crosses thereforedemand high quality and good data management facilities.
In the perennial apple crop, a new concept of gene and QTL identificationwas initiated called Pedigree Genotyping (Van de Weg et al.,2004). This approach aims to identify marker-gene associations,functional allelic diversity and both intra- and inter-locusinteractions by the integrated analysis of multiple plant populations(crosses, breeding selections and commercial cultivars) thatare genetically related by their pedigree. The European project‘High-quality Disease Resistance in Apples for SustainableAgriculture’ (HiDRAS) (Gianfranceschi and Soglio, 2004),was initiated to test the concept. In this study, more than2000 genotypes are being extensively phenotyped and genotyped,delivering more than 1 million data points. Each phenotypicdata point is associated with its own descriptors for tree,year, sample and locality. Each genetic data point is associatedwith its own descriptors for DNA sample, tree, genotype, markerand map position. To meet the needs for the storage and accessibilityof these data, a database was needed. There are already severaldatabases managing both genomic and phenotypic information forthe plant kingdom. MaizeGDB database (Lawrence et al., 2004),for instance, is a repository for maize sequence, stock, phenotype,genotypic and karyotypic variation, as well as chromosomal mappingdata The GrainGenes database (Matthews et al., 2003) focuseson grasses and cereals storing both genetic and phenotypic information.It holds, amongst others, the genealogy and allelic constitutionof markers and genes from 69 632 wheat accessions. Other databaseshave been developed for managing genome molecular information(Rhee et al., 2003; Schoof et al., 2002) or for storing genesand protein information for Arabidopsis thaliana (ABRC, NASC,MATDB).
All these databases focus on annual plants and most of themmanage genomic or phenotypic information separately. None ofthem allows the management of pluri-annual data on the sameindividual plants (Reiser et al., 2002; Sakata et al., 2000).As none of the existing public databases were able to supportextensive studies on marker-trait associations in pedigreedpopulations of perennial crops, AppleBreed DB was developed.In the context of database construction, apples could serveas a model for perennial crops. Apples are a woody perennialand have a 3–7 year juvenile phase, which is a significanthandicap in combining high fruit quality and durable diseaseresistance by classical breeding. Apples are self-incompatibledue to a gametophytic incompatibility system, and thereforeinbreeding methods are not applied (Lespinasse, 1992). Applesare vegetatively propagated, have an economic lifespan of about15 years during which they produce 13 crops, are economicallyimportant and are highly rated among consumers, being rankedthird in a fresh fruit consumption survey after banana and citrus(Pollack, 2001). Currently, there are more than 10 000 applecultivars (Morgan and Richardson, 2002; Way et al., 1991) throughoutthe world. Nevertheless, world apple production is based ona handful of cultivars that are grown in commercial orchards.The most important commercial cultivars are highly susceptibleto the most important apple diseases (scab, powdery mildew andEuropean canker), and most of the resistant cultivars do notyet meet the quality demands of consumers. The most importantobjective of worldwide apple breeding programmes is thereforeto combine high fruit quality with good disease resistance.To achieve this aim, breeders need a better understanding ofthe genetic basis of fruit quality traits and disease resistance,and to obtain access to molecular markers for the most importantgenes controlling these traits.AppleBreed DB supports breeders and geneticists in their geneticstudies and in their exploration of germplasm collections. Structuredinformation stored in the database should help them not onlyto elucidate the genetics of complex traits and to assess marker-traitassociations, but also to choose more easily and more quicklythe most interesting genitors to cross with (e.g. with gooddisease resistance, a particular taste, or a skin colour preferredby consumers). In this way, it is expected that breeders willmore easily be able to create new cultivars meeting consumerpreferences and allowing sustainable production systems. Thisarticle describes the database model of AppleBreed DB. AppleBreedDB is sufficiently generic to allow it to be used as a modeldatabase for many other perennial crops.
Conceptual data model (CDM)
The CDM includes six main super-classes: GENOTYPE, PHENOTYPEDATA, MOLECULAR DATA, GROWTHSITE, ORGANIZATION and REFERENCE.As shown in Figure 1, all entities are structured around thesuper-class GENOTYPE, which is the core element of the model.It covers all plant material by individual trees and DNA sampleswhich can come from any kind of material (cultivars, breedingselections, segregating populations and gene bank accessions).GENOTYPE is subdivided into three classes: PLANT MATERIAL, PASSPORTand SYNONYMS. PLANT MATERIAL includes the two main sub-classesTREE and DNA-SAMPLE, PASSPORT includes the PEDIGREE and ACCESSIONmain sub-classes, and SYNONYMS includes the SYNONYM and PATRONYMsub-classes.
TREE and DNA-SAMPLE hold the identity descriptors for each individualtree, DNA sample and genotype name. TREE also includes descriptorsfor the precise location where trees were grown (institute,plot, row and position in row) and their origin (origin of budwood, year of sowing, planting and grafting, and rootstock).DNA-SAMPLE also includes the origin of each sample (tree fromwhich the sample was derived, date of isolation and positionon micro-titre plates of the original sample as their sub-samplesetc.).
ACCESSION is used to identify and characterize the plant material(cultivars, breeding selection, segregating population and genebank accession information). PEDIGREE describes the parentageof each accession up to the founder level and therefore facilitates‘Pedigree Genotyping’, a new pedigree-based approachof QTL identification and allele mining in pedigreed populations(Van de Weg et al., 2004). The class SYNONYMS holds the knownsynonyms and patronyms of each genotype, and accounts for themost frequently occurring typing errors.
Figure 1 shows the relationships between GENOTYPE and otherelements of the database. Each genotype is localized in oneor more specific trial plots (GROWTHSITE) and each institution(ORGANIZATION) supervises its trial plots. Genotypes are evaluatedfor their fruit quality and disease resistance (PHENOTYPE DATA).The procedures and results of the genotype DNA analyses arestored in MOLECULAR DATA. Each genotype listed in the databaseis referenced according to the literature references (Silbereisenet al., 1996; Smith, 1971) in the REFERENCE super-class. Table 1summarizes the information included in each super-class andthe corresponding main classes.
Most classes are further divided into one to various generationsof sub-classes, until the desired level of detail is reached.All these entities have been converted into tables at the LDMlevel. A class or sub-class may include one or several tables.The most important tables of the database are listed in Table 2.
As stated earlier, GENOTYPE holds information that identifiesgenotypes (names of cultivars, breeding selections, crossesand gene bank accessions characteristics) and the tangible partof the plant material (trees and DNA samples). Phenotypic informationconcerns fruit quality and disease resistance. Finally, molecularinformation relates to molecular markers used to construct geneticlinkage maps, information on mining allele, loci and pedigreeof the allele. Each genotype listed in the database is consideredto be a central key for the traceability of the informationstored in the AppleBreed DB.
Logical data model (LDM)
The LDM describes entities defined within each super-class andtheir relationships with other entities defined above. The databasediagrams (see Figs 2–4) give an external view of the AppleBreedDB data content. The consistency of data is automatically checkedby the database management system itself, at a superior level,according to the rules and the relationships defined when theschema is implemented. The LDM is presented in more detail forthe super-classes (i) GENOTYPE, (ii) PHENOTYPE and (iii) MOLECULARDATA, specifying their primary and secondary keys.
In the GENOTYPE super-class (Fig. 2) the GT_TREE table and GT_DNA_SAMPLEtable are the most important because they allow the individualgenotype for the phenotype assessment and the molecular dataanalysis to be set up. Because of the high importance of plantmaterial identification and certification for genetic studies,the emphasis was put on tracking and tracing aspects for thedefinition of the structure of these tables. Their detailedcontent is presented in Tables 3 and 4. The link between themis made through the TREE_LABEL primary key.
The GT_ACCESSION table is used to store information that isassigned to an accession when it is entered into a collection.The key element of this table is the accession number, whichis unique in the collection. Once assigned, this number cannever be reassigned to another accession. The GT_PEDIGREE tableallows a user to determine whether a relationship exists betweenphenotypic characteristics and genomic results from genitorsand their progenies.
The high number of synonyms for cultivars is a recurrent problemfor breeders, geneticists and managers of gene banks; they impairthe efficient management and exploitation of the collections.
This is especially true for old genotypes received or collectedin different places and times.
For example, Cox's Orange Pippin has more than 40 synonyms.In addition, very modern cultivars often have both a cultivarand a trademark name. Finally, for widely grown cultivars thereare often many mutants, each with its own name. For the oldcultivars, there are many sources of synonyms. One is translationor transliteration of original names into local languages. Thereare also spelling errors due to the ‘appropriation’,over time, of introduced foreign genotypes in local traditions,resulting in new local names adapted to the language or dialect(Oger and Lateur, 2004). This problem can lead to major disappointments.For example, geneticists might believe they are working on differentgenotypes, but after obtaining their results they realize theyare working on the same genotype with different names. The databasemodel addresses this problem by using the SYNONYMS main class.The first appellation found in the literature has, in most cases,to be considered as the patronymic name. This name is filledout in the identifier field PATRONYM_NAME in the GT_PATRONYMtable, as displayed in Figure 2, and a link between the patronymicname of a genotype and its synonyms is assured through the PATRONYM_ID.
3.2.2 PHENOTYPE DATA super-class
The PHENOTYPE DATA super-class (Fig. 3) includes two main classes:FRUIT QUALITY and DISEASE RESISTANCE. Each genotype is studiedfor several traits (Gianfranceschi and Soglio, 2004), such as:fruit external characteristics (shape, ground colour, overallcolour, fruit size, etc.), fruit internal quality (sugar content,starch index, acidity, etc.), the sensorial evaluations of expertpanels to determine the quality of the fruits (sourness, juiciness,firmness, etc.) and the disease levels under natural conditionsin the orchard as well as in specially designed greenhouse tests.Data are encoded for each individual assessment, which can bemade for a series of individual apples (e.g. firmness data)for different dates (e.g. 0, 2 and 4 months after harvest),localities and years.
Figure 3 also displays the relationships among the main tablesof this super-class as well as the relationships with othertables included in other entities, such as GENOTYPE and GROWTHSITE.With regard to the sensorial, instrumental, external and panelexpert observations, a composite primary key identifies eachobservation. This key includes an identifier for the sample,an identifier for harvest times (a date), an identifier forthe applied method of assessment and an identifier for the institutionmaking the observations.
This kind of primary key structure gives each institution thepossibility of marking its own samples (there is a unique samplecode number for each institution).
Each genotype is linked to fruit quality assessment tables (PH_INSTRUMENTAL_ANALYSIS,PH_SENSORIAL_ANALYSIS, PH_EXPERT_PANEL, PH_SENSORIAL ANALYSIS)by the successive tables GT_TREE PK_TREE_LABEL and the tablePH_SAMPLE. The PH_SAMPLE_ID field links all the informationfrom instrumental, sensorial and disease observations to trees,and thereby to genotypes.
Each tree is assessed individually, making it possible to connectphenotypic observations with molecular marker data by meansof the genotype. This structure allows users to select, forexample, a genotype with fruits that have the same level ofsugar content and the same starch index, or are similar or dissimilarfor other important characteristics. The primary key for PH_DISEASE_ASSESSMENT(the table is included in the DISEASE RESISTANCE class) is alsoa composite key. This key includes the identifiers of each individualtree, observation date, disease identifier and observed organplant, as well as an identifier for the applied method of assessment.
MOLECULAR DATA super-class
The objective of the HiDRAS project is to molecularly characterizeall the individuals belonging to a selected pedigree using highlyinformative markers. Families and their connected progeniesare chosen for being representative of apple breeding materialand differentiated for fruit quality and disease resistance.
One aspect of the project concerns the development of new highlyinformative molecular markers to fill the gaps in the availableapple linkage maps. The origin of all alleles of each marker/genotypecombination is assessed in terms of the alleles of the foundingcultivars (identity by descent) by analysing marker data.
The MOLECULAR DATA super-class (Fig. 4) is one of the most importantcomponents of the model. Its data describes the genetic constitutionof each genotype (allelic composition of molecular markers andmajor genes) and must allow alleles to be traced over generations.Starting from the genotype, all information is linked in thedatabase as a chain. The molecular information is linked tothe genotype by the GT_DNA_SAMPLE table and the DNA_SAMPLE_ID.In the MOLECULAR DATA super-class, the MOL_DNA_LINK_LOCUS andthe LOCUS_ID make the link with MOL_LOCUS, MOL_ALLELE, MOL_MAPSand MOL_MARKER tables.
The content of the MOL_LOCUS and MOL_MARKER tables is describedin Table 5.
Due to the links between all the tables, the AppleBreed DB caneasily provide input data for QTL software to search for combinationsof certain molecular markers and fruit quality traits (e.g.skin colour, shape or global taste).
AppleBreed DB was implemented within a MySQL database systemand a Linux environment. A web interface was developed in PhPlanguage. Figure 5 illustrates the data management system adoptedfor the submission and validation of data. Users send theirdata to the database administrator via specific, standardizedtemplates (Excel files). Data quality control involves threesteps: (1) the structure of the encoding templates (templatescreated to collect data include control concerning the allowednumeric values or class evaluations), (2) the quality checkby the database manager and (3) the constraints existing inthe database structure itself (a journal with the error valuesis generated). Once these checks are achieved, the results regardingsuspicious data are sent back to users for validation. Afterre-submission, the administrator carries out the transfer andintegration of data into the database structure. Finally, userscan visualize and upload both the raw and interpreted resultsby accessing specific web pages. Simple SQL queries allow on-lineaccess to the database through the Internet. Various real-timequery tools have been developed, including specific multiple-choicequestionnaires for different views of the requested information.Data output formats can be generated ‘à la carte’,making output directly compatible for a wide range of softwarepackages, including packages for QTL mapping.
AppleBreed DB is built on a relational model. The structureof its conceptual model allows for the flexible addition ofnew entities. In other words, the AppleBreed DB structure allowsdata with new characteristics to be easily and quickly integratedinto the database, at least as long as the database integrityrules are respected. The ability to encode new data into thedatabase is checked by the database structure itself.
Due to the relational structure of the database, users’queries are easily handled through SQL requests. Other potentialreal-time query tools can be easily added, such as specificmultiple-choice questionnaires for different views of the requestedinformation. Modules to export data in ‘à la carte’output formats are also under development, making data directlycompatible for a wide range of software packages, includingpackages for QTL mapping. An interesting point for geneticistsand breeders is that it is possible to manage traceability ofplant material, a genotype or a family and to follow the parentsand their descendants. In addition, the flexibility of the datamodel makes it possible to adapt this system for other multi-annualbotanical species. Unfortunately, one characteristic of relationaldatabases might represent an inconvenience. Direct encodingof results is not allowed, for example, for new genotypes ormarkers. It is always necessary to insert new data in a particularand logical order and according to a specific and defined format.
AppleBreed DB can store phenotypic data at the level on whichthey were originally assessed, including at the level of individualsamples. In addition, the position of trees in the orchard andthe genetic relationships among genotypes are documented. Together,this allows in-depth analysis of the data because experimentaldesign, position effects, genetic relationships and experimentalvariation can be taken into account.
This not only allows in-depth classical analysis of the phenotypicdata itself, such as heritability estimates and the effect ofdifferent cultivation practices and environments, but also ensuresa high-power detection of marker-trait associations. As it standsAppleBreed DB will be a powerful tool for resolving the geneticbase of horticulturally important traits. In addition, it hasthe potential to support valorization of EST and genome sequencingprojects, since its phenotypic and genetic data can be helpfulin the identification of the candidate genes validated by geneticists.Currently, there are various public databases for perennialcrops that are related to different aspects of genetics andbreeding. The USDA-ARS Germplasm Resources Information Network(GRIN http://www.ars-grin.gov/npgs/) is a database which storesinformation about clonal germplasm in the USDA system, includingvarious tree species as apples, pears stone fruits, grapes,etc. The Genome Database for Rosaceae (GDR, http://www.mainlab.clemson.edu/gdr/)is a curated and integrated web-based relational database. GDRcontains data on physical and linkage maps, annotated EST sequencesand all publicly available Rosaceae sequences. Although thisdatabase started as a database for Prunus, it is now extendingto other families of the Rosaceae. Various databases for themanagement of genetic resources were created by the EuropeanCooperative Programme for Plant Genetic Resources Networks (ECP/GR).These databases are crop specific and include Apple (http://www.ecpgr.cgiar.org/databases/Crops/Malus.htm[Maggioni et al., 2002]), Pear (http://pyrus.cra.wallonie.be/)and various stone fruits (http://www.bordeaux.inra.fr/urefv/base/).The HiDRAS SSRdb (http://www.hidras.unimi.it/) contains detailedinformation on more than 300 SSR markers that have been mappedin apple. The AppleBreed DB is currently uploading the HiDRASdata, most of which are likely to become public. All these databasesare relational, curated and web based. They are continuouslyextending in content and functionality. Much synergism couldbe obtained by tuning into their policies, content and formats,and much added value could be obtained if private databasessuch as the HortResearch Apple EST Database (Crowhurst et al.,2005) became part of the network.
This research was carried out with financial support from theCommission of the European Communities (Contract No. QLK5-CT-2002-01492),Directorate-General Research—Quality of Life and Managementof Living Resources Program. This manuscript does not necessarilyreflect the Commission's views and in no way anticipates itsfuture policy in this area. Its content is the sole responsibilityof the authors. The authors are deeply indebted to all participantsof the HiDRAS project for their involvement, collaboration andsupport in the development of the conceptual data model of AppleBreedDB. Funding to pay the Open Access publication charges was providedby Agricultural Walloon Research Centre of Gembloux (Belgium). Conflict of Interest: none declared.
Associate Editor: Chris StoeckertReceived on November 20, 2006; revised on January 11, 2007; accepted on January 12, 2007.
Crowhurst RN, et al. The HortResearch apple EST database – a resource for apple genetics and functional genomics. ( (2005) ) Proceedings of Plant & Animal Genomes XIII Conference. http://www.intl-pag.org/13/abstracts/PAG13_P499.html..
Oger R, Lateur M. Development of a specific software for the management of the recurrent synonymous problem of cultivars inside plant genetic resources databases: the case of the European EUROPEAN ECP/GR Pyrus database. Acta Hortic, ( (2004) ) 663, : 593–596..
Rhee SY, et al. The Arabidopsis information resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res, ( (2003) ) 31, : 224–228.
Way RD, et al. Apple (Malus). Acta Hortic, ( (1991) ) 290, : 3–46..
|GENOTYPE||GT||Information on the material that represents the genotype (tree, DNA sample), passport data of the genotype (accession, pedigree) and synonyms||Plant material, passport, synonyms|
|PHENOTYPE DATA||PH||Fruit quality results (as external, sensorial, instrumental evaluations and expert panel results) and disease resistance evaluations||Fruit quality, disease resistance|
|MOLECULAR DATA||MOL||Information on all results related to markers, linkage groups and allelic forms of the markers, and all necessary information for building maps with markers of a specific genotype. Marker information includes sizes of observed bands, PCR protocols, date and laboratory at which the data were raised primer sequences, and the gDNA or EST sequences from which the markers were derived.||Allele Markers Locus Linkage group Maps|
|GROWTHSITE||GRO||Information on location of trees and orchards||Site Trial plot|
|ORGANIZATION||ORG||Information about institutions supervising the site and the trial plot||Institution|
|REFERENCES||REF||Information on literature references used to describe the genotypes and the evaluation procedure||Reference|
|Super-classes||Main classes||Main tables||Content|
|GENOTYPE||PLANT MATERIAL||GT_TREE||Trees traced in the model|
|GT_DNA_SAMPLE||Information on DNA samples used in the model|
|PASSPORT||GT_ACCESSION||Type of material (cultivar, breeding selection, segregating population, gene bank accession)and their names, including synonyms|
|GT_PEDIGREE||Parents of each accession, if known|
|SYNONYMS||GT_SYNONYM||List of synonyms for each patronym|
|GT_PATRONYM||Patronym names with their literature references|
|PHENOTYPE DATA||FRUIT QUALITY||PH_INSTRUMENTAL_ANALYSIS||Instrumental analysis made during the observation period|
|PH_EXTERNAL ANALYSIS||External analysis made duringthe observation period|
|PH_SENSORIAL_ANALYSIS||Sensorial analysis made during the observation period|
|PH_EXPERT_PANEL||Expert panel evaluation|
|PH_SAMPLE||Sampling of the fruit to facilitate the traceability of the information|
|DISEASE RESISTANCE||PH_DISEASE||Information on diseases|
|PH_DISEASE_ASSESSEMENT||Observations made over several years|
|PH_DISEASE_ORGAN||Information on the plant organ evaluated|
|MOLECULAR DATA||ALLELE||MOL_ALLELE||Allele information|
|MARKERS||MOL_MARKERS||Markers used in the molecular analyses|
|LOCUS||MOL_LOCUS||Locus names and other information about it|
|LINKAGE GROUP||MOL_LG||Linkage group information: link between genotype, loci and allele|
|GROWTHSITE||SITE||GRO_SITE||Sites used to locate the genotype|
|TRIAL PLOT||GRO_TRIAL_PLOT||Trial plots used to locate the genotype|
|GRO_TP_FERTILITY||Soil fertility classes|
|GRO_TP_DRAINAGE||Soil drainage classes|
|GRO_TP_ORGANIC_MATTER||Soil organic matter classes|
|GRO_TP_TEXTURE||Soil texture classes|
|ORGANIZATION||INSTITUTION||ORG_INSTITUTIONS||Institutions which supervised a growth|
|REFERENCES||REFERENCES||REF_REFERENCES||Information on references used to describe the genotypes or the analysis methods|
PK: primary key; FK: foreign key.
PK: primary key; FK: foreign key.
PK: primary key; FK: foreign key.
Figure 1 Conceptual data model of AppleBreed DB and existing links between various super-classes.
Figure 2 Detailed structure of the GENOTYPE super-class.
Figure 3 Detailed structure of the PHENOTYPE DATA super-class.
Figure 4 Detailed structure of the MOLECULAR DATA super-class.
Figure 5 Data flow setup within the framework of the model.