Conceptual data model (CDM)
The CDM includes six main super-classes: GENOTYPE, PHENOTYPEDATA, MOLECULAR DATA, GROWTHSITE, ORGANIZATION and REFERENCE.As shown in Figure 1, all entities are structured around thesuper-class GENOTYPE, which is the core element of the model.It covers all plant material by individual trees and DNA sampleswhich can come from any kind of material (cultivars, breedingselections, segregating populations and gene bank accessions).GENOTYPE is subdivided into three classes: PLANT MATERIAL, PASSPORTand SYNONYMS. PLANT MATERIAL includes the two main sub-classesTREE and DNA-SAMPLE, PASSPORT includes the PEDIGREE and ACCESSIONmain sub-classes, and SYNONYMS includes the SYNONYM and PATRONYMsub-classes.
TREE and DNA-SAMPLE hold the identity descriptors for each individualtree, DNA sample and genotype name. TREE also includes descriptorsfor the precise location where trees were grown (institute,plot, row and position in row) and their origin (origin of budwood, year of sowing, planting and grafting, and rootstock).DNA-SAMPLE also includes the origin of each sample (tree fromwhich the sample was derived, date of isolation and positionon micro-titre plates of the original sample as their sub-samplesetc.).
ACCESSION is used to identify and characterize the plant material(cultivars, breeding selection, segregating population and genebank accession information). PEDIGREE describes the parentageof each accession up to the founder level and therefore facilitates‘Pedigree Genotyping’, a new pedigree-based approachof QTL identification and allele mining in pedigreed populations(Van de Weg et al., 2004). The class SYNONYMS holds the knownsynonyms and patronyms of each genotype, and accounts for themost frequently occurring typing errors.
Figure 1 shows the relationships between GENOTYPE and otherelements of the database. Each genotype is localized in oneor more specific trial plots (GROWTHSITE) and each institution(ORGANIZATION) supervises its trial plots. Genotypes are evaluatedfor their fruit quality and disease resistance (PHENOTYPE DATA).The procedures and results of the genotype DNA analyses arestored in MOLECULAR DATA. Each genotype listed in the databaseis referenced according to the literature references (Silbereisenet al., 1996; Smith, 1971) in the REFERENCE super-class. Table 1summarizes the information included in each super-class andthe corresponding main classes.
Most classes are further divided into one to various generationsof sub-classes, until the desired level of detail is reached.All these entities have been converted into tables at the LDMlevel. A class or sub-class may include one or several tables.The most important tables of the database are listed in Table 2.
As stated earlier, GENOTYPE holds information that identifiesgenotypes (names of cultivars, breeding selections, crossesand gene bank accessions characteristics) and the tangible partof the plant material (trees and DNA samples). Phenotypic informationconcerns fruit quality and disease resistance. Finally, molecularinformation relates to molecular markers used to construct geneticlinkage maps, information on mining allele, loci and pedigreeof the allele. Each genotype listed in the database is consideredto be a central key for the traceability of the informationstored in the AppleBreed DB.
Logical data model (LDM)
The LDM describes entities defined within each super-class andtheir relationships with other entities defined above. The databasediagrams (see Figs 2–4) give an external view of the AppleBreedDB data content. The consistency of data is automatically checkedby the database management system itself, at a superior level,according to the rules and the relationships defined when theschema is implemented. The LDM is presented in more detail forthe super-classes (i) GENOTYPE, (ii) PHENOTYPE and (iii) MOLECULARDATA, specifying their primary and secondary keys.
In the GENOTYPE super-class (Fig. 2) the GT_TREE table and GT_DNA_SAMPLEtable are the most important because they allow the individualgenotype for the phenotype assessment and the molecular dataanalysis to be set up. Because of the high importance of plantmaterial identification and certification for genetic studies,the emphasis was put on tracking and tracing aspects for thedefinition of the structure of these tables. Their detailedcontent is presented in Tables 3 and 4. The link between themis made through the TREE_LABEL primary key.
The GT_ACCESSION table is used to store information that isassigned to an accession when it is entered into a collection.The key element of this table is the accession number, whichis unique in the collection. Once assigned, this number cannever be reassigned to another accession. The GT_PEDIGREE tableallows a user to determine whether a relationship exists betweenphenotypic characteristics and genomic results from genitorsand their progenies.
The high number of synonyms for cultivars is a recurrent problemfor breeders, geneticists and managers of gene banks; they impairthe efficient management and exploitation of the collections.
This is especially true for old genotypes received or collectedin different places and times.
For example, Cox's Orange Pippin has more than 40 synonyms.In addition, very modern cultivars often have both a cultivarand a trademark name. Finally, for widely grown cultivars thereare often many mutants, each with its own name. For the oldcultivars, there are many sources of synonyms. One is translationor transliteration of original names into local languages. Thereare also spelling errors due to the ‘appropriation’,over time, of introduced foreign genotypes in local traditions,resulting in new local names adapted to the language or dialect(Oger and Lateur, 2004). This problem can lead to major disappointments.For example, geneticists might believe they are working on differentgenotypes, but after obtaining their results they realize theyare working on the same genotype with different names. The databasemodel addresses this problem by using the SYNONYMS main class.The first appellation found in the literature has, in most cases,to be considered as the patronymic name. This name is filledout in the identifier field PATRONYM_NAME in the GT_PATRONYMtable, as displayed in Figure 2, and a link between the patronymicname of a genotype and its synonyms is assured through the PATRONYM_ID.
3.2.2 PHENOTYPE DATA super-class
The PHENOTYPE DATA super-class (Fig. 3) includes two main classes:FRUIT QUALITY and DISEASE RESISTANCE. Each genotype is studiedfor several traits (Gianfranceschi and Soglio, 2004), such as:fruit external characteristics (shape, ground colour, overallcolour, fruit size, etc.), fruit internal quality (sugar content,starch index, acidity, etc.), the sensorial evaluations of expertpanels to determine the quality of the fruits (sourness, juiciness,firmness, etc.) and the disease levels under natural conditionsin the orchard as well as in specially designed greenhouse tests.Data are encoded for each individual assessment, which can bemade for a series of individual apples (e.g. firmness data)for different dates (e.g. 0, 2 and 4 months after harvest),localities and years.
Figure 3 also displays the relationships among the main tablesof this super-class as well as the relationships with othertables included in other entities, such as GENOTYPE and GROWTHSITE.With regard to the sensorial, instrumental, external and panelexpert observations, a composite primary key identifies eachobservation. This key includes an identifier for the sample,an identifier for harvest times (a date), an identifier forthe applied method of assessment and an identifier for the institutionmaking the observations.
This kind of primary key structure gives each institution thepossibility of marking its own samples (there is a unique samplecode number for each institution).
Each genotype is linked to fruit quality assessment tables (PH_INSTRUMENTAL_ANALYSIS,PH_SENSORIAL_ANALYSIS, PH_EXPERT_PANEL, PH_SENSORIAL ANALYSIS)by the successive tables GT_TREE PK_TREE_LABEL and the tablePH_SAMPLE. The PH_SAMPLE_ID field links all the informationfrom instrumental, sensorial and disease observations to trees,and thereby to genotypes.
Each tree is assessed individually, making it possible to connectphenotypic observations with molecular marker data by meansof the genotype. This structure allows users to select, forexample, a genotype with fruits that have the same level ofsugar content and the same starch index, or are similar or dissimilarfor other important characteristics. The primary key for PH_DISEASE_ASSESSMENT(the table is included in the DISEASE RESISTANCE class) is alsoa composite key. This key includes the identifiers of each individualtree, observation date, disease identifier and observed organplant, as well as an identifier for the applied method of assessment.
MOLECULAR DATA super-class
The objective of the HiDRAS project is to molecularly characterizeall the individuals belonging to a selected pedigree using highlyinformative markers. Families and their connected progeniesare chosen for being representative of apple breeding materialand differentiated for fruit quality and disease resistance.
One aspect of the project concerns the development of new highlyinformative molecular markers to fill the gaps in the availableapple linkage maps. The origin of all alleles of each marker/genotypecombination is assessed in terms of the alleles of the foundingcultivars (identity by descent) by analysing marker data.
The MOLECULAR DATA super-class (Fig. 4) is one of the most importantcomponents of the model. Its data describes the genetic constitutionof each genotype (allelic composition of molecular markers andmajor genes) and must allow alleles to be traced over generations.Starting from the genotype, all information is linked in thedatabase as a chain. The molecular information is linked tothe genotype by the GT_DNA_SAMPLE table and the DNA_SAMPLE_ID.In the MOLECULAR DATA super-class, the MOL_DNA_LINK_LOCUS andthe LOCUS_ID make the link with MOL_LOCUS, MOL_ALLELE, MOL_MAPSand MOL_MARKER tables.
The content of the MOL_LOCUS and MOL_MARKER tables is describedin Table 5.
Due to the links between all the tables, the AppleBreed DB caneasily provide input data for QTL software to search for combinationsof certain molecular markers and fruit quality traits (e.g.skin colour, shape or global taste).
AppleBreed DB was implemented within a MySQL database systemand a Linux environment. A web interface was developed in PhPlanguage. Figure 5 illustrates the data management system adoptedfor the submission and validation of data. Users send theirdata to the database administrator via specific, standardizedtemplates (Excel files). Data quality control involves threesteps: (1) the structure of the encoding templates (templatescreated to collect data include control concerning the allowednumeric values or class evaluations), (2) the quality checkby the database manager and (3) the constraints existing inthe database structure itself (a journal with the error valuesis generated). Once these checks are achieved, the results regardingsuspicious data are sent back to users for validation. Afterre-submission, the administrator carries out the transfer andintegration of data into the database structure. Finally, userscan visualize and upload both the raw and interpreted resultsby accessing specific web pages. Simple SQL queries allow on-lineaccess to the database through the Internet. Various real-timequery tools have been developed, including specific multiple-choicequestionnaires for different views of the requested information.Data output formats can be generated ‘à la carte’,making output directly compatible for a wide range of softwarepackages, including packages for QTL mapping.