Richard Bruskiewich,1* Martin Senger,1 Guy Davenport,2 Manuel Ruiz,3 Mathieu Rouard,4 Tom Hazekamp,4 Masaru Takeya,5 Koji Doi,5 Kouji Satoh,56 Reinhard Simon,7 Jayashree Balaji,8 Akinnola Akintunde,9 Ramil Mauleon,1 Samart Wanchana,1,10 Trushar Shah,2 Mylah Anacleto,1 Arllet Portugal,1 Victor Jun Ulat,1 Supat Thongjuea,10 Kyle Braak,2 Sebastian Ritter,2 Alexis Dereeper,3 Milko Skofic,4 Edwin Rojas,7 Natalia Martins,6 Georgios Pappas,6 Ryan Alamban,1 Roque Almodiel,1 Lord Hendrix Barboza,1 Jeffrey Detras,1 Kevin Manansala,1 Michael Jonathan Mendoza,1 Jeffrey Morales,11 Rowena Valerio,1 Yi Zhang,1 Sergio Gregorio,1,11 Joseph Hermocilla,1,11 Michael Echavez,1,12 Jan Michael Yap,1,12 Andrew Farmer,1313 Jennifer Lee,14 Terry Casstevens,15 Pankaj Jaiswal,15 Ayton Meintjes,16 Mark Wilkinson,17 Benjamin Good,18,19 James Wagner,18,1916 David Marshall,14 Anthony Collins,7 Shoshi Kikuchi,5 Thomas Metz,1 Graham McLaren,1 and Theo van Hintum20 Marcos Costa, Barry Peralta, Gary Schiltz, Jane Morris,
1Crop Research Informatics Laboratory, International Rice Research Institute (IRRI), DAPO Box 7777, Metro Manila, Philippines
2Crop Research Informatics Laboratory, International Maize and Wheat Improvement Center (CIMMYZT), Apdo. Postal 6-641, 06600 Mexico, DF, Mexico
3Centre International de Recherche Agronomique pour le Développement (CIRAD), Avenue Agropolis, 34398 Montpellier, Cedex 5, France
4Bioversity International, Via dei Tre Denari 472/a, 00057 Maccarese (Fiumicino), Rome, Italy
5National Institute for Agrobiological Sciences (NIAS), Kannondai 2-1-2, Tsukuba, Ibaraki 305-8602, Japan
6Empresa Brasileira de Pesquisa Agropecuaria (EMBRAPA), Parque Estação Biologia Final W5 Norte, 70770-900 Brasilia, DF, Brazil
7Centro Internacional de la Papa (CIP), Avenida La Molina 1895, La Molina, Apartado Postal 1558, Lima 12, Peru
8International Crops Research Institute for the Semi-Arid Tropics, Patancheru, Andhra Pradesh 502324, India
9International Center for Agricultural Research in the Dry Areas, P.O. Box 5466, Aleppo, Syria
10National Center for Genetic Engineering and Biotechnology, 113 Thailand Science Park, Phahonyothin Road, Klong 1, Klong Luang, Pathumthani 12120, Thailand
11Institute of Computer Science, College of Arts and Sciences, University of the Philippines, Los Baños, Laguna 4031, Philippines
12Department of Computer Science, University of the Philippines, Room 215, Melchor Hall, Diliman, Quezon City 1101, Philippines
13National Center for Genome Resources, 2935 Rodeo Park Drive East, Santa Fe, NM 87505, USA
14Scottish Crop Research Institute, Invergowrie, Dundee DD2 5DA, Scotland, UK
15Department of Plant Breeding, Cornell University, Ithaca, NY 14853, USA
16African Centre for Gene Technologies, P.O. Box 75011, Lynnwood Ridge 0040, South Africa
17Department of Medical Genetics, Faculty of Medicine, The University of British Columbia, Vancouver, BC, Canada V6T 1Z3
18School of Computing Science, Simon Fraser Universtiy, 8888 University Drive, Burnaby, BC, Canada V5A 1S6
19Bioinformatics Graduate Program, Genome Sciences Centre, BC Cancer Agency, 100-570 West 7th Avenue, Vancouver, BC, Canada V5Z 4S6
20Centre for Genetic Resources, The Netherlands (CGN), P.O. Box 16, 6700 AA Wageningen, The Netherlands
*Richard Bruskiewich: Email: [email protected]
Recommended by Chunguang Du
Received September 22, 2007; Accepted December 14, 2007.
This is an open access article from Int J Plant Genomics. 2008; 2008: 369601 distributed under the Creative Commons Attribution License.
The Generation Challenge programme (GCP) is a global crop research consortium directed toward crop improvement through the application of comparative biology and genetic resources characterization to plant breeding. A key consortium research activity is the development of a GCP crop bioinformatics platform to support GCP research. This platform includes the following: (i) shared, public platform-independent domain models, ontology, and data formats to enable interoperability of data and analysis flows within the platform; (ii) web service and registry technologies to identify, share, and integrate information across diverse, globally dispersed data sources, as well as to access high-performance computational (HPC) facilities for computationally intensive, high-throughput analyses of project data; (iii) platform-specific middleware reference implementations of the domain model integrating a suite of public (largely open-access/-source) databases and software tools into a workbench to facilitate biodiversity analysis, comparative analysis of crop genomic data, and plant breeding decision making.
The fast-moving fields of comparative genomics, molecular breeding, and bioinformatics have the potential to bring new knowledge to bear on problems encountered by resource-poor farmers. These problems include abiotic stresses (such as drought and soil salinity) and biotic stresses (such as plant diseases and pests). The Generation Challenge Programme (GCP; http://www.generationcp.org/) aims to exploit advances in molecular biology to harness the rich global heritage of plant genetic resources and contribute to a new generation of stress-tolerant varieties that meet the needs of these farmers and the consumers of their crops. The GCP brings together three sets of partners: member agricultural research institutes of the Consultative Group on International Agricultural Research (CGIAR; http://www.cgiar.org/), advanced research institutes in developed countries, and national agricultural research and extension systems in developing countries, to undertake a long-term program of globally integrated scientific research, capacity building, and delivery of products for the above goal.
Central to GCP activities is the development of an integrated platform of molecular biology and bioinformatics tools to be applied to the research objectives of the GCP. The resulting platform is also intended to be a “global public good” to be made freely available to all crop researchers and breeders around the world, thus enabling agricultural scientists, particularly in developing countries, to more readily apply information about elite genetic stocks, genomic knowledge, and new breeding technologies that are becoming available to their local breeding programmes.
The goal of this GCP crop informatics platform is to provide solutions for priority end-user needs for biodiversity analysis, comparative analysis of crop genomic data, and plant breeding decision making. Development of the platform is driven by the following observations:
A GCP crop information platform is being developed to better meet these challenges by managing genetic resources, genomics, and crop information using the following components:
This paper will survey progress on some of the central components of the platform, with a special emphasis on the domain model, a reference Java middleware implementation, and Internet protocol aspects of the project.
GCP domain model
To cope with the scope, diversity, and dispersion of crop information, GCP researchers formulated a vision to specify a consensus blueprint of a scientific domain model and associated ontology. The resulting models and ontology allow a “model-driven architecture” for the development of GCP software and network protocols .
The domain model is documented in Unified Modeling Language (UML). Computable versions of the UML model are archived in the DemeterUML folder of the “Pantheon” project in CropForge (http://cropforge.org/) software project repository. The UML diagrams themselves are indexed and published with supporting narratives on a project website (http://pantheon.generationcp.org/demeter). The bulk of the models are specified with the UML <<interface>> stereotype.
At the heart of the domain model are generic core model interfaces from which other specific scientific model interfaces are derived. This core model starts with the concept of simple identification of data objects in the system (using the SimpleIdentifier interface), which is extended by several more specific interfaces. The core includes a general concept of Entity, which serves as the superclass for most other interfaces describing major scientific concepts or data types in the system. The Entity interface documents generic metadata about objects in the system, including specific annotation of object characteristics using a rich Feature model. Other packages in the core models provide utility models for ontology, publication, and experimental study management.
Additional scientific models are derived as extensions of the core models. For example, the base interface classes of most specific major concepts or experimental objects in the scientific domain of discourse of the GCP, such as Germplasm, Map, or GeneProduct, directly extend the Entity model, adding subdomain-specific attributes as required. More lightweight concepts in the system extend simpler interfaces such as Feature.
For the elaboration of specific components of the core, as well as scientific domain models, the project generally adapts extant public domain models. For example, the Germplasm and Study subdomain models are derived from the data models of the open-source International Crop Information System (ICIS, http://www.icis.cgiar.org/; [2–4]). Aspects of the genotype (and associated genetic map and genomic sequence) models are influenced by public initiatives such as the Chado relational database schemata of the Generic Model Organism Database (GMOD) project . The production-release GCP domain model is being validated based on feedback from project scientists and developers, who are striving to validate the model by practical application in data management and platform implementation.
A significant feature of the domain model is the reliance on extensible controlled vocabulary and ontology (CVO) to define the full semantics of specialized types, feature attributes, and annotation values of instances of the model classes. Where possible, the GCP is simply adopting existing CVO standards, such as from the gene ontology , plant ontology , and Microarray Gene Expression Data Society (MGED) ontology  consortia. Where no appropriate ontology has yet been formalized, new dictionaries of terms are being compiled in collaboration with GCP scientists. CVO dictionaries selected for the platform are being catalogued in a dedicated online database (at http://pantheon.generationcp.org/) with web browser and web service access. Each selected dictionary is assigned a GCP ontology index number to facilitate platform management of the ontology. Where an existing public ontology already has its own accession identifiers (e.g., GO identifiers for the GO CVO), these identifiers are propagated into the full GCP identifier for the corresponding CVO terms. However, newly specified CVO lacking such a number space are assigned de novo GCP accession identifiers.
Additional support libraries are being provided within Ceres to support GCP domain model-driven DataSource development. In addition to core and support libraries, the Pantheon project provides a clearinghouse for platform and data-type-specific components. These components include adapters implementing the DataSource interface for specific data sources (archived in Osiris) for various crop databases at various GCP partner and external sites. Among others, current DataSource implementations include a wrapper for the ICIS and for GMOD schemata (Chado, Gbrowse). Other Pantheon components provide application support, including a search engine, data visualization, and web service provider implementations (in Belenus). Examples of the latter are support for NCGR ISYS , support for stand-alone applications based on Eclipse/RCP , and a web-based GCP domain-model-compliant web-based search engine (Koios).
For BioMOBY, data types were designed using GCP domain model semantics. Although generally faithful in translating the semantics of the Demeter UML specification of the domain model (i.e., the SimpleIdentifier interface is represented as a GCP_SimpleIdentifier data type), the GCP BioMOBY data types simplify the data representation as a concession to BioMOBY design constraints and to web service performance.
One key example of this is the extensive substitution of GCP_SimpleIdentifier objects, instead of fully detailed data objects, at the end of model-to-model association edges found in the Demeter model. The rationale for this is the expectation that, in most cases, web services can apply a concept of “lazy loading” of data-type components, in which one identifies what objects might be embedded in a parent object, but does not necessarily retrieve their details until the user needs them (as a separate web service accepting a GCP_SimpleIdentifier of the object but returning the fully populated complex object of the specified type).
UML diagrams with supporting explanatory narration for these GCP-specific BioMOBY data types are published on the Pantheon website (http://pantheon.generationcp.org/moby), which is complemented by a website documenting GCP BioMOBY implementation details (http://moby.generationcp.org/). Supporting the BioMOBY protocol in Pantheon are a series of Pantheon modules for interconversion between GCP MOBY data types and Demeter-compliant Java objects, for web service provider implementation, and for a MOBY client DataSource adapter to communicate with GCP-compliant web service providers.
Using GCP model-constrained BioMOBY data types (all prefixed with “GCP_” in their name in the MOBY central registry), various GCP teams are deploying GCP-compliant web services from a common proposed list of documented web service use cases. Concurrently, the MOBY client DataSource adapter is being elaborated to communicate with these web services and import remote data into local “workbench” instances of the GCP platform.
In addition to data sources and tools already mentioned above, additional open-source third-party analysis tools already coded using Java, but agnostic concerning the GCP framework are being connected to the platform through targeted software engineering. To this end, GCP developers are connecting several public open-source applications by writing suitable DataSource adapters, DataConsumer, or DataTransformer integration code. These include Java software hosted by GMOD such as the Apollo genome browser , tools forming part of the Genomic Diversity and Phenotype Connection (GDPC) protocol such as Tassel , and tools such as TIGR Multiple Experiment Viewer  for microarray analysis, the Comparative Map and Trait Viewer  connected to the NCGR ISYS framework , the Cytoscape network visualization tool , and the MAXD microarray system .
The GCP consortium was formally established in 2003. The first meeting of the bioinformatics and crop informatics development team of the GCP, designated as Subprogramme 4, was hosted in Rome, in February 2004. The general user needs and project goals were coarsely mapped out at this meeting, with some considerable differences in opinion voiced at how to construct the required informatics framework for the GCP. In May 2004, a smaller team of software experts met in Mexico to discuss project management, identify key user needs and platform requirements, and make some initial progress in the design of the system. Key decisions at this latter meeting were the adoption of the “model-driven architecture” paradigm for system development and to embrace web services as a key technology for global integration of systems. Numerous development meetings have been convened annually since these initial meetings to further refine and advance the design and implementation of the platform.
In particular, a milestone review of the GCP domain model and initial software systems using the model was held in Pretoria, South Africa in March 2006. Since that time, a number of early release versions of software systems based on GCP platform technology have become available, generally documented at http://pantheon.generationcp.org/ and publicly downloadable from various CropForge projects. A special “communications” project for GCP-specific projects is also available on CropForge at the http://cropforge.org/projects/gcpcomm to further inform prospective users on the variety of such GCP software tools now available, and provide a venue for user discussions and feedback about the tools.
In this light, a number of practical “use cases” may be described in general terms, as a series of data manipulation steps, to highlight some of the anticipated usage of the platform. As an indication of the data retrieval and analysis scope of the GCP platform, we describe a general integrative use case here below, in terms of a series of defined steps.General GCP platform analysis use case for crop improvement
The vision of the platform development team of the bioinformatics and crop informatics subprogramme of the GCP is to establish a state-of-art but truly easy-to-use and extensible open-source workbench providing interoperability and enhanced data access across all GCP partner sites and, by extension, the global crop research community.
Although several attempts have been made in the past to build such globally integrative bioinformatics systems, few have the global distribution of partners, scope of crop research, diversity of data types, and magnitude of datasets in comparison to the GCP consortium, nor do they have the long-term project perspective of 10 years. In addition, the GCP platform is specifically targeted to bioinformatics for developing world crop research, in contrast to biomedical research, and also strives to integrate databases from many plants and crops less well represented by well-funded model organisms and crops.
In these respects, the GCP platform effort represents an extremely ambitious but very useful global public good resource for crop research. It is still conceded to be, in several respects, an incomplete evolving product, one with many rough edges and incompletely met end-user needs; however, the open-source and public nature of the project provides a credible venue for wide participation of interested developers and prospective end users in the future evolution and deployment of the platform.
2. Fox, PN.; Skovmand, B. The international crop information system (ICIS)—connects genebank to breeder to farmer's field. In: Cooper M, Hammer GL. , editors. Plant Adaptation and Crop Improvement. Wallingford, UK: CAB International; 1996. pp. 317–326.
3. Bruskiewich R, Cosico AB, Eusebio W, et al. Linking genotype to phenotype: the international rice information system (IRIS). Bioinformatics. 2003;19(supplement 1):i63–i65. [PubMed]
4. McLaren CG, Bruskiewich R, Portugal AM, Cosico AB. The international rice information system. A platform for meta-analysis of rice crop data. Plant Physiology. 2005;139(2):637–642. [PubMed]
5. http://www.gmod.org/, September 2007.
6. http://www.geneontology.org/, September 2007.
7. http://www.plantontology.org/, September 2007.
8. http://www.mged.org/, September 2007.
9. Siepel A, Farmer A, Tolopko A, et al. ISYS: a decentralized, component-based approach to the integration of heterogeneous bioinformatics resources. Bioinformatics. 2001;17(1):83–94. [PubMed]
11. Wilkinson M, Schoof H, Ernst R, Haase D. BioMOBY successfully integrates distributed heterogeneous bioinformatics web services. The PlaNet exemplar case. Plant Physiology. 2005;138(1):5–17. [PubMed]
12. Senger, M.;Rice, P.; Oinn, T. Soaplab—a unified Sesame door to analysis tools. In: Cox SJ. , editor. In: Proceedings of the 2nd UK E-Science, All Hands Meeting; September 2003; Nottingham, UK. pp. 509–513.
13. http://www.sswap.info/, September 2007.
14. http://www.tdwg.org/activities/tapir, September 2007.
15. Lewis SE, Searle SM, Harris N, et al. Apollo: a sequence annotation editor. Genome Biology. 2002;3(12: research0082)
16. Casstevens TM, Buckler ES. GDPC: connecting researchers with multiple integrated data sources. Bioinformatics. 2004;20(16):2839–2840. [PubMed]
17. Saeed AI, Bhagabati NK, Braisted JC, et al. TM4 microarray software suite. Methods in Enzymology. 2006;411:134–193. [PubMed]
18. Sawkins MC, Farmer AD, Hoisington D, et al. Comparative map and trait viewer (CMTV): an integrated bioinformatic tool to construct consensus maps and compare QTL and functional genomics data across genomes and experiments. Plant Molecular Biology. 2004;56(3):465–480. [PubMed]
19. Hancock D, Wilson M, Velarde G, et al. maxdLoad2 and maxdBrowse: standards-compliant tools for microarray experimental annotation, data management and dissemination. BMC Bioinformatics. 2005;6:264. [PubMed]