GCP domain model
To cope with the scope, diversity,
and dispersion of crop information, GCP researchers formulated a vision to
specify a consensus blueprint of a scientific domain model and associated
ontology. The resulting models and ontology allow a “model-driven architecture”
for the development of GCP software and network protocols [1].
The domain model is documented in
Unified Modeling Language (UML). Computable versions of the UML model are archived
in the DemeterUML folder of the “Pantheon” project in CropForge
(http://cropforge.org/) software project
repository. The UML diagrams themselves are indexed and published with supporting
narratives on a project website (http://pantheon.generationcp.org/demeter). The
bulk of the models are specified with the UML <<interface>>
stereotype.
At
the heart of the domain model are generic core model interfaces from which
other specific scientific model interfaces are derived. This core model starts
with the concept of simple identification of data objects in the system (using
the SimpleIdentifier interface), which is extended by several more
specific interfaces. The core includes a general concept of Entity, which
serves as the superclass for most other interfaces describing major scientific concepts
or data types in the system. The Entity interface documents generic metadata
about objects in the system, including specific annotation of object characteristics
using a rich Feature model. Other packages in the core models provide
utility models for ontology, publication, and experimental study management.
Additional
scientific models are derived as extensions of the core models. For example,
the base interface classes of most specific major concepts or experimental
objects in the scientific domain of discourse of the GCP, such as Germplasm, Map, or GeneProduct, directly extend the Entity model,
adding subdomain-specific attributes as required. More lightweight concepts in
the system extend simpler interfaces such as Feature.
For
the elaboration of specific components of the core, as well as scientific
domain models, the project generally adapts extant public domain models. For
example, the Germplasm and Study subdomain models are derived
from the data models of the open-source International Crop Information System
(ICIS, http://www.icis.cgiar.org/; [2–4]). Aspects of the genotype (and associated
genetic map and genomic sequence) models are influenced by public initiatives such
as the Chado relational database schemata of the Generic Model Organism Database (GMOD) project [5]. The production-release GCP domain
model is being validated based on feedback from project scientists and
developers, who are striving to validate the model by practical application in
data management and platform implementation.
A
significant feature of the domain model is the reliance on extensible
controlled
vocabulary and ontology (CVO) to define the full semantics of
specialized types, feature attributes, and annotation values of
instances of the model classes. Where possible, the GCP is simply
adopting existing CVO standards, such as from the gene ontology [6], plant ontology [7], and Microarray Gene Expression Data Society (MGED) ontology [8]
consortia. Where no appropriate ontology has yet been formalized, new
dictionaries of terms are being compiled in collaboration with GCP
scientists. CVO dictionaries selected for the platform are being
catalogued in a dedicated online database (at http://pantheon.generationcp.org/)
with web browser and web service access. Each selected dictionary is
assigned a
GCP ontology index number to facilitate platform management of the
ontology. Where an existing public ontology already has its own
accession identifiers (e.g., GO identifiers for the GO CVO), these
identifiers are propagated into the full GCP identifier for the
corresponding CVO terms. However, newly specified CVO lacking such a
number space are assigned de novo GCP accession identifiers.
GCP platform middleware
Since a March 2006 public review of
the GCP domain model, the GCP informatics team has developed selected technology-specific
GCP implementations of the model, primarily focusing on Java-based middleware
specifying a Model-View-Controller (MVC) architecture (see
Figure 1). Although the primary development stream of
the project is focusing on a Java language implementation, the GCP domain model
is a “platform-independent model” amenable to implementation with other
computing languages and is, indeed, being used to guide some complementary work
with languages such as Perl, Javascript, and PHP. The Java-based middleware was given the
overall name “Pantheon” to account for the usage of various ancient
agricultural gods (mostly agricultural, e.g., Demeter, Ceres, Belenus,
Osiris) in the naming of the various layers and component parts of the code
base. This code base is open source and managed under the Pantheon project in CropForge.
In
addition to a Java implementation of the GCP domain model, a Java
application programming interface (API) was specified to assist with
and standardize software integration of components
within the middleware architecture. These interfaces are collected into
a core Java library called “PantheonBase” hosted as a module in the
Ceres section of Pantheon
(under Ceres/projects/Pantheonbase). PantheonBase includes a simple
DataSource interface for read-only query retrieval of data from any source (local or distributed); a
DataConsumer interface to guide integration and synchronization of applications and viewers wishing to use data extracted using
the middleware; and finally, a
DataTransformer
interface to provide a framework for analysis and transformations
(e.g., reformatting) of data. PantheonBase was deliberately designed to
be essentially
agnostic about the GCP domain model per se, for maximum flexibility and
possible reuse with non-GCP-compliant data.
Additional support libraries are
being provided within Ceres to support GCP domain model-driven DataSource development. In addition to core and support
libraries, the Pantheon project provides a clearinghouse for platform and data-type-specific
components. These components include adapters implementing the DataSource
interface for specific data sources (archived in Osiris) for various
crop databases at various GCP partner and external sites. Among others,
current DataSource implementations include a wrapper for the
ICIS and for GMOD schemata (Chado,
Gbrowse). Other Pantheon components provide application support,
including a search engine, data visualization, and web service provider
implementations (in
Belenus). Examples of the latter are support for NCGR ISYS [9], support for stand-alone
applications based on Eclipse/RCP [10], and a web-based GCP domain-model-compliant web-based search engine (Koios).
GCP network protocols
The GCP domain model is also being applied
to platform-specific implementation of a GCP network based on Internet bioinformatics
data exchange protocols such as BioMOBY [
11], SoapLab [
12], SSWAP [
13], and Tapir
[
14].
In this paper, in the interest of brevity, we will discuss only
BioMOBY, arepresentative protocol being used in the GCP network.
For BioMOBY, data types were designed using GCP domain model semantics. Although generally faithful in
translating the semantics of the Demeter UML specification of the domain model (i.e., the SimpleIdentifier interface is represented as a GCP_SimpleIdentifier
data type), the GCP BioMOBY data types simplify the data representation
as a concession to BioMOBY design constraints and to web service
performance.
One key example of this is the extensive substitution of GCP_SimpleIdentifier
objects, instead of fully detailed data objects, at the end of
model-to-model association edges found in the Demeter model. The
rationale for this is the expectation that, in most cases, web services
can apply a concept of “lazy loading” of data-type components, in which
one identifies what objects might be embedded in a parent object, but
does not necessarily retrieve their details until the user needs them
(as a separate web service accepting a GCP_SimpleIdentifier of the object but returning the fully populated complex object of the specified
type).
UML
diagrams with supporting explanatory narration for these GCP-specific
BioMOBY data types are published on the Pantheon website (http://pantheon.generationcp.org/moby), which is complemented by a website documenting GCP BioMOBY implementation details (http://moby.generationcp.org/). Supporting the BioMOBY protocol in Pantheon are a series of Pantheon modules for interconversion between GCP MOBY data types
and Demeter-compliant Java objects, for web service provider implementation, and for a MOBY client DataSource adapter to communicate with GCP-compliant web service providers.
Using GCP model-constrained BioMOBY
data types (all prefixed with “GCP_” in their name in the MOBY central
registry), various GCP teams are deploying GCP-compliant web services from a
common proposed list of documented web service use cases. Concurrently, the
MOBY client DataSource adapter is being elaborated to communicate with
these web services and import remote data into local “workbench” instances of
the GCP platform.
Additional tools integrated into the GCP platform
The GCP domain model and associated
platform middleware is not an end in itself. Rather, the goal of these
informatics products is to serve as a semantically and operationally rich
scaffold for the integration of both local and remote (Internet-connected)
bioinformatics data resources and analysis tools.
In addition to data sources and
tools already mentioned above, additional open-source third-party analysis tools
already coded using Java, but agnostic concerning the GCP framework are being
connected to the platform through targeted software engineering. To this end, GCP
developers are connecting several public open-source applications by writing
suitable DataSource adapters, DataConsumer, or DataTransformer integration code. These
include Java software hosted by GMOD such as the Apollo genome browser [15],
tools forming part of the Genomic Diversity and Phenotype Connection (GDPC)
protocol such as Tassel [16], and tools such as TIGR Multiple Experiment Viewer
[17] for microarray analysis, the Comparative Map and Trait Viewer [18] connected
to the NCGR ISYS framework [9], the Cytoscape network visualization tool [20], and the MAXD microarray system [19].