Along with the introduction of omic technologies in the life sciences came the need to handle, analyze and interpret biological data in a different, more formalized way. The semantic web approach enhances data exchange and integration by providing standardized formats such as RDF, RDFS and OWL, to achieve a formalized computational environment. In this study, we have investigated the potential of the semantic web concept for the purpose of data integration by computational experimentation in the life sciences. We focused on basic linkage of data to knowledge using semantic web based data and knowledge models.
Because of the complexity of biology, we adopted a strategy to formalize a (part of a) domain that is of interest to a specific (group of) scientist(s) by capturing the knowledge via a network of interrelated semantic models using ontologies as a controlled vocabulary. This allows a modular approach for data integration in which the individual scientists can use existing (general) models, potentially in combination with small specific models that they create themselves. This means that each scientist can interact with external data and knowledge models from their own perspective using a kind of ‘personal semantic framework’. In this way the involved scientists are familiar with the concepts and the relationships in the models they work with and can create semantic queries using their own terms. The external models should be either made by coordinating data-managing organizations (e.g. NCBI), or organized domain experts (e.g. FlyBase consortium).
After creating several necessary knowledge and data models, we were able to provide a proof-of-principle for our SWEDI approach. Multiple genomics data sets, which involved histone modification and TFBS, were successfully linked via a common domain. The integrated data in our biological use case resulted in some interesting biological observations that may lead to new hypotheses regarding the role of histone modification in gene expression regulation. With our approach, we established a type of formalization of the problem domain by creating a vocabulary in the form of knowledge models that describe the data and capture the domain knowledge. This promoted the transparency and reproducibility, as well as the easy extensibility of our experiments. However, more sophisticated tools (for example the OntoViz plug-in in Protégé) are needed for visualization of concepts and their relationships when more and more data sets are added. Also, our approach gives us flexibility in asking questions to the data sets. Although sites like UCSC Genome Browser have tremendous amounts of data and information, it is rather difficult to ask simple questions like: what is the overlap between any number of tracks concerning histone modifications, transcription factors and genes.
Although in this study we used a rather limited use case, we still could show that SWEDI is extendable by starting from just one cell line (GM06990) and adding similar H3K4me3 modification data sets from four additional human cell lines and three H3K4me3 recalculated data sets. The use of new data from the same site (UCSC Genome Browser) and from the same track, resulted in the re-use of the H3K4me3 data model, because this model describes the data on a very low level and the format is identical. Since the data model can be re-used, the linking between the data and knowledge model remains the same, as does the common domain. In contrast, the SeRQL queries do have to be altered slightly. The addition of extra data sets was facilitated by the great similarity between the data sets, but in essence any data set that contains at least a chromosome number, start position and end position can be integrated. Data models need to be created, if they cannot be re-used. Although we showed extensibility by adding similar data sets, it will be a challenge to add totally different data, e.g. data not related to genomic location.
There are also drawbacks to SWEDI. The main problem is that the initial setting-up costs are high, because there are hardly any adequate knowledge models available yet. This problem is inherent to formalizing a domain, but the applied semantic web technologies are still immature, so RDF data set manipulation is hard. Also, a future problem could rise when many, highly divergent personalized frameworks eventually have to be merged. This relates to the more general problem of ontology alignment (Euzenat and Valtchev, 2003). So it seems fair to assume that if any domain in biology wants to embrace this approach, it will take a community as well as a multidisciplinary effort to make and maintain something like a domain semantic framework. The effort can be compared to those from initiative such as NCBI, UCSC Genome Browser, FlyBase, WormBase, etc. An example in this context is the effort of the W3C-HCLS (Ruttenberg et al., 2007) to recommend a standard scheme for the URIs that refer to commonly used bioconcepts. If widely adopted, such a URI scheme would have a normalizing effect that would greatly increase the ease of data integration and model sharing. This would be a first step in the direction of a domain semantic framework. Although the necessary consensus for a domain semantic framework would be a challenge to establish, it would only need to happen once. In contrast, consider the countless times that database schemas are re-invented for use with the same data but at different institutions. The scalability of SWEDI is also a matter of concern, as performance and provenance may become bottlenecks. With the sizes of our whole-genome data sets the queries took 2 days to run, because the query engine is not optimized for our type of semantic web based query yet. As noted earlier, query optimizations can greatly improve the performance (Marshall et al., 2006). Although we could have used SQL for better query performance, we would have to give up all the benefits of explicit use of our models in the query itself. Furthermore, in SWEDI data models are mapped to knowledge models using subPropertyOf statements that are stored in the data models. This poses a problem because in our scenario, we do not control these data models and so we cannot add mapping statements directly to these models. We therefore, had to store the linking statements in a separate file, thus keeping the data models intact, but adding an extra layer. Finally, choosing the common domain is done manually, which demands extensive domain knowledge. It would be extremely useful if methods could be developed that identify common domains automatically.
In the context of biological data and knowledge integration, numerous solutions have been developed to enable retrieval of data from heterogeneous distributed sources (Eckman et al., 2003; Searls, 2005; Stein, 2003). The solutions range from monolithic, such as, SRS (Zdobnov et al., 2002) with keyword indexes and hyperlinks, Kleisli/K2 (Ritter et al., 1994) with a query language across databases as if they were one, data warehouses such as BioZon (Birkland and Yona, 2006), automated annotation systems such as PhosphaBase-myGrid (Wolstencroft et al., 2006), to BioMOBY which uses web services acting as portals to biological data (Wilkinson et al., 2005). Perhaps the most widely used system is SRS, providing integration of over 400 databases. Our approach to data integration uses semantic models to provide a schema for integration. TAMBIS pioneered such an approach by creating a molecular-biology ontology as a global schema for transparent access to a number of sources including Swiss-Prot and Blast (Stevens et al., 2000). Systems such as BACIIS (Miled et al., 2003), BioMediator (Mork et al., 2005) and INDUS (Caragea et al., 2005) extend on this example. For instance, BioMediator uses a ‘source knowledge base’ that represents a ‘semantic web’ of sources linked by typed objects. The knowledge base includes a ‘mediated schema’ that can represent a user's domain of discourse. INDUS shows important similarities to our approach, offering an integrated user interface to import or create user ontologies, and creating ontological mappings between concepts and ‘ontology-extended’ data sources. In contrast to our approach, however, INDUS does not use semantic web formats such as OWL and RDF. While the syntactic step of our import is similar to that of YeastHub (Cheung et al., 2005), our explicit linking of the semantic types to the syntactic types with RDFS moves the work of discovering semantics from the query to the model alignment stage. A different approach using Semantic Web methodologies to integrate gene data with phenotype data uses RDF graph analysis to prioritize candidate disease genes (Gudivada et al., 2007).
With respect to the applicability of SWEDI in biology there is no doubt that semantic modeling is a necessity for biological knowledge bases (Ruttenberg et al., 2007). Life sciences research today is all about data, information and knowledge management. Whereas previously the important domain information and knowledge resided mainly in the head of the responsible principal investigator and literature, we are moving into the era of data warehouses/repositories, information management systems and knowledge bases. For any life sciences research group, this means that they have to deal with these e-science issues if they want to stay competitive (Goble et al., 2005; Rauwerda et al., 2006). Furthermore, merely managing all resources will not help much. Once all resources are accessible, multidisciplinary skills such as data mining, data integration, data analysis, statistics, etc are needed. With SWEDI we advanced towards formalized resource management by semantic models. Even our limited SWEDI approach can be used to integrate a substantial number of data sets. However, at present SWEDI only covered data integration and not data analysis or interpretation. This means that with SWEDI we succeeded to integrate multiple genomics data sets. As always, once the technical bottleneck for data integration was lifted, it shifted immediately to data analysis and interpretation. These phases are not covered by SWEDI yet, and it takes quite an effort to extract relevant biological knowledge from raw integration results, as we experienced in our use case. However, the data and findings of any integrative bioinformatics experiment using SWEDI are by definition in a standardized format. This facilitates putting them in semantic web repositories, which subsequently increases their re-use by other members of the research community.
SWEDI is based on building OWL models confined within the scope of an experiment (Marshall et al., 2006). OWL enables the linking of small models to form a larger semantic web, hence a ‘bottom up’ approach. This ensures freedom for scientists to compose and extend models to their specific needs, such as, new (hypothetical) concepts that have yet to reach the level of consensus necessary for consortia-managed ontologies. In contrast, OBO models are being built ‘top-down’ by consortia to encompass an extensive number of concepts. Integration could be enhanced through the use of upper-bio-ontologies (Grenon et al., 2004; Rector and Rogers, 2004; Schulz et al., 2006) and the anticipated problems of merging many and divergent personalized frameworks can be circumvented by linking personalized frameworks via upper-bio-ontologies. Upper-bio-ontologies can also facilitate the introduction of new domains to the created formalized computational environment. These upper ontologies increasingly offer guidance on how to categorize new concepts. Careful use of them, and best practices should simplify the alignment of semantic models.
Altogether our SWEDI approach is a first step towards a formalized computational environment for integrative bioinformatics experimentation. The modular nature of SWEDI in combination with the use of standardized semantic web formats ensures the extendibility and scalability of the approach. SWEDI can either be used to create bottom-up small personal semantic frameworks or (top–down) larger domain semantic frameworks. An important advantage of the use of semantic web repositories compared to relational databases is that their complete (meta)data models (i.e. data schemas) are described in a standardized language, which enhances transparency because they can be visualized and manipulated by nonproprietary tools. These schemas are also referenced by the query so that it is possible to examine any RDF query to discover precisely what it means, i.e. track data provenance. Where RDFS and OWL constructions are used, the corresponding reasoning can be applied to the data schemas, creating opportunities for innovative integrative bioinformatics experimentation. Furthermore, semantic web repositories can flexibly be accessed and modified, because many types of modifications require neither specialized knowledge about repository internals nor risky processes such as table migration. Finally, if upper ontologies and best practices are carefully applied in the smaller personal semantic frameworks, it will be possible to link them together into a functional semantic web.