With the development of high-throughput biotechniques and the subsequent omics studies, exciting avenues of scientific exploration are opening up. Instead of being constrained to analyze a handful of genes or proteins per experiment, whole genomes and proteomes can be studied today. This allows biologists to investigate more complex processes that were not accessible before (Carroll et al., 2006; Lein et al., 2007; Souchelnytskyi, 2005; Spellman et al., 1998; van Steensel, 2005).
As became evident from the human genome project, once the technology limitations were lifted, the bottleneck rapidly shifted to the annotation of the produced DNA sequence data. Therefore, like the biotechniques, huge projects with numerous research groups collaborate to tackle complex issues such as annotating the human genome (The ENCODE Project Consortium, 2004). So on top of the omics data a growing layer of biological annotations is being produced. These data are made increasingly available through public web-accessible data stores like Ensembl and the UCSC Genome Browser. Because the data is distributed across the web, this raises new issues on data management, maintenance and usage. Biologists use these data as reference, but increasingly also for in silico data integration experiments. Integrating these heterogeneous data sets across different databases, however, is technically quite challenging, because one must find a way to extract information from a variety of search interfaces, web pages and APIs. To complicate matters, some databases periodically change their export formats, effectively breaking the tools that provide access to their data. At the same time, most omics databases do not yet provide computer-readable metadata and, when they do, it is not in a standard format. Hence, expert domain-specific knowledge from the user is required to interpret what the data actually represents before using it in integration experiments. This limits the practical scale and breadth of integration, given the variety and amount of data available from distributed resources.
The Semantic Web is designed to bring meaning to the raw data content by defining relationships between distinct concepts (http://www.w3.org/2001/sw/) using ontologies. This allows the sharing and processing of data by automated agents that can assist in the retrieval of relevant information and metadata (Roos et al., 2004). The Resource Description Framework (RDF) specification is a metadata model that forms the basis of the Semantic Web. The metadata model describes everything as a resource that can be linked to other resources by defining relationships as properties. Resources are described by making statements identifying the resource (the subject), its property (the predicate) and the value of the property (the object), e.g. MAPKAP-2, hasFunction, Kinase. The statements used in RDF are defined in RDF Schema (RDFS). RDFS describes the semantics and defines class and property hierarchies of the domain for which the RDF document is used.
To allow machine reasoning over the formalized knowledge of a domain, the W3C has developed a standard for a web based ontology language: OWL (http://www.w3.org/2004/OWL/), a language that builds upon RDF and RDFS. Essentially, an ontology is a formalization of a domain, defining concepts (i.e. collections of biological elements that share common properties) and the relationships between them, thus creating a common, controlled vocabulary that can be reasoned over in a well-defined manner. Applying ontologies to data involves populating the concepts with individuals, i.e. real-life entities. By defining ontologies for a field as complex as biology, one can eventually build a knowledge base that facilitates the exchange and interoperability of the data present in numerous available databases. Many biological ontology initiatives exist (http://obo.sourceforge.net/), with the Gene Ontology (GO) being the most widely adopted (Ashburner et al., 2000). This allows these databases to transcend from data stores to knowledge stores. Thus, ontologies will greatly aid biological research by providing a structured approach to capturing knowledge in a computer-understandable way (Bodenreider and Stevens, 2006; Good and Wilkinson, 2006; Ruttenberg et al., 2007; Strizh, 2006).
Because of the heterogeneity of life sciences data, the semantic web approach could be useful throughout the entire cycle of integrative bioinformatics experimentation. Figure 1 shows this cycle divided into five phases: problem definition, experimental design, data integration, data analysis and data interpretation. In the current study, we have applied the semantic web approach to the data integration phase of an example integrative bioinformatics experiment and evaluated its applicability in the context of the whole cycle. As biological use case, we set out to combine two genomics data sets from UCSC: data about a specific histone modification and data about transcription factor binding sites. Our approach using semantic web-enabled data integration (SWEDI) is based on semantic web technology for a model-based integration of data sets in the life sciences domain (Marshall et al., 2006). We constructed three OWL biological knowledge models, one OWL technical knowledge model and two RDFS data models. We then transformed and mapped relevant data to the data models, linked the data models to the knowledge models using linkage statements and ran a semantic query. The analysis of the results of the biological use case demonstrated the relevance of these kinds of integrative bioinformatics experiments. Our findings are that the initial ‘startup’ costs for SWEDI are high, but that subsequent addition of (similar) data is straightforward.