We set out to analyze the applicability of our SWEDI approach for a specific phase of the integrative bioinformatics experimentation cycle by means of a biological use case. Although we only want to analyze one specific phase of this experimentation cycle, its indissoluble nature forces us to perform an experiment including all phases in order to determine the usability of SWEDI for life sciences research. We identified five phases in our experimental cycle: problem definition, experimental design, data integration, data analysis and interpretation (Fig. 1). At the end of each phase an outcome is generated that serves as input for the next phase. Our study focuses mainly on the application of SWEDI in the data integration phase.
3.1 Problem definition
We started by defining the biological hypothesis of our use case with the input from domain experts i.e. biologists. Figure 2 shows a cartoon representation of the hypothesis. A puzzling phenomenon in biology pertains to histone modifications. DNA is bound by histone octamers, called nucleosomes that package it inside the nucleus. Nucleosomes are built of eight histones, which can undergo post-translational chemical modifications at their N-terminal tail (Felsenfeld and Groudine, 2003). Different histone marks are associated with different cellular processes (Peterson and Laniel, 2004). It is believed that these histone marks act in concert to form a ‘histone code’ that defines the transcriptional state of the chromatin (Strahl and Allis, 2000). For instance, the presence of three methyl groups to the fourth amino acid, a lysine (K), of histone 3, named H3K4me3 (Turner, 2005) is believed to be a histone mark for active gene transcription (Schneider et al., 2004).
Transcription factors regulate gene expression by (in)direct binding to specific regions in the genome. In its simplest view, one transcription factor with a DNA binding domain, recognizes and binds a specific DNA sequence upstream of a gene (i.e. transcription factor binding site, TFBS) to alter its transcriptional state (Fig. 2
). For many transcription factors the associated TFBS sequence motifs they recognize have been identified (Matys et al.
Because H3K4me3 is a histone mark for active gene transcription, we formulated a biological hypothesis that postulates a direct relationship between the presence of this histone mark and specific cTFBS or cTFt (Heintzman et al., 2007). Although in essence this relationship is known in biology, it is a nice hypothesis to test our approach and possibly further interpret this relationship.
3.2 Experimental design
For this hypothesis, we identified relevant data sources about histone modification H3K4me3 and TFBS at the UCSC Genome Browser site, which stores various genome annotations concerning a number of different species including human. We used a data track about cTFt conserved among human, mouse and rat plus data tracks from the ENCODE project about ChIP-on-chip intensity scores of human H3K4me3. The ENCODE-H3K4me3 data tracks holds data of 1% of the whole genome. We chose human cell line GM06990 from the Sanger Institute track together with the cTFBS data track from UCSC for SWEDI. For proof-of-principle, we decided to start with only one human cell line together with the cTFBS data track. Subsequently, we added similar data from four additional human cell lines, plus HMM-analyzed H3K4me3 data from three human cell lines to show extendibility.
In order to discover a relationship between the level of H3K4me3 modification and specific cTFBS or cTFt, we devised SWEDI, a model-based data integration approach, to integrate these genomics data sets. For this, we decided to model the domains that cover the minimal relevant biological and technical features associated with the data sets in a way that would allow future extension. This means several small models rather than one big, all-inclusive model. We also decided to use RDFS data models that capture the low-level metadata related to the data, to link the RDF data to our knowledge models in OWL.
3.3 Data integration
This is the actual phase in which we want to apply SWEDI. In our case we subdivided this phase into five steps: import or create models, transform raw data to RDF data, link data models to knowledge models, select common domain and construct and run semantic query (Fig. 1). Given the complexity of this phase, we will explain each consecutive step separately.
3.3.1 Importing or creating models
The initial step in SWEDI is translating the hypothesis into a formalized, composite knowledge model that captures the domain-specific concepts and their relationships. This is done either by using an existing domain-specific ontology or creating a new ontology. Although an ontology like GO captures some of the desired concepts, as a knowledge model it is too restricted due to the limited types of relationships, only isA and partOf. Also, the level of granularity in the area of histones is limited. Since we could not find any suitable ontology for our use case, we constructed our own ontologies (Supplementary Material, Fig. S2).
Because our approach is meant to be scalable by allowing addition of more models and data, we purposely created distinct models that capture different aspects of the involved biology domain. Figure 3 shows our four OWL ontologies to model the domains that cover the data sets: an epigenetics model (epi), a histone model (HistOn), a TFBS model (tfbs) and a technical model (tech). The knowledge models were created with help of experts in the field of nuclear organization and peer-reviewed literature. We chose to express the knowledge models using OWL-DL in order to have enough expressiveness but still remain computationally efficient (http://www.w3.org/TR/owl-features/).
- epi, the largest model, was constructed to capture general concepts in the biology domain of histone modification from an epigenetics viewpoint. Epigenetics is about (heritable) DNA-related features other than the actual DNA sequence, such as DNA methylation, histone modification and chromatin structure. The model epi contains 78 concepts and 16 properties. Concepts range from amino acid modifications, sequences and chromosomes.
- HistOn, the histone model, is more specific and covers histones and histone-related concepts like histone modifying proteins. It contains 17 concepts plus 19 properties and is nested within epi.
- tfbs, the TFBS model covers TF, TFBS and related concepts like promoters, enhancers and repressors. It contains 14 concepts and 6 properties. tfbs is nested within epi.
- tech, the technical model was created to cover abstract terms in the data sets such as experimental measurement scores and calculated z-scores. It contains seven concepts and four properties.
Together, these knowledge models capture all concepts essential for our use case, but they can be further expanded as needed.
We also constructed two RDFS data models that semantically capture the data sets we want to integrate. We based our data models on the database table schema of the data sets. The data models semantics is limited to describing the data file rows and columns.
- The H3K4me3 data model describes the H3K4me3 data track on UCSC and contains two objects and six properties.
- The cTFBS data model describes the cTFBS data track on UCSC and contains two objects and nine properties.
Although we constructed the data models using Protégé/OWL, for our data models the expressiveness of RDFS is sufficient and computationally less expensive.
3.3.2 Transform raw data into RDF format
We retrieved cTFBS and H3K4me3 data sets of human cell line GM06990 from UCSC and used an adapted version of Mapper with the associated data model to transform the tab-delimited flat file to RDF/XML. We chose to keep the data separated from the ontology so that we can describe the data using ‘simple’ RDF/XML.
A drawback of expressing data in RDF/XML is bloating of data size on disk. An approximate 15-fold increase in file size was observed when the cTFBS tab-delimited data set was transformed to RDF/XML and an approximate 18-fold increase for the H3K4me3 data sets. Although once loaded into memory, the XML bloat is no longer a problem, file storage on disk and file exchange are issues that will require attention.
3.3.3 Link data model to knowledge model
The next step is to link the models that capture biological knowledge to the data models. Figure 4 shows how we started by populating the knowledge model with individuals representing each data set. We then linked the data models to the knowledge models by linking properties. Using the inference feature of RDFS, we declared that a property of the data model, for instance chrom, is a sub-property of the property Chromosome_identifier from the knowledge model. For each data set this resulted in two collections (i.e. files) that contain all linkage statements (Fig. 3). As such, the linkage statements function as a user-defined viewpoint of how a data model is related to the knowledge models, which allows defining queries in the more familiar terms of knowledge models.
We chose to keep our data independent of the knowledge models, with an explicit mapping in the form of the linking statements. This approach to linking also preserves the data supplier's naming scheme. We could have directly transformed the raw data files into RDF that includes our own OWL terms directly in the RDF version of the data. However, such an approach would shift control of linking to the import stage and subsequent changes to our knowledge models could require an entire new import process for any affected data to correct obsolete links embedded in the data.
3.3.4 Select common domain
To determine the relationship between H3K4me3 and cTFBS or cTFt, we had to find a path between these concepts in our models and identify a domain for comparison. By selecting a common domain we could integrate these data sets. In our case, we chose ChromosomeRegion as the common domain, because both histone modification and cTFBS data have genomic positions coordinates in the form of chromosome number, start and end position.
3.3.5 Construct and run semantic query
With a common domain identified, we created a query to test the relationship under question: ‘Which DNA regions are bound by a H3K4me3 modified histone as well as a cTFt?’ We constructed a SeRQL query (Supplementary Material, Fig. S1) that checks if H3K4me3 and cTFBS DNA regions are on the same chromosome within each other's start and stop positions. It states that two regions from each data set overlap when there is at least one base pair with a direct overlap. Overlap means identification of a cTFBS for each cTFt.
Initially, the query took around 45 h (wall time) to complete and returned 12 349 overlaps for GM06990 (Supplementary Material, Fig. S3). Restricting the whole-genome cTFBS data set (1 077 457) to cTFBS that are within an ENCODE region (13 779), dramatically reduced the query run time to 30 min.
The raw integration results in fact are the proof-of-principle of SWEDI because it accomplishes data integration of heterogeneous data sets by means of semantic web technology. An important difference between the use of a traditional database and semantic web repositories is that the (meta)data model for the semantic web approach is described in a standardized language. Where RDFS and OWL are used, reasoning can be applied to the model.
3.4 Extension of data integration experiment
After achieving this proof-of-principle using data from just one cell line (GM06990), we evaluated the extensibility of SWEDI, since that is its main motivation. For this, we extended the experimental design with data from four additional human cell lines; HeLa, HFL-1, K562 and Molt4. By simply changing the identity of the histone modification data set and slightly adapting the SeRQL queries we were quickly able to achieve raw H3K4me3–cTFBS integration results for these cell lines; 13 350 overlaps for HeLa, 13 350 for HFL-1, 13 315 for K562 and 13 341 for Molt4.
We further extended our analysis with additional UCSC data: HMM-analyzed H3K4me3 data from cell lines: GM06990, HeLa and K562. The integration steps for these three data sets remained the same, with similar exception for the query step. The results were; 3289 overlaps for GM06990, 2134 for HeLa and 3273 for K562. Because HMM-analyzed data sets are much smaller than nonanalyzed data, the query took 1.5 h (wall time) to complete for the whole genome cTFBS data set and 2 min for cTFBS that are within an ENCODE region.
The UCSC Genome Browser contains an extensive number of genome annotation tracks that can potentially be integrated using our approach. There are 13 tracks containing 97 sub tracks that are almost identical to the data sets we used for our use case. These can be integrated almost directly, if we use the H3K4me3 data model to transform the tab-delimited data sets to RDF data. Also, the HistOn and tfbs knowledge models need to be populated with the new data sets and the SeRQL query needs to be changed to cover the new data sets. There are also 12 tracks containing 37 sub tracks that can be integrated requiring only minor concept additions to the tech model and/or a new data model in addition to the changes mentioned above (Supplementary Material, Table S1).
3.5 Data analysis and interpretation
Through SWEDI we have coupled H3K4me3 intensity scores to all cTFBS of each cTFt and we obtained raw integration results (Supplementary Material, Fig. S4). These results showed that the majority of cTFBS displayed a low H3K4me3 score. Applying a H3K4me3 score cutoff of >2 resulted in: 1382 overlaps for GM06990, 984 for HeLa, 1063 for HFL-1, 1303 for K562 and 739 for Molt4. An in-depth analysis and interpretation is beyond the scope of this article. A brief preliminary analysis and interpretation of the result can be found in Supplementary Material, Figure S5.
The analysis of the raw integration results from SWEDI on the HMM-analyzed H3K4me3 data from cell lines GM06990, HeLa and K562 showed essentially the same outcome as compared to the original H3K4me3 data (data not shown).