2.1 Data
All data sets were downloaded from the UCSC genome browser website (http://genome.ucsc.edu/). We used H3K4me3 data from the ChIP-on-chip data set produced by the Sanger Institute (http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=88704835&g=encodeSangerChip) for five human cell lines; GM06990, HeLaS3, HFL-1, K562 and MOLT-4. Each data set contains locations on the human genome where H3K4me3 is present plus the intensity score, which is an indication for the amount of H3K4me3 at that position (ftp://ftp.sanger.ac.uk/pub/encode/H3K4me3_GM06990_2/README) The Sanger data set is a ENCODE region-wide H3K4me3 analysis which comprises
1% of the total human genome. These motif sequences are usually represented by position weight matrices of conserved TFBS types (cTFt). With these cTFt matrices, highly similar locations in the genome can be identified that are potentially conserved TFBS (cTFBS) for each cTFt. The cTFBS data was generated by UCSC using 410 binding matrices from the Transfac Matrix and Factor database v8.3 from Biobase (http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=88704835&g=tfbsConsSites). In essence, the track gives information about cTFt and the predicted occurrence of associated cTFBS in the whole human genome. The Hidden Markov Model (HMM) data identifies hit regions in the Sanger data set using a two-state HMM analysis as performed by EBI (http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=86022194&g=encodeSangerChipHits). We used HMM data for the three available human cell lines; GM06990, HeLaS3 and K562.
2.2 Semantic web technology
We created the data models, knowledge models and linkage statement files using Protégé 3.1.1 with OWL plug-in V2.1 (http://protege.stanford.edu/). To visualize the data sets within the knowledge models, we used Protégé to create individuals for each data set within the corresponding concept. To transform the tab-delimited data to RDF/XML data we used a version of Mapper (https://gforge.vl-e.nl/projects/mapper) that we modified for RDF output. Transformed data was loaded in Sesame v1.2.6 (http://www.openrdf.org/). Subsequently, SeRQL queries (Supplementary Material, Fig. S1) were constructed to find cTFBS that overlapped with H3K4me3 regions. The Sesame program was run on a server station with two Intel Xeon processors at 2.8 GHz equipped with 4 GB main memory.