We describe the implementation of DataRail
, a flexible toolbox
for storing and manipulating experimental data for the purpose
of numerical modeling. Metadata in DataRail
is based on a ‘minimum
information’ MIDAS standard closely related to standards
that have already proven their utility in the analysis of DNA
microarray and other types of high-throughput data. Because
MIDAS is a simplified version of the MIACA standard, export
into a MIACA-compliant file is straightforward.
Based on several use cases with up to 1.5 x 106
(see Table S1), DataRail
appears to be scalable and broadly
useful, thanks to its efficient reuse of primary data and data
processing algorithms. Compared to traditional relational databases,DataRail
is significantly easier to deploy and modify, and it
can accommodate a wider range of data formats since its internal
arrays can have any dimensionality. Careful management of arrays
via semantically typed identifiers (which also serve as URLs),
use of a strict containment hierarchy, and imposition of metadata
standard take the place of the rigid tabular structure found
in relational databases. However, in cases in which data formats
stabilize, or greater transactional capacity is desired, all
or part of a DataRail
data model can be implemented in an RDBMS.
The current DataRail implementation meets our original designgoals in the following ways: (i) data provenance is maintainedthrough the containment hierarchy, the record of processingsteps, and the assignment of UIDs; (ii) visualization and modelingare possible with internal tools specialized to PCA and PLSR;(iii) interaction with external software such as CellNetAnalyzer,PottersWheel and Spotfire is implemented, and export routinesare available to expand this list; (iv) flexibility is providedby the use of data arrays with user-determined dimensionalityand a simple interface for adding new analysis routines; (v)data sharing and publication are facilitated by a special categoryof project that packages together transformed arrays and figuresdescribing key analyses, including those in published papers.Future developments in DataRail include the creation of utilitiesfor managing image and mass spectrometry data, importers fora range of common laboratory instruments, and support for theHDF5 file format (http://hdf.ncsa.uiuc.edu/HDF5/). HDF is awidely supported, open-source format used in many fields dealingwith large data sets, such as earth imaging or astronomy. HDFfiles are self-describing and permit access to binary data inmanner that is much more efficient that with XML rules. Moreover,integration of DataRail with Gaggle and similar interoperabilitystandards is a high priority (Shannon et al., 2006). Gagglecoordinates multiple analysis tools, among them the R/Bioconductorstatistical environment (Gentleman et al., 2004), thereby providingaccess to tools for the statistical analysis of high-throughputdata. Finally, versions of DataRail based on the open-sourcelanguages R or Python are in development, as are discussionswith instrument vendors to create direct export routines forMIDAS-compatible files. In the context of commercial use, weare discussing, with a commercial partner, the implementationof granular access control functionality.
A model-centric approach explicitly encodes specific hypothesesabout data and its meaning, and can therefore merge data notonly at the level of information but at the more useful levelof knowledge. Even in the database dependent world of business,knowledge is usually derived from information in specializeddatabases (data warehouses, which are static representationsof transactional databases processed to ensure data consistency)using business intelligence tools. Business intelligence is,in essence, an approach to modeling business and financial processesmathematically and then testing the models on data. In the caseof biological models, data plays an even more central role becausemany model parameters can be estimated only by induction fromexperimental observations. Thus, for mathematical models ofbiology to realize their full potential, a tight link betweenmodel and experiment is necessary. This involves not only aneffective means to calibrate models, but also reliable informationon data provenance. Only then can model-based predictions beevaluated in light of assumptions and uncertainties. DataRailtherefore represents a step forward in the complex task of designingsoftware that supports model-driven knowledge creation in biomedicine.