table of contents table of contents

The authors describe the implementation of DataRail, an open source MATLAB-based toolbox …

Home » Biology Articles » Biomathematics » Flexible informatics for linking experimental data to mathematical models via DataRail » Discussion

- Flexible informatics for linking experimental data to mathematical models via DataRail

We describe the implementation of DataRail, a flexible toolboxfor storing and manipulating experimental data for the purposeof numerical modeling. Metadata in DataRail is based on a ‘minimuminformation’ MIDAS standard closely related to standardsthat have already proven their utility in the analysis of DNAmicroarray and other types of high-throughput data. BecauseMIDAS is a simplified version of the MIACA standard, exportfrom DataRail into a MIACA-compliant file is straightforward.Based on several use cases with up to 1.5 x 106 data points(see Table S1), DataRail appears to be scalable and broadlyuseful, thanks to its efficient reuse of primary data and dataprocessing algorithms. Compared to traditional relational databases,DataRail is significantly easier to deploy and modify, and itcan accommodate a wider range of data formats since its internalarrays can have any dimensionality. Careful management of arraysvia semantically typed identifiers (which also serve as URLs),use of a strict containment hierarchy, and imposition of metadatastandard take the place of the rigid tabular structure foundin relational databases. However, in cases in which data formatsstabilize, or greater transactional capacity is desired, allor part of a DataRail data model can be implemented in an RDBMS.

The current DataRail implementation meets our original designgoals in the following ways: (i) data provenance is maintainedthrough the containment hierarchy, the record of processingsteps, and the assignment of UIDs; (ii) visualization and modelingare possible with internal tools specialized to PCA and PLSR;(iii) interaction with external software such as CellNetAnalyzer,PottersWheel and Spotfire is implemented, and export routinesare available to expand this list; (iv) flexibility is providedby the use of data arrays with user-determined dimensionalityand a simple interface for adding new analysis routines; (v)data sharing and publication are facilitated by a special categoryof project that packages together transformed arrays and figuresdescribing key analyses, including those in published papers.Future developments in DataRail include the creation of utilitiesfor managing image and mass spectrometry data, importers fora range of common laboratory instruments, and support for theHDF5 file format ( HDF is awidely supported, open-source format used in many fields dealingwith large data sets, such as earth imaging or astronomy. HDFfiles are self-describing and permit access to binary data inmanner that is much more efficient that with XML rules. Moreover,integration of DataRail with Gaggle and similar interoperabilitystandards is a high priority (Shannon et al., 2006). Gagglecoordinates multiple analysis tools, among them the R/Bioconductorstatistical environment (Gentleman et al., 2004), thereby providingaccess to tools for the statistical analysis of high-throughputdata. Finally, versions of DataRail based on the open-sourcelanguages R or Python are in development, as are discussionswith instrument vendors to create direct export routines forMIDAS-compatible files. In the context of commercial use, weare discussing, with a commercial partner, the implementationof granular access control functionality.

A model-centric approach explicitly encodes specific hypothesesabout data and its meaning, and can therefore merge data notonly at the level of information but at the more useful levelof knowledge. Even in the database dependent world of business,knowledge is usually derived from information in specializeddatabases (data warehouses, which are static representationsof transactional databases processed to ensure data consistency)using business intelligence tools. Business intelligence is,in essence, an approach to modeling business and financial processesmathematically and then testing the models on data. In the caseof biological models, data plays an even more central role becausemany model parameters can be estimated only by induction fromexperimental observations. Thus, for mathematical models ofbiology to realize their full potential, a tight link betweenmodel and experiment is necessary. This involves not only aneffective means to calibrate models, but also reliable informationon data provenance. Only then can model-based predictions beevaluated in light of assumptions and uncertainties. DataRailtherefore represents a step forward in the complex task of designingsoftware that supports model-driven knowledge creation in biomedicine.

rating: 5.00 from 2 votes | updated on: 3 Nov 2008 | views: 11514 |

Rate article: