table of contents table of contents

The authors describe the implementation of DataRail, an open source MATLAB-based toolbox …

Home » Biology Articles » Biomathematics » Flexible informatics for linking experimental data to mathematical models via DataRail » Introduction

- Flexible informatics for linking experimental data to mathematical models via DataRail

A fundamental goal of systems biology is constructing mathematicalmodels that elucidate key features of biological processes asthey exist in real cells. A critical step in realizing thisgoal is effectively calibrating models against experimentaldata. The challenges of model calibration are well recognized(Jaqaman and Danuser, 2006) but we have found systematizingand processing data prior to calibration to be tricky as well.This is particularly true as the volume of data or the complexityof models grows. Few information systems exist to organize,store and normalize the wide range of experimental data encounteredin contemporary molecular biology in a sufficiently systematicmanner to maintain provenance and meanwhile retaining the adaptabilitynecessary to accommodate changing methods. Partly as a consequence,relatively few complex physiological processes have been modeledusing a combination of theory and high throughput experimentaldata.

An information management system for experimental data mustrecord data provenance and experimental conditions, maintaindata integrity as various numerical transformations are performed,describe data in terms of a standardized terminology, promotedata reuse and facilitate data sharing. The most common wayto achieve these requirements is via a relational database managementsystem (RDBMS, see SBEAMS——orBioinformatics Resource Manager for relevant examples; Shahet al., 2006). Databases in biology resemble those previouslydeveloped for business and have proven spectacularly successfulin managing data on DNA and protein sequences. In a relationaldatabase, the subdivision of information and its subsequentstorage into cross-indexed tables follows a precise, predefinedschema. The granularity and stability of the schema allows anRDBMS to identify and maintain links between disparate piecesof information, even in the face of frequent read–writeoperations. However, this power comes at a considerable costin terms of inflexibility. It is difficult for a relationaldatabase to accommodate frequent changes in the formats of dataor metadata, and to incorporate unstructured information.

Whereas the sequence of a human gene represents valuable informationindependent of how sequencing was performed or of the individualfrom whom the DNA was obtained (a statement that remains truedespite the value of characterizing sequence variations); suchis not the case for measures of protein activity or cellularstate. Such biochemical and physiological data are highly contextdependent. Data on ERK kinase activity, for example, is uninformativein the absence of information on cell type, growth conditions,etc. Moreover, a wide range of techniques are used to make biochemicaland physiological measurements, and both the assays and thedata they generate change over time, as new methods are developed(e.g. in imaging see Swedlow et al., 2003). Context dependenceand rapidly changing data formats pose fundamental problemsfor databases because RDBMS schemes are not easily modified.

Moreover, even if effective metadata standards are developedto describe the context-dependence of experimental findings,data from different experiments cannot be reconciled simplyby storing them in a single database. Subtle distinctions mustbe made about different types of data and biological insightbrought to bear. Currently this is performed implicitly in theminds of individual investigators, but we envision a futurein which the unique ability of mathematical models to formalizehypotheses and manage contingent information makes them theprimary repositories of biological knowledge. As we work towardsa model-centric future, it is our contention that informationsystems based solely on relational databases are unnecessarilylimiting; rarely do we modify a difficult experiment simplyto conform to a pre-existing database schema (whereas conformityto uniform—even arbitrary—standards is a strengthfor a business database). New approaches to data managementthat reconcile competing requirements for flexibility and structureare required.

One response to the challenges of systematizing biological datahas been the creation of lightweight data standards focusedon the most important metadata. Pioneered by the Microarrayand Gene Expression Data Society's Minimum Information abouta Microarray Experiment (MIAME), these ‘minimum information’approaches typically define a simple data model that can beinstantiated as an XML file, a database schema, etc. A strengthof ‘minimum information’ standards is that theyspecify that subset of the metadata that is relatively constantamong ever-shifting and context-sensitive experiments. The philosophyis that of the Pareto principle or 80-20 rule, namely that 80%of the information can be captured with 20% of the effort whereasthe final 20% requires exponentially greater effort. An underlyingassumption is that a minimum information standard successfullyrecords the information needed to make experimental data intelligible.In this article we implement an information processing system,DataRail, intended to bridge the gap between data acquisitionand modeling. A new minimum information standard (MIDAS) ispart of the DataRail system, but a series of additional toolsare also applied to maintain the provenance of data and ensureits integrity through multiple steps of numerical manipulation.DataRail is model- rather than data-centric in that the taskof creating and transmitting knowledge is invested in mathematicalmodels constructed using the software, rather than the datastorage system itself, but it is designed to support existingmodeling tools rather than serve itself as an integrated modelingenvironment. We illustrate this capacity in DataRail using alarge set of protein measurements derived from primary and transformedhepatocytes; through the use of DataRail we derive insight bothinto the biology of these cell types and the optimal means bywhich to perform partial least squares regression (PLSR) modelingof cue-signal-response data.

rating: 5.00 from 2 votes | updated on: 3 Nov 2008 | views: 11657 |

Rate article: