Flexible informatics for linking experimental data to mathematical models via DataRail


Flexible informatics for linking experimental data to mathematical models via DataRail

Julio Saez-Rodriguez 1,2,{dagger}, Arthur Goldsipe 1,3,{dagger}, Jeremy Muhlich 1,2, Leonidas G. Alexopoulos 1,2, Bjorn Millard 1,2, Douglas A. Lauffenburger 1,3 and Peter K. Sorger 1,2,3,

1Center for Cell Decision Processes, 2Department of Systems Biology, Harvard Medical School, Boston, MA 02115 and 3Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139 

Bioinformatics 2008 24(6):840-847. Open Access Article.



Motivation: Linking experimental data to mathematical modelsin biology is impeded by the lack of suitable software to manageand transform data. Model calibration would be facilitated andmodels would increase in value were it possible to preservelinks to training data along with a record of all normalization,scaling, and fusion routines used to assemble the training datafrom primary results.

Results: We describe the implementation of DataRail, an opensource MATLAB-based toolbox that stores experimental data inflexible multi-dimensional arrays, transforms arrays so as tomaximize information content, and then constructs models usinginternal or external tools. Data integrity is maintained viaa containment hierarchy for arrays, imposition of a metadatastandard based on a newly proposed MIDAS format, assignmentof semantically typed universal identifiers, and implementationof a procedure for storing the history of all transformationswith the array. We illustrate the utility of DataRail by processinga newly collected set of ~22 000 measurements of protein activitiesobtained from cytokine-stimulated primary and transformed humanliver cells.

Availability: DataRail is distributed under the GNU GeneralPublic License and available at http://code.google.com/p/sbpipeline/



A fundamental goal of systems biology is constructing mathematicalmodels that elucidate key features of biological processes asthey exist in real cells. A critical step in realizing thisgoal is effectively calibrating models against experimentaldata. The challenges of model calibration are well recognized(Jaqaman and Danuser, 2006) but we have found systematizingand processing data prior to calibration to be tricky as well.This is particularly true as the volume of data or the complexityof models grows. Few information systems exist to organize,store and normalize the wide range of experimental data encounteredin contemporary molecular biology in a sufficiently systematicmanner to maintain provenance and meanwhile retaining the adaptabilitynecessary to accommodate changing methods. Partly as a consequence,relatively few complex physiological processes have been modeledusing a combination of theory and high throughput experimentaldata.

An information management system for experimental data mustrecord data provenance and experimental conditions, maintaindata integrity as various numerical transformations are performed,describe data in terms of a standardized terminology, promotedata reuse and facilitate data sharing. The most common wayto achieve these requirements is via a relational database managementsystem (RDBMS, see SBEAMS—http://www.sbeams.org—orBioinformatics Resource Manager for relevant examples; Shahet al., 2006). Databases in biology resemble those previouslydeveloped for business and have proven spectacularly successfulin managing data on DNA and protein sequences. In a relationaldatabase, the subdivision of information and its subsequentstorage into cross-indexed tables follows a precise, predefinedschema. The granularity and stability of the schema allows anRDBMS to identify and maintain links between disparate piecesof information, even in the face of frequent read–writeoperations. However, this power comes at a considerable costin terms of inflexibility. It is difficult for a relationaldatabase to accommodate frequent changes in the formats of dataor metadata, and to incorporate unstructured information.

Whereas the sequence of a human gene represents valuable informationindependent of how sequencing was performed or of the individualfrom whom the DNA was obtained (a statement that remains truedespite the value of characterizing sequence variations); suchis not the case for measures of protein activity or cellularstate. Such biochemical and physiological data are highly contextdependent. Data on ERK kinase activity, for example, is uninformativein the absence of information on cell type, growth conditions,etc. Moreover, a wide range of techniques are used to make biochemicaland physiological measurements, and both the assays and thedata they generate change over time, as new methods are developed(e.g. in imaging see Swedlow et al., 2003). Context dependenceand rapidly changing data formats pose fundamental problemsfor databases because RDBMS schemes are not easily modified.

Moreover, even if effective metadata standards are developedto describe the context-dependence of experimental findings,data from different experiments cannot be reconciled simplyby storing them in a single database. Subtle distinctions mustbe made about different types of data and biological insightbrought to bear. Currently this is performed implicitly in theminds of individual investigators, but we envision a futurein which the unique ability of mathematical models to formalizehypotheses and manage contingent information makes them theprimary repositories of biological knowledge. As we work towardsa model-centric future, it is our contention that informationsystems based solely on relational databases are unnecessarilylimiting; rarely do we modify a difficult experiment simplyto conform to a pre-existing database schema (whereas conformityto uniform—even arbitrary—standards is a strengthfor a business database). New approaches to data managementthat reconcile competing requirements for flexibility and structureare required.

One response to the challenges of systematizing biological datahas been the creation of lightweight data standards focusedon the most important metadata. Pioneered by the Microarrayand Gene Expression Data Society's Minimum Information abouta Microarray Experiment (MIAME), these ‘minimum information’approaches typically define a simple data model that can beinstantiated as an XML file, a database schema, etc. A strengthof ‘minimum information’ standards is that theyspecify that subset of the metadata that is relatively constantamong ever-shifting and context-sensitive experiments. The philosophyis that of the Pareto principle or 80-20 rule, namely that 80%of the information can be captured with 20% of the effort whereasthe final 20% requires exponentially greater effort. An underlyingassumption is that a minimum information standard successfullyrecords the information needed to make experimental data intelligible.In this article we implement an information processing system,DataRail, intended to bridge the gap between data acquisitionand modeling. A new minimum information standard (MIDAS) ispart of the DataRail system, but a series of additional toolsare also applied to maintain the provenance of data and ensureits integrity through multiple steps of numerical manipulation.DataRail is model- rather than data-centric in that the taskof creating and transmitting knowledge is invested in mathematicalmodels constructed using the software, rather than the datastorage system itself, but it is designed to support existingmodeling tools rather than serve itself as an integrated modelingenvironment. We illustrate this capacity in DataRail using alarge set of protein measurements derived from primary and transformedhepatocytes; through the use of DataRail we derive insight bothinto the biology of these cell types and the optimal means bywhich to perform partial least squares regression (PLSR) modelingof cue-signal-response data.


2.1 Design goals and implementation
To facilitate the collection, annotation and transformationof experimental data, DataRail software is designed to meetthe following specific requirements (see Fig. 1): (i) serveas a stable repository for experimental results of differenttypes while recording key properties of the biological settingand complete information about all data processing steps; (ii)promote model development and analysis via internal visualizationand modeling capabilities; (iii) interact efficiently and transparentlywith external modeling and mining tools; (iv) meet new requirementsin data collection, annotation and transformation as they ariseand (v) facilitate data sharing and publication through compatibilitywith existing bioinformatics standards. A system meeting theserequirements was designed in which data is stored in a successionof regular multi-dimensional arrays, known as ‘data cubes’in information technology (Gray et al., 1997), each representingtransformations of an original set of primary data. The integrityof data is maintained by tagging the primary data with metadatareferenced to a controlled ontology, storing all arrays arisingfrom the same primary data in one file structure, documentingthe relationships of arrays to each other, storing algorithmsused for data transformation with data arrays and assigningeach data structure a unique identifier (UID) based on a controlledsemantic. DataRail was implemented as a MATLAB toolbox (http://www.mathworks.com/)with scripting and GUI-based interaction and incorporating avariety of data processing algorithms. DataRail works best asa component of a loosely coupled set of software tools includingcommercial data mining packages such as Spotfire (http://spotfire.tibco.com/)or public toolboxes for modeling. In addition, DataRail is designedto communicate with a semantic Wiki, to be described in a separatepaper but available at the DataRail download site, that is betterdesigned for storing textual information, such as experimentalprotocols, and that documents DataRail's use of UIDs.

2.2 System overview
Information in DataRail arising from a single set of experimentsis organized into a compendium, which consists of multiple n-dimensionaldata arrays, each of which contains either primary data or processeddata (see Fig. 2). It is left up to users to determine the breadthof experimental data included within each compendium, but goodpractice is to group results with similar experimental aims,biological setting or place of publication into one compendium.DataRail also supports creation of containers for multiple compendiaknown as projects. The dimensionality of arrays containing primarydata is determined by the user at the time of import, makingit possible to accommodate a wide range of experimental approachesand measurement technologies. For example, measuring a few propertiesof many samples by flow cytometry generates an array of differentdimensionality than measuring many variables in a few samplesby mass spectrometry. In practice, data in our laboratory canusually be described in six dimensions: three for the experimentalconditions (e.g. cell type, cytokine stimuli and small-moleculetreatment), one for time, one for experimental replicates andone for actual measurements. 

Arrays of transformed data are generated from primary data byapplying numerical algorithms that normalize, scale or otherwiseincrease accuracy and information content. Algorithms used duringdata processing, along with the values of all free parameters,are stored with each array to maintain a complete record ofall transformations performed prior to data mining or modeling.

2.3 Test cases
We have tested DataRail on seven sets of recent data availablein our laboratories, containing between 5 x 103 and ~1.6 x 106data points. Each set had a unique structure and gave rise toarrays with 4–6 dimensions (see Supplementary Table S1).Here we discuss the analysis of a ‘CSR Liver compendium’,a cue-signal-response dataset (Gaudet et al., 2005) comprising22 512 measurements in primary human hepatocytes and a hepatocarcinomacell line (HepG2 cells; L.A. et al., unpublished data). In thiscompendium, cells were exposed to 11 cytokine treatments and8 small-molecule drugs upon which the states of phosphorylationof 17 signaling proteins (at 30 min and 3 h) and the concentrationsof 50 extracellular cytokines (at 12 and 24 h) were measuredusing bead-based micro-ELISA assays.

2.4 Storing primary data and metadata
Tagging primary data with metadata is essential to its utilityand involves two aspects of DataRail: a new metadata standardand a process for actually collecting the metadata. The metadatastandard is based on our proposed MIDAS format (Minimum Informationfor Data Analysis in Systems Biology) that is itself based onpre-existing minimum-information standards such as MIACA (MinimumInformation About a Cellular Assay, http://miaca.sourceforge.net/).MIDAS is a tabular (or spreadsheet) format that specifies thelayout of experimental data files that gives rise, upon importinto DataRail, to an n-dimensional data array. The MIDAS formatwas derived from the ‘experimental module’ conceptin MIACA, with modifications required for model-centric datamanagement (see Fig. 3). Typically a MIDAS file is used to inputinformation from instruments into DataRail and to export informationfrom DataRail into other software that uses spreadsheets. However,export from a data array to a MIDAS file entails loss of informationabout data provenance and prior processing steps. We are thereforein the process of implementing a standardized format for exchangingDataRail files that does not depend on the use of MATLAB files(see Section 3 for details). Each row in a MIDAS table representsa single experimental sample; each column represents one sampleattribute, such as identity (e.g. multi-well plate name or wellcoordinate), treatment condition, or value obtained from anexperimental assay. A column header consists of two values:(i) a two-letter code defining the type of column, (e.g. TRfor treatment, DV for data value), and (ii) a short column name(e.g. a small molecule inhibitor added or a protein assayed).The body of each column stores the corresponding value for eachrow (sample) such as a plate/well name, reagent concentration,time point, or data value (see Supplementary Materials for detailsand example MIDAS spreadsheets). MIDAS is designed to fulfillthe need for data exchange and analysis within a closely knitresearch group. It is not a stand-alone solution for archivalstorage or publication and should be implemented in conjunctionwith MIAME, MIACA or DataRail itself.

The sequence of steps involved in entering metadata into DataRailis designed to accommodate the rhythm of a typical laboratoryin which simple annotation is possible while experiments arein progress, but detailed data analysis is performed subsequently.As an experiment is being designed, a MIDAS file specifyingthe dimensionality and format of the data (treatments, timepoints, readouts, etc.) is created, and scripts specializedto different instruments or experimental methodologies are thenused to add results to the ‘empty’ MIDAS file. Thusfar we have written a script to import bead-based micro-ELISAdata generated by a Luminex reader running BioRad software (Bio-Plex).We have also implemented a general purpose Java program forMIDAS file creation that can be used to import data into DataRail,used as a stand-alone application, or integrated into othersoftware. Within the MIDAS layout utility, wells that will betreated in a similar or sequential manner are selected via aGUI and appropriate descriptions of the samples added via pop-uptabs (see Supplementary Fig. S1). When layout is complete, acorrectly formatted MIDAS file is generated, ready for the additionof data. Lists that assist experimentation are also created(these lists typically specify times of reagent addition, samplewithdrawal, etc.). We invite instrument manufacturers to incorporatethis utility into their software so that creation of MIDAS-compliantfiles is automatic; the code is therefore distributed undera non-viral caBIG open source license developed by the NationalCancer Institute. If a MIDAS file has not been generated atthe outset of an experiment, it is possible to convert experimentaldata at any point prior to import, but in this case MIDAS-associatedsupport tools are not available to help with experiments.

As mentioned above, DataRail need not be used in combinationwith SBWiki, a wiki based on semantic web technology (Berners-Lee,2001). For the current discussion, four features of SBWiki areimportant. First, a web form used for upload prompts users toenter the metadata such as user name, date, cell type, etc.,required for full MIDAS compliance, and this data is storedas a wiki page. Because continuous web access is easy to arrange,even for geographically dispersed instruments, users recordmetadata when files are first saved to a central location. Thisis very important in practice because metadata is rarely addedwhen the process is cumbersome or separated in time from datacollection. Second, use of semantic web forms makes it possibleto create simple, familiar and easily modified interfaces whilecollecting structured information. In contrast, tools for accessingmetadata in traditional databases or XML files are more difficultto use and require considerable expertise to modify. Third,as data is imported it is assigned a UID by SBWiki itself, whichdirectly encodes, among other things, the type of data and theperson who created it (see Supplementary Materials). The assignmentof a UID makes it possible to track the origin of all data inDataRail, independent of the array-compendium-project structure.Fourth, although metadata describing key aspects of experimentsare stored internal to the MIDAS file, complete details of experimentalprotocols and reagents are stored in SBWiki. Storage externalto the MIDAS file allows complex textual information to be modifiedand reused more easily. Links from data arrays to external filesare made via URLs that follow the UID scheme described aboveand can use the revision history in SBWiki to reference a specificversion of a protocol or reagent.

When constructing the CSR Liver compendium, a spreadsheet generatedby Bio-Plex software was appended to a MIDAS file, and a secondMIDAS file containing data on total protein concentrations wasgenerated using a plate reader. Overall three primary data arrayswere created from CSR Liver data: one recording phosphorylationstates of 17 proteins at three time points (0, 30 min, 3 h),one recording extracellular cytokine concentrations, also atthree time points (0, 12 h, 24 h), and one recording total proteinconcentration. In principle these arrays could be combined tobundle data together, but the resulting single array would besparsely populated. In addition, bundling data into a singlearray is not the same as fusing different types of data. Thefusion of flow cytometry, Western blot and live-cell imagingdata (J.A. et al., unpublished data) is facilitated by DataRailbut also requires biological insight and problem-specific modeling.

2.5 Adding transformed arrays to compendia
Once primary data is imported into a new compendium, it is thentransformed by one or more algorithms internal to DataRail,by user-specified algorithms, or by external programs, to createa new transformed data array. Transformations can change numericalvalues within an array or can expand or collapse the dimensionalityof arrays. A long time series, for example, can be transformedinto a shorter series involving selected times, time-averageddata or integrated values. When a transformation is performedon an array, the code used for the transformation and the valuesof free parameters are stored, along with a reference to theinput data (in the current implementation, the algorithms themselvesare recorded as the text of MATLAB functions), so that the compendiumis a self-documenting entity, in which the provenance of datacan be tracked.

Overall, DataRail can perform a diversity of transformationsfalling into several general categories. Simple arithmetic operationsinclude subtracting background from primary data, or dividingone type of data by another (see Supplementary Fig. S2). Forexample, Bio-Plex-based measures of protein phosphorylationin CSR Liver data were divided by total protein concentrationto correct for differences in cell number and extraction efficiency.In a second type of transformation, metrics such as ‘areaunder the curve’, maximal value of a variable in a series,standard deviation of a series and relative values are computed.Third, complex data transformations are performed, includingmean-centering and variance-scaling, both of which are helpfulin performing principal component analysis (PCA) or assemblingmodels using PLSR (Gaudet et al., 2005). Finally, computationsspecific to particular modeling methods are performed, includingtransformation of continuous variables into discrete valuesfor the purpose of Boolean or discrete data modeling. For example,to support Boolean modeling, a discretization routine assignsa value of ‘1’ to a variable if and only if (i)it is above a typical background value for the assay, as determinedby the user or extracted automatically from primary data, (ii)it is above a user-supplied threshold and (iii) it is high withrespect to the values of the same signal under other conditionsin the data set.

2.6 Data mining and visualization
Visualization can involve data export directly to an externalapplication such as Spotfire, or it can be performed withinthe pipeline. Internal visualization routines that make useof transformations performed by DataRail are often an effectivemeans to create thumbnails of time-courses, heat maps, etc.For example, the data viewer in Figure 4 was developed to displaytime courses of protein modification in the CSR Liver compendium,corrected for background and protein concentration and scaledto a common vertical axis. Data from primary hepatocytes andHepG2 cells was compared, and the difference between the integratedactivities in the two lines then computed and displayed in thebackground as a red-blue heat map. Discretization was then usedto score responses as transient, sustained or invariant, eachof which was assigned a different color. Finally, a heat mapof the phenotypic responses was generated to facilitate comparisonof signals and outcomes (see Fig. 4). Importantly, efficientgeneration of plots such as this relies on the inclusion inDataRail of multiple data transformation routines.

2.7 Constructing and evaluating models
DataRail supports three approaches to modeling. First, severalroutines that create statistical models, such as PLSR, havebeen integrated directly into the code. Second, efficient linkshave been created to other MATLAB toolboxes such as CellNetAnalyzer(Klamt et al., 2007), which performs Boolean modeling, and thedifferential-equation-based modeling package PottersWheel (http://www.PottersWheel.de/).Third, export of primary or transformed data from DataRail asvectors, matrices or n-dimensional arrays has been implementedto facilitate links to other modeling tools. In this case, usersneed to ensure continuing compliance with the MIDAS data standardso as to preserve the integrity of metadata. Thus far we haveimplemented export into a MIDAS file, which can be read by Spotfire,and formats compatible with either PottersWheel or CellNetAnalyzer.

It is well recognized that modeling in biology is an iterativeprocess in which modeling, hypotheses generation and experimentsalternate. Less obvious is that the relationship between modelsand data can be very complex. We have previously shown thatthe quality of statistical models can be improved by variouspre-processing algorithms that mean-center data or scale itto unit variance (Gaudet et al., 2005). Moreover, metrics derivedfrom time course data such as area under the curve, maximumslope and mean value can be more informative than primary databecause they implicitly account for time in different ways.However, it is rarely known a priori which data transformationswill yield the best model. Instead, multiple models must becomputed and the choice among them made using an objective functionsuch as least squares fit to experimental data. From the pointof view of workflow, the key point is that a single primarydata array can give rise to multiple transformed arrays, andeach of these to multiple models that differ in their underlyingassumptions. As a consequence, a very large number of modelsare generated, each of which needs to be referenced correctlyto underlying data and data processing algorithms. DataRailexcels at maintaining these links between model and data.

For example, data in the Liver CSR Compendium were processedto account for variation in experimental protocol. PCA was thenused to reduce the dimensionality of the cytokine data, andk-means clustering applied to identify relevant cytokine subsets.PLSR was then performed, taking as an input phosphorylationdata (signals) and as an output a PCA-derived cluster of importantmediators of the inflammatory response, namely the pro-inflammatorycytokine IL1β and several activators of granulocytes (MIP1{alpha}/CCL3,MIP1β/CCL4, RANTES/CCL5, GCSF). We could have chosen adifferent response set, but this cluster served to demonstratekey steps in statistical modeling by PLSR. Next, 24 transformeddata arrays were created for signals and responses based ondifferent scaling (mean-centering or variance-scaling) or metrics(area under the curve, slope, and mean activation; see Fig. 5).PLSR was performed on pairs of signal-response arrays, generating576 models that were then ranked by goodness of fit to data(a least squares fit based on R2, see Supplementary Table S3).To prevent overfitting, the number of components for each modelwas determined using 7-fold cross-validation (Wold et al., 2004).Importantly, the whole process of creating and evaluating modelsran in DataRail in a matter of minutes, and every model couldbe traced back to the transformed data from which it was derived.

A variety of input arrays gave rise to top scoring models, butarea under the curve was clearly the best measure of output(Supplementary Table S4). Models based on unit variance scalingof input data and area under the curve, which constituted thebest form for the input in an earlier PLSR study (Gaudet etal., 2005), scored no better than 218 out of the 576 modelsand had R2 values 4-fold worse than the best model. Had we simplyassumed our previous findings to be universally applicable,we would have generated models with very poor performance. Whenthe best performing model (whose scores and loading plots canbe found in Fig. S3) was examined by variable importance ofprojection (VIP; Gaudet et al., 2005) to see which signals weremost predictive of cytokine secretion, the levels of phosphorylationof Hsp27 and cJun (each at 0, 30 min and 3 h) comprised 6 ofthe 10 highest scoring variables. Phospho-Hsp27 is an integratedmeasure of p38 kinase activity and cJun of JNK kinase activity;intriguingly, the levels of activating phosphorylation on p38and JNK kinases were considerably less informative. Thus, thesteady-state activities of p38 and JNK (captured by t = 0 data)appear to play a key role in determining the extracellular concentrationsof five cytokines and growth factors involved in epithelia-immunecell interactions. Consistent with this idea, it has previouslybeen described that RANTES secretion is positively regulatedby p38 MAPK and JNK in intestinal and airway epithelial cells(Pazdrak et al., 2002; Yan et al., 2006), as it is in liver. 2.8 Facilitating data sharing and publication
The fact that DataRail packages primary and transformed dataarrays and their provenance together makes it a good means toshare data among laboratories. However, knowledge transfer wouldbe greatly facilitated by including figures, particularly thosedestined for publication or public presentation, within DataRailin a manner that maintained the analysis itself, the provenanceof the data, and the identities of all algorithms and free parameters.Users could then interact with published figures in a dynamicfashion that would go far beyond what is available in today'sjournals, while also discovering new ways in which the datacould be viewed or put to use. We have implemented a specialcategory of project whose UID can contain a Pubmed ID and inwhich figures are saved as structured variables, ‘pseudo-arrays’,that are embedded in compendia in the same manner as other arrays.We are currently working on an additional feature in which allof the relevant data in a linked SBWiki are stored as a wiki-book(e.g. a PDF file), thereby ensuring a complete description ofall experimental procedures, reagents, etc. In the case of open-sourcepublication, the actual manuscript could also be embedded; otherwise,a link would be provided to the publisher.





We describe the implementation of DataRail, a flexible toolboxfor storing and manipulating experimental data for the purposeof numerical modeling. Metadata in DataRail is based on a ‘minimuminformation’ MIDAS standard closely related to standardsthat have already proven their utility in the analysis of DNAmicroarray and other types of high-throughput data. BecauseMIDAS is a simplified version of the MIACA standard, exportfrom DataRail into a MIACA-compliant file is straightforward.Based on several use cases with up to 1.5 x 106 data points(see Table S1), DataRail appears to be scalable and broadlyuseful, thanks to its efficient reuse of primary data and dataprocessing algorithms. Compared to traditional relational databases,DataRail is significantly easier to deploy and modify, and itcan accommodate a wider range of data formats since its internalarrays can have any dimensionality. Careful management of arraysvia semantically typed identifiers (which also serve as URLs),use of a strict containment hierarchy, and imposition of metadatastandard take the place of the rigid tabular structure foundin relational databases. However, in cases in which data formatsstabilize, or greater transactional capacity is desired, allor part of a DataRail data model can be implemented in an RDBMS.

The current DataRail implementation meets our original designgoals in the following ways: (i) data provenance is maintainedthrough the containment hierarchy, the record of processingsteps, and the assignment of UIDs; (ii) visualization and modelingare possible with internal tools specialized to PCA and PLSR;(iii) interaction with external software such as CellNetAnalyzer,PottersWheel and Spotfire is implemented, and export routinesare available to expand this list; (iv) flexibility is providedby the use of data arrays with user-determined dimensionalityand a simple interface for adding new analysis routines; (v)data sharing and publication are facilitated by a special categoryof project that packages together transformed arrays and figuresdescribing key analyses, including those in published papers.Future developments in DataRail include the creation of utilitiesfor managing image and mass spectrometry data, importers fora range of common laboratory instruments, and support for theHDF5 file format (http://hdf.ncsa.uiuc.edu/HDF5/). HDF is awidely supported, open-source format used in many fields dealingwith large data sets, such as earth imaging or astronomy. HDFfiles are self-describing and permit access to binary data inmanner that is much more efficient that with XML rules. Moreover,integration of DataRail with Gaggle and similar interoperabilitystandards is a high priority (Shannon et al., 2006). Gagglecoordinates multiple analysis tools, among them the R/Bioconductorstatistical environment (Gentleman et al., 2004), thereby providingaccess to tools for the statistical analysis of high-throughputdata. Finally, versions of DataRail based on the open-sourcelanguages R or Python are in development, as are discussionswith instrument vendors to create direct export routines forMIDAS-compatible files. In the context of commercial use, weare discussing, with a commercial partner, the implementationof granular access control functionality.

A model-centric approach explicitly encodes specific hypothesesabout data and its meaning, and can therefore merge data notonly at the level of information but at the more useful levelof knowledge. Even in the database dependent world of business,knowledge is usually derived from information in specializeddatabases (data warehouses, which are static representationsof transactional databases processed to ensure data consistency)using business intelligence tools. Business intelligence is,in essence, an approach to modeling business and financial processesmathematically and then testing the models on data. In the caseof biological models, data plays an even more central role becausemany model parameters can be estimated only by induction fromexperimental observations. Thus, for mathematical models ofbiology to realize their full potential, a tight link betweenmodel and experiment is necessary. This involves not only aneffective means to calibrate models, but also reliable informationon data provenance. Only then can model-based predictions beevaluated in light of assumptions and uncertainties. DataRailtherefore represents a step forward in the complex task of designingsoftware that supports model-driven knowledge creation in biomedicine.



We thank S. Gaudet and J. Albeck for helpful discussions and B. Hendriks, C. Espelin, and M. Zhang for testing DataRail. This work was funded by NIH grant P50-GM68762 and by a grant from Pfizer Inc. to P.K.S. and D.A.L.

Conflict of Interest: none declared.


Associate Editor: Trey Ideker

{dagger}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back

Received on October 10, 2007; revised on December 11, 2007; accepted on January 9, 2008


Berners-Lee T, et al. The semantic web—a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. Sci. Am, ( (2001) ) 284, : 34.

Gaudet S, et al. A compendium of signals and responses triggered by prodeath and prosurvival cytokines. Mol. Cell. Proteomics, ( (2005) ) 4, : 1569–1590.

Gentleman RC, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol, ( (2004) ) 5, : R80.

Gray J, et al. Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowl. Discov, ( (1997) ) 1, : 29–53.

Jaqaman K, Danuser G. Linking data to models: data regression. Nat. Rev. Mol. Cell. Biol, ( (2006) ) 7, : 813–819.

Klamt S, et al. Structural and functional analysis of cellular networks with CellNetAnalyzer. BMC Syst. Biol, ( (2007) ) 1, : 2–14.

Pazdrak K, et al. MAPK activation is involved in posttranscriptional regulation of RSV-induced RANTES gene expression. Am. J. Physiol. Lung Cell. Mol. Physiol, ( (2002) ) 283, : L364–L372.

Shah AR, et al. Enabling high-throughput data management for systems biology: the bioinformatics resource manager. Bioinformatics, ( (2006) ) 23, : 906–909.

Shannon PT, et al. The Gaggle: an open-source software system for integrating bioinformatics software and data sources. BMC Bioinformatics, ( (2006) ) 7, : 176.

Swedlow JR, et al. Informatics and quantitative analysis in biological imaging. Science, ( (2003) ) 300, : 100–102.

Wold S, et al. The PLS method and its applications in industrial RDP (research, development, and production). ( (2004) ) Accessed 2007 Dec 2 at http://umetrics.com/default.asp/pagename/news_pastevents/c/2..

Yan SR, et al. Differential pattern of inflammatory molecule regulation in intestinal epithelial cells stimulated with IL-1. J. Immunology, ( (2006) ) 177, : 5604–5611.


mcith_btn018f1.JPG Figure 1 Process diagram for model-centric information management in DataRail. Measurements generated using one or more methods (left side of diagram) are processed to output new knowledge (right); hypothesis testing links modeling and measurement in an iterative cycle. Processes and entities within the red box have been implemented; those outside the box remain to be completed; dotted lines denote external processes that have been linked to DataRail. Experimental measurements are first converted into a MIDAS format using one or more routines (pink lozenges; see text for details) and then used to assemble a multi-dimensional primary data array (green). Alternatively, an empty MIDAS-compliant spreadsheet is generated using a Java utility and experimental values then entered. Algorithms for normalization, scaling, discretization, etc. transform the data to create new data arrays (orange) that can then be modeled using internal or external routines. Finally, analysis and visualization assist in knowledge generation. The calibration of kinetic and Boolean models is not shown explicitly, although it constitutes a critical and complicated step in the overall workflow of systems biology that is as-yet external to DataRail.

(Click image to enlarge)

mcith_btn018f2.JPG Figure 2 Containment hierarchy for DataRail. Individual arrays of primary or transformed data are gathered together into a MATLAB structure we call a compendium; multiple compendia are linked together into a project. Each compendium contains a unique name (UID), a short textual documentation, and a set of multi-dimensional arrays. Each array is stored together with simple metadata (name, free-text information, source, algorithm, and free parameters used in array creation). The representation follows the conventions of UML (Unified Modeling Language) format, indicating that a compendium contains one or more arrays, which contain one or more labels and zero or more parameters.

(Click image to enlarge)

mcith_btn018f3.JPG Figure 3 Minimum information for data analysis in systems biology (MIDAS). (A) A simplified map of a multi-well experiment in which Akt phosphorylation is to be assayed at 0 and 30 min in extracts from cells treated, or not, with lipo-polysacharide (LPS) and a PI3-kinase inhibitor (PI3Ki). (B) MIDAS representation of the experiment. A column header consists of a two-letter code defining the type of column and a short column name. For clarity headers are color-coded to match the corresponding values on the plate map. The leftmost five columns (codes ID: identity, TR: treatment, and DA: data acquisition) are experimental design parameters and would be filled in before bench work begins. The rightmost column holds measured data values (DV) that are appended as data acquisition is performed. See Supplementary Table S2 for a larger example. (C) A list of the type codes used for MIDAS columns and a few relevant SBWiki types.

(Click image to enlarge)

mcith_btn018f4.JPG Figure 4 Visualizing data in DataRail by exploiting data in transformed arrays. (A) Structure of the compendium used to generate this plot and the relationship of each feature to data in a transformed array. This structural map was generated using routines internal to DataRail. (B) Time courses for the phosphorylation of 17 key proteins (rows) in primary hepatocytes under 11 different conditions of cytokine stimulation (columns) and treated with seven different small molecule drugs (subpanels within each cytokine-signal block). Curves are colored according to their dynamics (green = sustained, yellow = transient, magenta = late activation, grey = no significant signal). The intensity of the signal determines the intensity of the color. The corresponding signals from HepG2 tumor cells are plotted behind without color coding. The background is blue if the mean signal is stronger for primary cells and red if it is stronger for HepG2 cells; larger differences lead to stronger coloring. In addition, the levels of IL8 at 24 h, a measure of cellular response, are added as a heat map.

(Click image to enlarge)

mcith_btn018f5.JPG Figure 5 PLSR analysis in DataRail. Liver CSR data was imported to DataRail and values for protein phosphorylation designated as inputs and levels of secreted cytokine as outputs. The data was not normalized with respect to total protein concentration, to not introduce additional experimental error. The extent of cytokine co-expression was determined using internal PCA and k-means clustering routines. This yielded as set of five tightly clustered cytokines that were used as outputs for modeling (see row 1 of Table S1 for information about the dimensionality of the data). Primary data and data scaled with respect to maximum signal were then analyzed to compute area under the curve, slope, and mean change; this generated 8 transformed arrays for both input and output data. The resulting arrays were rescaled using routines for mean-centering, variance-scaling, or both combined (auto-scaling). The resulting 24 input cubes and 24 output cubes gave rise to 576 PLSR models, which were ranked according to their goodness of fit. For the best model, the variable importance of projection (VIP) is shown as a way to assess the relative importance of different inputs for cytokine secretion.

(Click image to enlarge)