2.1 Design goals and implementation
To facilitate the collection, annotation and transformationof experimental data, DataRail software is designed to meetthe following specific requirements (see Fig. 1): (i) serveas a stable repository for experimental results of differenttypes while recording key properties of the biological settingand complete information about all data processing steps; (ii)promote model development and analysis via internal visualizationand modeling capabilities; (iii) interact efficiently and transparentlywith external modeling and mining tools; (iv) meet new requirementsin data collection, annotation and transformation as they ariseand (v) facilitate data sharing and publication through compatibilitywith existing bioinformatics standards. A system meeting theserequirements was designed in which data is stored in a successionof regular multi-dimensional arrays, known as ‘data cubes’in information technology (Gray et al., 1997), each representingtransformations of an original set of primary data. The integrityof data is maintained by tagging the primary data with metadatareferenced to a controlled ontology, storing all arrays arisingfrom the same primary data in one file structure, documentingthe relationships of arrays to each other, storing algorithmsused for data transformation with data arrays and assigningeach data structure a unique identifier (UID) based on a controlledsemantic. DataRail was implemented as a MATLAB toolbox (http://www.mathworks.com/)with scripting and GUI-based interaction and incorporating avariety of data processing algorithms. DataRail works best asa component of a loosely coupled set of software tools includingcommercial data mining packages such as Spotfire (http://spotfire.tibco.com/)or public toolboxes for modeling. In addition, DataRail is designedto communicate with a semantic Wiki, to be described in a separatepaper but available at the DataRail download site, that is betterdesigned for storing textual information, such as experimentalprotocols, and that documents DataRail's use of UIDs.
2.2 System overview
Information in DataRail arising from a single set of experimentsis organized into a compendium, which consists of multiple n-dimensionaldata arrays, each of which contains either primary data or processeddata (see Fig. 2). It is left up to users to determine the breadthof experimental data included within each compendium, but goodpractice is to group results with similar experimental aims,biological setting or place of publication into one compendium.DataRail also supports creation of containers for multiple compendiaknown as projects. The dimensionality of arrays containing primarydata is determined by the user at the time of import, makingit possible to accommodate a wide range of experimental approachesand measurement technologies. For example, measuring a few propertiesof many samples by flow cytometry generates an array of differentdimensionality than measuring many variables in a few samplesby mass spectrometry. In practice, data in our laboratory canusually be described in six dimensions: three for the experimentalconditions (e.g. cell type, cytokine stimuli and small-moleculetreatment), one for time, one for experimental replicates andone for actual measurements.
Arrays of transformed data are generated from primary data byapplying numerical algorithms that normalize, scale or otherwiseincrease accuracy and information content. Algorithms used duringdata processing, along with the values of all free parameters,are stored with each array to maintain a complete record ofall transformations performed prior to data mining or modeling.
2.3 Test cases
We have tested DataRail on seven sets of recent data availablein our laboratories, containing between 5 x 103 and 1.6 x 106data points. Each set had a unique structure and gave rise toarrays with 4–6 dimensions (see Supplementary Table S1).Here we discuss the analysis of a ‘CSR Liver compendium’,a cue-signal-response dataset (Gaudet et al., 2005) comprising22 512 measurements in primary human hepatocytes and a hepatocarcinomacell line (HepG2 cells; L.A. et al., unpublished data). In thiscompendium, cells were exposed to 11 cytokine treatments and8 small-molecule drugs upon which the states of phosphorylationof 17 signaling proteins (at 30 min and 3 h) and the concentrationsof 50 extracellular cytokines (at 12 and 24 h) were measuredusing bead-based micro-ELISA assays.
2.4 Storing primary data and metadata
Tagging primary data with metadata is essential to its utilityand involves two aspects of DataRail: a new metadata standardand a process for actually collecting the metadata. The metadatastandard is based on our proposed MIDAS format (Minimum Informationfor Data Analysis in Systems Biology) that is itself based onpre-existing minimum-information standards such as MIACA (MinimumInformation About a Cellular Assay, http://miaca.sourceforge.net/).MIDAS is a tabular (or spreadsheet) format that specifies thelayout of experimental data files that gives rise, upon importinto DataRail, to an n-dimensional data array. The MIDAS formatwas derived from the ‘experimental module’ conceptin MIACA, with modifications required for model-centric datamanagement (see Fig. 3). Typically a MIDAS file is used to inputinformation from instruments into DataRail and to export informationfrom DataRail into other software that uses spreadsheets. However,export from a data array to a MIDAS file entails loss of informationabout data provenance and prior processing steps. We are thereforein the process of implementing a standardized format for exchangingDataRail files that does not depend on the use of MATLAB files(see Section 3 for details). Each row in a MIDAS table representsa single experimental sample; each column represents one sampleattribute, such as identity (e.g. multi-well plate name or wellcoordinate), treatment condition, or value obtained from anexperimental assay. A column header consists of two values:(i) a two-letter code defining the type of column, (e.g. TRfor treatment, DV for data value), and (ii) a short column name(e.g. a small molecule inhibitor added or a protein assayed).The body of each column stores the corresponding value for eachrow (sample) such as a plate/well name, reagent concentration,time point, or data value (see Supplementary Materials for detailsand example MIDAS spreadsheets). MIDAS is designed to fulfillthe need for data exchange and analysis within a closely knitresearch group. It is not a stand-alone solution for archivalstorage or publication and should be implemented in conjunctionwith MIAME, MIACA or DataRail itself.
The sequence of steps involved in entering metadata into DataRailis designed to accommodate the rhythm of a typical laboratoryin which simple annotation is possible while experiments arein progress, but detailed data analysis is performed subsequently.As an experiment is being designed, a MIDAS file specifyingthe dimensionality and format of the data (treatments, timepoints, readouts, etc.) is created, and scripts specializedto different instruments or experimental methodologies are thenused to add results to the ‘empty’ MIDAS file. Thusfar we have written a script to import bead-based micro-ELISAdata generated by a Luminex reader running BioRad software (Bio-Plex).We have also implemented a general purpose Java program forMIDAS file creation that can be used to import data into DataRail,used as a stand-alone application, or integrated into othersoftware. Within the MIDAS layout utility, wells that will betreated in a similar or sequential manner are selected via aGUI and appropriate descriptions of the samples added via pop-uptabs (see Supplementary Fig. S1). When layout is complete, acorrectly formatted MIDAS file is generated, ready for the additionof data. Lists that assist experimentation are also created(these lists typically specify times of reagent addition, samplewithdrawal, etc.). We invite instrument manufacturers to incorporatethis utility into their software so that creation of MIDAS-compliantfiles is automatic; the code is therefore distributed undera non-viral caBIG open source license developed by the NationalCancer Institute. If a MIDAS file has not been generated atthe outset of an experiment, it is possible to convert experimentaldata at any point prior to import, but in this case MIDAS-associatedsupport tools are not available to help with experiments.
As mentioned above, DataRail need not be used in combinationwith SBWiki, a wiki based on semantic web technology (Berners-Lee,2001). For the current discussion, four features of SBWiki areimportant. First, a web form used for upload prompts users toenter the metadata such as user name, date, cell type, etc.,required for full MIDAS compliance, and this data is storedas a wiki page. Because continuous web access is easy to arrange,even for geographically dispersed instruments, users recordmetadata when files are first saved to a central location. Thisis very important in practice because metadata is rarely addedwhen the process is cumbersome or separated in time from datacollection. Second, use of semantic web forms makes it possibleto create simple, familiar and easily modified interfaces whilecollecting structured information. In contrast, tools for accessingmetadata in traditional databases or XML files are more difficultto use and require considerable expertise to modify. Third,as data is imported it is assigned a UID by SBWiki itself, whichdirectly encodes, among other things, the type of data and theperson who created it (see Supplementary Materials). The assignmentof a UID makes it possible to track the origin of all data inDataRail, independent of the array-compendium-project structure.Fourth, although metadata describing key aspects of experimentsare stored internal to the MIDAS file, complete details of experimentalprotocols and reagents are stored in SBWiki. Storage externalto the MIDAS file allows complex textual information to be modifiedand reused more easily. Links from data arrays to external filesare made via URLs that follow the UID scheme described aboveand can use the revision history in SBWiki to reference a specificversion of a protocol or reagent.
When constructing the CSR Liver compendium, a spreadsheet generatedby Bio-Plex software was appended to a MIDAS file, and a secondMIDAS file containing data on total protein concentrations wasgenerated using a plate reader. Overall three primary data arrayswere created from CSR Liver data: one recording phosphorylationstates of 17 proteins at three time points (0, 30 min, 3 h),one recording extracellular cytokine concentrations, also atthree time points (0, 12 h, 24 h), and one recording total proteinconcentration. In principle these arrays could be combined tobundle data together, but the resulting single array would besparsely populated. In addition, bundling data into a singlearray is not the same as fusing different types of data. Thefusion of flow cytometry, Western blot and live-cell imagingdata (J.A. et al., unpublished data) is facilitated by DataRailbut also requires biological insight and problem-specific modeling.
2.5 Adding transformed arrays to compendia
Once primary data is imported into a new compendium, it is thentransformed by one or more algorithms internal to DataRail,by user-specified algorithms, or by external programs, to createa new transformed data array. Transformations can change numericalvalues within an array or can expand or collapse the dimensionalityof arrays. A long time series, for example, can be transformedinto a shorter series involving selected times, time-averageddata or integrated values. When a transformation is performedon an array, the code used for the transformation and the valuesof free parameters are stored, along with a reference to theinput data (in the current implementation, the algorithms themselvesare recorded as the text of MATLAB functions), so that the compendiumis a self-documenting entity, in which the provenance of datacan be tracked.
Overall, DataRail can perform a diversity of transformationsfalling into several general categories. Simple arithmetic operationsinclude subtracting background from primary data, or dividingone type of data by another (see Supplementary Fig. S2). Forexample, Bio-Plex-based measures of protein phosphorylationin CSR Liver data were divided by total protein concentrationto correct for differences in cell number and extraction efficiency.In a second type of transformation, metrics such as ‘areaunder the curve’, maximal value of a variable in a series,standard deviation of a series and relative values are computed.Third, complex data transformations are performed, includingmean-centering and variance-scaling, both of which are helpfulin performing principal component analysis (PCA) or assemblingmodels using PLSR (Gaudet et al., 2005). Finally, computationsspecific to particular modeling methods are performed, includingtransformation of continuous variables into discrete valuesfor the purpose of Boolean or discrete data modeling. For example,to support Boolean modeling, a discretization routine assignsa value of ‘1’ to a variable if and only if (i)it is above a typical background value for the assay, as determinedby the user or extracted automatically from primary data, (ii)it is above a user-supplied threshold and (iii) it is high withrespect to the values of the same signal under other conditionsin the data set.
2.6 Data mining and visualization
Visualization can involve data export directly to an externalapplication such as Spotfire, or it can be performed withinthe pipeline. Internal visualization routines that make useof transformations performed by DataRail are often an effectivemeans to create thumbnails of time-courses, heat maps, etc.For example, the data viewer in Figure 4 was developed to displaytime courses of protein modification in the CSR Liver compendium,corrected for background and protein concentration and scaledto a common vertical axis. Data from primary hepatocytes andHepG2 cells was compared, and the difference between the integratedactivities in the two lines then computed and displayed in thebackground as a red-blue heat map. Discretization was then usedto score responses as transient, sustained or invariant, eachof which was assigned a different color. Finally, a heat mapof the phenotypic responses was generated to facilitate comparisonof signals and outcomes (see Fig. 4). Importantly, efficientgeneration of plots such as this relies on the inclusion inDataRail of multiple data transformation routines.
2.7 Constructing and evaluating models
DataRail supports three approaches to modeling. First, severalroutines that create statistical models, such as PLSR, havebeen integrated directly into the code. Second, efficient linkshave been created to other MATLAB toolboxes such as CellNetAnalyzer(Klamt et al., 2007), which performs Boolean modeling, and thedifferential-equation-based modeling package PottersWheel (http://www.PottersWheel.de/).Third, export of primary or transformed data from DataRail asvectors, matrices or n-dimensional arrays has been implementedto facilitate links to other modeling tools. In this case, usersneed to ensure continuing compliance with the MIDAS data standardso as to preserve the integrity of metadata. Thus far we haveimplemented export into a MIDAS file, which can be read by Spotfire,and formats compatible with either PottersWheel or CellNetAnalyzer.
It is well recognized that modeling in biology is an iterativeprocess in which modeling, hypotheses generation and experimentsalternate. Less obvious is that the relationship between modelsand data can be very complex. We have previously shown thatthe quality of statistical models can be improved by variouspre-processing algorithms that mean-center data or scale itto unit variance (Gaudet et al., 2005). Moreover, metrics derivedfrom time course data such as area under the curve, maximumslope and mean value can be more informative than primary databecause they implicitly account for time in different ways.However, it is rarely known a priori which data transformationswill yield the best model. Instead, multiple models must becomputed and the choice among them made using an objective functionsuch as least squares fit to experimental data. From the pointof view of workflow, the key point is that a single primarydata array can give rise to multiple transformed arrays, andeach of these to multiple models that differ in their underlyingassumptions. As a consequence, a very large number of modelsare generated, each of which needs to be referenced correctlyto underlying data and data processing algorithms. DataRailexcels at maintaining these links between model and data.
For example, data in the Liver CSR Compendium were processedto account for variation in experimental protocol. PCA was thenused to reduce the dimensionality of the cytokine data, andk-means clustering applied to identify relevant cytokine subsets.PLSR was then performed, taking as an input phosphorylationdata (signals) and as an output a PCA-derived cluster of importantmediators of the inflammatory response, namely the pro-inflammatorycytokine IL1β and several activators of granulocytes (MIP1/CCL3,MIP1β/CCL4, RANTES/CCL5, GCSF). We could have chosen adifferent response set, but this cluster served to demonstratekey steps in statistical modeling by PLSR. Next, 24 transformeddata arrays were created for signals and responses based ondifferent scaling (mean-centering or variance-scaling) or metrics(area under the curve, slope, and mean activation; see Fig. 5).PLSR was performed on pairs of signal-response arrays, generating576 models that were then ranked by goodness of fit to data(a least squares fit based on R2, see Supplementary Table S3).To prevent overfitting, the number of components for each modelwas determined using 7-fold cross-validation (Wold et al., 2004).Importantly, the whole process of creating and evaluating modelsran in DataRail in a matter of minutes, and every model couldbe traced back to the transformed data from which it was derived.
A variety of input arrays gave rise to top scoring models, butarea under the curve was clearly the best measure of output(Supplementary Table S4). Models based on unit variance scalingof input data and area under the curve, which constituted thebest form for the input in an earlier PLSR study (Gaudet etal., 2005), scored no better than 218 out of the 576 modelsand had R2 values 4-fold worse than the best model. Had we simplyassumed our previous findings to be universally applicable,we would have generated models with very poor performance. Whenthe best performing model (whose scores and loading plots canbe found in Fig. S3) was examined by variable importance ofprojection (VIP; Gaudet et al., 2005) to see which signals weremost predictive of cytokine secretion, the levels of phosphorylationof Hsp27 and cJun (each at 0, 30 min and 3 h) comprised 6 ofthe 10 highest scoring variables. Phospho-Hsp27 is an integratedmeasure of p38 kinase activity and cJun of JNK kinase activity;intriguingly, the levels of activating phosphorylation on p38and JNK kinases were considerably less informative. Thus, thesteady-state activities of p38 and JNK (captured by t = 0 data)appear to play a key role in determining the extracellular concentrationsof five cytokines and growth factors involved in epithelia-immunecell interactions. Consistent with this idea, it has previouslybeen described that RANTES secretion is positively regulatedby p38 MAPK and JNK in intestinal and airway epithelial cells(Pazdrak et al., 2002; Yan et al., 2006), as it is in liver.
2.8 Facilitating data sharing and publication
The fact that DataRail packages primary and transformed dataarrays and their provenance together makes it a good means toshare data among laboratories. However, knowledge transfer wouldbe greatly facilitated by including figures, particularly thosedestined for publication or public presentation, within DataRailin a manner that maintained the analysis itself, the provenanceof the data, and the identities of all algorithms and free parameters.Users could then interact with published figures in a dynamicfashion that would go far beyond what is available in today'sjournals, while also discovering new ways in which the datacould be viewed or put to use. We have implemented a specialcategory of project whose UID can contain a Pubmed ID and inwhich figures are saved as structured variables, ‘pseudo-arrays’,that are embedded in compendia in the same manner as other arrays.We are currently working on an additional feature in which allof the relevant data in a linked SBWiki are stored as a wiki-book(e.g. a PDF file), thereby ensuring a complete description ofall experimental procedures, reagents, etc. In the case of open-sourcepublication, the actual manuscript could also be embedded; otherwise,a link would be provided to the publisher.