such as "Introduction", "Conclusion"..etc
2.1 Design goals and implementation
To facilitate the collection, annotation and transformationof experimental data, DataRail software is designed to meetthe following specific requirements (see Fig. 1): (i) serveas a stable repository for experimental results of differenttypes while recording key properties of the biological settingand complete information about all data processing steps; (ii)promote model development and analysis via internal visualizationand modeling capabilities; (iii) interact efficiently and transparentlywith external modeling and mining tools; (iv) meet new requirementsin data collection, annotation and transformation as they ariseand (v) facilitate data sharing and publication through compatibilitywith existing bioinformatics standards. A system meeting theserequirements was designed in which data is stored in a successionof regular multi-dimensional arrays, known as ‘data cubes’in information technology (Gray et al., 1997), each representingtransformations of an original set of primary data. The integrityof data is maintained by tagging the primary data with metadatareferenced to a controlled ontology, storing all arrays arisingfrom the same primary data in one file structure, documentingthe relationships of arrays to each other, storing algorithmsused for data transformation with data arrays and assigningeach data structure a unique identifier (UID) based on a controlledsemantic. DataRail was implemented as a MATLAB toolbox (http://www.mathworks.com/)with scripting and GUI-based interaction and incorporating avariety of data processing algorithms. DataRail works best asa component of a loosely coupled set of software tools includingcommercial data mining packages such as Spotfire (http://spotfire.tibco.com/)or public toolboxes for modeling. In addition, DataRail is designedto communicate with a semantic Wiki, to be described in a separatepaper but available at the DataRail download site, that is betterdesigned for storing textual information, such as experimentalprotocols, and that documents DataRail's use of UIDs.
2.2 System overview
Information in DataRail arising from a single set of experimentsis organized into a compendium, which consists of multiple n-dimensionaldata arrays, each of which contains either primary data or processeddata (see Fig. 2). It is left up to users to determine the breadthof experimental data included within each compendium, but goodpractice is to group results with similar experimental aims,biological setting or place of publication into one compendium.DataRail also supports creation of containers for multiple compendiaknown as projects. The dimensionality of arrays containing primarydata is determined by the user at the time of import, makingit possible to accommodate a wide range of experimental approachesand measurement technologies. For example, measuring a few propertiesof many samples by flow cytometry generates an array of differentdimensionality than measuring many variables in a few samplesby mass spectrometry. In practice, data in our laboratory canusually be described in six dimensions: three for the experimentalconditions (e.g. cell type, cytokine stimuli and small-moleculetreatment), one for time, one for experimental replicates andone for actual measurements.
Arrays of transformed data are generated from primary data byapplying numerical algorithms that normalize, scale or otherwiseincrease accuracy and information content. Algorithms used duringdata processing, along with the values of all free parameters,are stored with each array to maintain a complete record ofall transformations performed prior to data mining or modeling.
2.3 Test cases
We have tested DataRail on seven sets of recent data availablein our laboratories, containing between 5 x 103 and 1.6 x 106data points. Each set had a unique structure and gave rise toarrays with 4–6 dimensions (see Supplementary Table S1).Here we discuss the analysis of a ‘CSR Liver compendium’,a cue-signal-response dataset (Gaudet et al., 2005) comprising22 512 measurements in primary human hepatocytes and a hepatocarcinomacell line (HepG2 cells; L.A. et al., unpublished data). In thiscompendium, cells were exposed to 11 cytokine treatments and8 small-molecule drugs upon which the states of phosphorylationof 17 signaling proteins (at 30 min and 3 h) and the concentrationsof 50 extracellular cytokines (at 12 and 24 h) were measuredusing bead-based micro-ELISA assays.
2.4 Storing primary data and metadata
Tagging primary data with metadata is essential to its utilityand involves two aspects of DataRail: a new metadata standardand a process for actually collecting the metadata. The metadatastandard is based on our proposed MIDAS format (Minimum Informationfor Data Analysis in Systems Biology) that is itself based onpre-existing minimum-information standards such as MIACA (MinimumInformation About a Cellular Assay, http://miaca.sourceforge.net/).MIDAS is a tabular (or spreadsheet) format that specifies thelayout of experimental data files that gives rise, upon importinto DataRail, to an n-dimensional data array. The MIDAS formatwas derived from the ‘experimental module’ conceptin MIACA, with modifications required for model-centric datamanagement (see Fig. 3). Typically a MIDAS file is used to inputinformation from instruments into DataRail and to export informationfrom DataRail into other software that uses spreadsheets. However,export from a data array to a MIDAS file entails loss of informationabout data provenance and prior processing steps. We are thereforein the process of implementing a standardized format for exchangingDataRail files that does not depend on the use of MATLAB files(see Section 3 for details). Each row in a MIDAS table representsa single experimental sample; each column represents one sampleattribute, such as identity (e.g. multi-well plate name or wellcoordinate), treatment condition, or value obtained from anexperimental assay. A column header consists of two values:(i) a two-letter code defining the type of column, (e.g. TRfor treatment, DV for data value), and (ii) a short column name(e.g. a small molecule inhibitor added or a protein assayed).The body of each column stores the corresponding value for eachrow (sample) such as a plate/well name, reagent concentration,time point, or data value (see Supplementary Materials for detailsand example MIDAS spreadsheets). MIDAS is designed to fulfillthe need for data exchange and analysis within a closely knitresearch group. It is not a stand-alone solution for archivalstorage or publication and should be implemented in conjunctionwith MIAME, MIACA or DataRail itself.
The sequence of steps involved in entering metadata into DataRailis designed to accommodate the rhythm of a typical laboratoryin which simple annotation is possible while experiments arein progress, but detailed data analysis is performed subsequently.As an experiment is being designed, a MIDAS file specifyingthe dimensionality and format of the data (treatments, timepoints, readouts, etc.) is created, and scripts specializedto different instruments or experimental methodologies are thenused to add results to the ‘empty’ MIDAS file. Thusfar we have written a script to import bead-based micro-ELISAdata generated by a Luminex reader running BioRad software (Bio-Plex).We have also implemented a general purpose Java program forMIDAS file creation that can be used to import data into DataRail,used as a stand-alone application, or integrated into othersoftware. Within the MIDAS layout utility, wells that will betreated in a similar or sequential manner are selected via aGUI and appropriate descriptions of the samples added via pop-uptabs (see Supplementary Fig. S1). When layout is complete, acorrectly formatted MIDAS file is generated, ready for the additionof data. Lists that assist experimentation are also created(these lists typically specify times of reagent addition, samplewithdrawal, etc.). We invite instrument manufacturers to incorporatethis utility into their software so that creation of MIDAS-compliantfiles is automatic; the code is therefore distributed undera non-viral caBIG open source license developed by the NationalCancer Institute. If a MIDAS file has not been generated atthe outset of an experiment, it is possible to convert experimentaldata at any point prior to import, but in this case MIDAS-associatedsupport tools are not available to help with experiments.
As mentioned above, DataRail need not be used in combinationwith SBWiki, a wiki based on semantic web technology (Berners-Lee,2001). For the current discussion, four features of SBWiki areimportant. First, a web form used for upload prompts users toenter the metadata such as user name, date, cell type, etc.,required for full MIDAS compliance, and this data is storedas a wiki page. Because continuous web access is easy to arrange,even for geographically dispersed instruments, users recordmetadata when files are first saved to a central location. Thisis very important in practice because metadata is rarely addedwhen the process is cumbersome or separated in time from datacollection. Second, use of semantic web forms makes it possibleto create simple, familiar and easily modified interfaces whilecollecting structured information. In contrast, tools for accessingmetadata in traditional databases or XML files are more difficultto use and require considerable expertise to modify. Third,as data is imported it is assigned a UID by SBWiki itself, whichdirectly encodes, among other things, the type of data and theperson who created it (see Supplementary Materials). The assignmentof a UID makes it possible to track the origin of all data inDataRail, independent of the array-compendium-project structure.Fourth, although metadata describing key aspects of experimentsare stored internal to the MIDAS file, complete details of experimentalprotocols and reagents are stored in SBWiki. Storage externalto the MIDAS file allows complex textual information to be modifiedand reused more easily. Links from data arrays to external filesare made via URLs that follow the UID scheme described aboveand can use the revision history in SBWiki to reference a specificversion of a protocol or reagent.
When constructing the CSR Liver compendium, a spreadsheet generatedby Bio-Plex software was appended to a MIDAS file, and a secondMIDAS file containing data on total protein concentrations wasgenerated using a plate reader. Overall three primary data arrayswere created from CSR Liver data: one recording phosphorylationstates of 17 proteins at three time points (0, 30 min, 3 h),one recording extracellular cytokine concentrations, also atthree time points (0, 12 h, 24 h), and one recording total proteinconcentration. In principle these arrays could be combined tobundle data together, but the resulting single array would besparsely populated. In addition, bundling data into a singlearray is not the same as fusing different types of data. Thefusion of flow cytometry, Western blot and live-cell imagingdata (J.A. et al., unpublished data) is facilitated by DataRailbut also requires biological insight and problem-specific modeling.
2.5 Adding transformed arrays to compendia
Once primary data is imported into a new compendium, it is thentransformed by one or more algorithms internal to DataRail,by user-specified algorithms, or by external programs, to createa new transformed data array. Transformations can change numericalvalues within an array or can expand or collapse the dimensionalityof arrays. A long time series, for example, can be transformedinto a shorter series involving selected times, time-averageddata or integrated values. When a transformation is performedon an array, the code used for the transformation and the valuesof free parameters are stored, along with a reference to theinput data (in the current implementation, the algorithms themselvesare recorded as the text of MATLAB functions), so that the compendiumis a self-documenting entity, in which the provenance of datacan be tracked.
Overall, DataRail can perform a diversity of transformationsfalling into several general categories. Simple arithmetic operationsinclude subtracting background from primary data, or dividingone type of data by another (see Supplementary Fig. S2). Forexample, Bio-Plex-based measures of protein phosphorylationin CSR Liver data were divided by total protein concentrationto correct for differences in cell number and extraction efficiency.In a second type of transformation, metrics such as ‘areaunder the curve’, maximal value of a variable in a series,standard deviation of a series and relative values are computed.Third, complex data transformations are performed, includingmean-centering and variance-scaling, both of which are helpfulin performing principal component analysis (PCA) or assemblingmodels using PLSR