table of contents table of contents

Although a huge amount of mammalian genomic data does become publicly available, …

Home » Biology Articles » Bioinformatics » Construction of an open-access database that integrates cross-reference information from the transcriptome and proteome of immune cells » Systems and Method

Systems and Method
- Construction of an open-access database that integrates cross-reference information from the transcriptome and proteome of immune cells

2.1 Samples of immune cells and tissues
In this study, we analyzed a number of samples for mRNA and/orprotein profiling including those derived from various tissuesand immune cells extracted from human and mouse biopsies, suchas T lymphocytes, B lymphocytes, natural killer cells, naturalkiller T (NKT) cells, dendritic cells, macrophages, mast cellsand intestinal epithelial cells, as well as several popularlymphoid or myeloid cell lines. Detailed information regardingthese samples and their preparation are available in sampleattribute tables, which are accessible through clicking eachsample identifier, termed ‘cellID’, found in thedata sets section of the website (

2.2 Extracting data from public databases
For the integration of mRNA and protein profiling data for immunecells, we took advantage of the Entrez Gene database ( the National Center for Biotechnology Information (NCBI, because it provides solid referencesfor various lines of genomic information, such as transcriptand protein sequences, and Gene Ontology (GO) terms with unique,stable and traceable gene identifiers (Maglott et al., 2007).The sequence data for transcripts and proteins were extractedfrom RefSeq, GenBank and UniGene databases maintained by NCBI.Additionally, protein sequence data was extracted from the InternationalProtein Index (IPI, servedby the European Bioinformatics Institute (EBI,,since IPI is designed for complete non-redundant data sets forhuman, mouse and rat proteomes (Kersey et al., 2004). The proteindomain data was extracted from Pfam ( entries in the RefSeq protein database do not have externallinks to Pfam, we assigned Pfam domains in the respective RefSeqpeptide entries using HMMER (Eddy, 1998), with the thresholdE-value set to 0.1. All sequence data for probe sets on AffymetrixGeneChip expression arrays was downloaded from the Affymetrixwebsite ( Human and mousegenome sequence data (Build 36) was also downloaded from theNCBI website. We collected publicly available microarray experimentaldata from Gene Expression Omnibus (GEO) ( ArrayExpress (

2.3 Microarray experiments
Total RNA was extracted using TRIzol reagent (Invitrogen, Carlsbad,CA, USA) and/or an RNAeasy kit (Qiagen, Hilden, Germany). TheRNA integrity was assessed using a bioanalyzer (Agilent TechnologiesInc., Palo Alto, CA, USA). Samples whose RNA integrity numberwas greater than 7.0 were used for mRNA profiling by microarrayanalysis. cDNA synthesis, cRNA amplification, biotinylationand fragmentation were performed with a One-Cycle Target LabelingKit (Affymetrix, Santa Clara, CA, USA). Twenty micrograms oflabeled target RNA was hybridized with Mouse Genome 430 2.0(Mouse430_2) or 430A 2.0 (Mouse430A_2), or Human Genome U133Plus 2.0 (HG-U133_Plus2) GeneChip expression arrays (Affymetrix)at 45°C for 16 h, as described in the manufacturer's instructions.Washing stages and streptavidin-phycoerythrin staining wereconducted using a GeneChip Fluidics Station (Affymetrix). Subsequently,the chips were scanned using a GeneChip Scanner 3000 (Affymetrix).Array data was normalized using either MAS5 (Hubbell et al.,2002) or gcRMA (Irizarry et al., 2003) algorithms. All of themicroarray data were deposited to CIBEX with an accession numberof CBX19 (

2.4 Gene annotation of Affymetrix probe sets
Each probe set on the Affymetrix arrays consists of 22 oligonucleotides,each 25-bases long. Half of these probes were designed as a‘perfect match’ (PM) for a specific transcript.To identify which transcript could be probed by each probe set,the 11 PM sequences were subjected to a Basic Local AlignmentSearch Tool (BLAST) search (Altschul et al., 1990) against RefSeq(39 179 human and 47 930 mouse cDNA sequences), GenBank (128863 human and 147 850 mouse cDNA sequences) and UniGene [6 988853 human and 4 277 970 mouse expressed sequence tags (EST)sequences] databases (updated on 1 November 2006). In this study,when the nucleotide sequences for more than 9 of the 11 probesin the set matched that of a given transcript perfectly, weconsidered that the probe set targeted this particular transcript.If they matched with a complementary sequence of a given transcriptor with multiple transcript sequences originating from differentgene loci, or if they failed to match any transcript in anyof the databases, then it was concluded that the probe set targetedan antisense transcript of a known gene, or was a cross-hybridizingor non-informational probe set, respectively. Consequently,the same Entrez GeneID was given to a probe set targeting aspecific transcribed sequence as for one targeting the correspondingtranscript. Based on the results of the BLAST search, we classifiedall of the probe sets into six categories: categories A, B andC included the probe sets targeting specific RefSeq transcripts,specific GenBank transcripts (i.e. present in GenBank but notin RefSeq), and specific transcripts found only in the UniGenedatabase, respectively; category D included those sets targetingantisense transcripts; category E consisted of cross-hybridizingprobe sets and category X comprised non-informational probesets.

2.5 Two-dimensional gel-based proteome experiments
Quantitative protein profiling by 2-DE followed by gel imageanalyses and mass spectrometry was performed essentially asdescribed (Kimura et al., 2006) with a few modifications (Kimuraet al., manuscript in preparation). Briefly, whole-cell proteinsamples (250 µg) were separated by isoelectric focusing(IEF) in the first dimension and by SDS-PAGE in the second dimension,and subsequently the proteins in the gel were stained with SYPRORuby (Molecular Probes, Eugene, OR, USA). The 2-DE gel imageswere scanned using a ProXPRESS fluorescent imager (PerkinElmer,Inc., Boston, MA, USA) and analyzed using Progenesis Workstation(Nonlinear Dynamics Ltd, Newcastle upon Tyne, UK). The proteinlevels for each spot on a given gel were normalized by mediancentering. In order to identify the proteins in each spot onthe gels, the mass spectra of digest fragments originating fromthe proteins excised from each gel were obtained by peptidemass fingerprinting (PMF) and/or MS/MS methods and were searchedagainst the IPI database (Version 3.21; 51 432 mouse proteinsequences) using the MASCOT Version 1.8.06 (Matrix Science,London, UK). To define a spot on a gel that corresponded toone on another gel, all gel images were superimposed. When theposition of a spot on a gel matched that of one on another,we considered them provisionally as identical protein spots.Each protein was assigned to one of five classes (A–E)based on reliability of identification, primarily accordingto the degree of confidence in the database search results forPMF and/or MS/MS analysis data (Kimura et al., 2006). The reliabilityfor classes A (conclusive) and B (most likely) was based solelyon the MS data. Protein identification in class C was supportedby comigration of the previously identified proteins on 2-DEand was consistent with the predicted MS data for the identifiedproteins, although the MS data alone did not allow us to identifythe protein with sufficient confidence. On the other hand, proteinsidentified in class D were supported only by their electrophoreticbehaviors on 2-DE, while proteins in class E remained uncharacterizedeven after we combined 2-DE and MS data. All of the proteomicdata were deposited into the PRIDE database with accession numbersof 2354–2378 and 2414 (

2.6 Correlation analysis of mRNA and protein profiles
For correlation analysis of mRNA and protein levels from thesame gene in different samples, Pearson's correlation coefficientswere used. Values for the signal intensity of the probe setswere taken as expression levels of transcripts from the correspondinggenes. When gene products from a single gene produced multipleprotein spots on a 2-DE gel, the sum of the volumes of theseprotein spots were taken as the volume of these gene productsin this study. When a protein spot was found to include twoor more proteins, it was excluded from the analysis. The significanceof the correlation coefficients was tested using the one-tailedt-distribution with (n–2) degrees of freedom.

2.7 Web database platform
A web server for the RefDIC is on a machine running CentOS ( the Apache web server ( All scriptsfor data querying, retrieving and visualization were writtenin Perl. A MySQL 4.1x server ( is usedas the storage engine for the database.

rating: 2.00 from 1 votes | updated on: 28 Oct 2008 | views: 15487 |

Rate article: