Login

Join for Free!
119271 members
table of contents table of contents

Although a huge amount of mammalian genomic data does become publicly available, …


Biology Articles » Bioinformatics » Construction of an open-access database that integrates cross-reference information from the transcriptome and proteome of immune cells » Results and Discussion

Results and Discussion
- Construction of an open-access database that integrates cross-reference information from the transcriptome and proteome of immune cells

4.1 Data statistics for RefDIC
RefDIC is comprised of quantitative transcriptomic and proteomicdata for immune cells. For transcriptomic data, a total of 125and 34 microarray data from 60 to 21 different subsets of immunecells and tissues from mouse and human, respectively (Table 1),have been newly obtained and stored in the database. Table 2summarizes the results for the annotation and classificationof Affymetrix GeneChip probe sets. Based on our annotation systemwe found that 39 406 and 43 835 probe sets on Mouse430_2 andHG-U133_Plus2, respectively, could detect specific transcriptsfrom 20 623 and 19 300 distinct mouse and human genes, respectively.For proteomic data, 23 2-DE gel images from 21 different subsetsof immune cells and 2 subsets of epithelial cells were obtainedand stored for the mouse (Table 1). We could identify at least963 protein spots on each gel image, and 435 proteins in totalfrom distinct genes (Table 3 and Supplementary Table S1). Wequantified the amount of protein in these spots by gel imageanalysis. However, because most of our quantitative proteomedata were obtained by a single 2-DE experimental run, the quantitativeprotein data must be interpreted with caution (Fievet et al.,2004; Gustafsson et al., 2004; Mahon et al., 2001). Accordingto our previous study, the average coefficient of variationfor the protein spot quantitative values obtained using essentiallysame protocol as in this study was 26.8% (range 3.6–105%;Kimura et al., 2006). For convenience, the actual gel imagesused for quantification are accessible by clicking on matchIDsfound in the proteome profile section. The current availabilityof mRNA and protein profile data can be checked in ‘Statistics’in the data set section of RefDIC (http://refdic.rcai.riken.jp/dataset.cgi).

4.2 Evaluation of the probe set annotation for Affymetrix GeneChip arrays and their relevance to protein profiling data
As some groups have already pointed out, there are problemsassociated with Affymetrix GeneChip probe set annotation (Daiet al., 2005; Harbig et al., 2005; Zhang, et al., 2005). BecauseAffymetrix utilized information that was incomplete at the timeof GeneChip design, the current GeneChip probe set design couldinclude some discrepancies when compared to the most up to dategenomic data. Furthermore, although the probe annotation dataprovided by Affymetrix stated that multiple probe sets weredesigned for the same gene, the relationship among these probesets was unclear in terms of the data provided. To provide informationto enable the interpretation of such relationships, we havere-annotated all of the probe sets based on the results of aBLAST search against transcript sequences in RefSeq, GenBankand UniGene databases, and have classified them into six categories(Table 2). This classification was based on extents of curationfor each database: RefSeq database is widely accepted as a manuallycurated collection of sequences representing genomes and thuscontains highly reliable transcript information (Pruitt et al.,2007); GenBank database is a comprehensive public repositoryof sequences, including data from high-throughput cDNA sequencingprojects, submitted by the scientific community (Benson et al.,2007); UniGene database is an informational platform for partitioningGenBank sequences, including EST, into a non-redundant set ofgene-oriented clusters in silico (Schuler, 1997). Therefore,the probe sets in category A, which corresponded to transcriptsin RefSeq, were most likely to probe mature mRNAs specifically.In contrast, we observed that ~44% and 51% of probe sets in thecategories B and C on Mouse430_2 and HG_U133_Plus2, respectively,were mapped on non-exonic regions (i.e. introns or downstreamof the 3'UTR) of protein-coding RefSeq genes by BLAST searchesagainst the genome sequences. Figure 2A shows the distributionof the mean hybridization signal intensities for probe setsacross 119 samples in the categories A, B and C on the Mouse430_2array. As we expected, the mean hybridization signal intensityfor category A (5.56 ± 2.87) was significantly higherthan that for B (4.09 ± 2.13) and C (3.6 ± 1.67).We observed similar trends with the data for the HG_U133_Plus2array (data not shown). These results are consistent with thosepreviously reported (Zhang et al., 2005). Based on our annotation,10 984 and 12 008 genes were targets for multiple probe setson Mouse430_2 and HG_U133_Plus2 arrays, respectively. We calculatedPearson's correlation coefficients for the expression patternsin different samples among the probe sets to which we assignedthe same gene in a pair-wise fashion, and confirmed that therewas a better correlation for pairs of probe sets from categoryA compared to pairs from the other categories (Fig. 2B).

4.2 Evaluation of the probe set annotation for Affymetrix GeneChip arrays and their relevance to protein profiling data
As some groups have already pointed out, there are problemsassociated with Affymetrix GeneChip probe set annotation (Daiet al., 2005; Harbig et al., 2005; Zhang, et al., 2005). BecauseAffymetrix utilized information that was incomplete at the timeof GeneChip design, the current GeneChip probe set design couldinclude some discrepancies when compared to the most up to dategenomic data. Furthermore, although the probe annotation dataprovided by Affymetrix stated that multiple probe sets weredesigned for the same gene, the relationship among these probesets was unclear in terms of the data provided. To provide informationto enable the interpretation of such relationships, we havere-annotated all of the probe sets based on the results of aBLAST search against transcript sequences in RefSeq, GenBankand UniGene databases, and have classified them into six categories(Table 2). This classification was based on extents of curationfor each database: RefSeq database is widely accepted as a manuallycurated collection of sequences representing genomes and thuscontains highly reliable transcript information (Pruitt et al.,2007); GenBank database is a comprehensive public repositoryof sequences, including data from high-throughput cDNA sequencingprojects, submitted by the scientific community (Benson et al.,2007); UniGene database is an informational platform for partitioningGenBank sequences, including EST, into a non-redundant set ofgene-oriented clusters in silico (Schuler, 1997). Therefore,the probe sets in category A, which corresponded to transcriptsin RefSeq, were most likely to probe mature mRNAs specifically.In contrast, we observed that ~44% and 51% of probe sets in thecategories B and C on Mouse430_2 and HG_U133_Plus2, respectively,were mapped on non-exonic regions (i.e. introns or downstreamof the 3'UTR) of protein-coding RefSeq genes by BLAST searchesagainst the genome sequences. Figure 2A shows the distributionof the mean hybridization signal intensities for probe setsacross 119 samples in the categories A, B and C on the Mouse430_2array. As we expected, the mean hybridization signal intensityfor category A (5.56 ± 2.87) was significantly higherthan that for B (4.09 ± 2.13) and C (3.6 ± 1.67).We observed similar trends with the data for the HG_U133_Plus2array (data not shown). These results are consistent with thosepreviously reported (Zhang et al., 2005). Based on our annotation,10 984 and 12 008 genes were targets for multiple probe setson Mouse430_2 and HG_U133_Plus2 arrays, respectively. We calculatedPearson's correlation coefficients for the expression patternsin different samples among the probe sets to which we assignedthe same gene in a pair-wise fashion, and confirmed that therewas a better correlation for pairs of probe sets from categoryA compared to pairs from the other categories (Fig. 2B).


4.4 Links to the external microarray data on public repositories
To function as a data-sharing platform for genomics informationin immunology, RefDIC has a function to search and retrievethe microarray experimental data in the public repositoriesGEO and Array Express. At present, data searching is confinedto data sets using Affymetrix GeneChip Mouse430_2, Mouse430A_2and HG_U133_Plus2 arrays simply because they are directly comparablewith our data. This function would greatly facilitate data-sharingprocesses in immunology. In the future, we plan to add a toolto enable the visualization of expression profiles for publiclyavailable microarray data that is relevant to immunology, togetherwith our data on the website.


rating: 2.00 from 1 votes | updated on: 28 Oct 2008 | views: 10674 |

Rate article:







excellent!bad…