Join for Free!
112420 members
table of contents table of contents

Large-scale expressed sequence tag (EST) – based bioinformatics analysis deal with the heterogenous …

Biology Articles » Bioinformatics » In silico identification and comparative analysis of differentially expressed genes in human and mouse tissues » Methods

- In silico identification and comparative analysis of differentially expressed genes in human and mouse tissues

Data retrieving, screening, and classifying

Raw data of EST reports from dbEST (at 2003/05/23 for human and 2003/07/10 for mouse) and cluster information from UniGene (build #161 for human and build #128 for mouse) were downloaded from NCBI. We parsed the EST reports to extract EST data of Homo sapiens and Mus musculus, from which we retrieved the EST unique identifier (GI number), GenBank accession number, and library information, including "Organism", "dbEST lib id", "Lib Name", "Tissue type", and "Organ". For each EST record, we retrieved its corresponding UniGene data, including cluster ID, gene name (gene symbol), and gene description.

For each EST library, we extracted a triplet consisting of title, tissue, and organ from, respectively, the fields "Lib Name", "Tissue Type", and "Organ" in the dbEST report files. Based on the triplet, each library was classified into a corresponding tissue category, according to the TissuDB tissue hierarchy [45]. Our library classification process is illustrated in Fig. 6. Libraries without a definite pathological description in the triplet were considered to be derived from normal tissues. To mitigate variation due to unspecified tissue and artificially modified expression, libraries described as pooled, mixed, subtracted, differentially displayed, normalized, or coming from multiple tissues were excluded. Libraries without a clear description in the triplet were also discarded. There remains the possibility of some artificially modified libraries escaping from this screening, but their effect on the present analysis should be minimized, not to mention that some of them may in fact equalize the expression count, thus making detection of differential expression more stringent.

In all, we downloaded 5,372,149 human ESTs from 8,145 EST libraries and the screening process described above left us with 6,247 libraries and 3,352,546 ESTs distributed in 96,444 UniGene clusters for analysis. Similarly for mouse, 841 EST libraries were downloaded, of which 630 survived the same elimination process, leaving 3,009,721 ESTs (out of 3,132,883) distributed in 30,172 UniGene clusters for analysis.

The 6,247 human libraries were classified by the process shown in Fig. 6 into 157 tissue/organ categories, of which 94 were normal, 53 tumor-related, and 10 related to other diseases. The 630 mouse libraries were classified into 108 tissue/organ categories, of which 99 were normal, 9 tumor-related, and none were related to other diseases. To simplify matters, only the analysis results for normal tissues are presented here; those for diseased tissues will be reported elsewhere.

A-C test for differentially expressed genes

To profile the genes expressed in a tissue, we extracted the UniGene cluster ID of the ESTs that were classified to the target tissue. For each gene in the target tissue, we performed the A-C test [21] to evaluate tissue specificity:

where x and y are the numbers of ESTs clustered in the same gene, but expressed, respectively, in the target tissue and in all other tissues, and N1 and N2 are, respectively, the total number of ESTs from the target tissue and from all other tissues. Following the criteria for using the Poisson distribution [21], tissues with insufficient ESTs (N1 or N2 1 × 5% or y ≥ N2 × 5%) were excluded from the statistical test.

Orthologous gene data retrieval and correlation analysis

The raw data of HomoloGene (released on Feb. 2, 2004) were downloaded from NCBI. Using the taxonomy ID of this database, we extracted curated human and mouse orthologous gene pairs and discarded those annotated as putative. For the curated orthologous gene pairs, we obtained their gene names and UniGene cluster IDs and linked them to the expression profiles we had computed using the A-C test. For each ortholog pair expressed in at least 3 tissues in both human and mouse, the association between their expression profiles was analyzed by applying Pearson's correlation to their tissue specificity p values. We classified the strength of association, using the absolute value of Pearson's correlation coefficient (r), as follows: 0–0.19 was regarded as very weak, 0.2–0.39 as weak, 0.40–0.59 as moderate, 0.6–0.79 as strong, and 0.8–1 as very strong.

rating: 0.00 from 0 votes | updated on: 31 Oct 2006 | views: 5225 |

Rate article: