Two separate, normalized cDNA libraries were constructed from a single pool of RNA extracted from the nerve cord tissue of several individual crickets. A total of approximately 22,000 clones were isolated from these libraries. 388 clones were sequenced from the first library (LK01); 14114 clones were sequenced from the second library (LK04). A total of 14,502 sequences were generated. Preliminary sequence analysis revealed that 5' end sequencing of the EST's provided higher quality reads than those generated from the 3' end. As a result, the majority of our sequencing effort was directed at sequencing the 5' end of the EST's. 14,261 sequences were generated from the 5' end and 241 sequences were generated from the 3' end of the insert. Of the 14,502 sequences, 14,377 were greater than 100 bases after the vector and linker sequences were stripped. Of these 14,377 sequences, read lengths ranged from 100 bases to 1051 bases. The average read length was 704 bases. Table 1 summarizes the results of the cDNA sequencing and basic bioinformatics analysis. All 14,377 sequences were submitted to GenBank and can be accessed through the accession numbers EH628894-EH643270.
A Gene Index was created from these 14,377 acceptable sequences [77]. We identified 8,607 unique sequences, representing 6,032 singletons and 2575 tentative consensus sequences (TCs). Tentative consensus sequences are composed of multiple sequencing reads with overlapping sequence alignments. The 2,575 TCs were derived from 8,345 EST's (Table 2) and ranged in length from 167 bases to 3,317 bases, with an average length of 935 bases. The number of EST's per TC ranged from 2 to 41, with a mean number of 3.24 EST's per TC. The remaining unique sequences were composed of single EST's. Singleton sequences ranged in size from 102 bases to 1019 bases, with an average length of 700 bases (Table 3).
The 8,607 unique sequences were translated into all 6 possible reading frames and compared using BLAT [78] against a comprehensive non-redundant protein database maintained by the Dana-Farber Cancer Institute. This database contains ~3 million entries collected from UniProt, SwissPro, RefSeq, GenBank resources and additional sequences from TIGR and its affiliates. The BLAT algorithm is integrated into the gene indexing bioinformatics pipeline to reduce computing times when building and annotating other large gene indices (e.g. human, [79]; mouse, [80]; and rat, [81]). In future releases, the pipeline may be modified to use additional algorithms, such as BLASTX, when working with more limited and/or phylogenetically distinct gene indices such as our cricket gene index.
5,225 of the 8,607 (60.7%) unique sequences had a significant sequence similarity match to an entry in the protein database [see Additional file 1]. 3,382 (39.3%) unique sequences returned no significant matches to entries in the database and no putative function could be assigned to them. However, 2,393 of the 3,382 (70%) sequences that did not return a significant match to a protein in the database were identified by ESTscan [82] as having putative ORF's with an average length of 295 nucleotides. This suggests that the majority of these unidentified EST's are expected to encode a protein and highlights the dearth of genomic information available for basal insect taxa.
The observed sequence similarities produced by the comparative analysis are consistent with our expectations given the tissue from which the cDNA library was constructed. While some of the unique sequences are similar to housekeeping genes, many unique sequences are similar to genes that may influence stridulation (Table 4). For example, several unique sequences are similar to genes that regulate the timing of biological events (e.g. Period and Diapause bioclock protein; Table 4), while others are involved with nervous system signal transduction (e.g. cGMP-gated cation channel protein, G-protein-coupled receptor, Shab-related delayed rectifier K+ channel, Na+/K+/2Cl-cotransporter, Nicotinic acetylcholine receptor non-alpha subunit precursor, Potassium channel tetramerisation domain-containing protein 5, Voltage-dependent anion channel, and Syntaxin 7; Table 4) and others contribute to developmental events that shape either the nervous system (e.g. Even-Skipped; Table 4) or wing development (e.g. Notch, Wnt inhibitory factor 1; Table 4). In addition to potentially influencing our primary phenotype, many of these sequences will be useful to researchers interested in insect neural function (e.g. Calmodulin, Innexin; Table 4) and insect molecular evolution (e.g. Opsin, Dyenin; Table 5).
Within our unigene set, we identified a number of genes that would be of comparative interest. To explore the Laupala unigene set as a comparative utility we compared the sequence of ten EST's from our unigene set to unigene sets available in Drosophila melanogaster, Anophelese gambiae, Bombyx mori, Apis mellifera, Tribolium casteneum, and Locusta migatoria (Table 5). The results show the evolutionary distinctiveness and phylogenetic distance between Laupala sequences and EST sequences from other genomic models. Across the ten EST's, the mean uncorrected sequence divergence (p) between Laupala and the other insect taxa surveyed was 30%. Furthermore, the mean distance between Laupala and Locusta was 89% that of the mean pairwise distance of all taxa in the analysis. Thus, despite the fact that Laupala and Locusta are both members of the insect order Orthoptera, the sequence divergence between them for this sample of EST's is close to that found among other insect orders.
Of the 5,225 sequences that matched protein entries, 408 sequences could be assigned a Gene Ontology (GO, [83,84]) term (Figures 3,4,5). 572 Biological Process GO terms were associated with predicted amino acid sequences from these 408 sequences. The 25 most frequent Biological Process GO terms are presented in Figure 3. The majority of Biological Process GO terms (488 or 85%) were assigned to five or fewer of the 408 sequences present and no Biological Process GO term was assigned to more than 45 sequences. 275 Molecular Function GO terms were associated with amino acid sequences identified in the 408 unique sequences. The 25 most frequent Molecular Function GO terms are presented in Figure 4. The majority of Molecular Function GO terms (221 or 80%) were assigned to five or fewer sequences. One Molecular Function GO term was assigned to 100 of the 408 sequences (protein binding). 212 Cellular Compartment GO terms were associated with predicted amino acid sequences identified in the 408 unique sequences. The 25 most frequent Cellular Compartment GO terms are presented in Figure 5. The 408 unique sequences contained 106 predicted nuclear proteins, and this was the most frequent Cellular Compartment GO term. Again, the majority of these GO terms, 163 (77%), were assigned to no more than five of the 408 sequences.
The low redundancy of the GO terms, in addition to the large proportion of singletons in the library and the small number of EST's per TC, testify that the normalization was successful and that a large proportion of the genes expressed in the cricket developing nerve cord were identified. The putative function of the singletons and tentative consensus sequences, as inferred from the BLAT comparison and the GO term assignments, is consistent with genes expected to be expressed in a nerve cord.