Login

Join for Free!
17131 members
table of contents table of contents

Biology Articles » Bioinformatics » Bioinformatics in microbial biotechnology – a mini review » Identifying gene function: searching and alignment

Identifying gene function: searching and alignment
- Bioinformatics in microbial biotechnology – a mini review

After identifying the ORFs (open reading frames), the next step is to annotate the genes with proper structure and function. The function of the gene has been identified using popular sequence search and pair-wise gene alignment techniques. The four most popular algorithms used for functional annotation of the genes are BLAST [4] and its variations [5], dynamic programming technique Smith-Waterman alignment [70] and its variations, indexing based scheme FASTA [53] and its variations, and BLOCKS [35] that uses multiple sequence alignment of conserved domains to identify motifs – characterizing patterns of proteins.

BLAST search is based upon expanding multiple probable seed points (longer than four nucleotides) that match (with the help of scoring matrices such as BLOSUM or PAM [46,70]) to identify the largest matching nonrandom segment. Scoring matrices have positive match-value for the amino acids that have common biochemical or biophysical properties and negative match-values if the amino acids do not share biophysical or biochemical properties. Substitution matrices such as BLOSUM (BLOcks SUbstitution Matrix) have been derived by statistically comparing the frequency patterns of the amino acids occurring in conserved domains of protein families. Nucleotide sequences use a nucleotide matrix for scoring that penalizes non-matching positions. BLAST algorithm has near linear time complexity, and the current implementations are fast. However, in order to enhance computational efficiency, BLAST algorithm uses most probable combinations of nucleotide seeds to index the sequences in the database sacrificing some accuracy.

BLAST algorithm has gone through many improvements in heuristics to improve the execution speed, accuracy, and the dependence on predefined scoring matrices. Two major improvements are: (i) the use of two or more hits within a matching region before extending the high scoring segment, and the use of multiple iteration of matching to derive a position specific scoring matrix to be used in place of predefined biochemical matrix. PSI-BLAST (Position Specific Iterative BLAST) [5] is a popular implementation of BLAST that uses both these improvements. The use of two hits improves the execution efficiency in the segment extension, and the use of position specific matrix improves the search for weakly homologous sequences in evolutionary distant species. Position specific matrix is built by deriving multiple sequence alignment of the best matching segments and analyzing the frequency of the amino acid substitutions in the matching segments.

Dynamic algorithms such as Smith-Waterman [70] and other indexing schemes [53] are more accurate for pair-wise gene alignment. The alignment of gene-pairs using dynamic algorithms is based upon incremental matching by maximizing the sum of the score of the best alignment of the preceding subsequences and the score of matching the current amino-acid characters (or nucleotide characters). The mismatches in amino-acid sequences are penalized using scoring matrices such as BLOSUM or PAM [46,70]; the nucleotide sequences use a nucleotide matrix for scoring that penalizes non-matching positions. A gap is inserted to show the insertion and deletion of nucleotides (or amino-acids). Gaps are not part of a substitution matrix, and are provided as parameters by users. The presence of a gap also results into score penalty. There are two major types of protein (or gene) alignments using dynamic programming: global and local. In global alignment, the amino-acid (or nucleotide) characters are placed to maximize the overall score. In contrast local alignment finds the segment with the maximum score, and the segments with negative scores are ignored. For comparing amino acid sequences from evolutionary distant organisms, local alignment is preferred to take care of large-scale amino-acid variations. Global alignment fares well when small amount of random mutation is involved. Due to the pair-wise comparison of all characters in an amino acid sequence to identify best matching subsequence, all dynamic programming techniques have quadratic time complexity making them less suitable for large-scale pair-wise genome comparisons unless preprocessed by BLAST to remove dissimilar genes [11].

Multiple sequence alignment techniques [22] compare multiple homologous genes (genes that have similar sequences) to derive conserved segments and to derive evolutionary tree. The technique uses the integration of pair-wise alignment between two homologs and the notion of distance between two nucleotide sequences or between two amino-acid sequences. The notion of distance can be derived either as an edit distance – number of mismatches derived after pair-wise alignment of two sequences, or as the evolutionary distance between two microorganisms given by an evolutionary tree. The technique is based upon progressive pair-wise comparison to make intermediate alignments between nearest neighbors – homologs having shortest distance, and has been implemented as a greedy algorithm. ClustalX [22] is a popular multiple sequence alignment technique that has been used to identify conserved portions in a gene, and to develop new evolutionary trees [36].

A major source of problem in the above sequence comparison techniques is the assignment of user defined equal weight to indels (gaps) that undermines the importance of a specific amino-acid(s) or a group of amino-acid characters would have. Another minor problem is the presence of repeat characters in the sequences as the repeat characters only show the functional or structural separation of the component units within a gene, and can not be mixed with other amino-acid characters.

Multiple sequence comparison techniques such as BLOCK [35] have been used to identify the conserved subsequences in very similar gene sequences, and are good to derive motifs. Motifs – a set of unique subsequences characterizing a protein – have been found very useful to identify genes with the same functionality. Motifs are derived by identifying the conserved subsequences of the functionally equivalent genes from multiple organisms after aligning the sequences.

Protein domain is the basic unit of protein function and is associated with a unique pattern (possibly one) of folding (alpha helix, beta sheet or their variations) at the structure level. The researchers have used multiple sequence alignment and HMM to identify the regions that are individually homologous to each other in multiple homologous genes. These regions are probable domains. Currently there are many domain related databases such as PRODOM [21], Pfam [16] and SMART [39] (also see http://elm.eu.org). Pfam [16,63] (and http://pfam.cgb.ki.se) is a database of multiple alignments of protein domains or conserved protein regions. The alignments represent some evolutionary conserved structure that has implications for the protein's function. Profile hidden Markov models (profile HMMs) built from the Pfam alignments are useful for automatically recognizing that a new protein belongs to an existing protein family, even under weak homology. Currently Pfam is derived automatically by cluster analysis of PRODOM database.

The sequence search based techniques assume that best sequence is sufficient to annotate the function. This assumption is generally true. However, in many cases best sequence match fails to identify the function due to: (1) function being localized to a specific area in the protein such as hydrophobic region, (2) the function being dependent on the presence of specific pattern of amino acids, or (3) function being dependent to a specific 3D conformational state in a multi domain protein. Sometimes mutation of few nucleotides alters the corresponding amino acids resulting into a different 3D conformation of a protein. Another limitation of best match techniques is that they cannot identify all possible functions of a multi-domain protein. A protein may have multiple domains, and may be multifunctional. The problem is more complex as there is no direct correlation between the number of domains in a protein and the number of functionality [32,37].


rating: 5.50 from 4 votes | updated on: 31 Oct 2006 | views: 867 |

Rate article:







excellent!bad…