Everything on bioinformatics, the science of information technology as applied to biological research.
2 posts • Page 1 of 1
I did answer some questions related to databases and pairwise alignments. Could you check if my answers are correct or not? if not could you correct it or add on it?
Q.1. Outline and briefly discuss the 3 main methods used by the programs/tools currently available to predict genes in eukaryotic sequences.
1. Content based methods (Interensic):
• Rely on the overall, bulk properties of a sequence in making a determination.
• Classify sequence as coding or noncoding.
• Markov models commonly used.
• Characterestics considered here are:
- how often particular codons are used.
- The periodicity of repeats.
- The compositional complexity of the sequence.
2. In site based methods (Extrensic):
• Focus on the presence of or absence of a specific sequence, pattern or consensus.
• Used to detect features such as donor and acceptor splice sites, binding sites for transcription factors, poly A tracts and start and stop codons.
• Various methods for detection: use of positional weight matrices and Hidden Markov models common.
3. Comparative methods:
• Make determinations based on sequence homology.
• Translated sequences are subjected to database searches against protein sequences to determine if it is characterized or not.
Q.2 What is the difference between a Primary and Secondary database? Give an example of a primary and secondary database. What is a specialized database? Give an example. What is a Database Retrieval System? Give an example.
• The primary databases represent experimental results (with some interpretation) but are not a curated review. Curated reviews are found in the secondary database. Example of primary database is GenBank and SWISS PROT.
Raw’ nucleotide (DNA or RNA) sequence data. May be annotated and/or curated - quality varies.
original biological data
•primary Sequence data DNA, RNA, protein sequences
•protein structures (PDB)
•content controlled by submitter
• A secondary database contains derived information from the primary database. It contains information like the conserved sequence, signature sequence and active site residues of the protein families arrived by multiple sequence alignment of a set of related proteins. Example is PROSITE, PRINTS and Pfam.
Add value to information in primary databases and make it more useful for specialist applications e.g. protein pattern/motif databases, Swiss-Prot, OMIM, SCOP.
•add value to primary databases
•derived from primary databases
•curated and annotated
•list structural/functional motifs
•comments and cross references
•content controlled by third party (e.g. NCBI)
• Specialised databases (secondary)
- specific collections of one type of data or information
- cater to a specific research interest
- may be from particular organism or group of organisms
• Database retrieval system is an integrated, text-based search system used for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. Entrez is an example of database retrieval system which allow retrieval of both literature and sequence data. Now incorporated as the query interface for almost all searches at NCBI. Cross-references between NCBI databases to integrate information.
Q.3 Describe the Dot matrix method for pairwise sequence comparison, including the means used to control sensitivity and specificity. Include a simple example.
The two sequences to be aligned are assigned as the column and row headings of a rectangular matrix. Dots are then placed in the matrix when the two sequences share an identical site.
- If the two sequences are identical, there will be entries in all the diagonal elements of the matrix.
- If the two sequences are similar and of the same length (can be aligned without gaps) there will be dots in most of the diagonal elements.
- If gaps have occurred the alignment diagonal will be displaced either vertically or horizontally.
Sequences containing repeating regions will have alignments parallel to the main alignment diagonal but displaced from it by a distance equal to the length of the repeat unit.
The power of this conceptually simple method may be extended by aligning a floating window of residues comprising a predetermined number of residues (instead of single sites) and allowing for a degree of mismatching of residues within the floating window. Dot plots are ideally suited for computer analysis and graphical output.
• Dot plots can be generated plotting points residue by residue.
• For long sequences, especially DNA which has an alphabet of only 4 bases, the resulting dot plots have many coincidental matches (noise) and are hard to interpret.
• To overcome this problem, a sliding window is used for which only a proportion of the residues have to match (the stringency).
For example, for a DNA repeat one may use a sliding window of 23 bases of which only
21/23 bases need to match to generate a point on the plot. The stringency here is 21
10. Briefly describe the two main methods used for generating local and global pairwise alignments of nucleotide and amino acid sequences? Why are these algorithms not suitable for multiple sequence alignments?
• Alignment based method – backward computation (original Needleman and Wunsch)
– Largely of historical interest and not covered here
• Optimal score based method – forward computation – gap penalties/mismatch penalties imposed.
– Variations of the Needleman and Wunsch method
- Types of optimal score based alignments:
1. Global alignment: Both entire sequences aligned. Penalties for end gaps.
2. Cost-free ends alignment (semi-global alignment): At each end, deletions (gaps) in one sequence are free
3. Local alignment (e.g. Smith-Waterman): Best matching subsequences identified
These algorithms are not suitable for multiple sequence alignments because can't deal with case with gap.????
Q.4. How was the BLOSUM 62 substitution matrix constructed?
BLOSUM is an acronym for BLOcks SUbstitution Matrices. This method was developed by Henikoff and Henikoff in 1992. The substitution matrix is obtained by using blocks of similar amino acid sequences as data and then applying statistical methods to the data to obtain the similarity scores.
Initially a simple unit matrix is used, where the score is 1 for a match and 0 for a mismatch. Then, from the sets of proteins from public databases that have been grouped into related families, some blocks of sequences (ungapped sequences) that appear to have been conserved are found . Thus this is not dependent on any specific scoring scheme as the starting point is the unit matrix.
To summarize the method, we see that BLOSUM matrices are based on local alignments. We start with blocks of sequence fragments from different protein families which can be aligned without the introduction of gaps. These sequence blocks correspond to highly conserved regions. Amino acid pair frequencies are then compiled from these blocks simply by summing over all possible pairs of the block. These frequencies are written down in a frequency table, and the odds for relatedness are calculated, which are then rounded off to get the BLOSUM matrix.
- BLOSUM62 uses only protein blocks that share at least 62 % identity.
- The BLOSUM matrix is based on conserved BLOCKS
- The blocks are derived from local alignments (blocks of amino acid sequence) characteristic of the protein families used.
- The substitution rates (values in the matrix) at different % identities are then estimated.
- For example, a BLOSUM 62 matrix is derived from sequences manifesting 62% sequence identity.
10. They aren't suitable because they're computationally expensive. Think about how much memory and computations you need for just ten 5kb sequences.
Living one day at a time;
Enjoying one moment at a time;
Accepting hardships as the pathway to peace;
2 posts • Page 1 of 1
Who is online
Users browsing this forum: No registered users and 0 guests