The MPI Bioinformatics Toolkit for protein sequence analysis

Abstract

The MPI Bioinformatics Toolkit for protein sequence analysis

 

Andreas Biegert*, Christian Mayer, Michael Remmert, Johannes Söding and Andrei N. Lupas

Department of protein Evolution, Max-Planck-Institute for Developmental Biology Spemannstrasse 35, 72076 Tubingen, Germany

Nucleic Acids Research 2006 34(Web Server issue):W335-W339; doi:10.1093/nar/gkl217. 

 

Abstract

The MPI Bioinformatics Toolkit is an interactive web servicewhich offers access to a great variety of public and in-housebioinformatics tools. They are grouped into different sectionsthat support sequence searches, multiple alignment, secondaryand tertiary structure prediction and classification. Severalpublic tools are offered in customized versions that extendtheir functionality. For example, PSI-BLAST can be run againstregularly updated standard databases, customized user databasesor selectable sets of genomes. Another tool, Quick2D, integratesthe results of various secondary structure, transmembrane anddisorder prediction programs into one view. The Toolkit providesa friendly and intuitive user interface with an online helpfacility. As a key feature, various tools are interconnectedso that the results of one tool can be forwarded to other tools.One could run PSI-BLAST, parse out a multiple alignment of selectedhits and send the results to a cluster analysis tool. The Toolkitframework and the tools developed in-house will be packagedand freely available under the GNU Lesser General Public Licence(LGPL). The Toolkit can be accessed at http://toolkit.tuebingen.mpg.de.


Introduction

As this special issue shows, the number of public bioinformatictools and web servers is growing quickly. However, the wealthof powerful tools and servers is, in our opinion, only utilizedby a fraction of biologists who would be able to profit fromthem. Especially for non-experts it can be very time-consumingto find out which services exist, what they can or cannot do,how to use them and how to feed results from one service tothe next in the right format. This has spawned the developmentof two classes of servers. The first class, exemplified by PredictProtein(1), accepts a single sequence as input, runs a whole set ofstandard protein analysis tools and returns the bare, concatenatedresults in a single Email or Web page, requiring users to befamiliar with the tools and their output format. The secondclass offers a collection of web interfaces to local versionsof public bioinformatic tools. For instance, PAT (protein analysistoolkit) (2) facilitates the combination of different analysismethods by automating repetitive data processing tasks. However,its user interface and the lack of an integrated help systemmake PAT, suited primarily for users with biocomputing experience.Two further servers designed as toolboxes for sequence analysisare the Biology Workbench (3), which has not been updated forquite some time, and AnaBench (4), which is more geared towardanalysis of DNA data.

The primary aim in developing the MPI Bioinformatics Toolkitwas to offer a web service that is as easy to use as possibleand that integrates a selected set of most useful methods forthe analysis of protein sequences. From our own experience asusers of the toolkit, its main advantages are as follows:


Web interface

Currently, 30 bioinformatics tools and utilities can be launchedfrom the MPI Bioinformatics Toolkit (Table 1). All tool sectionsare accessible from a tabbed menu bar located at the top ofthe page (Figure 1). Each tab reveals a submenu containing thesection-specific tools, an overview page with brief descriptionsfor each tool and a list of selected links. Located on the leftof the screen is a sidebar pane that holds a status and section-codedlist of all recent jobs in the current session. One can clickon previously submitted jobs to check their status and viewtheir results. Users can also choose their own job names toorganize their work. Each tool has a separate input page witha web form, in which the user can input sequence data, uploadsequence files, and specify options.

 


Tool sections

The search section contains popular search tools, such as NucleotideBLAST,ProteinBLAST (11), PSI-BLAST (12), and HMMER (13), as well asour in-house developments such as HHpred, HHsenser and PatternSearch.In comparison with the NCBI server, our BLAST tools offer greaterflexibility and functionality: searches can be run against uploadedpersonal databases or selectable sets of genomes (updated weeklyfrom NCBI and ENSEMBL), databases can be switched between PSI-BLASTruns, alignments can be extracted, viewed online or forwardedto other tools, and two graphs show matched regions and E-valuedistributions. The fastHMMER tool performs HMMER searches ofall standard sequence databases in ~10% of the time by reducingthe database with one iteration of PSI-BLAST at a cut-off E-valueof 10 000. PatternSearch identifies sequences containing a user-definedProsite pattern or regular expression. HHpred is a new serverfor protein structure and function prediction (5). It takesa query sequence as input and searches user-selected databasesfor homologs with a new and very sensitive method based on pairwisecomparison of hidden Markov models (HMMs). Available databases,among others, are InterPro, CDD and an aligment database webuild from Protein Data Bank (PDB) sequences and which can beused for 3D structure prediction. HHsenser is a transitive searchmethod based on HMM-HMM comparison (7). This method utilizesa sequence as input and builds an alignment with as many nearor remote homologs as possible, often covering the whole proteinsuperfamily.

The alignment section includes the well-known, popular multiplealignment program ClustalW (14), together with the more recentlydeveloped multiple alignment methods ProbCons (15), MUSCLE (16)and MAFFT (17). Also in this section is Blammer (10), whichconverts BLAST or PSI-BLAST output to a multiple alignment byrealigning gapped regions using ClustalW and removing localinconsistencies through comparison with an HMM. HHalign alignstwo alignments with each other by pairwise comparison of HMMsand displays similarities in a profile–profile dotplot.

In the sequence analysis section, we have grouped tools forrepeat identification and analysis of periodic regions in proteins.HHrep is a server for de novo repeat detection that is verysensitive in finding proteins with strongly diverged repeats,such as TIM barrels and ß-propellers (6). REPPER (8)analyzes regions with short gapless repeats in protein sequences.It finds periodicities by Fourier transform and internal sequencesimilarity. The output is complemented by coiled-coil predictionand secondary structure prediction using PSIPRED (18). Aln2Plotshows a graphical overview of average hydrophobicity and sidechain volume in a multiple alignment.

In the secondary structure section, Quick2D integrates the resultsof various secondary structure prediction programs, such asPSIPRED (18), JNET (19) and PROFKing (20), the transmembraneprediction of MEMSAT2 (21) and HMMTOP (22) and the disorderprediction of DISOPRED (23) into a single colored view. TheAlignmentViewer clusters sequences by a sequence idenity criterion,annotates groups of sequences using PSIPRED and MEMSAT2 predictionsof a multiple alignment and graphically displays the resultsin an interactive Java applet.

The tertiary structure section contains Modeller (24) and HHpred(5). Modeller is a very popular program for comparative modeling.It generates a 3D structural model from a sequence alignmentof a protein sequence with one or more structural templates.In contrast to the standalone version of Modeller, the inputformat does not need to be PIR but can also be FASTA or mostother standard multiple alignment formats. Modeller is tightlyintegrated with HHpred, allowing selected hits of HHpred resultsto be used as templates for subsequent comparative modeling.On the results page, models can be evaluated by using a browser-embedded3D-viewer and charts with output from several model qualityassessment programs are provided. This allows fast interactiverefinement cycles of the underlying multiple sequence alignment.The page also provides a link to the iMolTalk server, whichoffers several additional tools for the detailed analysis ofstructures and models (25,26).

In the classification section, we offer modules of the widelyused phylogenetic analysis suite PHYLIP (27), the ANCESCON package(28) for distance bases phylogenetic analysis and CLANS (9).CLANS clusters user-provided sequences based on BLAST pairwisesimilarities (29). The results can be analysed with a CLANSJava applet or can br exported to CLANS format.

Finally, in the utilities section there is a collection of toolswhich help to perform simple tasks that the user will oftenbe confronted with. It includes a sequence reformatting utility,a six-frame translation tool for nucleotide sequences, Extract_gisfor the extraction of gi-numbers from BLAST files, the RetrieveSeqtool for identifier-based sequence retrieval from the non-redundantprotein or nucleotide databases at NCBI, gi2Promotor for theextraction of nucleotide sequences upstream of genes identifiedby the gi-numbers of their encoded proteins and a backtranslationtool.

Future plans

Our own research on protein evolution now heavily depends onthe toolkit server. We will therefore continue to integratenew tools as they become available and improve the usabilityof the toolkit. For instance, a project manager will be addedthat will further facilitate the organization and long-termstorage of job results. On the technical side, we are currentlyin the process of porting the Toolkit to a new Rails-based webframework that permits shorter development cycles and more flexibletool interactions. The new architecture is fully object orientedand renders the Toolkit easily installable. We will packagethe Toolkit framework together with our in-house tools and distributeit freely under the GNU LGPL.

Acknowledgements

We thank Pawel Szczesny for contributing Aln2Plot and TancrdFrickey for many fruitful discussions and developing varioustools. We thank all users who helped to improve our server withtheir questions, feedback, bug reports and tool suggestions.Funding to pay the Open Access publication charges for thisarticle was provided by the Max-Planck society.Conflict of interest statement. None declared.

References

  1. Rost, B. and Liu, J. (2003) The PredictProtein server Nucleic Acids Res, . 31, 3300–3304

  2. Gracy, J. and Chiche, L. (2005) PAT: a protein analysis toolkit for integrated biocomputing on the web Nucleic Acids Res, . 33, Suppl. 2, W65–W71

  3. Subramaniam, S. (1998) The biology workbench—a seamless database and analysis environment for the biologist Proteins, 32, 1–2

  4. Badidi, E., De Sousa, C., Lang, B., Burger, G. (2003) Anabench: a web/corba-based workbench for biomolecular sequence analysis BMC Bioinformatics, 4, 63

  5. Söding, J., Biegert, A., Lupas, A.N. (2005) The HHpred interactive server for protein homology detection and structure prediction Nucleic Acids Res, . 33, W244–W248

  6. Söding, J., Remmert, M., Biegert, A. (2006) HHrep: de novo protein repeat detection and the origin of TIM barrels Nucleic Acids Res, .

  7. Söding, J., Biegert, A., Remmert, M., Lupas, A. (2006) HHsenser: exhaustive transitive profile search using HMM-HMM comparison Nucleic Acids Res, .

  8. Gruber, M., Söding, J., Lupas, A. (2005) REPPER—repeats and their periodicities in fibrous proteins Nucleic Acids Res, . 33, W239–W243

  9. Frickey, T. and Lupas, A. (2004) CLANS: a Java application for visualizing protein families based on pairwise similarity Bioinformatics, 20, 3702–3704

  10. Frickey, T. and Lupas, A.N. (2004) PhyloGenie: automated phylome generation and analysis Nucleic Acids Res, . 32, 5231–5238

  11. Altschul, S., Gish, W., Miller, W., Meyers, E., Lipman, D. (1990) Basic Local Alignment Search Tool J. Mol. Biol, . 215, 403–410

  12. Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res, . 25, 3389–3402

  13. Eddy, S. (1998) Profile hidden Markov models Bioinformatics, 14, 755–763

  14. Thompson, J., Higgin, D., Gibson, T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucl. Acids Res, . 22, 4673–4680

  15. Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S. (2005) ProbCons: probabilistic consistency-based multiple sequence alignment Genome Res, . 15, 330–340

  16. Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput Nucl. Acids Res, . 32, 1792–1797

  17. Katoh, K., Misawa, K., Kuma, K.-I., Miyata, T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform Nucleic Acids Res, . 30, 3059–3066

  18. Jones, D. (1999) Protein secondary structure prediction based on position-specific scoring matrices J. Mol. Biol, . 292, 195–202

  19. Cuff, J. and Barton, G. (2000) Application of multiple sequence alignment profiles to improve protein secondary structure prediction Proteins, 40, 502–511

  20. Ouali, M. and King, R. (2000) Cascaded multiple classifiers for secondary structure prediction [In Process Citation] Protein Sci, . 9, 1162–1176

  21. Jones, T., Taylor, W., Thornton, J. (1994) A model recognition approach to the prediction of all-helical membrane protein structure and topology Biochemistry, 33, 3038–3049

  22. Tusnády, G. and Simon, I. (1998) Principles governing amino acid composition of integral membrane proteins: application to topology prediction J. Mol. Biol, . 283, 489–506

  23. Ward, J., Sodhi, J., McGuffin, L., Buxton, B., Jones, D. (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life J. Mol. Biol, . 337, 635–645

  24. Sali, A., Potterton, L., Yuan, F., vanVlijmen, H., Karplus, M. (1995) Evaluation of comparative protein modeling by MODELLER Proteins, 23, 318–326

  25. Diemand, A.V. and Scheib, H. (2004) MolTalk—a programming library for protein structures and structure analysis BMC Bioinformatics, 5, 39

  26. Diemand, A.V. and Scheib, H. (2004) iMolTalk: an interactive, internet-based protein structure analysis server Nucleic Acids Res, . 32, W512–W516

  27. Felsenstein, J. (1989) PHYLIP—Phylogeny Inference Package (Version 3.2) Cladistics, 5, 164–166 .

  28. Cai, W., Pei, J., Grishin, N.V. (2004) Reconstruction of ancestral protein sequences and its applications BMC Evol. Biol, . 4, 33

  29. Frickey, T. and Lupas, A.N. (2004) Phylogenetic analysis of AAA proteins J. Struct. Biol, . 146, 2–10

  30. Söding, J. (2005) Protein homology detection by HMM-HMM comparison Bioinformatics, 21, 951–960.

  31. Lupas, A., Van Dyke, M., Stock, J. (1991) Predicting Coiled Coils from Protein Sequences Science, 252, 1162–1164

  32. Clamp, M., Cuff, J., Searle, S.M., Barton, G.J. (2004) The Jalview Java alignment editor Bioinformatics, 20, 426–427

Table 1

Overview of tools

Tool Source references Description

Search
    NucleotideBLAST{dagger} Altschul et al. (11) Sequence search against nucleotide databases (blastn, tblast, tblastx)
    ProteinBLAST{dagger} Altschul et al. (11) Sequence search against protein databases (blastpgp1, blastx)
    PSI-BLAST{dagger} Altschul et al. (12) Iterated sequence search against protein databases
    fastHMMER{dagger} Eddy (13) Fast profile HMM search tool derived from HMMER
    HHpred* Söding et al. (5) Sensitive protein homology detection, function and structure prediction by HMM-HMM comparison
    HHsenser* Söding et al. (7) Sensitive iterative sequence search based on HMM-HMM comparison
    PatternSearch* Unpublished Search for sequences containing a given pattern
Alignment
    ClustalW Thompson et al. (14) Multiple alignment program for protein and DNA sequences
    MUSCLE Edgar (16) Multiple alignment program for protein sequences
    ProbCons Do et al. (15) Multiple alignment program for protein sequences
    MAFFT Katoh et al. (17) Multiple alignment program for protein and DNA sequences
    Blammer* Frickey and Lupas (10) Converts BLAST/PSI-BLAST output to a multiple alignment by realigning gapped regions with Clustal and removing local inconsistencies through comparison to a HMM
    HHalign* Söding (30) Comparison of two alignments using HMMs
Sequence Analysis
    HHrep* Söding et al. (6) Sensitive de novo repeat identification in protein sequences by HMM-HMM comparison
    PCOILS* Lupas et al.(31) Coiled-coil prediction
    REPPER* Gruber et al. (8) Identification of repeats and their periodicity by Fourier transform and internal sequence comparisons
    TPRpred* Unpublished Prediction of TPRs (Tetratrico Peptide Repeats) and related repeats (Pentatrico Peptide Repeats and SEL1-like)
    Aln2Plot* Unpublished Graphical overview of average hydrophobicity and side chain volume in a multiple alignment
Secondary Structure
    Quick2D* Unpublished Concise overview of secondary structure prediction by PSIPRED (18), JNET (19) and PROFKing (20); of coiled-coils by COILS (31); of transmembrane helices by MEMSAT2 (21) and HMMTOP (22) and of natively disordered regions by DISOPRED2 (23)
    Alignment Viewer* Unpublished Annotate an alignment with individual PSIPRED (18) and MEMSAT2 (21) predictions
Tertiary Structure
    Modeller{dagger} Sali et al. (24) Comparative protein structure modeling by satisfying of spatial restraints
    HHpred* Söding et al. (5) Sensitive protein homology detection, function and structure prediction by HMM-HMM comparison
Classification
    PHYLIP-NEIGHBOR Felsenstein (27) Modules of the phylogenetic analysis package Phylip which allow the construction of distance-based, neighbor-joining trees
    CLANS* Frickey and Lupas (9) Clustering tool based on all-against-all BLAST comparisons
    ANCESCON Cai et al. (28) Distance-based phylogenetic inference and reconstruction of ancestral protein sequences
Utilities
    Reformat* Unpublished Sequence reformatting utility
    6FrameTranslation* Unpublished Six-frame translation of nucleotide sequences
    Extract_gis* Unpublished Extraction of gi-numbers from BLAST files
    RetrieveSeq* Unpublished Sequence retrieval from the nr or nt database using a list of identifiers
    gi2Promotor* Unpublished Extraction of nucleotide sequences upstream of genes identified by the gi-numbers of their encoded proteins
    Backtranslator* Unpublished Reverse translation of amino acids into nucleotide sequences

An asterisk after the toolname indicates that the tool was developed in our group.

A dagger indicates a public tool with extended functionality.


Figure

mcith_mpi.JPG Figure 1 Input and result pages of PSI-BLAST with overlaid windows for genome databases and Jalview alignment viewer (32).

(Click image to enlarge)

 


http://www.biology-online.org/articles/mpi-bioinformatics-toolkit-protein-sequence.html