In order to move towards a complete understanding of the biochemical functions and their mechanisms of action within the cell, structural biology faces the task of characterizing the shapes and modes of action of the entire protein repertoire encoded within the genomes. However, with the rapid growth in the number of known genome sequences and the relatively tiny number of experimentally solved protein structures, it is of considerable importance to develop efficient strategies to structurally and functionally annotate sequence space.
The combined advances in the late 1990s of methods such as X-ray crystallography, Nuclear Magnetic Resonance (NMR), gene cloning and expression, and whole genome sequencing, signalled the advent of structural genomics, enabling the goal of high-throughput protein structure determination to become a feasible proposition. As a result a considerable number of structural genomics projects have been initiated around the world [1,2] and though each may differ in their absolute objectives, all work upon the principle of achieving high-throughput structural determination with a view to gaining novel insights into protein function from a structural perspective. Structural genomics has now become a driving force behind new developments in protein structure prediction technology, aiming to automate, and consequently expedite, all areas of the experimental pipeline, ultimately benefiting the structural biology community as a whole. Recent analyses of structures released by the initiatives have highlighted the significant contribution they are now making in both the scope and depth of our structural knowledge of protein families, especially when compared to the relative contribution of non-structural genomics structures. The worldwide structural genomics initiatives now contribute approximately half of new structurally characterised families and over five times as many novel folds as mainstream structural biology, despite accounting for only ~ 20% of the new structures [3,4].
Through the use of homology modelling methods to extend the value of each newly solved structure across 'sequence space' (a term used here to describe all known protein sequences), it is not unreasonable to expect structural genomics to make far reaching advances into the structural landscape over the coming years. Recent advances in sensitive sequence comparison algorithms, homology modelling and threading methods [5-8] mean that it is not necessary to experimentally characterise the structure of every protein – a procedure clearly limited in terms of time and cost. Evolutionary related proteins share similar structures [9], and in cases where one or more members of a related set of sequences, or domain family, has been structurally characterised, structural data can be transferred to the remaining structurally uncharacterised family members. The accurate one-to-many structural annotation of protein sequences is fundamental to gaining a significant structural coverage across the genomes.
Amongst the structural genomics projects that were instigated to work upon this principle is the Protein Structure Initiative (PSI), funded by the National Institute for General Medical Sciences in the United States [10,11], which aims to target new areas of structure space for which an experimental structure had not yet been solved. In so doing, it is hoped that each structure will maximally cover surrounding sequence space by acting as a structural template for comparative modelling and fold recognition [12-18]. The PSI project has recently moved into its production phase where it is hoped that approximately 3000 new structures will be solved in its five year duration [19].
Structural genomics target selection, the procedure through which specific proteins are selected for structural characterisation, often views sequence space in terms of its organisation into protein domain families. That is, collections of evolutionarily related sequences that can be prioritised according to a range of properties, such as size, taxonomic distribution and suitability of family representatives for high-throughput structure determination [20-23]. Sequence comparison methods, such as Position Specific Scoring Matrices or hidden Markov models, are now sensitive enough to group distantly related sequences into a 'coarse-grained' classification of domain families. For example, domain-level family annotations of the genomes can be achieved using hidden Markov model libraries derived from domain structure classifications such as the SCOP [24] and CATH [25] databases and domain sequence classifications such as Pfam [26].
Through the use of such coarse-grained family annotations one can examine sequence and structure space at the domain-level in order to quantify the number and types of, as yet, structurally uncharacterised domain families. For instance, from a structural genomics viewpoint it is of considerable interest to calculate the number of experimental structures that will be required to structurally annotate all protein sequences. Clearly such findings can vary depending on how domain families are constructed (i.e. the number of families (or granularity) is dependent on the quality of homology-models required) and how sequence-structure coverage is calculated (e.g. as a percentage of sequences or percentage of residues that have a near or distant structure homologue) [27-32].
Such estimates generally focus on coarse-grained coverage where domain sequences have been grouped through the identification of distant relationships. Domain families for which a structural representative has yet to be experimentally solved (i.e. a structurally uncharacterised family) can be identified and prioritised according to the number of family-members, where the benefit of solving a structure for a given family can therefore be seen to correlate with family-size – the greater the family size, the greater the structural coverage. For example, recently Chandonia and Brenner [30] proposed the Pfam5000 target selection strategy which suggested that target selection could be guided through the selection of a manageable number of target proteins from a list of the largest 5000 Pfam families, many of which lack a member of known structure.
Targeting sequence space through coarse-grained target selection provides an important directed approach to characterising structural representatives of protein families. However, other analyses, including our own [32], have also conceded that structural-coverage based on a coarse-grained measure will be limited in terms of reliable structural and functional annotation. The fraction of proteins in a given genome for which we can infer structures by homology modelling depends on the accuracy required for a model. For instance, high-accuracy models require high levels of sequence identity between target and template sequences, with lower measures of similarity simply providing the basic description of protein fold. The level of granularity required for the selection of additional targets should therefore account for the number of sequences that can be computationally modelled with 'useful' accuracy from each solved structure. It is generally accepted that sequences sharing 30% or more sequence identity are likely to share a similar fold [9] and accordingly, to confidently construct models of reasonable accuracy, at least 30% sequence identity (60% overlap) must be shared between the sequence to be modelled and the structural template [33]. Reliable functional inference is even more limited with 60% or more sequence identity required [34-36]. The selection of 'fine-grain' targets from within larger coarse-grained families of distantly related proteins would provide a more thorough coverage of functional space as it relates to protein structure [37].
Furthermore, one must also consider the number of structurally uncharacterised families that are within the scope of the experimental methodology used by structural genomics. The principle aim of structural genomics initiatives is the high-throughput determination of protein structure. Achieving such levels of structure solution has required the development of automated protein structure determination pipelines [38,39]. However, predicting protein behaviour within these pipelines is still problematic especially considering the fact that high-throughput methods require well-expressing and highly soluble proteins. Accordingly, the cloning and expression of target proteins is often parallelized in order to increase the chances of producing sufficient protein that is not only soluble but ultimately amenable to X-ray crystallography or NMR structure analysis [40-43]. Strategies include the cloning of target sequence homologues from as wide a range of sequenced organisms as possible, often described as a multi-orthologue approach. Historically, most structural genomics targets tend to be prokaryotic in origin allowing direct amplification from genomic DNA. Large-scale expression of eukaryotic proteins is much more challenging and considerably more expensive and therefore has not become routine in structural genomics, although several centres are developing new methodology [19].
In this work we consider how a broad structural coverage of the genomes might be achieved and the limitations that may be encountered in a family-based target selection procedure. A principle aim of Phase 2 of the PSI (PSI2) structural genomics initiative is to solve structures for coarse-grained families which do not yet have a representative of known structure. We aim to identify the number of these families and what level of additional structural coverage will be achieved if structural genomics devotes much of its efforts towards solving the largest (in terms of sequence members) of these families. We also ask what proportion of these coarse-grained families are within reach of structural genomics pipelines – specifically those employing a multi-orthologue approach focused on solving prokaryotic sequences. We also consider what benefits might be achieved if structural genomics forfeited the opportunity to solve 'novel-folds' with a view to solving additional structures for large structurally-diverse families. We ask how this would compare to pursuing structurally characterised coarse-grained families in terms of structural coverage and species distribution.
Through a comprehensive analysis of CATH, Pfam and NewFam [32] domain families we find that many of the largest structurally uncharacterised domain families are eukaryotic or viral and have no prokaryotic sequence members. Therefore these families may well be more challenging targets. In addition, many of the coarse-grained families which do have prokaryotic relatives have relatively few members, offering small returns from a structural annotation viewpoint.
We show that solving structures for many of the largest fine-grained subfamilies (derived through sequence clustering at 30% sequence identity) from structurally characterised families can offer similar increases in sequence coverage, with more available prokaryotic sequences for potential high-throughput structure determination. Such an approach could be used in concert with the targeting of structurally uncharacterised families to achieve a broad coverage of sequence space. For instance one could identify those cases where a structurally characterised family member exists (e.g CATH, SCOP families), but reliable modelling coverage of the remaining family is low, particularly in the case of many large families of protein domains which can display considerable divergence in their molecular function [34,37,44].
We evaluate the current structural coverage of domain families in Swiss-Prot and TrEMBL sequence databases [45], which encompass a wide representation of known sequences, using the CATH domain structure classification. An additional mapping of the manually curated Pfam-A and our in-house automatically-derived NewFam domain family supplement [32] (described in Methods), is then made in order to comprehensively identify the number and distribution of remaining structurally uncharacterised domains and corresponding families. Structural coverage can be calculated by a number of criteria; we consider the coverage of these families across Swiss-Prot and TrEMBL using three measures, as a percentage of sequences, domains and residues.
From these observations we show that whilst targeting structurally uncharacterised domain families may achieve small gains in structural coverage compared to existing structural coverage, such efforts will expand our knowledge of protein folds and function. Furthermore, PSI2 structural genomics has the potential to solve structures for around half the remaining structurally uncharacterised families accessible to multi-orthologue approaches. We also demonstrate that identifying additional targets within fine-grained subfamilies from broad, structurally characterised families, often with a wide species distribution, will enable comparable increase in structural and functional coverage, whilst expanding our knowledge of these highly expanded protein families. The proportion of effort that should be spent on solving structurally uncharacterised families or re-sampling from large structurally characterised superfamilies should be addressed as the initiatives progress.