such as "Introduction", "Conclusion"..etc
Domain family annotation of protein sequences
In order to measure what contribution structural genomics must make in order to provide broad structural annotation coverage of protein families we have based our calculations on the CATH, Pfam and NewFam protein domain-level annotations (as outlined in the Methods). Accordingly, our first aim was to assign a comprehensive coverage of protein domains to the sequences held in the Swiss-Prot and TrEMBL sequence databases. By organising sequence data into domain families we are able to quantify those families lacking structural coverage, or large families with limited structural representation. In addition, remaining unassigned sequence regions to which no family assignments can be made must also be accounted for, as they still represent a significant proportion of the genomes. Definitions of the datasets and terms used in this analysis can be seen in Table 1.
The classification of proteins into families is most commonly carried out at the domain level (e.g. SCOP, CATH, Pfam, SMART [46]) since it is well recognised that the protein domain represents the evolutionary building blocks from which larger multi-domain proteins have been constructed via domain duplication and recombination events. From a structural perspective, domains can be viewed as independent units of protein folding, whilst from a sequence perspective they tend to be considered as recurring units of evolution. Despite the difference in definitions, in many cases the domain families generated by each system are equivalent [47]. Those domains belonging to the same family share a common protein structure and, depending on their degree of relatedness (i.e. how the family has evolved), can share similarities in their function.
Structural domain assignments were made using HMMs [48] based on the CATH domain database (version 3.0). Of the 2,241,277 non-redundant sequences (greater than 50 residues in length) in SP-TrEMBL (Swiss-Prot version 48.1, TrEMBL version 31.1), 1,101,819 sequences (49.2%) were assigned to one or more CATH domain families. Whilst this figure may be considered to represent a rough guide to the extent to which the known fraction of whole protein sequences are covered in whole or by part by structure annotation, it does not account for 1. Domains within these sequences that cannot be structurally annotated, 2. Protein Data Bank (PDB) [49] structures not yet classified in the CATH domain database and 3. The quality of structural models that can be obtained (i.e. only represents a coarse grained coverage).
To extend the domain coverage beyond the CATH domain assignments we assigned Pfam-A domain families (version 19) bringing the total number of sequences in SP-TrEMBL with one or more domain family assignment to 75.0 % (1,681,640 sequences). Finally, assignments using NewFam families, available from the Gene3D database (see Methods and [32]), were used to further extend domain family assignments across the remaining unannotated sequence regions. As such, 79.1% of sequences in SP-TrEMBL sequence database could be assigned to one or more domain families containing 2 or more sequences.
Removal of families unsuitable for structure determination
When calculating existing and additional structural annotation coverage of domain sequences that might be achieved by structural genomics, it is important to identify those sequences that are unlikely to be tractable to high-throughput structural characterisation. It is generally agreed that in order to reduce the high attrition rate encountered in high-throughput structural genomics pipeline, sequences with low complexity regions, coiled-coils and helical transmembrane helices are best avoided. Such features can be reliably predicted using computational methods, and sequences or families with a significant proportion of 'problematic' residues (see Methods) can be removed from the target list. In Table 2 we show the breakdown of these categories across the SP-TrEMBL sequences, 268 completed genomes annotated by the integr8 database [50] and the model genome examples. Over 18% of the domain sequences in SP-TrEMBL are considered problematic and excluded from high throughout structural characterisation. Just 13% of domains in the compact genome of T. maritima appear as problematic for structure determination, though another prokaryotic genome, B anthracis appears to have the highest level of potentially intractable domain sequences (nearly 20%). It is worth noting that such prediction methods only offer a rough measure for the exclusion of problematic sequences. Parameters that accurately linked sequence composition and features to the bottlenecks in structure characterisation, such as protein expression, solubility and crystallisation, would be of considerable value to structural genomics target selection. It is also of importance to note that a domain-based approach to high-throughput structural characterisation brings its own difficulties in terms of resolving domain boundaries that enable the expression of soluble protein.
Table 2 also shows the percentage of 'singleton' domain sequences within SP-TrEMBL and model genomes. The structural characterisation of true singleton sequences would offer large insights into the uniqueness of each species, providing a more comprehensive understanding of the structure-sequence relationship where they can be assigned as very remote homologues to known structural families. However, by definition, their structural characterisation would also provide a small modelling leverage across the genomes. Additionally, their species-specific nature reduces the chance of achieving a successful structural characterisation since they cannot be characterised through a multi-orthologue approach. The percentage of singleton sequences varies considerably between genomes (e.g. 7% in E. coli compared to over 22% in the eukaryote C. elegans). By our definition, the proportion of singleton sequences calculated in this analysis is subject to the proportion of domain sequences that are assigned to CATH, Pfam-A and NewFam domain families, and therefore is partially dependent on the sensitivity of assignment of these domain families using HMM methods. The true nature of singleton or 'ORFan' sequences has been open to much debate [31,32,51,52]. It has been suggested that their existence is partly due to the sparse sampling of sequence space (and that over time, sequence relatives will be found), or that many of these sequences relate to miss-predicted non-expressing proteins. It still appears somewhat unclear as to whether the number of singleton sequences will rise, or fall as more genomes are completed and their gene maps revised. Additionally, in this study the percentage of singleton sequences is related to the length threshold used to include unassigned regions. We cannot be certain that all unassigned regions are indeed true protein domains (or multi-domains), however in mind of the fact that structural genomics target lists tend to avoid small fragment sequences we apply a cautious length threshold of 80 residues (compared to 50 residues) for inclusion of unassigned regions into our domain-level coverage calculations.
Protein sequence coverage by current structural data
Our attention now turns to the proportion of sequences to which some structural data can already be assigned. We identified 2486 coarse-grained CATH and Pfam_struc domain families already containing one or more PDB structures (see Methods). Table 3 summarises the coverage statistics across all sequences in the SP-trEMBL dataset and also the 263 completed genomes using three measures of coverage: First, on the basis of per-sequence coverage, calculated as the fraction of whole-protein sequences with at least one domain belonging to a given structurally characterised family. Secondly, per-domain coverage, where the fraction of all domain sequences belonging to a given structurally characterised family is calculated. Thirdly, coverage is calculated in terms of the fraction of residues assigned to a structurally characterised family (where all residues between the N and C-termini of an HMM alignment are included).
For all measures a slightly lower percentage structural coverage is found across the 263 completed genomes (integr8_263 dataset) compared to SP-TrEMBL. Of the three measures, per-sequence calculations give the highest levels of coverage, with 54.4% of SP-TrEMBL sequences containing one or more domains (52.4% of completed genomes) that can be assigned to a coarse-grained family that is already structurally characterised. This is an overestimate of our current ability to structurally annotate the genomes, because it does not account for the fact that many protein sequences (up to 80% in eukaryotes) contain two or more domains, and as yet, many of these domains sequences cannot be classified into a structurally characterised family (upon which we base our structural coverage calculations). Accordingly we have attempted to calculate structural coverage on a per-domain basis. Whilst one cannot always accurately predict the domain content of a given sequence (robust domain boundary prediction is an as yet unsolved challenge) we have, where necessary, estimated the number of protein domains within a given sequence/genome (see Methods). In so doing, we calculate that 47.7% of domain-like sequences in SP-trEMBL are structurally annotated at a coarse-grained level, a lower but possibly more realistic view. Calculating coverage on a per-residue basis shows that 36% of residues in SP-TrEMBL fall into the 2486 CATH and Pfam_struc families identified in this analysis.
In these values we account for every sequence, predicted-domain and residue in our sequence database. However, the high-throughput nature of structural genomics generally requires that sequence or families with a significant percentage of transmembrane regions or 'problematic' regions, such as low-complexity or coiled-coils should be excluded from the target selection process. In Table 3, we also show coarse-grained structural coverage expressed as a percentage of domains and residues that are expected to be tractable to structural genomics, i.e. transmembrane and problematic domains are excluded from this calculation. These values suggest that we currently have at least fold-level annotations for 57.3 % of domains and 44.1% of residues that are tractable to high-throughput methods. Even so, it is important to note that the remaining 42.7% of domains may still not be tractable to structural genomics because, for example, we do not consider components of complexes, low-expression, poor-solubility proteins etc.
Finally, in view of the suggestion that structural genomics should aim to solve structural representatives of sequence families, we show coverage excluding transmembrane, problematic sequences and also excluding singletons. Such a calculation strips-out a large proportion of sequence space, focusing coverage on areas accessible to structural genomics, with over 80% of 'accessible' non-singleton domain sequence already having some coarse-grained structural coverage (63.8% of residues). However, such values are misleading if one considers the goal of achieving a comprehensive structural and functional annotations of the genomes. Accordingly, we use per-domain coverage of all predicted domain sequences for the remaining calculations in this study.
Additional structural coverage
In order for structural genomics efforts to provide increased levels of structural coverage it is logical that target lists should favour representatives from the largest coarse-grained sequence families for which no structure has yet been acquired. Indeed, such approaches to target selection have been discussed in various analyses [27-31], including the Pfam5000 (Chandonia and Brenner) which suggested that structural genomics should aim to solve structures such that each of the 5000 largest non-membrane protein Pfam-A families (and therefore of significant biological interest) includes one or more structural representative. With almost over half of the 5000 largest Pfam-A already having a structural representatives, this would require approximately 2500 additional structures to achieve this goal (a structure count that is considered to be within the capability of the production phase of the Protein Structure Initiative) [17].
In Figure 1 we show the coarse-grained sequence coverage of domain sequences in structurally characterised families, followed by the coverage of structurally uncharacterised domain families. Domain families are ordered according to their size (the number of domain sequence relatives identified by HMM searches), largest to smallest and coverage is given as the percentage of all domain sequences, including problematic sequences and singletons. We identify 2486 CATH and Pfam-A_struc families, covering just over 47% of domain sequences. Assignment of Pfam-A and NewFam domain families, over and above the CATH family annotations, identified 4100 structurally uncharacterised Pfam-A and over 50,000 structurally uncharacterised NewFam families. Solving a new structure for a comparable number of these largest non-structural families (the vast majority of which are Pfam-A families) would increase coverage of SP-TrEMBL by 10%, a relatively small fraction compared to current coverage levels, revealing that we already have structural representatives for many of the largest domain families.
Historical structural coverage of Pfam
Using the CATH domain structure classification, we considered the extent to which newly structurally characterised Pfam-A domain families tend to represent a novel CATH fold or a founding member of a CATH superfamily using the historical trend observed in the Pfam database. In Figure 2a we show the percentage frequency for which the first structure solved for a given Pfam-A family represented a previously unobserved fold or an old fold, but the first member of a CATH domain superfamily. Values are given as the percentage of these first-solved structures that are classified in the CATH database. On average, since 1990 172 Pfam families have gained their first structure each year (average of 255 per year since 2000). Between the years 1990 and 2005, the fraction of first-solved structures that are novel folds (as classified by CATH) has gradually reduced (from 75% to 17%) though the number of newly structurally characterised families has increased from 13 in 1990 to an average of 332 in 2004. From these figures it appears that an average 50% of newly structurally characterised Pfam-A families in the last 5 years can be seen to correspond to an experiment on protein domain belonging to a new superfamily if not a novel fold.
Figure 2b shows the source data in more detail for structures classified in the SCOP and CATH domain databases respectively. It can be seen that both databases fall behind in their classification of first-solved Pfam structures in the latter years. Despite this fact, when calculating values as a percentage of all first-solved structures, on average, approximately one third of the domains represent a new fold or new superfamily in SCOP in the early part of this decade (e.g. 1999–2003). It will be of considerable interest to repeat this analysis on the next releases of the CATH and SCOP databases. These data suggest that the pursuit of structures for uncharacterised sequence families, such as Pfam, is likely to yield structural characterisations that represent significant and interesting variations of known folds and also a fair percentage of novel folds.
Structural coverage of model genomes
The current coarse-grained structural coverage of domain families in eight completed 'model' genomes, followed by the coverage of non-structural Pfam-A and NewFam families is shown in Figure 3. Indeed a single-genome approach to structural genomics appears attractive because of the possibility of identifying the minimal component of genes necessary for life. For many of the genomes illustrated, solving structures for 2000 additional families (within reach of PSI2) would result in almost a doubling of structural coverage. However, a single-genome approach to structural genomics has its drawbacks. Many of the sequences in a given genome tend to belong to small families with little or no overlap with other genomes, whilst in addition, up to 20% of sequences may be classed as singletons (i.e. specific only to a given species). By our calculations E. coli contains 1571 domain families with no solved structure. Characterising a structure for each of these families would increase structural coverage in E. coli from 42% to 70% however the leverage of these new structures upon the other model genomes is comparatively small, Figure 3 inset table.
A closer look at structurally uncharacterised sequence families
Many of the consortia involved in Phase 1 of the Protein Structure Initiative focused their efforts on the characterisation of prokaryotic targets, often employing a multi-orthologue approach where multiple forms of the same target from related species were fed into the pipeline in order to increase the chances of obtaining expressed soluble protein and ultimately a solved structure [21-23,38-41]. With this in mind we calculated the species distribution within some of the largest structurally uncharacterised Pfam-A and NewFam families, Table 4. We divide the families into four groupings: Eukaryotic, where all family members belong to a eukaryotic genome, viral, where all family members are viral in their origin and prokaryotic. Prokaryotic families are further subdivided into two subgroups; families containing five or more prokaryotic species (for which a multi-orthologue approach could be used) and those containing one or more prokaryotic sequences.
It is hoped that during PSI2, consortia members will solve in the region of approximately 3000 new structures, much of the effort going towards solving structurally uncharacterised familes. As such, if we consider the largest 2500 non membrane protein domain families with no solved structure (column 2, Table 4), we find that 53% have 5 or more prokaryotic sequences. 31.5% and 11.8% of these families are viral and eukaryotic families respectively, and therefore of minor interest (viral families) or present considerable experimental challenges to most structural genomics groups (eukaryotic families). Solely eukaryotic families present a problem given that the majority of high-throughput structural genomics pipelines are not geared towards eukaryotic proteins, though some groups have put much effort into solving human proteins [42,43].
In Figure 4 we show the size distribution of these 2500 largest, structurally uncharacterised Pfam families (clear bars), with the proportion of these families that are accessible through prokaryotic organisms highlighted (light grey). For comparative purposes we also show the size distribution of the 2500 largest structurally uncharacterised domain families (Pfam-A and NewFam) chosen such that each must contain at least five prokaryotic members (black bars). As shown in Table 4 (column 3) there are 1959 structurally uncharacterised Pfam-A families with five or more prokaryotic sequences. Extra families containing prokaryotic sequences can be selected from the large number of additional NewFam families delineated by the Gene3D resource (Table 4, column 4). The comparative distributions in Figure 4 show that targeting sequence representatives for these more accessible prokaryotic families (black bars) results in the coverage of progressively smaller areas of sequence space – guiding target selection towards comparatively smaller prokaryotic families. Nevertheless, such efforts would still be of significant value, especially if one considers that targeting 2500 of these families would provide structural representatives for 40% of the remaining families in reach of high-throughput structural genomics utilising a multi-orthologue approach.
Coarse-grained vs fine-grained structural coverage
So far we have described structural coverage according to a coarse-grained measure. That is, we consider a domain sequence to be structurally annotated if it can be assigned to a CATH or Pfam-A_struc family through the use of hidden Markov model searches. In many cases domain sequences will only be very distantly related to the member structure and it is likely that only fold-level inference of structure data can be reliably achieved. Defining the level of detail or granularity of the structural coverage is clearly an important issue for target selection. The domain families identified through CATH or Pfam represent a coarse grained division of sequence space into broad superfamilies containing all relatives sharing a common ancestor. The size of structural families increases dramatically when lowering the threshold for detection structural similarities (i.e. traditional pairwise sequence comparison vs. profile based sequence comparison). Lower thresholds imply lower accuracy of comparative modelling. Therefore the estimate for the number of targets for structural genomics is sensitive to the accuracy we require in comparative modelling to remove a protein from the potential target list.
In our previous analysis and others [32] it has been shown that in order to achieve reasonable modelling coverage of genome sequences, many orders of magnitude more structures will be required compared to the numbers calculated for coarse-grained structural coverage. In this analysis we applied a greedy-clustering algorithm to group sequences in diverse domain families into closer related (30% sequence identity) subfamilies (see Methods for more detail). Figure 5 shows the structural coverage of those fine-grained subfamilies with a known structure, followed by subfamilies lacking a structure, in descending order of domain family size (i.e. the number of domain sequence members identified through HMM searches). These figures suggest that over 30,000 structures will be required to model half the domain sequences in SP-TrEMBL using a 30% sequence identity modelling cutoff – a huge effort by any standards. However, it is likely that significant advances in threading techniques and comparative modelling will make this a significant over-estimate in terms of structures required.
How does solving structures for subfamilies in structurally characterised families compare to solving coarse-grained structurally uncharacterised target families in terms of structural annotation coverage and the relative species distribution of the families? In Figure 6a we show the percentage of domain sequences currently covered by subfamilies that contain a solved structure (black line). This coverage distribution can be compared to the fine-grained sequence coverage that would be gained by solving structures for the largest non-structural subfamilies found in CATH sequence families (dark grey line), and non-structural subfamilies found in structurally uncharacterised Pfam families (light grey line). In terms of fine-grained coverage, slightly higher levels of modelling coverage could be achieved by solving the largest fine-grained subfamilies in structural families, such as CATH. However, of greater interest is where we also show the coarse-grained coverage of non-structural families in the same figure (thin black line). It is apparent that structural coverage achieved by fine-grained re-sampling in some of the largest subfamilies in CATH families would bring similar levels of additional sequence coverage as targeting structurally uncharacterised coarse-grained Pfam families.
In addition we also compare the number of unique prokaryotic genomes found within structurally uncharacterised coarse-grained families and non-structural fine-grained subfamilies within structural families, Figure 6b. Interestingly, the structurally uncharacterised subfamilies in CATH families tend to have a wider distribution across the prokaryotic genomes and might therefore be considered as more attractive targets from the viewpoint of an experimentalist.
Solving a representative structure for as yet structurally uncharacterised domain families forms a significant cornerstone of PSI structural genomics initiatives. Such families, if targeted carefully, are likely to represent proteins with previously unobserved folds and/or functions. However, it is also apparent that there exists a diminishing law of returns if one is to measure the contribution of structural genomics on the additional contribution to coarse-grained structural coverage. We have seen that a significant proportion of domain sequences belong to a relatively few large domain families. Beyond these families there exist a considerable number of smaller families from which target lists can be derived. How we access these families efficiently requires considerable thought if one considers their accessibility through prokaryotic organisms, especially so for experiments utilising a multi-orthologue approach. Some structurally uncharacterised Pfam families may be very diverse relatives of known structural families in CATH and are therefore of interest because they may reveal the extent to which families can diverge, illuminating the structural plasticity of these families.
Structural genomics is complementary to traditional biology as there is a greater interest in solving the structures of proteins whose function is not yet known. It is hoped that once solved, a given structure will lend itself to computational function prediction methods, where function can be inferred or predicted. The largest domain families contain a considerable proportion of genome sequences but they also contain a considerable proportion of the sequence diversity of genome sequences. As can be seen on Figure 7, much of this sequence diversity lacks any close structural homologue. A website detailing the structural coverage of CATH superfamily domain sequence relatives has been made available [53]. Some protein families are quite well conserved in function during evolution. Clearly, fewer targets will be needed from these families. However other families are extremely diverse, for example relatives in the P-loop hydrolase superfamily exhibit more than 40 functions [34].
Enter the code exactly as it appears. All letters are case insensitive, there is no zero.