Recent analyses [3,4] have shown that structural genomics projects are already making significant contributions to our knowledge of structure space in terms of number and value of structure depositions. Despite this, structural genomics will clearly not be able to support the experimental determination of structures of all proteins. Accordingly target selection should therefore sample broadly from family space in a manner that optimises the number of genome sequences that can be modelled. The proposal of a Pfam5000-like strategy gives an effective approach to the selection of target families for which a structure has not yet been solved. Whilst conceptually simple, certain considerations must be addressed when putting the system into practice. As has been shown, many of the largest structurally uncharacterised Pfam families have few or no prokaryotic members, whilst the size distribution of these families tends towards smaller families (in comparison to the structurally characterised families). Nonetheless, the characterisation of such families should play a significant part in structural genomics, especially in light of the identification of novel folds or new superfamilies. In addition, it may also be valuable to choose additional targets from large structurally characterised families which so far have a very low level of fine-grained homology modelling coverage.
Comparative genome analysis has shown that many of the most functionally diverse domain superfamilies have expanded significantly during evolution through extensive gene duplication within a genome [54,55]. Following domain duplication evolution of a new function has been achieved in a number of ways including fusion of the duplicate domains with a range of different domain partners. Other mechanisms include the significant structural embellishment of a domain or changes in the oligomerisation state of a protein . Increasing the coverage of structure annotations will reveal new insights between protein sequence, structure and function, which in turn will expedite our understanding of protein function on the molecular level and improve the methods by which we can automatically provide structure-guided functional annotations to new protein structures .