Assignment of CATH, Pfam-A and NewFam domain family annotations
A non-redundant set of protein sequences were taken from the Swiss-Prot version 48.1 and TrEMBL version 31.1 databases  with sequences shorter than 50 residues excluded, giving a dataset of 2,241,277 sequences. Domain families were taken from the Gene3D database which includes domain annotations from the CATH  and Pfam  domain classifications as well as Gene3D NewFam families . CATH version 3.0 domain assignments (representing 2043 CATH domain superfamilies) were made by searching libraries of hidden Markov models (HMMs) against the sequence dataset using HMMer  where an HMM match was assigned using an E-value cutoff of 0.01. Overlapping annotations were resolved using the DomainFinder algorithm . Pfam assignments were based on Pfam version 19 which classifies 8193 Pfam-A families. The HMMer protocol was used to identify family members using the family-specific gathering threshold cutoff to identify true-matches.
To gain a maximum coverage of protein families, regions without CATH or Pfam-A assignments were further annotated, where possible, by NewFam families. As described previously  Gene3D NewFam families are automatically generated from the TribeMCL  clustering of whole or partial sequences to which no CATH or Pfam-A domain assignment can be made. Such unassigned regions have been shown to follow a domain-like length distribution , Figure S1 Supplementary Material, where the largest NewFam cluster contains 453 sequences.
Resolving overlapping CATH and Pfam-A families
A hierarchical approach to domain assignment was applied where CATH and Pfam domain annotations are found to overlap. A hybrid of CATH and Pfam assignments were calculated where CATH domain matches were given priority over Pfam domain matches. Domains in the CATH database are identified from both sequence and structure, which is generally considered to be a more reliable approach for protein domain delineation than their identification from sequence. Conflicts were resolved according to the degree of sequence overlap and family overlap: In cases where 70% or more of the sequences in a Pfam family were found to overlap a CATH family by 70% or more of their sequence length, the Pfam family was inherited into the CATH family. In cases were less than 70% of the Pfam sequences members overlapped the CATH family, the non-overlapping remainder of the Pfam family was assigned as a structurally uncharacterised Pfam family remainder. In some cases, where a partial region of a Pfam family was overlapped by a CATH assignment, a domain-like (in terms of length) Pfam region remained. A cutoff of 80 residues was used to filter such remaining Pfam regions, which were subsequently assigned to Pfam sub-domain families. On average, 1.8 Pfam families were merged into each CATH domain family. In all cases family size was defined as the number of unique sequences matching each domain family.
Families matching solved structures
Since the CATH database is partially reliant on manual curation, it is not entirely up to date with the PDB. In order to address this fact, we searched sequences from Pfam-A (after overlap with CATH families had been resolved) against all proteins deposited into the PDB up to the 21st January 2006. Pfam families matching a PDB structure were defined as Pfam_struc.
Analysis on completed genomes was based on genome-sequence sets defined by Integr8 . We used 263 completed genomes (913,094 sequences) in this study, 237 prokaryotes and 26 eukaryotes, each of which had 95% or more of their sequences included in the Swiss-Prot&TrEMBL database. We refer to this completed genome sequence dataset as integr8-263.
Taxonomic sequence data for each protein in Swiss-Prot and TrEMBL was taken from the UniProt Knowledge database version 8.0 .
Identification of helical transmembrane and problematic sequences and families
The Memsat program  was used to identify transmembrane helices using default thresholds. We used the COILS2 algorithm  to identify coiled-coil regions, using a probability cutoff of 0.9 and a window size of 28 residues and the SEG program  with default parameters to identify regions of low complexity. Helical transmembrane families were defined as those with 30% or more members having one or more helical membrane regions predicted by the Memsat algorithm. Problematic families were defined as those in which 30% or more members had one or more low-complexity regions (15 or more residues in length) as annotated by SEG or with a coiled-coil prediction.
Greedy coverage algorithm to identify fine-grained sequence clusters
A greedy coverage algorithm was run on sequence relatives assigned to CATH and Pfam domain families as follows: Links between family members were assigned, using an implementation of the Needleman and Wunsch global alignment sequence comparison algorithm, in cases where sequence identity and overlap were found to be = 30% and = 80% respectively. The sequence (representing a possible homology modelling template) with the highest number of links is first selected, and removed, along with all those sequences to which it is linked, from further calculations, to form a new cluster. This step is repeated until no sequences are left in the family. In cases where a family contained one or more experimentally solved structure a slightly different procedure was employed. First, each structurally characterised sequence was used to seed each cluster, identifying true families of sequences that can be homology modelled, after which the general method was implemented.
Identifying structural relatives for Pfam families
In order to calculate the number of newly structurally characterised Pfam families that represented a new fold or superfamily we identified the oldest PDB structure (in terms of release date) matching each Pfam family. Pfam-A HMMs were searched using the HMMer protocol against two sequence datasets representing the CATH domain database (version 3.0) and the SCOP domain database (version 1.69). In mind of the fact that each database falls behind the PDB care was taken to identify whole sequences or domain sequences (from partially classified chains) not yet assigned into the respective classifications. Classified domains and unclassified sequences were combined to produce a representative sequence set for each database containing all PDB chains released up to 31st December 2005. COMBS sequences  were used for the CATH domain database and ASTRAL sequences for the SCOP database (24). Seqres sequences were used for unclassified PDB sequences (49). The HMMer protocol was used to search the Pfam HMM library using Pfam specific gathering thresholds.
RLM carried out the study, participated in its design and drafted the manuscript. TEL participated in data acquisition and analysis. CAO helped conceive the study, assisted in its design and in drafting the manuscript. All authors read and approved the final manuscript.
We thank Andrzej Joachimiak for critical reading of the manuscript. This work was funded by the NIGMS of the NIH as part of the Protein Structure Initiative, and the Wellcome trust.