Validation against Proteins with known Structural Motifs
To test the structural alphabet-based strategy for discovering metal-binding site structural motifs, a database of 42 structural zinc sites from 29 proteins in previous work  was searched for proteins containing the C(2)C(13–15)C(2)C sequence motif, where the number in parentheses indicates the number of amino acid residues separating the conserved Zn-binding cysteines. Proteins with such a sequence motif belong to the Zn-finger family of the nuclear receptor type, having a Cys4 Zn-binding site , which is known to adopt a specific structure. Each of the Zn-proteins containing the C(2)C(13–15)C(2)C sequence motif was represented by a 1D structural alphabet, as described in Methods and illustrated in Figure 1. All of these proteins were found to possess a f(2)o(13–15)f(2)m structural motif of the Zn-binding site (see Figure 1). This shows that the structural-alphabet based approach for discovering new structural motifs seems promising.
Structural Preference of Mg2+-Binding Sites
Although the 70 Mg2+-proteins used herein share 2+-binding sites prefer certain local structures? To answer this question, the 3D structure of each of the 70 nonredundant Mg2+ proteins was represented by a 16-letter structural alphabet (see Methods and Additional file 1), and the frequencies of the letters in all the first-and second-shells as well as in the entire Mg2+ dataset were compared (see Figure 2). The results reveal a clear preference towards certain types of local structures in the Mg2+-binding sites. The 'b', 'd', 'f', and 'h' frequencies of first-shell Mg2+-ligands and the 'd', 'e', 'f' and 'k' frequencies of second-shell Mg2+-ligands are statistically significantly higher than the respective frequencies of all the amino acid residues in the dataset (see Table 1). Both first and second-shell Mg2+-ligands favor the 'd' and 'f' structures. Furthermore, the first-shell (but not the second-shell) Mg2+-ligands strongly prefer the local structure 'h', whose frequency of first-shell ligands is 5.3-fold greater than that of all residues in Mg2+ proteins. However, compared to all amino acid residues in the Mg2+ proteins, both first and second-shell Mg2+-ligands disfavor certain local protein structures such as the 'c' and 'm' structures: The 'c', 'i', 'm', and 'p' frequencies of first-shell Mg2+-ligands and the 'a', 'c', 'm' and 'o' frequencies of second-shell Mg2+-ligands are statistically significantly lower than the respective frequencies of all the amino acid residues in the dataset (see Table 1).
Additional file 1. The Mg2+-dataset containing 77 metal-binding sites in 70 nonredundant Mg2+-proteins. A table listing the PDB entries, protein description, native metal-cofactors (if known), EC code, metal-bound amino acid residues, and first-shell structural representation of the 70 nonredundant Mg2+-proteins.
To relate the observed bias of the first-shell Mg2+-ligands for certain structures to standard regular and irregular secondary structures, the percentage frequency distribution of first-shell, second-shell, and all amino acid residues that are found in α-helices, β-strands, or loops in the Mg2+-proteins according to the secondary structure information in the Protein Data Bank  (PDB) files were computed (see Figure 3). The loop occurrence frequency of the first or second-shell Mg2+-residues (47–50%) is significantly higher than that of all residues (~32%) with p-values ≤ 0.014 (see Table 1). However the β-sheet occurrence frequency of the first or second-shell Mg2+-residues (~29%) is not significantly higher than that of all residues (~22%). In contrast, the α-helix occurrence frequency of the first or second shell Mg2+-residues (22–23%) is nearly half of the respective frequency of all residues (~46%) with p-values ≤ 0.0004.
In summary, the Mg2+-binding sites generally prefer certain local structures: compared to all amino acid residues in the Mg2+ proteins, both first and second-shell ligands tend to prefer loops to helices. This may be due to the need to position not only the first and second-shell ligands, but also the helix dipole, in a proper orientation for metal binding.
Structural Motifs of Mg2+-Binding Sites
Even when the Mg2+-proteins share no significant sequence homology (2+-binding sites have the same first-shell letters and similar interletter spacing (see Methods and Additional file 1). These structural motifs are listed in Table 2 and illustrated in Figure 4, while first-shell structural patterns that are common to only 2 Mg2+-binding sites are listed in Additional file 2. For the first shell, 4 structural motifs, representing about a fifth (16/77 or 21%) of all Mg2+-binding sites, were discovered. All 4 motifs occur in proteins whose functions are either Mg2+-dependent or whose native co-factors are Mg2+ according to UniProt and/or the literature. Consistent with the above finding that the 'h' structure is preferred by the first-shell Mg2+-ligands, it is in the middle of all 4 motifs and the partial motif 'f(1–2)h' accounts for half of the Mg2+-proteins with structural motifs. For the second shell, too many residues define the Mg2+-binding site; hence not enough Mg2+-binding sites possess the same second-shell letters and similar interletter spacing. However, 5 partial motifs for the second shell were found: These are f(1)lm, kl(0–1)m, d(1–2)ff, d(1)e(1)i(0–5)l, f(1)l(18–25)d, with an occurrence frequency of 21, 12, 11, 8, and 6%, respectively.
Each of the 4 motifs in Table 2 is found in proteins containing Mg2+-binding domains belonging to the same superfamily. This is evidenced by the fact that proteins with the same Mg2+-structural motif have Mg2+-binding domains belonging to the same superfamily with the same CATH numbers (Table 2), implying structurally homologous domains. For example, all 3 proteins with the f(2)h(126–158)m motif possess in common a Mg2+-binding domain belonging to the fructose-1,6-bisphosphatase, subunit A, domain 1 superfamily (CATH number 3.30.540.10). Likewise, all 5 proteins with the k(26–29)h(1)a motif possess Mg2+-binding domains with the same CATH number, 188.8.131.520. The fact that the motifs are found in structurally homologous Mg2+-binding domains further supports the use of the structural alphabet to discover motifs.
The first-shell motifs discovered herein can also help to uncover relationships between proteins with unassigned CATH numbers. For example, 2 of the 3 proteins with the e(24–47)h(24)k motif (1SJC and 1TKK) possess Mg2+-binding domains pertaining to the enolase superfamily (CATH number 184.108.40.206), whereas the third protein (2AKZ) has not yet been assigned a domain and therefore has no CATH number. Although the n-acylamino acid racemase (1SJC) and gamma enolase (2AKZ) proteins do not share significant sequence homology (only 15.4% identity) and overall structure similarity (protein backbone rmsd = 17.5 Å), they possess similar Mg2+-binding site structures (backbone rmsd of the first-shell letters = 0.5 Å), as shown in Figure 5. In analogy, 3 of the 5 proteins with the f(1)h(109–349)b motif (1O08, 1WPG, and 2B82) possess Mg2+-binding domains with the same CATH number (220.127.116.110), whereas the other 2 proteins (1U7P and 2C4N) have not yet been chopped into domains and therefore have not been assigned CATH numbers. The results in Table 2 predict that the Mg2+-dependent phosphatase (1U7P) and NagD (2C4N) proteins are likely to possess Mg2+-binding domains that are structurally homologous to those assigned with the CATH number 18.104.22.1680.
Figure 5. The conserved binding site of 2 nonhomologous Mg2+-proteins. (a) Cartoon diagram of the metal-binding domain in N-acylamino acid racemase (1SJC). (b) Cartoon diagram of the metal-binding domain in gamma enolase (2AKZ). (c) Superposition of the first-shell structural letters of 1SJC (blue) and 2AKZ (yellow).
Relation between Mg2+-Structural Motifs and PROSITE Sequence Motifs
To see if any of the Mg2+-proteins containing structural motifs match sequence motifs stored in the PROSITE database , the sequences of the proteins listed in Table 2 were scanned for the occurrence of PROSITE sequence motifs. None of the proteins match any PROSITE sequence motifs encompassing residues involved in Mg2+-binding. However, the halotolerance protein hal 2 (1KA1) containing the f(2)h(126–158)m motif matched 2 inositol monophosphatase family signatures (PROSITE PDOC00547) containing conserved metal-binding residues. This supports the 'f(2)h(126–158)m' motif as a signature of Mg2+-binding sites.
Relation between Mg2+-Structural Motifs and Protein Function
Do any of the structural motifs found for the Mg2+-proteins map to specific protein functions? To answer this question, for each of the Mg2+-proteins found with a structural motif, the functional group of the protein from the PDB header and enzyme classification (EC) code, if applicable, are listed in Table 2. Note that proteins belonging to the same functional group have the same first EC number. The results in Table 2 show that most of the structural motifs found for the Mg2+-proteins map to certain protein functions. For example, proteins with the partial f(1–2)h motif are all hydrolases, catalyzing the hydrolytic cleavage of mostly ester bonds (EC3.1.-.-), except for beta-phosphoglucomutase (1O08), which is an isomerase converting beta-D-glucose 1-phosphate to beta-D-glucose 6-phosphate. Interestingly, although class b acid phosphatase (2B82) and the halotolerance protein hal 2 (1KA1) contain structurally nonhomologous Mg2+-binding domains with different CATH numbers, both are phosphoric monoester hydrolases (EC3.1.3.-). Proteins with the e(24–47)h(24)k motif are either lyases and/or isomerases, whereas proteins with the k(26–29)h(1)a motif have even more diverse functions: 3 are oxidoreductases (1POX, 1UMD, 2C3M), one is a lyase (1ZPD) and the other is a transferase (1ITZ). This shows that even if the proteins share structurally homologous domains (CATH number 22.214.171.1240) and structurally similar Mg2+-binding sites, as represented by the k(26–29)h(1)a motif, they can perform different functions.
Statistical Significance of the Mg2+-Structural Motifs
Do the structural motifs found for Mg2+-proteins in Table 2 occur in other proteins that do not bind metal ions? To address this question, de Brevern's databank of protein structures that have been encoded into 1D structural sequences was searched for the occurrence of each of the 4 structural motifs listed in Table 2. Consistent with the Mg2+ and Ca2+ datasets, proteins in de Brevern's databank sharing ≥ 30% sequence identity with ≥ 2.5-Å resolution X-ray structures were removed. Proteins in de Brevern's databank whose structures are complexed with metal ions were also removed, yielding a set of 385 non-homologous test proteins. In order to eliminate those matched structural letters that cannot spatially bind Mg2+, the maximum Cα-Cα distance between any pair of Mg2+-ligands belonging to proteins of a given structural motif in Table 2 was extracted; this distance is 9.32 Å for the e(24–47)h(24)k motif, 8.32 Å for f(1)h(109–349)b, 8.44 Å for f(2)h(126–158)m, and 7.86 Å for k(26–29)h(1)a. For a given structural motif in Table 2, matched letters in the test proteins whose Cα-Cα distances exceed the respective maximum distance were eliminated. This resulted in no matches for the e(24–47)h(24)k and f(2)h(126–158)m motifs, whereas 2 proteins (1C3K, 1GPE) contained the f(1)h(109–349)b motif, and another 2 proteins (1A7U, 1JFR) contained the k(26–29)h(1)a motif. A check of the literature confirmed that these 4 proteins (1C3K, 1GPE 1A7U, 1JFR) do not bind metal ions. This suggests that (i) the 4 Mg2+-structural motifs discovered are statistically significant, and (ii) the e(24–47)h(24)k and f(2)h(126–158)m motifs could be used to predict metal-binding sites.
Metal Preference of the Mg2+-Structural Motifs
To check the specificity of the 4 structural motifs in Table 2 for Mg2+, the same procedure used to represent the Mg2+-binding sites in terms of their local structure was repeated for Ca2+, which is the metal ion most similar to Mg2+. Both Mg2+ and Ca2+ are closed-shell divalent cations belonging to the same group (IIA) with similar chemical properties: They are both "hard" dications that prefer to bind directly to "hard" oxygen-containing anions, and are hence often found to bind in the same protein cavity . Thus, the 3D structure of each of the 177 nonredundant Ca2+ proteins was represented by a 16-letter structural alphabet (see Methods), and the 1D structural letter representation of the 230 Ca2+-binding sites are listed in Additional file 3.
Additional file 3. The Ca2+-dataset containing 230 metal-binding sites in 177 nonredundant Ca2+-proteins. A table listing the PDB entries, protein description, native metal-cofactors (if known), EC code, metal-bound amino acid residues, and first-shell structural representation of the 177 nonredundant Ca2+-proteins.
Format: DOC Size: 442KB Download file
This file can be viewed with: Microsoft Word Viewer
None of the structural motifs in Table 2 or Additional file 2 were found in 3 or more Ca2+-binding sites, and therefore cannot be classified as Ca2+-structural motifs according to our definition. The f(1)h(109–349)b motif is found in the Ca2+-binding site of the hydrolase from the haloacid dehalogenase family (2FI1), which appears to utilize Mg2+ as a natural co-factor . Although the k(26–29)h(1)a motif is found in the calcium-binding sites of the transketolase protein (1TRK) and benzoylformate decarboxylase (1Q6Z), the latter is a Mg2+-dependent enzyme . The e(24–47)h(24)k and f(2)h(126–158)m motifs did not match any first-shell structural letters of the Ca2+-binding sites, indicating that they seem to favor Mg2+ over its competitor, Ca2+.
Discussion and conclusion
Comparison with Previous Structural Motif Discovery Methods
Assuming that similarity in the local active site structure implies similarity in biological function, 3D patterns/templates of key active sites have been used to suggest the function of a protein whose structure is known. The 3D patterns/templates have been constructed either manually or automatically using various methods, which have been reviewed recently by Watson et al. . Recently, 3D templates in the absence of experimental data have been constructed using the evolutionary trace method to identify evolutionarily important, solvent accessible residues that cluster in the protein structure . Furthermore, structural motifs for the metal-binding sites of 3 distinct metalloenzymes families; viz., DNase 1 homologs, dimetallic phosphatases, and dioxygenases, have been obtained by first identifying physical chemical property-based sequence motifs in homologous protein sequences, and subsequently identifying those motifs whose structures are conserved in members of a family/superfamily [23,24]. However, to the best of our knowledge, 3D patterns of key active sites and recurrent patterns (structural motifs) have not been identified using the structural alphabet to convert 3D structures to the respective 1D letter sequences. Also, systematic studies of the structural preference or conservation of Mg2+-binding sites in nonhomologous proteins and Mg2+-specific structural motifs have not been reported.
Advantages of the Structural-Alphabet Based Motif Discovery Approach
This work presents the first application of the structural alphabet approach to define the 3D patterns of metal active sites and to identify recurrent patterns (structural motifs). The method requires as input only the 3D protein structure to define a 1D structural representation of the respective active site. The structural alphabet-based approach has several advantages: (i) It is efficient at handling many structures as it takes less than a minute on a present-day PC to convert a 3D structure to the corresponding 1D structural sequence. (ii) It requires no sequence comparisons, no parameters or scoring functions and can thus produce consistent structural motifs, whose structures are readily visualized, as illustrated in Figures 4 and 5. (iii) It is general and can be used to define 3D patterns not only in metal-binding sites, but also in enzyme active sites, ligand-binding clefts and interacting regions between proteins and their respective partners. The 3D patterns/motifs discovered could be incorporated in methods to detect metal/ligand-binding sites to improve their prediction accuracy.
Secondary Structure Preference of Mg2+-Binding Residues
In this work, the structural alphabet-based approach has been used to reveal the structural preference of Mg2+-binding sites. Even though helix-like segments represented by the letter 'm' is the most common building block of the Mg2+-proteins in the dataset, residues that bind Mg2+ disfavor helices, but favor loops. The similarity in the structural preference of the first and second-shell residues supports previous conclusions regarding the relationship between these 2 layers; namely, the structure and properties of the 2nd- shell are dictated by those of the 1st layer .
Similar Mg2+-Binding Site Structures in Dissimilar Protein Sequences
The motif discovery method herein has revealed 4 structural motifs, comprising 21% of the Mg2+-binding sites. The 3D structural motifs discovered seems to have more predictive utility in identifying Mg2+-binding sites than 1D sequence motifs: A scan of the Mg2+-protein sequences in our dataset for the occurrence of sequence motifs stored in the PROSITE  database yielded only a single positive match, 1WC1, which contains a PROSITE sequence motif predicting the protein to bind Mg2+. However, the ScanProsite results did not identify any of the Mg2+ proteins with structural motifs.
Functional Preference of the Mg2+-Structural Motifs
The structural motifs discovered generally relate to the biological role of Mg2+ and the function of the respective proteins. They capture some essential biochemical and/or evolutionary properties, as proteins found to contain a specific structural motif possess structurally homologous Mg2+-binding domains, even though they share no significant sequence homology. Furthermore, the f(2)h(126–158)m motif maps to a specific functional group, namely, hydrolases, indicating the apparent importance of the local Mg2+-binding site structure for the function of these hydrolases. As the f(2)h(126–158)m motif was not found in non-metalloproteins and in Ca2+-binding proteins, the presence of this motif in a novel protein structure may suggest a likely Mg2+-binding site and the protein function. On the other hand, the other 3 motifs map to more than one functional group, suggesting that the local Mg2+-binding site structure is not the only determinant of the protein's function.
Why Mg2+-Specific Structural Motifs are Not Found For Most Mg2+-Proteins
Out of the 70 nonhomologous Mg2+-proteins, only 16 have first-shell structural motifs, while the rest do not seem to possess any metal-binding site structural motifs-why? One reason might be that some Mg2+-structural motifs may have been missed out in this work. As the dataset employed only proteins with Mg2+-bound structures (see Database subsection below), some PDB structures complexed with heavier metal ions such as Mn2+ may in fact correspond to native Mg2+-binding site(s); moreover, not all structures of proteins whose native co-factor is Mg2+ have been solved. A second reason is that for native Mg2+-binding sites that can accommodate other metal ions such as Ca2+ or Mn2+, the binding-site structure need not be conserved in order to recognize a specific metal co-factor. A third reason is that although Mg2+ occupied the binding site in the 3D structure, it is not the native cofactor. Although all 70 proteins are bound to Mg2+ in our dataset, according to PDBSUM  and from the UniProt annotation and references therein, 14 proteins do not employ Mg2+ as the native co-factor, while for 6 proteins, the native metal-cofactor is apparently not known (see Additional file 1). For example, calbindin d9K is a vitamin D-dependent calcium-binding protein, but in the 1IG5 structure, the native cofactor Ca2+ has been replaced by Mg2+. In Mg2+-proteins with multiple Mg2+-binding sites, one or more sites may be non-native, as they have been artificially induced by the high Mg2+ concentration used during crystallization. In these cases, the local structure of the non-native metal-binding site would not be expected to share any similarity with that of a native Mg2+-binding site, where the conserved local structure (as in the f(2)h(126–158)m motif) plays an important role in the protein's function. The absence of structural motifs for non-native Mg2+-binding sites indirectly supports our strategy.