Center for Studies in Physics and Biology, The Rockefeller University 1230 York Avenue, New York, NY 10021, USA 1Department of Biochemistry, University of Washington Box 357350, Seattle, WA 98195-7350, USA
*To whom correspondence should be addressed. Tel: +1 212 327 8139; Fax: +1 212 327 8544; Email: email@example.com
Received July 13, 2005. Revised September 13, 2005. Accepted September 13, 2005.
Protein–DNA interactions play a central role in transcriptional regulation and other biological processes. Investigating the mechanism of binding affinity and specificity in protein–DNA complexes is thus an important goal. Here we develop a simple physical energy function, which uses electrostatics, solvation, hydrogen bonds and atom-packing terms to model direct readout and sequence-specific DNA conformational energy to model indirect readout of DNA sequence by the bound protein. The predictive capability of the model is tested against another model based only on the knowledge of the consensus sequence and the number of contacts between amino acids and DNA bases. Both models are used to carry out predictions of protein–DNA binding affinities which are then compared with experimental measurements. The nearly additive nature of protein–DNA interaction energies in our model allows us to construct position-specific weight matrices by computing base pair probabilities independently for each position in the binding site. Our approach is less data intensive than knowledge-based models of protein–DNA interactions, and is not limited to any specific family of transcription factors. However, native structures of protein–DNA complexes or their close homologs are required as input to the model. Use of homology modeling can significantly increase the extent of our approach, making it a useful tool for studying regulatory pathways in many organisms and cell types.
Nucleic Acids Research 2005 33(18):5781-5798. Published by Oxford University Press. The online version of this article has been published under an open access model.
Gene regulation is mediated in part by protein transcription factors (TFs) binding to cis-regulatory regions of the genome. Accurate genomewide characterization of TF binding sites is thus a necessary prerequisite to deciphering complex gene expression patterns. Probabilistic models of TF binding profiles, often called position-specific weight matrices (PWMs), are typically used as input to such predictions (1–3). With the weight matrix representation of TF binding sites, the probability P(S|p) that sequence S is a binding site for the TF represented by p is given by
In principle, one should be able to predict the TF binding site profile from a structure of the protein–DNA complex or its close homolog. At first it was anticipated that structural studies would reveal a universal protein–DNA recognition code, which could be used for predicting TF binding sites based on amino acid identities at the protein–DNA interface (4,5). It became apparent, however, when more protein–DNA structures were solved and classified that despite some predominantly occurring interactions, such as Lys-G, the energetics of amino acid–base contacts depends on their structural context and, in particular, on the structural family of the DNA-binding protein (6–10). Many amino acids were observed to form favorable contacts with different bases, making it necessary to generalize a deterministic recognition code to a probabilistic binding profile based, for example, on maximizing the likelihood of observed protein–DNA contacts (11–13). Probabilistic recognition codes are more accurate when developed for a specific structural family, thereby implicitly taking protein–DNA structural context into account. Indeed, binding site profiles based on the classification of TFs into families were found to be useful in bioinformatics pattern detection algorithms (14). However, data availability has so far limited knowledge-based PWM predictions to the C2H2 zinc finger family (15,16).
An alternative approach to specificity and binding affinity predictions is based on all-atom modeling of protein–DNA complexes (17–19). Starting from a known structure of the protein bound to its consensus DNA sequence, an ensemble of models is created by threading novel DNA sequences onto the binding site. Protein–DNA binding energies, G, are then evaluated for each member of the structural ensemble. G predictions can be either used to directly infer high-affinity binding sites in genomic sequence or converted into PWM probabilities using the Boltzmann formula. In the latter case, it is only necessary to compute G values for all one-point mutations from the consensus-binding site. The main limiting factor of the structural approach to TF binding site specificity predictions is the availability of experimentally determined structures of protein–DNA complexes. The range of applicability of structural methods will significantly increase if the DNA-binding proteins can be modeled by homology. Homology modeling involves threading a protein amino acid sequence onto a suitable structural template chosen on the basis of its sequence similarity to the protein of interest. The threading procedure creates a new protein–DNA binding interface, for which G and PWM probability calculations are then carried out as in native structures.
Here we present a computational model for predicting protein–DNA binding affinities and specificities. The model can be applied to a wide variety of DNA-binding proteins for which there is either a native protein–DNA structure or a sufficiently close homolog. The model is based on a simple free energy function, which consists of the protein–DNA interaction energy and the DNA conformational energy. The protein–DNA interaction energy is used to describe direct readout of the DNA sequence by the protein, whereas the DNA conformational energy takes into account distortion of B-DNA shape caused by protein binding. We carried out a series of tests of our PWM and binding energy predictions. First, we checked the ability of the model to reproduce experimental binding free energy measurements. We also assessed the accuracy of the pairwise additivity approximation in our analysis. Second, we checked the ability of our algorithm to discriminate experimentally known TF binding sites from random ensembles of sequences. Third, we carried out PWM predictions for a number of TFs and compared them with experimental PWMs. For all these predictions native protein–DNA complexes were used as structural templates. Finally, the extent of applicability of homology modeling to protein–DNA binding affinity predictions was explored with several representative PWM calculations. The relative accuracy and computational efficiency of our approach allowed us to carry out numerous predictions of TF binding affinities and specificities, facilitating future experimental and computational studies of transcriptional regulation in different organisms and biological systems.
Binding stability and weight matrix predictions
Free energy model of protein–DNA interactions
We extended Rosetta protein–nucleic acid model developed in Ref. (20) by adding sequence-specific DNA conformational energy. The free energy function employed in protein–DNA binding affinity calculations consists of the protein–DNA interaction component describing intermolecular readout of the DNA sequence by the protein, and of the DNA deformation component describing intramolecular readout of the binding site sequence:
The DNA sequence-dependent conformational energy model is based on the effective harmonic potential developed in Ref. (25). The DNA conformational energy is computed as a sum over all base pair and base step energies:
All distributions are self-consistently trimmed by removing all data points for which at least one of the geometric parameters is more than three standard deviations away from the average, followed by updating all averages and standard deviations. This procedure is repeated until convergence, which usually requires 2–3 iterations and removes just a few percent of the data points (28). A non-homologous, manually curated set of 101 structures used in Ref. (20) was employed to derive the force constants for the DNA conformational potential.
With the pairwise additivity assumption, a set of 4L predicted energies can be converted into weight matrix probabilities using the Boltzmann formula (29):
Binding free energies
Table 1 collects structural data for protein–DNA complexes with binding free energy measurements available from the ProNIT database (http://dna01.bse.kyutech.ac.jp/jouhou/pronit/pronit.html) and from the literature. Each dataset in Table 1 consists of a structure of the protein–DNA complex and a series of binding free energy measurements G for wild-type and mutant DNA sequences. In several cases, association or dissociation constants reported by the authors were converted into binding free energies using
Binding sites and weight matrices
Table 2 collects protein–DNA structures for which binding site data and experimental weight matrix data are available. The Zif268 weight matrix was taken from a selection experiment (31). The experiment generated the G-C-G-T/g-G/A-G-G-C/a/t-G-G/T consensus sequence, which we used to create five binding site variants (in addition to the consensus sequence from the Zif268 structure). Binding sites and weight matrices for seven Escherichia coli TFs were obtained from the DPInteract database (32). For TrpR, 4 sites from DPInteract were augmented with 10 additional sites from RegulonDB (33). For Ndt80 in Saccharomyces cerevisiae we used the weight matrix from Ref. (34); Ndt80 binding sites were collected by searching promoters of the genes known to be regulated by Ndt80 with the YGNCACAAAA consensus sequence. A collection of 17 naturally occurring MAT a1/2 binding sites plus the synthetic consensus sequence were obtained from Ref. (30). Eight Gcn4p sites were assembled from Ref. (35) and the TRANSFAC database (36). Finally, binding sites and weight matrices for the Prd homeodomain homodimer were taken from Ref. (37), whereas binding data for the rest of Drosophila melanogaster TFs were collected by E. D. Siggia and E. Emberly (unpublished data).
Protein–DNA interaction model based on the number of interface atomic contacts
In order to test our G and PWM predictions we developed a simple null model that exploits the structure of the protein–DNA complex but does not require any detailed predictions of protein–DNA energetics. This so-called ‘contact’ model constructs a weight matrix from the consensus DNA sequence and the number of atomic contacts N between protein side chains and DNA base pairs. In particular, we assume that the three non-consensus bases occur with equal probabilities, whereas the consensus base is favored over any non-consensus base by (N/Nmax) if N Nmax. If N > Nmax, the consensus base becomes absolutely conserved:
Probabilities defined by Equation 12 are converted into energies using:
Significance test of PWM predictions
Statistical significance of PWM predictions is estimated using the -test, which is a generalization of the well-known 2-test (38):
Binding free energy predictions
All-atom free energy models
DNA-binding proteins employ two complementary mechanisms of binding site recognition. The intermolecular readout mechanism is based on direct interactions of protein side chains with DNA bases, whereas the intramolecular readout mechanism involves sequence-specific deformation of the DNA site by the bound protein. We developed a free energy function that takes both these mechanisms into account. Interactions of protein side chains with DNA are modeled using an all-atom representation of both protein and DNA, including all the hydrogen atoms. The protein–DNA interaction energy is a weighted sum of terms describing shape complementarity and packing at the interface, polar interactions (electrostatics and hydrogen bonds), van der Waals forces and solvation energies (Equation 3). The DNA conformational energy is calculated using a reduced geometric representation in which DNA bases and base pairs are represented by rigid bodies and their mutual orientation serves as a measure of deviation from the B-form DNA. The DNA conformational energy is a weighted sum of two terms describing base pairing and stacking of consecutive base pairs. Using this free energy function, we developed the following approach to predicting protein–DNA binding affinities. First, a suitable protein–DNA complex is identified as a structural template for computational modeling. Second, each novel DNA sequence (i.e. from a set for which experimental-binding affinity measurements are available) is threaded onto the DNA phosphate backbone with fixed DNA torsional angles. The result of this procedure is a set of initial structural models with novel DNA sequences. Third, binding free energies, G, for each member of the set are computed in either of two different ways.
One approach, which we shall call the static model, does not allow any side chain or DNA conformational rearrangements in the protein–DNA complex. The free energy is computed once for each initial model, and the difference in binding affinity between mutant and wild-type DNA sequences is calculated as where Gprot–dna is the free energy of the protein–DNA complex. The relative weights of the free energy terms in Equation 2 are found by the least-squares fit to a set of experimentally observed protein–DNA binding affinities. The experimental dataset used in the fitting consists of 11 series of G measurements with a total of 196 data points (Table 1). For each series of measurements there is a crystal structure of the protein–DNA complex with 3.0 Å resolution used as a template for base pair threading and binding affinity predictions. The set of weights obtained in this way is cross validated by removing parts of the dataset and refitting the weights. Very similar weights are obtained in each case (data not shown). The ratio of protein–DNA to DNA conformational energies obtained through the least squares fit to experimental G data is necessarily averaged over protein families included in the fit. Although there are not enough data to carry out separate fits for each protein family, restricting the fit to proteins known to significantly bend and twist DNA results in a larger contribution of the DNA conformational energies.
In the other approach, called the dynamic model, we minimize the total free energy of the protein–DNA complex starting from the initial model. The protein backbone stays fixed during minimization, whereas the torsional angles of DNA and interface side chains are allowed to relax (the protein–DNA interface is defined based on amino acid-dependent distance cutoffs). The conformational search used in Gprot–dna minimization consists of 10 two-step iterations. (i) Simulated annealing of amino acid side chains at the protein–DNA interface with side chains represented as discrete backbone-dependent rotamers (39) on a fixed protein backbone, and frozen DNA conformation. (ii) Continuous minimization of amino acid side chains at the protein–DNA interface together with simultaneous conformational relaxation of DNA. Amino acid side chains are no longer represented by rotamers at this step.
Experimental binding affinity data available to us are insufficient to reliably fit the weights by iterations to self-consistency when conformational rearrangement is allowed. Instead, we obtain the weights for components of the protein–DNA free energy function by maximizing the recovery of native amino acid side chains at all interface positions in a non-homologous set of protein–DNA complexes. In other words, we adopt a strategy used in protein sequence design in which rotamer conformations for all amino acids are substituted at all interface positions, and the probability of native amino acids is maximized by varying the weights (20,40). Similar to the static model, the ratio of the protein–DNA energy to the DNA conformational energy is expected to be protein family dependent and was estimated on average by requiring that the typical fluctuations from the equilibrium shape observed in the database of protein–DNA complexes be on the order of RT: (overbar denotes an average over all base step/base pair types; we neglect off-diagonal coupling of the effective degrees of freedom). The estimated ratio of the protein–DNA to intra-DNA energies is consistent with the non-iterative least squares fit to experimental G data from Table 1. In the dynamic model, G is calculated as , where is the minimized free energy of the protein–DNA complex, and is the reference free energy of the unbound DNA ( since the protein sequence is fixed). is computed by continuous minimization of the DNA conformation in the absence of the protein. Including separately minimized unbound DNA rather than ideal B-DNA as a reference state was found to be beneficial in most cases where DNA was not significantly distorted from its equilibrium shape. However, in several more extreme cases, such as BamHI endonuclease and the PU.1 ETS domain, local DNA minimization in the absence of bound protein was found to be insufficient for relaxing DNA conformation and was omitted from the model. Better conformational sampling might be provided by simulated annealing of DNA degrees of freedom.
Measures of prediction success and the contact model
We use three alternative measures to assess the quality of G predictions: a linear correlation coefficient r, an average unsigned error between predicted and experimental binding free energies, and a fraction of correct predictions F. Although the first two measures are computed using standard formulae, the fraction of correct predictions is based on a binary function: a prediction is considered to be correct if both Gcomp and Gexp are 1.0 kcal/mol, or else separated by to a 5-fold reduction in binding affinity at room temperature. G predictions are labeled as correct if they are successfully classified to be favorable or destabilizing, even if the absolute magnitudes of binding energies are not perfectly reproduced.
We compare calculations of binding energies described above with predictions based on a simple contact model. Instead of modeling detailed energetics of protein–DNA complexes, the contact model uses only the consensus sequence and the number of protein–DNA base atomic contacts at the binding interface (see Methods). The energy penalty for each mutation from the consensus sequence is given by Equation 13; it is a function of Nmax (the minimum number of contacts at which a given consensus base is assumed to be absolutely conserved, cf. Equation 12) and of Emax (the maximum energy penalty for a mutation of the consensus base). Nmax and Emax are adjusted to simultaneously maximize the fraction of correct predictions F and minimize the average error over the G dataset. The latter requirement is necessary since many Nmax/Emax pairs result in similar values of F. The minimum value of = 1.73 kcal/mol is obtained for Nmax = 15 and Emax = 3.0 kcal/mol. We find that the contact model provides a stringent test of more complicated models because, as demonstrated below, it is fairly successful in binding affinity and weight matrix predictions.
G predictions and weight fitting
The extent to which the static model reproduces experimental G measurements from the 196 point dataset used for static model weight fitting is shown in Figure 1A (closed circles). The overall correlation between experimental and predicted binding free energies is 0.57 (Table 3). This is better than the correlation of 0.42 predicted with the contact model (Figure 1A, red triangles). The number of mutations from the consensus sequence is also shown (Figure 1A, green triangles). With the static model, the fraction of correct predictions is 73%, somewhat higher than the 70% fraction for the contact model. The average unsigned error between experiment and prediction () is 1.58 kcal/mol for the static model. Most of the errors (with an exception of the repressor) occur when the experimentally measured reduction in binding affinity is underpredicted by our method. This will lead to false positive hits in genomic sequence scans, but will not miss true binding sites. Finally, we observe that including DNA conformational energy into the model is beneficial on average even though the relative degree of its importance is probably protein family specific: prediction quality decreases when it is excluded from the fit (Table 3).
Figure 1B shows the dynamic model predictions for the same G dataset. Unlike the static model, relative weights of energy terms are not adjusted using these data. Overall, the prediction quality is somewhat decreased compared with the static model, both in terms of the correlation coefficient and the fraction of correct predictions (Table 3). There is a striking increase in the average error (2.90 kcal/mol versus 1.58 kcal/mol), caused by the absence of the least squares fit to the data. Indeed, the error decreases considerably when such a fit is carried out (Table 3). The decision to ‘turn on’ conformational rearrangements and energy minimization depends on several crucial factors, such as the quality of the structural template (e.g. its X-ray resolution), the degree of DNA bending (in some extreme cases such as IHF in E.coli DNA is distorted so much that our effective quadratic description of DNA conformational energies may be no longer accurate), and the necessity to model changes in DNA shape caused by amino acid substitutions at the protein–DNA binding interface. It is remarkable that in most cases where the protein–DNA crystal structure is available, reasonable quality G predictions can be made simply by computing the free energy of the initial, non-minimized complex. The static model is very efficient computationally and is nearly pairwise additive by construction, making possible rapid scans of longer genomic sequences for high-affinity binding sites. However, it is more likely to fail when DNA conformational change is required to predict novel binding sites by homology, because the DNA shape from the structure will always favor the original binding site sequence unless it is allowed to relax.
In Figure 2 we show four binding free energy predictions with the static model. Static model weights are not fitted on this dataset. The first prediction is for Ndt80, the primary transcriptional activator of the middle sporulation genes in budding yeast (43). Ndt80 binds to the middle sporulation element (MSE) with the gNCRCAAAW consensus sequence found in promoters of middle sporulation genes. Its in vitro binding affinity was studied by Pierce et al. (34) for a number of one-point mutations from the GTCACAAAT MSE variant and its flanking sequence. The measurements were carried out using the electrophoretic mobility shift assay (EMSA) and reported as ratios of protein concentrations bound to the mutant sequence and the wild-type sequence. We converted these ratios into changes in free energy using the Boltzmann formula, and carried out binding affinity predictions starting from the Ndt80 crystal structure with 1.4 Å resolution and the GACACAAAA site. The correlation between theoretical predictions and experimental measurements is 0.67, the average error is 0.58 kcal/mol, and the fraction of correct predictions is 19/26. When the dynamic model is used for predictions the correlation improves to 0.74, but the fraction of correct predictions drops to 15/26 (data not shown).
For our third and fourth predictions the only available structural templates were solved by NMR rather than X-ray crystallography. AtERF1 is the ERF DNA-binding domain from Arabidopsis, which mediates gene regulation by the plant hormone ethylene (45). c-Myb is a product of the mouse protooncogene c-myb essential in proliferation and differentiation of hematopoietic cells (46). In both cases the binding affinity data were obtained by using EMSA (45,47). Surprisingly, the static model is capable of reasonable quality predictions (Figure 2). The fraction of correct predictions is 12/21 for AtERF1 and 20/27 for c-Myb. All incorrect predictions classify corresponding mutations as too favorable, probably as a result of experimental uncertainty in atomic coordinates of NMR structures. This observation is confirmed by a significant drop in prediction quality when the DNA conformation is allowed to relax (data not shown). We generally find that attempts at refinement of side chain and DNA atomic positions starting from NMR structures do not lead to better G and weight matrix predictions, in agreement with a previous observation that X-ray structures are consistently more accurate than NMR-derived structures (48).
DNA conformational energy and sequence specificity
Finally, we studied how well the DNA base step energy captures sequence specificity owing to indirect readout in BamHI endonuclease (49) and the PU.1 ETS domain (50). Engler et al. (49) mutated 3 bp sequences flanking the GGATCC core BamHI recognition site and measured resulting binding free energy changes. Inspection of the protein–DNA structure reveals that outside of the core binding site sequence specificity is mostly imparted by protein-phosphate backbone contacts and thus should be ‘recorded’ in the DNA shape. Using the 2.2 Å crystal structure of the BamHI/DNA complex as a modeling template we were able to reproduce experimental binding affinities measured for flanking sequence mutations with a correlation coefficient of 0.63 (Figure 3, left panel). Interestingly, this prediction requires DNA minimization in the context of the protein–DNA complex, probably owing to DNA geometry artifacts in the initial structure (the correlation drops to 0.24 if DNA is not allowed to relax). For the PU.1 ETS domain, up to 3 bp upstream and/or 2 bp downstream of the GGAA core recognition site were mutated and corresponding free energy changes measured using EMSA (50). We again observe that most of the flanking sequence specificity is due to the changes in DNA shape conferred by protein-phosphate backbone contacts and captured by the DNA base step energy (Figure 3, right panel). DNA minimization is not as critical in this case (the correlation is 0.69 for the static model and 0.70 for the dynamic model).
Additivity in protein–DNA interactions
The assumption of independence of DNA base pair probabilities at each position in the binding site forms a basis of the weight matrix description of binding specificity (1,2). Independence of DNA base pair probabilities implies that the binding energy of a given DNA sequence is a sum of energies associated with each base pair (Equation 8). Although there is direct experimental evidence indicating that pairwise additivity is a reasonable approximation for , Cro and Mnt repressors (51–53), there are also reports in the literature emphasizing the importance of dinucleotide and higher order correlations (54,55). Nonetheless, it is generally believed that pairwise additive energies do provide a reasonable approximation to true protein–DNA interaction energies (1,56).
In protein–DNA energy, non-additivity can only arise in the dynamic model because all atomic potentials are pairwise. When conformational rearrangement is allowed, the degree of additivity will depend on the range of protein–DNA interactions. For example, long-range electrostatic interactions may cause more deviations from pairwise additivity than relatively short-range van der Waals and hydrogen bonding interactions. The degree of conformational change at the protein–DNA interface in turn depends on the quality of experimental structures and the number of water molecules at the interface (as they are not explicitly modeled in our approach). In DNA conformational energy the base stacking term is non-additive by construction. If, however, the total binding energy predicted by our model turns out to be approximately pairwise additive, the search for binding sites in genomic sequence can be considerably simplified by constructing a weight matrix or a table of energies for all one-point mutations from the consensus sequence, instead of independently computing G values for each putative binding site.
In Figure 4, we compare protein–DNA binding energies computed directly and by assuming pairwise additivity. In the latter approach, the protein–DNA binding energy change caused by an arbitrary number of binding site mutations is computed as a sum over energy changes associated with one-point mutations from the consensus sequence:
In Table 4 we show Z-scores for the binding site from the protein–DNA complex (ZPDB) and Z-scores averaged over all sites from Table 2 (Zsite). In addition, binding energies with sites from protein–DNA complexes are ranked for all TFs with L quality of predictions is quite good, consistent with our previous G predictions. Interestingly, the energy of the binding site from the protein–DNA structure is more favorable than the average energy of all binding sites in all cases except Trl (1yui [PDB] ; Table 4), showing that most experimentally characterized binding sites have lower affinities than the consensus site. Nonetheless, almost all of the inspected sites have highly favorable binding energies and thus low Z-scores. One notable exception is the integration host factor protein (Ihf) for which the average Z-score for 27 sites from Table 2 is only –1.12. The large discrepancy between the average Z-score and the PDB Z-score might be explained in this case by the major role of indirect readout in Ihf binding: the crystal structure of the Ihf–DNA complex shows that DNA is bent almost 180° and has relatively few direct contacts between side chains and DNA bases, especially in the 5' region of the binding site. The experimental PWM is relatively non-specific, and different sites or groups of sites might utilize significantly different binding modes that are not captured well by our approach.
Ranking binding sites from the protein–DNA structure versus ensembles of random sites provides further illustration of the accuracy of our predictions: for example, the native binding site of the Ndt80 (1mnn [PDB] ) is 14th out of 16 777 216 sequences, whereas in Gcn4p (1ysa) the native binding site is the lowest in energy among 16 384 sequences (Table 4). Similar to G predictions, binding site discrimination from random sequences strongly depends on the quality of the structural template: Trl (1yui) native site is ranked only 91st out of 16 384, most probably because the structure of the protein–DNA complex was determined by NMR.
Results from the previous section show a reasonable degree of additivity in protein–DNA binding free energy predictions, most probably because of the limited role of long-range interactions in our model. Therefore, we can convert binding energies into weight matrices without significant loss of information and test our PWM predictions against experimental data. Similar to G predictions, we compare all-atom static and dynamic models with the simpler contact model, which uses the number of atomic contacts between protein side chains and DNA base pairs as a measure of binding specificity (Equation 12).
Experimental PWMs and prediction testing
We constructed experimental PWMs for 20 TFs using several alternative approaches. PWMs for R (1lmb), CroR (6cro), c-Myb (1mse) and AtERF1 (1gcc) were created by converting experimental binding free energy measurements for all one-point mutations of the binding site (Table 1) into probabilities using the Boltzmann formula at room temperature. The resulting PWMs are then constructed in a way directly comparable to computational predictions. Another and more commonly used method for constructing PWMs is based on the alignment of binding sites obtained from either a SELEX experiment or a set of promoter sequences of genes regulated by the TF (2). The quality of weight matrices obtained from such alignments depends on the number of sequences used in the alignment. In our 20 PWM dataset shown in Table 2, 4 PWMs are created from G measurements, 14 PWMs are based on genomic sites available from the literature, the Zif268 PWM is from a SELEX experiment (31), and the EcR/Usp PWM comes from a combination of SELEX (57) and genomic sites. PWM predictions are analyzed using the -test (Equation 14). (p, q) is a non-negative measure of the ‘goodness of fit’ between computational probabilities and experimental frequencies (38). It is a monotonic function of prediction quality.
Free parameters of PWM models
The contact model replaces detailed calculations of protein–DNA energetics with an assumption that the probability of each base in the consensus sequence is directly proportional to the number of contacts made between the base pair and all protein side chains (Equation 12). As in the binding free energy contact model, protein–DNA contacts are defined as protein–DNA base atomic pairs within Rmax = 4.5 Å; contacts to the phosphate backbone are ignored. As the number of contacts N increases, the consensus base becomes more and more specific, with the rest of the probability evenly divided between the other three bases. The probability of the consensus base becomes 1 for N = Nmax, and all other probabilities become 0. Given Rmax, Nmax is a free parameter to be adjusted by minimizing (p, q) averaged over a subset of TFs from Table 2. PWMs constructed from fewer than 10 binding site sequences [including Gcn4p (1ysa), Trl (1yui) and DnaA (1j1v)] are removed from the Nmax fit. The average value of is at a minimum for Nmax = 20.
Using the Boltzmann formula to convert energies into probabilities involves an inverse temperature factor ß = 1/RT, which can also be viewed as an adjustable scaling factor. The specificity of PWM predictions depends on ß: at low temperatures only a few lowest energy binding sites contribute to weight matrix probabilities, whereas at high temperatures a broader spectrum of sites is included. Therefore, incorrect predictions of low-energy sites will result in higher fitted temperatures. The static model fit with the 17 TF dataset used for adjusting Nmax in the contact model gives ß = 2.25 (kcal/mol)–1. For the dynamic model fit, we additionally exclude all NMR structures (1gcc and 1mse), crystal structures with >2.5 Å resolution (6cro, 2drp, 1run and 2puc) and the Ihf–DNA complex (1ihf) because its DNA conformation cannot be reasonably expected to be modeled by relaxation with the quadratic DNA potential. The dynamic model fit over 10 remaining TFs results in ß = 0.75 (kcal/mol)–1.
PWM predictions and comparison with experiment
The fitted values of Nmax and ß are used to make PWM predictions for all 20 TFs (16 for the dynamic model as 3 NMR structures and Ihf are excluded). In Table 5 we show values of computed using the contact, static and dynamic models. We compare model predictions with the average value of for an ensemble of randomly generated weight matrices (random column in Table 5). is lower than the corresponding random value if a prediction is successful. For the contact model the average value of over all TFs is 0.19, significantly better than the random value of 0.70. For the static and dynamic models the average over all TFs increases to 0.23 and 0.35, respectively. Note, however, that even for the least successful predictions (1b8i for the contact model, 1mj2 for the static model and 2puc for the dynamic model) is much smaller than the corresponding random (Table 5). Surprisingly, it is the contact model that is the most successful on average: the static model has a lower in only 6 cases out of 20. This finding demonstrates that PWM predictions may not require detailed models of protein–DNA energetics if native protein–DNA complexes are available. Furthermore, allowing conformational change generally makes predictions worse, consistent with our earlier observations regarding G predictions.
The -test provides only an average measure of success and does not necessarily reflect all relevant details of probability distributions in specific columns. Hence it is useful to analyze several PWM predictions in more detail. In Figure 5, PWM WebLogo representations (58) are shown for 2 TFs: Ndt80 (A) and Zif268 (B). The total height of each weight matrix column is constant and the relative height of each letter is proportional to its probability in a given PWM column. All bases are sorted by probability. For each panel in Figure 5 the top logo is the experimental PWM and the other three logos are predictions with the contact model, the static model and the dynamic model, respectively.
For Ndt80, is 0.14 for the contact model, 0.12 for the static model and 0.20 for the dynamic model (Table 5). Figure 5A shows that relative specificities of C at position 5 and A/g at position 6 are best reproduced with the static model (the contact model underpredicts specificity of C5 and overpredicts specificity of A6, whereas in the dynamic model both C5 and A6 are insufficiently conserved). All three computational models overpredict specificity of guanine at position 3 and underpredict specificity of adenine at position 9. Specificity of adenine at position 8 is predicted best with the contact model. Our second prediction is for the Zif268 zinc finger with the experimental PWM from a SELEX experiment (Figure 5B) (31). The value of is 0.13, 0.19 and 0.35 for the contact, static and dynamic models, respectively. Because almost all positions in the experimental PWM are strongly conserved, the contact model with the appropriate Nmax cutoff is capable of a near perfect prediction of experimental results. The high degree of specificity in the SELEX experiment could be an artifact of the stopping criteria in the selection rounds. The main advantage of more complicated models is in making some columns less specific in better agreement with the experimental data, notably G/a at position 5. Interestingly, the -test yields 0.13 for the weight matrix predicted using zinc finger energy matrices from Benos et al. (15), comparable to our predictions with contact and static models.
In summary, reliable PWM predictions can be carried out if the native structure of the protein–DNA complex is used as a starting point for computational modeling. The contact model which assigns base pair specificity based on the number of atomic contacts between protein amino acids and DNA bases is surprisingly successful. In some cases, the static model provides the best results: although the agreement with the experimental data is somewhat worse on average compared with the contact model, core motif probabilities are often better reproduced (Figure 5). Finally, the dynamic model is typically worse than the others, especially in cases where a high-resolution crystal structure of the protein–DNA complex is not available, or where ordered water molecules mediating protein–DNA interactions play a significant role. Water molecules are not explicitly modeled in our approach and, thus, their removal from the interface may result in artificially increased conformational freedom of neighboring bases and side chains. Limited utility of side chain conformational rearrangement in computational modeling of intermolecular binding interfaces was previously noted in the context of protein–protein complexes (59).
Binding affinity and specificity predictions described in the previous sections require native structures of protein–DNA complexes. However, even in well-studied organisms such as S.cerevisiae and D.melanogaster there are only 10–15 suitable structures in the database. Furthermore, in most cases experimentally available protein–DNA complexes are not focused on any specific biological pathway (such as regulation of the Drosophila segmentation gene network), being instead distributed across a range of regulatory pathways and cell types. Hence the ability to model protein–DNA interactions by homology is crucial to future practical applications of our approach.
A suitable modeling template is initially identified by sequence similarity to protein–DNA complexes in the structural database using Ginzu (60). Besides sequence similarity, the quality of experimental structures (such as X-ray resolution or missing atoms) is taken into account. Amino acid substitutions at the DNA-binding interface are identified by a sequence–structure alignment with the template using K*Sync (60) or ClustalW (61). The approximate pairwise additivity of base pair energies and the relatively short range nature of the free energy function make it possible to ignore amino acid substitutions elsewhere because they are less likely to mediate protein–DNA interactions. An obvious consequence of this assumption is that binding specificity does not change if all amino acids at the protein–DNA interface are conserved. To give a specific example, essentially all DNA contacting amino acids are conserved in the Rel homology region family (8) and, thus, binding sites for the embryonic polarity protein dorsal in fly bear a strong resemblance to nuclear factor B sites from mouse and human [e.g. a dorsal weight matrix from Ref. (62) gives the T-G/T-G-G-A/T-T-T-T-T/C-C-C consensus sequence, very close to the T-G-G-G-A-A-T-T-C-C-C binding site from the structure of the mouse nuclear factor B p50 homodimer bound to DNA]. A classification of protein–DNA complexes into families and subfamilies has been carried out on the basis of identical DNA contacting amino acids (8). The definition of interface amino acids is somewhat arbitrary: typically, amino acids are considered to be at the interface based on distance cutoffs and/or visual identification of hydrogen bonds, van der Waals contacts and favorable electrostatic interactions with DNA atoms. We adopt a simple interface definition based on the 4.5 Å cutoff between protein side chain atoms and DNA base or phosphate backbone atoms.
In Figure 6 we show PWM predictions by homology for two TFs from the Drosophila segmentation gene network: the maternal factor bicoid (Bcd) and the gap factor giant (Gt) (63). Bcd is a homeodomain TF; its experimentally characterized binding sites suggest monomeric binding. The experimental PWM for Bcd obtained from the alignment of 30 binding sites shows a strongly conserved C/T-T-A-A-T-C-C/T-C/G consensus sequence. We use structures of the Engrailed homeodomain–DNA wild-type complex (3hdd) and its Q50K mutant (2hdd) as modeling templates (64,65). The amino acid at position 50 is an important specificity determinant for the 2 bp 3' of the core TAAT homeodomain motif: Lys-50 makes favorable contacts with two guanines complementary to the CC dinucleotide immediately 3' of TAAT, contributing to the difference between Q50 wild-type Engrailed (GTAATTAC) and Engrailed Q50K (GTAATCCC) binding sites from the structures.
Owing to the differences between Bcd- and En-binding specificities, the contact model is not very successful in predicting the bicoid PWM starting from the wild-type En–DNA complex. The dynamic model reproduces PWM columns 8 and 9 significantly better, but in column 7 adenine is favored over cytosine, and in column 4 adenine is mixed with guanine to the extent not corroborated by the experiment (Figure 6A). One possible reason for this discrepancy is that the model does not fully reproduce experimentally observed conformational differences between wild-type and Q50K DNA sites (64). This difference is reflected in wild-type Engrailed DNA conformational energies which favor T/A at position 7 and G at position 8 (Figure 6A; the static model DNA weights are multiplied by 5 as before). Better DNA conformational sampling should improve the accuracy of PWM predictions with homology models.
Giant is a TF from the leucine zipper family. It binds DNA as a homodimer, with the TTAC consensus motif at positions 3–6 and its inverted complement GTAA at positions 7–10 (Figure 6B). The experimental PWM for Gt is constructed using just 7 binding sites and thus might be too specific at positions 1 and 2 and 11 and 12, where there are few contacts with protein side chains (as is evident from the contact model prediction). We used the 1.8 Å resolution crystal structure of the human nuclear factor NF-IL6 bound to the GATTGCGCAATA site (1gu4) as the homology modeling template. For each monomer in the homodimer there are 5 amino acid mutations at the protein–DNA binding interface: 1 DNA base contact and 4 phosphate backbone contacts. However, only one of these contacts is non-homologous (K contacting phosphate backbone in NF-IL6 is mutated to V in Gt), suggesting that NF-IL6- and Gt-binding specificities should not be very different. The mutated LYS makes contacts with the phosphate group joining a dinucleotide complementary to TG at positions 4 and 5 in one chain, and to CA at positions 8 and 9 in the other chain (Figure 6B, panel 2 from top; column numbers are from the experimental PWM). In accordance with this observation, comparison of the experimental PWM for Gt with the contact model shows that the specificity change is largely restricted to positions 5 (GA) and 8 (CT). By construction, the contact model is incapable of accommodating this change, whereas static and dynamic models are partially successful, but at the price of making the prediction at position 6 in the experimental PWM less specific (Figure 6B).
The Bcd and Gt examples described above are representative of the future applications of our approach. Binding site specificity predictions based on structural modeling can be used in conjunction with existing bioinformatic algorithms to study regulatory gene networks in many species. Even though the requirement of having homologous structures of protein–DNA complexes with a limited number of mutations at the protein–DNA binding interface is much less restrictive, it is still likely to be the main limitation of our approach. PWM prediction by homology becomes less tractable if the number of mutations at the interface is >2 or 3, probably because the current implementation of our model does not allow protein backbone degrees of freedom to relax. Modeling TF binding specificities using distant homologs may require including rigid body motion of the TF and sampling over multiple docking conformations.
We developed a computational all-atom approach for predicting protein–DNA binding affinities and TF weight matrices. Protein–DNA energetics is described with the empirical free energy function that accounts for protein–DNA interactions (including electrostatics, solvation, hydrogen bonding, van der Waals interactions and packing) and distortion of the DNA shape caused by protein binding. Each term in the free energy function is multiplied by a weight which is adjusted to optimize the performance of the model on an experimental dataset. Free energy minimization and conformational rearrangement at the protein–DNA binding interface are either not employed at all (static model), or limited to repacking interface side chains and DNA minimization (dynamic model). Protein–DNA docking orientation and protein backbone conformation are kept fixed during energy minimization. Our approach is computationally efficient and can be applied on the genomewide scale. We demonstrated its utility by carrying out a number of G and PWM predictions using native protein–DNA complexes as structural templates.
Proteins bind DNA in a sequence-specific manner by utilizing two distinct interaction mechanisms. The mechanism of direct readout is mediated by protein side chains directly contacting DNA base atoms. Favorable protein–DNA base contacts result in base pair preferences at corresponding positions in the binding site. The mechanism of indirect readout is mediated by side chain contacts with the DNA phosphate backbone. These contacts are typically as numerous as direct protein–DNA base contacts and can exploit DNA flexibility by twisting and bending it into the shape that fits best with the binding interface presented by the protein. Since some DNA sequences are more flexible than others, DNA conformational change confers additional sequence specificity to the binding site. In cases where indirect readout predominates, our model predicts a major contribution of the DNA conformational energy to the overall binding specificity (Figure 3).
None of the terms in the DNA base pair energy depend on neighboring base pairs in the absence of conformational rearrangement (except the base stacking energy) and, thus, DNA base pair energies are nearly independent in the static model and only weakly coupled in the dynamic model (Figure 4). Thus, we can convert binding affinity predictions for one-point mutations into weight matrix probabilities without significant information loss. Results in Table 5 demonstrate that we are reasonably successful in predicting experimental PWMs for a variety of TFs starting from the native protein–DNA complex. Surprisingly, a simple contact model based on the consensus sequence from the protein–DNA complex works quite well, even though specific examples in Figure 5 make evident some of its limitations compared with all-atom models of protein–DNA interactions.
The number of protein–DNA complexes currently available in the structural database is insufficient for modeling transcriptional regulation on a large scale. Therefore, the range of applicability of our approach depends on its accuracy in modeling TF binding specificities starting from homologous structures. Owing to the relatively short-range nature of our free energy function, it is sufficient to substitute amino acids only at the DNA-binding interface when creating protein–DNA homology models. Homology modeling should be easiest when there are no dissimilar amino acid substitutions at the interface, because in many instances TFs with conserved interfaces have identical binding specificities. Our model makes accurate predictions in such cases, but changes in binding specificity resulting from amino acid mutations are often predicted less accurately (Figure 6). Refinement of protein–DNA interfaces is a challenging problem which is strongly affected by the quality of experimental structural data and the presence of ordered water molecules mediating interactions across the protein–DNA binding interface. Homology-based predictions should be improved if interface waters are explicitly modeled and multiple protein docking conformations are allowed. Furthermore, homology modeling should be aided by better DNA conformational sampling (i.e. using simulated annealing techniques rather than minimization towards the nearest local minimum).
In summary, the computational algorithm developed here is useful for binding affinity and weight matrix predictions if either a native structure of the protein–DNA complex or its sufficiently close homolog is available. Unlike previously reported knowledge-based approaches (15,16), our algorithm is not limited to any specific TF family and is not as data intensive. However, its accuracy strongly depends on the quality of the experimental structure used as the modeling template, and the number of amino acid substitutions at the DNA-binding interface. In future, we intend to combine structurally predicted PWMs with motif detection algorithms in order to identify TF binding sites on the genomic scale.
The protein–nucleic acid interaction module is implemented in C++ in the ROSETTA software package (http://www.bakerlab.org). ROSETTA software package is freely available to academic users. We hope to foster further development of computational algorithms for protein–DNA binding specificity predictions by providing all experimental datasets used in this study (including G measurements, PWMs and TF binding sites) as Supplementary Data.
Supplementary Data are available at NAR Online.
The authors would like to thank Jonathan Widom for careful reading of the manuscript and many useful suggestions, and Eldon Emberly for providing experimental binding data for several TFs. A.V.M. would like to acknowledge a fellowship from the Leukemia and Lymphoma Society. J.J.H. is a fellow of the Jane Coffin Childs Memorial Fund for Medical Research. Funding to pay the Open Access publication charges for this article was provided by National Institutes of Health (NIH).
Conflict of interest statement. None declared.
Supplementary Data are available at NAR Online.
The authors would like to thank Jonathan Widom for careful reading of the manuscript and many useful suggestions, and Eldon Emberly for providing experimental binding data for several TFs. A.V.M. would like to acknowledge a fellowship from the Leukemia and Lymphoma Society. J.J.H. is a fellow of the Jane Coffin Childs Memorial Fund for Medical Research. Funding to pay the Open Access publication charges for this article was provided by National Institutes of Health (NIH).
Conflict of interest statement. None declared.
Figure 1 G predictions (ddGcomp) versus experimental measurements (ddGexp). (A) Static model binding energy predictions for the set of experimental measurements used in fitting static model weights. (B) Dynamic model binding energy predictions for the same set of experimental measurements. Closed circles, static/dynamic model; red triangles, contact model; green triangles, number of mutations from the consensus sequence. r1-Linear correlation coefficient for the static/dynamic model and r2-linear correlation coefficient for the contact model. Three Zif268 datasets from Table 1 [two for Zif268 wild-type (42,67) and one for Zif268 D20A mutant (67)] are combined into one panel.
Figure 2 G predictions (ddGcomp) versus experimental measurements (ddGexp). Static model binding energy predictions for Ndt80, (34) MAT a1/2, (30) AtERF1 (45) and c-Myb. (47) Closed circles, static model; red triangles, contact model; green triangles, number of mutations from the consensus sequence. r1-Linear correlation coefficient for the static model and r2-linear correlation coefficient for the contact model.
Figure 3 Experimental binding affinities conferred by indirect readout can be explained with DNA conformational energies alone. Dynamic model predictions of DNA base step energies (ddGdna) versus experimental binding free energies (ddGexp) for BamHI endonuclease (49) and PU.1 ETS domain (50).
Figure 4 Degree of pairwise additivity in binding energies predicted with the dynamic model. Comparison of binding energies computed after making multiple base pairs substitutions (ddGcomp) with the sum of binding energies computed for corresponding one-point mutations of the DNA site from the protein–DNA structure (ddGpw).
Figure 5 PWM predictions for Ndt80 (A) and Zif268 (B). From top to bottom: experiment, contact model based on the consensus sequence and the number of protein–DNA contacts, static model and dynamic model (see text for details). PWMs are displayed using the uniform height WebLogo representation(58): the height of each letter in the column is proportional to its probability in the PWM.
Figure 6 PWM predictions by homology for D.melanogaster TF bicoid (Bcd) (A) and giant (Gt) (B). From top to bottom: (A) panel 1: experiment; panels 2–4: contact model, static model with full energy function and static model with DNA conformational energy only (with dna–bp and dna–bs weights multiplied by 5) using D.melanogaster Engrailed homeodomain Q50K (2hdd) as a template; panels 5–7: contact model, dynamic model with full energy function (reference DNA energies are not subtracted since DNA is bent in homeodomains) and static model with DNA conformational energy only (with dna–bp and dna–bs weights multiplied by 5) using D.melanogaster Engrailed wild-type homeodomain (3hdd) as a template. (B) Experiment, contact model, static model and dynamic model using Homo sapiens nuclear factor NF-IL6 (C/EBP-ß;1gu4) as a template. All amino acids substituted at the protein–DNA interface are repacked in the static model. PWMs are displayed using the uniform height WebLogo representation (58).
Table 1 Experimental binding affinity dataset
For protein–DNA structures solved by X-ray crystallography, resolution (Å) is shown in parentheses. The total number of G measurements are shown for each dataset, with the number of data points used in free energy function parameterization listed in parentheses.
Table 2 Experimental binding site and weight matrix dataset
For protein–DNA structures solved by X-ray crystallography, resolution (Å) is shown in parentheses. Nseq is the total number of aligned binding site sequences (including the DNA sequence from the protein–DNA complex).
aPWM is obtained from G data for all one-point mutations of the binding site.
bPWM is obtained separately from the binding sites listed in the table, by SELEX experiments or independently published surveys of genomic sites.
cAlignment of binding sites listed in the Nseq column is used to create an experimental PWM (binding site from the protein–DNA structure is not included into PWM; pseudocounts are set to 0.0).
Table 3 G predictions summary
r, Linear correlation coefficient; F, fraction of successful predictions (see text); and (kcal/mol), average unsigned error between predicted and experimental-binding affinities.
Table 4 Summary of binding site energy predictions
ZPDB is the Z-score (Equation 16) for the protein–DNA binding energy with the binding site found in the protein–DNA structure; Zsite is the average Z-score for protein–DNA binding energies with binding sites listed in Table 2; Rank is the rank of the binding energy for the structural site in the ensemble of 4L sequences (L is the binding site length). Rank was computed for all binding sites with L
Table 5 The -test of PWM predictions for the contact, static, dynamic and random models
Random model uses randomly sampled normalized PWM entries (see Methods). Three NMR structures and Ihf (1ihf) are excluded from the dynamic model predictions.