table of contents table of contents

A simple physical energy function, which uses electrostatics, solvation, hydrogen bonds and …

Home » Biology Articles » Biophysics » Molecular Biophysics » Protein–DNA binding specificity predictions with structural models » Results and Discussion

Results and Discussion
- Protein–DNA binding specificity predictions with structural models


Binding free energy predictions
All-atom free energy models
DNA-binding proteins employ two complementary mechanisms of binding site recognition. The intermolecular readout mechanism is based on direct interactions of protein side chains with DNA bases, whereas the intramolecular readout mechanism involves sequence-specific deformation of the DNA site by the bound protein. We developed a free energy function that takes both these mechanisms into account. Interactions of protein side chains with DNA are modeled using an all-atom representation of both protein and DNA, including all the hydrogen atoms. The protein–DNA interaction energy is a weighted sum of terms describing shape complementarity and packing at the interface, polar interactions (electrostatics and hydrogen bonds), van der Waals forces and solvation energies (Equation 3). The DNA conformational energy is calculated using a reduced geometric representation in which DNA bases and base pairs are represented by rigid bodies and their mutual orientation serves as a measure of deviation from the B-form DNA. The DNA conformational energy is a weighted sum of two terms describing base pairing and stacking of consecutive base pairs. Using this free energy function, we developed the following approach to predicting protein–DNA binding affinities. First, a suitable protein–DNA complex is identified as a structural template for computational modeling. Second, each novel DNA sequence (i.e. from a set for which experimental-binding affinity measurements are available) is threaded onto the DNA phosphate backbone with fixed DNA torsional angles. The result of this procedure is a set of initial structural models with novel DNA sequences. Third, binding free energies, {Delta}G, for each member of the set are computed in either of two different ways.

One approach, which we shall call the static model, does not allow any side chain or DNA conformational rearrangements in the protein–DNA complex. The free energy is computed once for each initial model, and the difference in binding affinity between mutant and wild-type DNA sequences is calculated as where Gprot–dna is the free energy of the protein–DNA complex. The relative weights of the free energy terms in Equation 2 are found by the least-squares fit to a set of experimentally observed protein–DNA binding affinities. The experimental dataset used in the fitting consists of 11 series of {Delta}G measurements with a total of 196 data points (Table 1). For each series of measurements there is a crystal structure of the protein–DNA complex with ≤3.0 Å resolution used as a template for base pair threading and binding affinity predictions. The set of weights obtained in this way is cross validated by removing parts of the dataset and refitting the weights. Very similar weights are obtained in each case (data not shown). The ratio of protein–DNA to DNA conformational energies obtained through the least squares fit to experimental {Delta}{Delta}G data is necessarily averaged over protein families included in the fit. Although there are not enough data to carry out separate fits for each protein family, restricting the fit to proteins known to significantly bend and twist DNA results in a larger contribution of the DNA conformational energies.

In the other approach, called the dynamic model, we minimize the total free energy of the protein–DNA complex starting from the initial model. The protein backbone stays fixed during minimization, whereas the torsional angles of DNA and interface side chains are allowed to relax (the protein–DNA interface is defined based on amino acid-dependent distance cutoffs). The conformational search used in Gprot–dna minimization consists of 10 two-step iterations. (i) Simulated annealing of amino acid side chains at the protein–DNA interface with side chains represented as discrete backbone-dependent rotamers (39) on a fixed protein backbone, and frozen DNA conformation. (ii) Continuous minimization of amino acid side chains at the protein–DNA interface together with simultaneous conformational relaxation of DNA. Amino acid side chains are no longer represented by rotamers at this step.

Experimental binding affinity data available to us are insufficient to reliably fit the weights by iterations to self-consistency when conformational rearrangement is allowed. Instead, we obtain the weights for components of the protein–DNA free energy function by maximizing the recovery of native amino acid side chains at all interface positions in a non-homologous set of protein–DNA complexes. In other words, we adopt a strategy used in protein sequence design in which rotamer conformations for all amino acids are substituted at all interface positions, and the probability of native amino acids is maximized by varying the weights (20,40). Similar to the static model, the ratio of the protein–DNA energy to the DNA conformational energy is expected to be protein family dependent and was estimated on average by requiring that the typical fluctuations from the equilibrium shape observed in the database of protein–DNA complexes be on the order of RT: (overbar denotes an average over all base step/base pair types; we neglect off-diagonal coupling of the effective degrees of freedom). The estimated ratio of the protein–DNA to intra-DNA energies is consistent with the non-iterative least squares fit to experimental {Delta}{Delta}G data from Table 1. In the dynamic model, {Delta}{Delta}G is calculated as , where is the minimized free energy of the protein–DNA complex, and is the reference free energy of the unbound DNA ( since the protein sequence is fixed). is computed by continuous minimization of the DNA conformation in the absence of the protein. Including separately minimized unbound DNA rather than ideal B-DNA as a reference state was found to be beneficial in most cases where DNA was not significantly distorted from its equilibrium shape. However, in several more extreme cases, such as BamHI endonuclease and the PU.1 ETS domain, local DNA minimization in the absence of bound protein was found to be insufficient for relaxing DNA conformation and was omitted from the model. Better conformational sampling might be provided by simulated annealing of DNA degrees of freedom.

Measures of prediction success and the contact model
We use three alternative measures to assess the quality of {Delta}G predictions: a linear correlation coefficient r, an average unsigned error {varepsilon} between predicted and experimental binding free energies, and a fraction of correct predictions F. Although the first two measures are computed using standard formulae, the fraction of correct predictions is based on a binary function: a prediction is considered to be correct if both {Delta}{Delta}Gcomp and {Delta}{Delta}Gexp are 1.0 kcal/mol, or else separated by to a ~5-fold reduction in binding affinity at room temperature. {Delta}G predictions are labeled as correct if they are successfully classified to be favorable or destabilizing, even if the absolute magnitudes of binding energies are not perfectly reproduced.

We compare calculations of binding energies described above with predictions based on a simple contact model. Instead of modeling detailed energetics of protein–DNA complexes, the contact model uses only the consensus sequence and the number of protein–DNA base atomic contacts at the binding interface (see Methods). The energy penalty for each mutation from the consensus sequence is given by Equation 13; it is a function of Nmax (the minimum number of contacts at which a given consensus base is assumed to be absolutely conserved, cf. Equation 12) and of Emax (the maximum energy penalty for a mutation of the consensus base). Nmax and Emax are adjusted to simultaneously maximize the fraction of correct predictions F and minimize the average error {varepsilon} over the {Delta}{Delta}G dataset. The latter requirement is necessary since many Nmax/Emax pairs result in similar values of F. The minimum value of {varepsilon} = 1.73 kcal/mol is obtained for Nmax = 15 and Emax = 3.0 kcal/mol. We find that the contact model provides a stringent test of more complicated models because, as demonstrated below, it is fairly successful in binding affinity and weight matrix predictions.

{Delta}{Delta}G predictions and weight fitting
The extent to which the static model reproduces experimental {Delta}{Delta}G measurements from the 196 point dataset used for static model weight fitting is shown in Figure 1A (closed circles). The overall correlation between experimental and predicted binding free energies is 0.57 (Table 3). This is better than the correlation of 0.42 predicted with the contact model (Figure 1A, red triangles). The number of mutations from the consensus sequence is also shown (Figure 1A, green triangles). With the static model, the fraction of correct predictions is 73%, somewhat higher than the 70% fraction for the contact model. The average unsigned error between experiment and prediction ({varepsilon}) is 1.58 kcal/mol for the static model. Most of the errors (with an exception of the {lambda} repressor) occur when the experimentally measured reduction in binding affinity is underpredicted by our method. This will lead to false positive hits in genomic sequence scans, but will not miss true binding sites. Finally, we observe that including DNA conformational energy into the model is beneficial on average even though the relative degree of its importance is probably protein family specific: prediction quality decreases when it is excluded from the fit (Table 3).  

The EcoRI endonuclease example clearly demonstrates why the fraction of correct prediction is a relevant measure of success. The linear correlation between prediction and experiment in a series of destabilizing mutations of the EcoRI endonuclease (41) is only 0.19, but all of the mutations are correctly predicted to be destabilizing (Figure 1A, EcoRI panel). The experiment shows binding free energy reduction in the 4.1–5.7 kcal/mol range, whereas static model predictions range from 0.83 to 12.21 kcal/mol (the latter prediction is for the inverted 6 bp consensus sequence: GAATTC -> CTTAAG; experiment shows 5.2 kcal/mol reduction in binding energy in this case). Two binding free energy changes are underpredicted at 0.83 and 1.06 kcal/mol, but the rest penalize unfavorable mutations with 1.5 kcal/mol or higher. Therefore, high-affinity sequences will be correctly identified by our method even though the degree of suppression of unfavorable mutations is not perfectly reproduced. In accordance with this observation, only 1 out of 13 predictions is labeled as incorrect in the binary measure. Another example is a series of seven destabilizing mutations of the zinc finger consensus sequence (Figure 1A, Zif268 panel) (42). Apparent binding dissociation constants (determined by a nitrocellulose filter binding assay) were too low to be measured accurately; the experiment only provides the lower bound of 3.13 kcal/mol on the binding energy reduction. Our calculations underpredict the magnitude of destabilization in two out of seven cases (giving 0.55 and 0.77 kcal/mol rather than ≥3.13 kcal/mol), but the other five mutations are correctly identified as being very unfavorable. The linear correlation coefficient is less relevant in this case since experimental binding affinities are not exactly known.

Figure 1B shows the dynamic model predictions for the same {Delta}{Delta}G dataset. Unlike the static model, relative weights of energy terms are not adjusted using these data. Overall, the prediction quality is somewhat decreased compared with the static model, both in terms of the correlation coefficient and the fraction of correct predictions (Table 3). There is a striking increase in the average error (2.90 kcal/mol versus 1.58 kcal/mol), caused by the absence of the least squares fit to the data. Indeed, the error decreases considerably when such a fit is carried out (Table 3). The decision to ‘turn on’ conformational rearrangements and energy minimization depends on several crucial factors, such as the quality of the structural template (e.g. its X-ray resolution), the degree of DNA bending (in some extreme cases such as IHF in E.coli DNA is distorted so much that our effective quadratic description of DNA conformational energies may be no longer accurate), and the necessity to model changes in DNA shape caused by amino acid substitutions at the protein–DNA binding interface. It is remarkable that in most cases where the protein–DNA crystal structure is available, reasonable quality {Delta}{Delta}G predictions can be made simply by computing the free energy of the initial, non-minimized complex. The static model is very efficient computationally and is nearly pairwise additive by construction, making possible rapid scans of longer genomic sequences for high-affinity binding sites. However, it is more likely to fail when DNA conformational change is required to predict novel binding sites by homology, because the DNA shape from the structure will always favor the original binding site sequence unless it is allowed to relax.

In Figure 2 we show four binding free energy predictions with the static model. Static model weights are not fitted on this dataset. The first prediction is for Ndt80, the primary transcriptional activator of the middle sporulation genes in budding yeast (43). Ndt80 binds to the middle sporulation element (MSE) with the gNCRCAAAW consensus sequence found in promoters of middle sporulation genes. Its in vitro binding affinity was studied by Pierce et al. (34) for a number of one-point mutations from the GTCACAAAT MSE variant and its flanking sequence. The measurements were carried out using the electrophoretic mobility shift assay (EMSA) and reported as ratios of protein concentrations bound to the mutant sequence and the wild-type sequence. We converted these ratios into changes in free energy using the Boltzmann formula, and carried out binding affinity predictions starting from the Ndt80 crystal structure with 1.4 Å resolution and the GACACAAAA site. The correlation between theoretical predictions and experimental measurements is 0.67, the average error is 0.58 kcal/mol, and the fraction of correct predictions is 19/26. When the dynamic model is used for predictions the correlation improves to 0.74, but the fraction of correct predictions drops to 15/26 (data not shown).

The second prediction is for the MAT a1/{alpha}2 homeodomain heterodimer. a1 and {alpha}2 TFs bind cooperatively to repress transcriptional activity of haploid-specific genes in diploid a/{alpha} cells (44). Jin et al. (30) investigated effects of mutations in the a1–{alpha}2 binding site on in vivo repression of a heterologous promoter assayed for ß-galactosidase activity in wild-type diploid a/{alpha} cells. The presence of the a1–{alpha}2 binding site in the promoter causes MAT a1/{alpha}2 dependent repression of lacZ expression. Repression ratios relative to wild type are converted into energies using the Boltzmann formula (see Methods) and compared to the experimental predictions. For the static model, the correlation coefficient is 0.45, the average error is 0.62 kcal/mol, and 44 out of 54 measurements are predicted correctly according to our definition. All but one of the incorrect predictions result in energies that are lower than corresponding experimental energies (Figure 2). The dynamic model is less successful in this case (r = 0.37; data not shown), probably because the X-ray structural template is only resolved to 2.5 Å and the 19 bp long binding site presents a challenge for DNA conformational modeling.

For our third and fourth predictions the only available structural templates were solved by NMR rather than X-ray crystallography. AtERF1 is the ERF DNA-binding domain from Arabidopsis, which mediates gene regulation by the plant hormone ethylene (45). c-Myb is a product of the mouse protooncogene c-myb essential in proliferation and differentiation of hematopoietic cells (46). In both cases the binding affinity data were obtained by using EMSA (45,47). Surprisingly, the static model is capable of reasonable quality predictions (Figure 2). The fraction of correct predictions is 12/21 for AtERF1 and 20/27 for c-Myb. All incorrect predictions classify corresponding mutations as too favorable, probably as a result of experimental uncertainty in atomic coordinates of NMR structures. This observation is confirmed by a significant drop in prediction quality when the DNA conformation is allowed to relax (data not shown). We generally find that attempts at refinement of side chain and DNA atomic positions starting from NMR structures do not lead to better {Delta}{Delta}G and weight matrix predictions, in agreement with a previous observation that X-ray structures are consistently more accurate than NMR-derived structures (48).

DNA conformational energy and sequence specificity
Finally, we studied how well the DNA base step energy captures sequence specificity owing to indirect readout in BamHI endonuclease (49) and the PU.1 ETS domain (50). Engler et al. (49) mutated 3 bp sequences flanking the GGATCC core BamHI recognition site and measured resulting binding free energy changes. Inspection of the protein–DNA structure reveals that outside of the core binding site sequence specificity is mostly imparted by protein-phosphate backbone contacts and thus should be ‘recorded’ in the DNA shape. Using the 2.2 Å crystal structure of the BamHI/DNA complex as a modeling template we were able to reproduce experimental binding affinities measured for flanking sequence mutations with a correlation coefficient of 0.63 (Figure 3, left panel). Interestingly, this prediction requires DNA minimization in the context of the protein–DNA complex, probably owing to DNA geometry artifacts in the initial structure (the correlation drops to 0.24 if DNA is not allowed to relax). For the PU.1 ETS domain, up to 3 bp upstream and/or 2 bp downstream of the GGAA core recognition site were mutated and corresponding free energy changes measured using EMSA (50). We again observe that most of the flanking sequence specificity is due to the changes in DNA shape conferred by protein-phosphate backbone contacts and captured by the DNA base step energy (Figure 3, right panel). DNA minimization is not as critical in this case (the correlation is 0.69 for the static model and 0.70 for the dynamic model).

Since most of the binding specificity for base pairs dominated by indirect readout can be captured by the DNA conformational energy, it is interesting to know what fraction of the base pairs in protein–DNA complexes has mostly phosphate backbone contacts. For these base pairs, the DNA conformational energy could be more important than the energies of direct interactions between protein side chains and DNA bases. A survey of protein–DNA contacts in structures from Figure 1 shows that out of 143 bp which have at least one contact with the protein, 54 bp have a ratio of phosphate backbone contacts to DNA base contacts of 3 or higher. When the static model {Delta}{Delta}G fit is restricted to mutations at positions with high backbone/base contact ratio, the fitted DNA weights are indeed several times greater than in the fit with all data points, indicating the increased importance of DNA conformational energies in this limit (data not shown).

Additivity in protein–DNA interactions
The assumption of independence of DNA base pair probabilities at each position in the binding site forms a basis of the weight matrix description of binding specificity (1,2). Independence of DNA base pair probabilities implies that the binding energy of a given DNA sequence is a sum of energies associated with each base pair (Equation 8). Although there is direct experimental evidence indicating that pairwise additivity is a reasonable approximation for {lambda}, Cro and Mnt repressors (5153), there are also reports in the literature emphasizing the importance of dinucleotide and higher order correlations (54,55). Nonetheless, it is generally believed that pairwise additive energies do provide a reasonable approximation to true protein–DNA interaction energies (1,56).

In protein–DNA energy, non-additivity can only arise in the dynamic model because all atomic potentials are pairwise. When conformational rearrangement is allowed, the degree of additivity will depend on the range of protein–DNA interactions. For example, long-range electrostatic interactions may cause more deviations from pairwise additivity than relatively short-range van der Waals and hydrogen bonding interactions. The degree of conformational change at the protein–DNA interface in turn depends on the quality of experimental structures and the number of water molecules at the interface (as they are not explicitly modeled in our approach). In DNA conformational energy the base stacking term is non-additive by construction. If, however, the total binding energy predicted by our model turns out to be approximately pairwise additive, the search for binding sites in genomic sequence can be considerably simplified by constructing a weight matrix or a table of energies for all one-point mutations from the consensus sequence, instead of independently computing {Delta}G values for each putative binding site.

In Figure 4, we compare protein–DNA binding energies computed directly and by assuming pairwise additivity. In the latter approach, the protein–DNA binding energy change caused by an arbitrary number of binding site mutations is computed as a sum over energy changes associated with one-point mutations from the consensus sequence:

where L is the length of the binding site in base pairs, si is the base at position i in sequence S, {alpha} = {A, C, G, T}, and if {alpha} = si and 0 otherwise. In the left panel of Figure 4 we plot the energies of all adjacent dinucleotide mutations of the Ndt80 binding site using the dynamic model (all terms in the static model except the base stacking energy are exactly pairwise additive). The protein–DNA interaction energies of the mutant sequences are quite well reproduced by the additive model. One discrepancy is in the absolute magnitude of binding energies which is predicted to be too large in the additive model. In the right panel of Figure 4, a similar comparison is carried out for a set of sequences including naturally occurring {lambda} repressor operators and multiple mutations of the OR1 site. The agreement is reasonable but somewhat less pronounced, perhaps reflecting the increased number of mutations from the OR1 consensus sequence. For each sequence in this set, {Delta}{Delta}G and measurements were also made (51). A similar plot using experimental {Delta}{Delta}G measurements shows better correlation (51), demonstrating limits of applicability of computational models to predicting the experimental degree of pairwise additivity.  
Binding site discrimination
In order to carry out successful predictions of genomewide transcriptional regulation, computational models of protein–DNA binding free energy should be able to discriminate TF binding sites from random ensembles of sequences. Since we demonstrated that the static model works best for {Delta}{Delta}G predictions, we assess its discriminatory power here by computing binding free energies of 16 TFs for which multiple DNA-binding sites are available (Table 2). Because pairwise additivity is nearly exact for the static model, we can compute binding energies of sites with arbitrary sequences using only binding energies for one-point mutations from the consensus sequence as input (Equation 15). The degree of discrimination of TF binding sites from random sites is given by the Z-score:
where {Delta}{Delta}G(S) is the binding energy of sequence S measured relative to the wild-type binding energy, <{Delta}{Delta}G> is the average binding energy for the ensemble of all possible 4L sequences (L is the length of the binding site) and {sigma} is the standard deviation for the same ensemble.

In Table 4 we show Z-scores for the binding site from the protein–DNA complex (ZPDB) and Z-scores averaged over all sites from Table 2 (<Zsite>). In addition, binding energies with sites from protein–DNA complexes are ranked for all TFs with L quality of predictions is quite good, consistent with our previous {Delta}{Delta}G predictions. Interestingly, the energy of the binding site from the protein–DNA structure is more favorable than the average energy of all binding sites in all cases except Trl (1yui [PDB] ; Table 4), showing that most experimentally characterized binding sites have lower affinities than the consensus site. Nonetheless, almost all of the inspected sites have highly favorable binding energies and thus low Z-scores. One notable exception is the integration host factor protein (Ihf) for which the average Z-score for 27 sites from Table 2 is only –1.12. The large discrepancy between the average Z-score and the PDB Z-score might be explained in this case by the major role of indirect readout in Ihf binding: the crystal structure of the Ihf–DNA complex shows that DNA is bent almost 180° and has relatively few direct contacts between side chains and DNA bases, especially in the 5' region of the binding site. The experimental PWM is relatively non-specific, and different sites or groups of sites might utilize significantly different binding modes that are not captured well by our approach.

Ranking binding sites from the protein–DNA structure versus ensembles of random sites provides further illustration of the accuracy of our predictions: for example, the native binding site of the Ndt80 (1mnn [PDB] ) is 14th out of 16 777 216 sequences, whereas in Gcn4p (1ysa) the native binding site is the lowest in energy among 16 384 sequences (Table 4). Similar to {Delta}{Delta}G predictions, binding site discrimination from random sequences strongly depends on the quality of the structural template: Trl (1yui) native site is ranked only 91st out of 16 384, most probably because the structure of the protein–DNA complex was determined by NMR.

PWM predictions
Results from the previous section show a reasonable degree of additivity in protein–DNA binding free energy predictions, most probably because of the limited role of long-range interactions in our model. Therefore, we can convert binding energies into weight matrices without significant loss of information and test our PWM predictions against experimental data. Similar to {Delta}{Delta}G predictions, we compare all-atom static and dynamic models with the simpler contact model, which uses the number of atomic contacts between protein side chains and DNA base pairs as a measure of binding specificity (Equation 12).

Experimental PWMs and prediction testing
We constructed experimental PWMs for 20 TFs using several alternative approaches. PWMs for {lambda}R (1lmb), CroR (6cro), c-Myb (1mse) and AtERF1 (1gcc) were created by converting experimental binding free energy measurements for all one-point mutations of the binding site (Table 1) into probabilities using the Boltzmann formula at room temperature. The resulting PWMs are then constructed in a way directly comparable to computational predictions. Another and more commonly used method for constructing PWMs is based on the alignment of binding sites obtained from either a SELEX experiment or a set of promoter sequences of genes regulated by the TF (2). The quality of weight matrices obtained from such alignments depends on the number of sequences used in the alignment. In our 20 PWM dataset shown in Table 2, 4 PWMs are created from {Delta}{Delta}G measurements, 14 PWMs are based on genomic sites available from the literature, the Zif268 PWM is from a SELEX experiment (31), and the EcR/Usp PWM comes from a combination of SELEX (57) and genomic sites. PWM predictions are analyzed using the {psi}-test (Equation 14). {psi}(p, q) is a non-negative measure of the ‘goodness of fit’ between computational probabilities and experimental frequencies (38). It is a monotonic function of prediction quality.

Free parameters of PWM models
The contact model replaces detailed calculations of protein–DNA energetics with an assumption that the probability of each base in the consensus sequence is directly proportional to the number of contacts made between the base pair and all protein side chains (Equation 12). As in the binding free energy contact model, protein–DNA contacts are defined as protein–DNA base atomic pairs within Rmax = 4.5 Å; contacts to the phosphate backbone are ignored. As the number of contacts N increases, the consensus base becomes more and more specific, with the rest of the probability evenly divided between the other three bases. The probability of the consensus base becomes 1 for N = Nmax, and all other probabilities become 0. Given Rmax, Nmax is a free parameter to be adjusted by minimizing {psi}(p, q) averaged over a subset of TFs from Table 2. PWMs constructed from fewer than 10 binding site sequences [including Gcn4p (1ysa), Trl (1yui) and DnaA (1j1v)] are removed from the Nmax fit. The average value of {psi} is at a minimum for Nmax = 20.

Using the Boltzmann formula to convert energies into probabilities involves an inverse temperature factor ß = 1/RT, which can also be viewed as an adjustable scaling factor. The specificity of PWM predictions depends on ß: at low temperatures only a few lowest energy binding sites contribute to weight matrix probabilities, whereas at high temperatures a broader spectrum of sites is included. Therefore, incorrect predictions of low-energy sites will result in higher fitted temperatures. The static model fit with the 17 TF dataset used for adjusting Nmax in the contact model gives ß = 2.25 (kcal/mol)–1. For the dynamic model fit, we additionally exclude all NMR structures (1gcc and 1mse), crystal structures with >2.5 Å resolution (6cro, 2drp, 1run and 2puc) and the Ihf–DNA complex (1ihf) because its DNA conformation cannot be reasonably expected to be modeled by relaxation with the quadratic DNA potential. The dynamic model fit over 10 remaining TFs results in ß = 0.75 (kcal/mol)–1.

PWM predictions and comparison with experiment
The fitted values of Nmax and ß are used to make PWM predictions for all 20 TFs (16 for the dynamic model as 3 NMR structures and Ihf are excluded). In Table 5 we show values of {psi} computed using the contact, static and dynamic models. We compare model predictions with the average value of {psi} for an ensemble of randomly generated weight matrices (<{psi}random> column in Table 5). {psi} is lower than the corresponding random value if a prediction is successful. For the contact model the average value of {psi} over all TFs is 0.19, significantly better than the random value of 0.70. For the static and dynamic models the average over all TFs increases to 0.23 and 0.35, respectively. Note, however, that even for the least successful predictions (1b8i for the contact model, 1mj2 for the static model and 2puc for the dynamic model) {psi} is much smaller than the corresponding random {psi} (Table 5). Surprisingly, it is the contact model that is the most successful on average: the static model has a lower {psi} in only 6 cases out of 20. This finding demonstrates that PWM predictions may not require detailed models of protein–DNA energetics if native protein–DNA complexes are available. Furthermore, allowing conformational change generally makes predictions worse, consistent with our earlier observations regarding {Delta}{Delta}G predictions.

The {psi}-test provides only an average measure of success and does not necessarily reflect all relevant details of probability distributions in specific columns. Hence it is useful to analyze several PWM predictions in more detail. In Figure 5, PWM WebLogo representations (58) are shown for 2 TFs: Ndt80 (A) and Zif268 (B). The total height of each weight matrix column is constant and the relative height of each letter is proportional to its probability in a given PWM column. All bases are sorted by probability. For each panel in Figure 5 the top logo is the experimental PWM and the other three logos are predictions with the contact model, the static model and the dynamic model, respectively. 

For Ndt80, {psi} is 0.14 for the contact model, 0.12 for the static model and 0.20 for the dynamic model (Table 5). Figure 5A shows that relative specificities of C at position 5 and A/g at position 6 are best reproduced with the static model (the contact model underpredicts specificity of C5 and overpredicts specificity of A6, whereas in the dynamic model both C5 and A6 are insufficiently conserved). All three computational models overpredict specificity of guanine at position 3 and underpredict specificity of adenine at position 9. Specificity of adenine at position 8 is predicted best with the contact model. Our second prediction is for the Zif268 zinc finger with the experimental PWM from a SELEX experiment (Figure 5B) (31). The value of {psi} is 0.13, 0.19 and 0.35 for the contact, static and dynamic models, respectively. Because almost all positions in the experimental PWM are strongly conserved, the contact model with the appropriate Nmax cutoff is capable of a near perfect prediction of experimental results. The high degree of specificity in the SELEX experiment could be an artifact of the stopping criteria in the selection rounds. The main advantage of more complicated models is in making some columns less specific in better agreement with the experimental data, notably G/a at position 5. Interestingly, the {psi}-test yields 0.13 for the weight matrix predicted using zinc finger energy matrices from Benos et al. (15), comparable to our predictions with contact and static models.

In summary, reliable PWM predictions can be carried out if the native structure of the protein–DNA complex is used as a starting point for computational modeling. The contact model which assigns base pair specificity based on the number of atomic contacts between protein amino acids and DNA bases is surprisingly successful. In some cases, the static model provides the best results: although the agreement with the experimental data is somewhat worse on average compared with the contact model, core motif probabilities are often better reproduced (Figure 5). Finally, the dynamic model is typically worse than the others, especially in cases where a high-resolution crystal structure of the protein–DNA complex is not available, or where ordered water molecules mediating protein–DNA interactions play a significant role. Water molecules are not explicitly modeled in our approach and, thus, their removal from the interface may result in artificially increased conformational freedom of neighboring bases and side chains. Limited utility of side chain conformational rearrangement in computational modeling of intermolecular binding interfaces was previously noted in the context of protein–protein complexes (59).

Homology modeling
Binding affinity and specificity predictions described in the previous sections require native structures of protein–DNA complexes. However, even in well-studied organisms such as S.cerevisiae and D.melanogaster there are only 10–15 suitable structures in the database. Furthermore, in most cases experimentally available protein–DNA complexes are not focused on any specific biological pathway (such as regulation of the Drosophila segmentation gene network), being instead distributed across a range of regulatory pathways and cell types. Hence the ability to model protein–DNA interactions by homology is crucial to future practical applications of our approach.

A suitable modeling template is initially identified by sequence similarity to protein–DNA complexes in the structural database using Ginzu (60). Besides sequence similarity, the quality of experimental structures (such as X-ray resolution or missing atoms) is taken into account. Amino acid substitutions at the DNA-binding interface are identified by a sequence–structure alignment with the template using K*Sync (60) or ClustalW (61). The approximate pairwise additivity of base pair energies and the relatively short range nature of the free energy function make it possible to ignore amino acid substitutions elsewhere because they are less likely to mediate protein–DNA interactions. An obvious consequence of this assumption is that binding specificity does not change if all amino acids at the protein–DNA interface are conserved. To give a specific example, essentially all DNA contacting amino acids are conserved in the Rel homology region family (8) and, thus, binding sites for the embryonic polarity protein dorsal in fly bear a strong resemblance to nuclear factor {kappa}B sites from mouse and human [e.g. a dorsal weight matrix from Ref. (62) gives the T-G/T-G-G-A/T-T-T-T-T/C-C-C consensus sequence, very close to the T-G-G-G-A-A-T-T-C-C-C binding site from the structure of the mouse nuclear factor {kappa}B p50 homodimer bound to DNA]. A classification of protein–DNA complexes into families and subfamilies has been carried out on the basis of identical DNA contacting amino acids (8). The definition of interface amino acids is somewhat arbitrary: typically, amino acids are considered to be at the interface based on distance cutoffs and/or visual identification of hydrogen bonds, van der Waals contacts and favorable electrostatic interactions with DNA atoms. We adopt a simple interface definition based on the 4.5 Å cutoff between protein side chain atoms and DNA base or phosphate backbone atoms.

In Figure 6 we show PWM predictions by homology for two TFs from the Drosophila segmentation gene network: the maternal factor bicoid (Bcd) and the gap factor giant (Gt) (63). Bcd is a homeodomain TF; its experimentally characterized binding sites suggest monomeric binding. The experimental PWM for Bcd obtained from the alignment of 30 binding sites shows a strongly conserved C/T-T-A-A-T-C-C/T-C/G consensus sequence. We use structures of the Engrailed homeodomain–DNA wild-type complex (3hdd) and its Q50K mutant (2hdd) as modeling templates (64,65). The amino acid at position 50 is an important specificity determinant for the 2 bp 3' of the core TAAT homeodomain motif: Lys-50 makes favorable contacts with two guanines complementary to the CC dinucleotide immediately 3' of TAAT, contributing to the difference between Q50 wild-type Engrailed (GTAATTAC) and Engrailed Q50K (GTAATCCC) binding sites from the structures.

Using the Q50K Engrailed mutant as a structural template for Bcd makes homology modeling relatively easy: all amino acids are conserved at the DNA-binding interface, even though there are 28/55 amino acid substitutions and a 2 residue gap in the alignment. Because Q50K Engrailed and Bcd DNA-binding interfaces are virtually identical, the experimental PWM for Bcd is reasonably well reproduced by the contact model. Prediction of the motif 3' of the TAAT core is further improved with the static model, but the TAAT motif becomes less specific (Figure 6A). Structural analysis of homeodomain–DNA complexes reveals that homeodomain binding causes distortion in the DNA conformation (37,66). The conformational change serves to enclose the recognition helix within the major groove, increasing the surface area of the protein–DNA interface and resulting in the Beg–DNA (enlarged groove) DNA form (66). The distortion of the DNA site is captured by the DNA conformational component of the free energy function (Figure 6A; note that DNA conformational energy weights from the static model are multiplied by 5). Surprisingly, DNA shape confers binding specificity not only to the TAAT core homeodomain motif, but also to the 3' CCC motif.

Owing to the differences between Bcd- and En-binding specificities, the contact model is not very successful in predicting the bicoid PWM starting from the wild-type En–DNA complex. The dynamic model reproduces PWM columns 8 and 9 significantly better, but in column 7 adenine is favored over cytosine, and in column 4 adenine is mixed with guanine to the extent not corroborated by the experiment (Figure 6A). One possible reason for this discrepancy is that the model does not fully reproduce experimentally observed conformational differences between wild-type and Q50K DNA sites (64). This difference is reflected in wild-type Engrailed DNA conformational energies which favor T/A at position 7 and G at position 8 (Figure 6A; the static model DNA weights are multiplied by 5 as before). Better DNA conformational sampling should improve the accuracy of PWM predictions with homology models.

Giant is a TF from the leucine zipper family. It binds DNA as a homodimer, with the TTAC consensus motif at positions 3–6 and its inverted complement GTAA at positions 7–10 (Figure 6B). The experimental PWM for Gt is constructed using just 7 binding sites and thus might be too specific at positions 1 and 2 and 11 and 12, where there are few contacts with protein side chains (as is evident from the contact model prediction). We used the 1.8 Å resolution crystal structure of the human nuclear factor NF-IL6 bound to the GATTGCGCAATA site (1gu4) as the homology modeling template. For each monomer in the homodimer there are 5 amino acid mutations at the protein–DNA binding interface: 1 DNA base contact and 4 phosphate backbone contacts. However, only one of these contacts is non-homologous (K contacting phosphate backbone in NF-IL6 is mutated to V in Gt), suggesting that NF-IL6- and Gt-binding specificities should not be very different. The mutated LYS makes contacts with the phosphate group joining a dinucleotide complementary to TG at positions 4 and 5 in one chain, and to CA at positions 8 and 9 in the other chain (Figure 6B, panel 2 from top; column numbers are from the experimental PWM). In accordance with this observation, comparison of the experimental PWM for Gt with the contact model shows that the specificity change is largely restricted to positions 5 (G->A) and 8 (C->T). By construction, the contact model is incapable of accommodating this change, whereas static and dynamic models are partially successful, but at the price of making the prediction at position 6 in the experimental PWM less specific (Figure 6B).

The Bcd and Gt examples described above are representative of the future applications of our approach. Binding site specificity predictions based on structural modeling can be used in conjunction with existing bioinformatic algorithms to study regulatory gene networks in many species. Even though the requirement of having homologous structures of protein–DNA complexes with a limited number of mutations at the protein–DNA binding interface is much less restrictive, it is still likely to be the main limitation of our approach. PWM prediction by homology becomes less tractable if the number of mutations at the interface is >2 or 3, probably because the current implementation of our model does not allow protein backbone degrees of freedom to relax. Modeling TF binding specificities using distant homologs may require including rigid body motion of the TF and sampling over multiple docking conformations.

rating: 3.00 from 2 votes | updated on: 7 Dec 2007 | views: 12411 |

Rate article: