We developed a computational all-atom approach for predicting protein–DNA binding affinities and TF weight matrices. Protein–DNA energetics is described with the empirical free energy function that accounts for protein–DNA interactions (including electrostatics, solvation, hydrogen bonding, van der Waals interactions and packing) and distortion of the DNA shape caused by protein binding. Each term in the free energy function is multiplied by a weight which is adjusted to optimize the performance of the model on an experimental dataset. Free energy minimization and conformational rearrangement at the protein–DNA binding interface are either not employed at all (static model), or limited to repacking interface side chains and DNA minimization (dynamic model). Protein–DNA docking orientation and protein backbone conformation are kept fixed during energy minimization. Our approach is computationally efficient and can be applied on the genomewide scale. We demonstrated its utility by carrying out a number of G and PWM predictions using native protein–DNA complexes as structural templates.
Proteins bind DNA in a sequence-specific manner by utilizing two distinct interaction mechanisms. The mechanism of direct readout is mediated by protein side chains directly contacting DNA base atoms. Favorable protein–DNA base contacts result in base pair preferences at corresponding positions in the binding site. The mechanism of indirect readout is mediated by side chain contacts with the DNA phosphate backbone. These contacts are typically as numerous as direct protein–DNA base contacts and can exploit DNA flexibility by twisting and bending it into the shape that fits best with the binding interface presented by the protein. Since some DNA sequences are more flexible than others, DNA conformational change confers additional sequence specificity to the binding site. In cases where indirect readout predominates, our model predicts a major contribution of the DNA conformational energy to the overall binding specificity (Figure 3).
None of the terms in the DNA base pair energy depend on neighboring base pairs in the absence of conformational rearrangement (except the base stacking energy) and, thus, DNA base pair energies are nearly independent in the static model and only weakly coupled in the dynamic model (Figure 4). Thus, we can convert binding affinity predictions for one-point mutations into weight matrix probabilities without significant information loss. Results in Table 5 demonstrate that we are reasonably successful in predicting experimental PWMs for a variety of TFs starting from the native protein–DNA complex. Surprisingly, a simple contact model based on the consensus sequence from the protein–DNA complex works quite well, even though specific examples in Figure 5 make evident some of its limitations compared with all-atom models of protein–DNA interactions.
The number of protein–DNA complexes currently available in the structural database is insufficient for modeling transcriptional regulation on a large scale. Therefore, the range of applicability of our approach depends on its accuracy in modeling TF binding specificities starting from homologous structures. Owing to the relatively short-range nature of our free energy function, it is sufficient to substitute amino acids only at the DNA-binding interface when creating protein–DNA homology models. Homology modeling should be easiest when there are no dissimilar amino acid substitutions at the interface, because in many instances TFs with conserved interfaces have identical binding specificities. Our model makes accurate predictions in such cases, but changes in binding specificity resulting from amino acid mutations are often predicted less accurately (Figure 6). Refinement of protein–DNA interfaces is a challenging problem which is strongly affected by the quality of experimental structural data and the presence of ordered water molecules mediating interactions across the protein–DNA binding interface. Homology-based predictions should be improved if interface waters are explicitly modeled and multiple protein docking conformations are allowed. Furthermore, homology modeling should be aided by better DNA conformational sampling (i.e. using simulated annealing techniques rather than minimization towards the nearest local minimum).
In summary, the computational algorithm developed here is useful for binding affinity and weight matrix predictions if either a native structure of the protein–DNA complex or its sufficiently close homolog is available. Unlike previously reported knowledge-based approaches (15,16), our algorithm is not limited to any specific TF family and is not as data intensive. However, its accuracy strongly depends on the quality of the experimental structure used as the modeling template, and the number of amino acid substitutions at the DNA-binding interface. In future, we intend to combine structurally predicted PWMs with motif detection algorithms in order to identify TF binding sites on the genomic scale.