such as "Introduction", "Conclusion"..etc
Gene regulation is mediated in part by protein transcription factors (TFs) binding to cis-regulatory regions of the genome. Accurate genomewide characterization of TF binding sites is thus a necessary prerequisite to deciphering complex gene expression patterns. Probabilistic models of TF binding profiles, often called position-specific weight matrices (PWMs), are typically used as input to such predictions (1–3). With the weight matrix representation of TF binding sites, the probability P(S|p) that sequence S is a binding site for the TF represented by p is given by
In principle, one should be able to predict the TF binding site profile from a structure of the protein–DNA complex or its close homolog. At first it was anticipated that structural studies would reveal a universal protein–DNA recognition code, which could be used for predicting TF binding sites based on amino acid identities at the protein–DNA interface (4,5). It became apparent, however, when more protein–DNA structures were solved and classified that despite some predominantly occurring interactions, such as Lys-G, the energetics of amino acid–base contacts depends on their structural context and, in particular, on the structural family of the DNA-binding protein (6–10). Many amino acids were observed to form favorable contacts with different bases, making it necessary to generalize a deterministic recognition code to a probabilistic binding profile based, for example, on maximizing the likelihood of observed protein–DNA contacts (11–13). Probabilistic recognition codes are more accurate when developed for a specific structural family, thereby implicitly taking protein–DNA structural context into account. Indeed, binding site profiles based on the classification of TFs into families were found to be useful in bioinformatics pattern detection algorithms (14). However, data availability has so far limited knowledge-based PWM predictions to the C2H2 zinc finger family (15,16).
An alternative approach to specificity and binding affinity predictions is based on all-atom modeling of protein–DNA complexes (17–19). Starting from a known structure of the protein bound to its consensus DNA sequence, an ensemble of models is created by threading novel DNA sequences onto the binding site. Protein–DNA binding energies, G, are then evaluated for each member of the structural ensemble. G predictions can be either used to directly infer high-affinity binding sites in genomic sequence or converted into PWM probabilities using the Boltzmann formula. In the latter case, it is only necessary to compute G values for all one-point mutations from the consensus-binding site. The main limiting factor of the structural approach to TF binding site specificity predictions is the availability of experimentally determined structures of protein–DNA complexes. The range of applicability of structural methods will significantly increase if the DNA-binding proteins can be modeled by homology. Homology modeling involves threading a protein amino acid sequence onto a suitable structural template chosen on the basis of its sequence similarity to the protein of interest. The threading procedure creates a new protein–DNA binding interface, for which G and PWM probability calculations are then carried out as in native structures.
Here we present a computational model for predicting protein–DNA binding affinities and specificities. The model can be applied to a wide variety of DNA-binding proteins for which there is either a native protein–DNA structure or a sufficiently close homolog. The model is based on a simple free energy function, which consists of the protein–DNA interaction energy and the DNA conformational energy. The protein–DNA interaction energy is used to describe direct readout of the DNA sequence by the protein, whereas the DNA conformational energy takes into account distortion of B-DNA shape caused by protein binding. We carried out a series of tests of our PWM and binding energy predictions. First, we checked the ability of the model to reproduce experimental binding free energy measurements. We also assessed the accuracy of the pairwise additivity approximation in our analysis. Second, we checked the ability of our algorithm to discriminate experimentally known TF binding sites from random ensembles of sequences. Third, we carried out PWM predictions for a number of TFs and compared them with experimental PWMs. For all these predictions native protein–DNA complexes were used as structural templates. Finally, the extent of applicability of homology modeling to protein–DNA binding affinity predictions was explored with several representative PWM calculations. The relative accuracy and computational efficiency of our approach allowed us to carry out numerous predictions of TF binding affinities and specificities, facilitating future experimental and computational studies of transcriptional regulation in different organisms and biological systems.
Enter the code exactly as it appears. All letters are case insensitive, there is no zero.