Binding stability and weight matrix predictions
Free energy model of protein–DNA interactions
We extended Rosetta protein–nucleic acid model developed in Ref. (20) by adding sequencespecific DNA conformational energy. The free energy function employed in protein–DNA binding affinity calculations consists of the protein–DNA interaction component describing intermolecular readout of the DNA sequence by the protein, and of the DNA deformation component describing intramolecular readout of the binding site sequence:

(2) 
Protein–DNA interactions are modeled as a linear combination of the Lennard–Jones potential (which switches to a linear repulsive potential at short distances) (
20), the orientationdependent hydrogen bonding potential describing amino acid side chain–amino acid side chain and amino acid side chain–DNA base hydrogen bonds (
21), the Generalized Born electrostatics and solvation model (
22), and the implicit solvation model developed by Lazaridis and Karplus (
23):

(3) 
where each energy is a sum over all protein–DNA and protein–protein atomic pairs, and {
w} is a set of fitting weights. Parameterization of the Lennard–Jones potential, the hydrogen bonding potential and the Lazaridis and Karplus solvation model is carried out as described in Ref. (
20), whereas the Generalized Born model with AMBER
parm99 atomic partial charges (
24) is adopted from Ref. (
22). Other free energy components, including the knowledgebased potential derived from residue pair statistics and the solvent accessible surface area term, were tested but omitted from the final model since they did not significantly improve the predictions of protein–DNA binding energies. The model requires an atomic representation of protein–DNA complexes including all hydrogen and heavy atoms. Hydrogen atoms and heavy atoms missing from experimental structures were built using standard CHARMM27 bond lengths and bond angles, with all rotatable bonds connecting hydrogens to heavy atoms optimized using Monte Carlo search with the free energy function described above.
The DNA sequencedependent conformational energy model is based on the effective harmonic potential developed in Ref. (25). The DNA conformational energy is computed as a sum over all base pair and base step energies:

(4) 
where the first sum is over all base pairs in the double helical DNA (
,ß denote bases in a base pair), the second sum is over all consecutively stacked base pair steps (
,ß denote base pairs in a base step), and
w_{dna–bp},
w_{dna–bs} are fitting weights. Base pairs and base steps are counted once in the 5'–3' direction. The first term in
Equation 4 is necessary to enforce reasonable base pairing in the process of DNA minimization, which occurs in the space of DNA torsional angles and not in the space of effective degrees of freedom defined below. Both
E_{dna–bs} and
E_{dna–bp} are approximated with harmonic functions (
25):

(5) 
where the sum runs over effective degrees of freedom {
} (Twist, Tilt, Roll, Shift, Slide and Rise for base steps; Opening, Buckle, Propeller, Shear, Stretch and Stagger for base pairs) (
26). All effective geometric parameters are calculated as described in Ref. (
27). Every set of six geometric parameters completely describes the mutual orientation of two bases or base pairs represented as rigid bodies.
is a deviation of
from its value averaged over a set of experimental observations (DNA structures from a nonhomologous set of protein–DNA complexes):
The force constants
are evaluated by inverting the covariance matrix of
averaged over the same set of DNA structures:
.
All distributions are selfconsistently trimmed by removing all data points for which at least one of the geometric parameters is more than three standard deviations away from the average, followed by updating all averages and standard deviations. This procedure is repeated until convergence, which usually requires 2–3 iterations and removes just a few percent of the data points (28). A nonhomologous, manually curated set of 101 structures used in Ref. (20) was employed to derive the force constants for the DNA conformational potential.
Given free energies of the protein–DNA complex and its unbound partners, the binding free energy is computed as follows:

(6) 
Mutations of one or several base pairs in the binding site result in changes of protein–DNA binding free energies:

(7) 
where
G^{mut}(
G^{wt}) refers to the mutated and wildtype DNA sequence, respectively.
Weight matrix predictions based on binding free energies
Computation of weight matrix probabilities requires an assumption of the pairwise additivity of base pair energies:

(8) 
where
L is the length of the binding site in base pairs,
s_{i} is the base at position
i and
is the binding energy of base
(A, C, G or T) at position
i, which is assumed to be independent of all other bases. In our model, all energy terms except for the base step energies are explicitly pairwise independent. However, even the energies that are pairwise independent by construction (i.e. computed as sums over individual atom–atom interactions) can become nonadditive if conformational rearrangement is allowed at the protein–DNA binding interface. Thus, it is important to check explicitly if the energies in our model are approximately pairwise additive.
With the pairwise additivity assumption, a set of 4L predicted energies can be converted into weight matrix probabilities using the Boltzmann formula (29):

(9) 
where ß = 1/
RT is the inverse temperature used as a fitting parameter [ß was changed in steps of 0.25 (kcal/mol)
^{–1} in all fits].
Experimental datasets
Binding free energies
Table 1 collects structural data for protein–DNA complexes with binding free energy measurements available from the ProNIT database (http://dna01.bse.kyutech.ac.jp/jouhou/pronit/pronit.html) and from the literature. Each dataset in Table 1 consists of a structure of the protein–DNA complex and a series of binding free energy measurements G for wildtype and mutant DNA sequences. In several cases, association or dissociation constants reported by the authors were converted into binding free energies using

(10) 
with
RT = 0.59 kcal/mol. In the case of the MAT
a1/
2 TF,
in vivo repression levels of the heterologous reporter promoter construct were used as a measure of binding affinity:

(11) 
where
R is the repression ratio in the presence and absence of the wildtype or mutant
a1/
2 binding site (
30). Eleven datasets in the top part of
Table 1 (total of 196 data points) are used to train the weights in the free energy function.
Binding sites and weight matrices
Table 2 collects protein–DNA structures for which binding site data and experimental weight matrix data are available. The Zif268 weight matrix was taken from a selection experiment (31). The experiment generated the GCGT/gG/AGGC/a/tGG/T consensus sequence, which we used to create five binding site variants (in addition to the consensus sequence from the Zif268 structure). Binding sites and weight matrices for seven Escherichia coli TFs were obtained from the DPInteract database (32). For TrpR, 4 sites from DPInteract were augmented with 10 additional sites from RegulonDB (33). For Ndt80 in Saccharomyces cerevisiae we used the weight matrix from Ref. (34); Ndt80 binding sites were collected by searching promoters of the genes known to be regulated by Ndt80 with the YGNCACAAAA consensus sequence. A collection of 17 naturally occurring MAT a1/2 binding sites plus the synthetic consensus sequence were obtained from Ref. (30). Eight Gcn4p sites were assembled from Ref. (35) and the TRANSFAC database (36). Finally, binding sites and weight matrices for the Prd homeodomain homodimer were taken from Ref. (37), whereas binding data for the rest of Drosophila melanogaster TFs were collected by E. D. Siggia and E. Emberly (unpublished data).^{ }
Prediction testing
Protein–DNA interaction model based on the number of interface atomic contacts
In order to test our G and PWM predictions we developed a simple null model that exploits the structure of the protein–DNA complex but does not require any detailed predictions of protein–DNA energetics. This socalled ‘contact’ model constructs a weight matrix from the consensus DNA sequence and the number of atomic contacts N between protein side chains and DNA base pairs. In particular, we assume that the three nonconsensus bases occur with equal probabilities, whereas the consensus base is favored over any nonconsensus base by (N/N_{max}) if N N_{max}. If N > N_{max}, the consensus base becomes absolutely conserved:

(12) 
Here
is the probability of the base pair type
s_{i} = {A, C, G, T} in the PWM column
i, wt denotes a consensus base pair found in the protein–DNA complex at position
i,
N is the number of protein–DNA base atomic contacts summed over the base pair
i (protein and DNA atoms are defined to be in contact if they are separated by
4.5 Å; hydrogen atoms are excluded from the counts), and
N_{max} is the number of contacts above which the native base pair is always conserved.
N_{max} is treated as a free parameter in the probability model.
Probabilities defined by Equation 12 are converted into energies using:

(13) 
The
offset ensures that mutations from the consensus sequence are not penalized in the absence of protein–DNA contacts. The maximum energy penalty is capped at
E_{max}, which together with
N_{max} constitute the free parameters of the model. In all contact model fits
N_{max} was changed in steps of 5 kcal/mol and
E_{max} was changed in steps of 1.0 kcal/mol.
Significance test of PWM predictions
Statistical significance of PWM predictions is estimated using the test, which is a generalization of the wellknown ^{2}test (38):

(14) 
where {
p^{j}} are predicted probabilities, {
q^{j}} are experimental frequencies and
L is the length of the binding site in base pairs. Both
p and
q distributions are smoothed by adding 0.05 to all PWM entries and renormalizing. The quality of PWM predictions is further estimated by comparing
(
p,
q) with the average value of
computed for an ensemble of 10 000 alignments of random weight matrices
to the experimental weight matrix frequencies {
q^{j}} [
(
p_{random},
q)
]. Each column in the random weight matrix is obtained by uniformly sampling four numbers in the (0,1) interval and enforcing normalization afterwards. The difference between
(
p_{random},
q)
and
(
p,
q) can be viewed as a measure of success of our predictions.