Reliable high-resolution prediction of protein structure remains a formidable challenge and it becomes more and more evident that we are entering the era in which high-resolution predictions and molecular designs will make increasingly important contributions to biology and medicine [1,2]. The high-resolution models could be built by means of various comparative modeling procedures, although it is also sometimes possible to obtain good models in a template-free modeling of small globular proteins [2-5].
Determining and properly quantifying the properties that are characteristic for protein native structures are of primary importance for the construction of an accurate tool for the model quality assessment. Several different approaches to the optimal model selection have been proposed – such as the use of empirical or knowledge-based potentials [6,7] derived from the databases of experimental structures. More straightforward, although more expensive computationally, is the evaluation of conformational energy by means of Molecular Mechanics force fields [8-10]. Another approach to the model selection is the structural clustering, especially useful when large set of models must be assessed . Finally, learning-based scoring functions can be developed using machine learning methods e.g. support vector machines , neural networks [13,14], etc.
It is widely believed that the native conformation of a protein corresponds to the global minimum of the free energy surface defined by the protein's conformational space and the molecular interactions. A straightforward protein modeling by the all-atom energy minimization remains impractical due to the high complexity of the interactions and astronomical size of the conformational space to be searched. Thus, most approaches used for exploring the protein's energy surface have resorted to essential simplifications in the description of the polypeptide chain geometry and definition of molecular interactions. Properly designed reduced models make possible very effective search of the protein's conformational space. Model simplifications, while beneficial in filtering out the majority of unrealistic structures, limit the degree of accuracy that can be achieved. In most contemporary approaches to protein structure prediction large sets of alternative models are built. Proper selection of the best model is in many cases as difficult as obtaining very good models (usually mixed with not so good models).
Even in the simplest case of protein structure prediction – comparative modeling, the exact structure of a target protein differs from its nearest structural template used in modeling. Such deviations can not be corrected on the low-resolution modeling level, a more detailed representation of the protein and more realistic force field are needed. Unfortunately, more complex energy functions produce more rough energy landscapes, which consequently makes sampling much more difficult. Thus, it seems reasonable to split the modeling process into two stages: fold assembly (in a simplified representation) followed by the model refinement/selection procedure, using a more detailed representation (preferentially all-atom) and a more exact interaction scheme.
The first attempts at using the all-atom modeling as a final stage of hierarchical approach were applied to GCN4 leucine zipper – a very simple homodimer coiled-coil consisting of two 33-mer monomers [3,4]. The simulations were held in the times when even short macromolecular simulations were hardly possible due to limitations of computer power. ~1 Å backbone RMSD (coordinate Root-Mean-Square Deviation from the native structure after the best superimposition) was achieved by means of reduced modeling of GCN4 leucine zipper, followed by a molecular dynamics annealing protocol . Such improvement was possible only with the help of α-helical constraints applied to each residue. A decade after the pioneering work by Vieth et al.  molecular dynamics was still too expensive for significant protein structure refinements. More recently, explicit solvent molecular dynamics and implicit solvent energy calculations on 12 small, single-domain proteins allowed a successful ranking of the near-native conformations and the best structure selection from predictions generated by Rosetta method . However, the simulations were unable to refine the best structures. De novo models produced by Rosetta were also subjected to molecular dynamics simulations performed in explicit water . RMSD values of the starting models increased during the short simulations, but longer simulations appeared to generate tighter packing of helices and regularization of β-strands in some cases. Very encouraging result was also obtained by Simmerling et al. , who managed to significantly improve assembled on a lattice, low resolution structure of 29-mer CMTI-1 protein (3.7 Å from native). The final model had the correct packing of β-strands and was much closer to the native structure (2.2 Å).
Very interesting hierarchical approach to protein folding was developed by Levitt group . First, a large set of compact decoys was generated on a very coarse-grained lattice. Then, fragments extracted from known structures were fitted to the lattice scaffolds. Subsequently an elaborated procedure for the model selection and evaluation was performed. Quite good structures were finally predicted. To some extent the present approach follows this idea, although the higher resolution lattice decoys enable a higher resolution modeling by the entire hierarchical scheme.
Currently, probably the most successful refinement procedures use the all atom force-field that focuses on the short range interactions and Monte Carlo minimization. Unfortunately, the methods consume a lot of computer power and can be used only for small protein domains . The authors suggest that the primary bottleneck in a consistent high resolution prediction appears to be the conformational sampling. Insufficient sampling misses the native basin and a false minimum could be selected.
Here we show that by using a combination of a relatively high resolution sampling in a reduced conformational space, with the model selection by an all-atom detailed potentials and a high performance computing, the high resolution structure prediction can be achieved (less than 1.0 Å from native). To get such result the reduced models need to be diverse enough to cover the near-native subset of the conformational space. CABS (CA-CB-Side chain) modeling tool was employed for this purpose [17,18]. CABS model was successfully used by Kolinski-Bujnicki group during the CASP6 (Critical Assessment of Protein Structure Prediction) experiment – the average score of the models submitted by this group was the second best among about 200 groups participating. Interestingly, inspections of the simulation trajectories after publication of the target structures have shown, that there were always better models (frequently much better) than those submitted to the CASP6 server. The lack of specificity of the CABS force field in a 3 Å vicinity of the native structure, was the main reason of the poor model selection. In this range the CABS energy is poorly correlated with RMSD for the majority of proteins. During the CASP6 experiment the group mentioned above, did not have sufficient computer resources for the all-atom refinement. Also, the role of even brief all-atom refinement in the proper model selection was underestimated at that time by the authors. Nevertheless, several submitted models (comparative modeling using CABS) were of very high accuracy, similar to the accuracy of crystallographic structures (detailed results are available at CASP6 website ).
In the present work we have demonstrated that a short, all-atom minimization with fixed Cα positions can properly rank-order large sets of near native decoys generated by CABS. In this context it becomes apparent that critical for the high-resolution protein structure prediction is ability to generate sets of models that contain some near-native structures. In comparative modeling with CABS, it could be achieved by using restraints extracted from various templates with alternative alignments in the uncertain regions. To our knowledge, that's the first approach enabling a meaningful refinement of large protein domains. The procedure proposed here may also work for small proteins in the template-free modeling. In such cases very large and diverse sets of decoys need to be generated and properly clustered before the all-atom based model selection. To further evaluate the proposed method for model assessment and ranking, we also performed tests on models generated by MODELLER  – probably the most popular, versatile and quite accurate computational tool for comparative modeling. Such models are collected in the MOULDER testing set  – a comprehensive and well evaluated, present-day decoy set. Numerous state-of-the-art methods for model selection were tested using this set. Our method performed similarly well, or even better than majority of the other methods. Very rigorous criteria of the model ranking assessments were used to make this comparison.