Residue-wise propensity scores
We started with a non-redundant set of all carbohydrate binding proteins (Procarb40) collected from PDB as described in the Methods section. Residue-wise propensities of carbohydrate binding sites in Procarb40 dataset were calculated and compared with the propensities of protein-ligand interaction database PLD116 and protein-DNA interaction database PDNA62 complexes, former of which is used as control data sets and the later for additional comparison. These datasets are described in Methods section. Results obtained from this analysis are summarized in Figure 2 and Table 2. It may be observed that certain residues (e.g. TRP, GLN, and ASN) are over-represented within the binding sites of these 40 protein-carbohydrate complexes, which signifies their importance in protein-carbohydrate interactions. These results may be understood in the light of reported experimental and theoretical studies on carbohydrate interactions. For example, it has been argued that the side chain residues with polar planar groups- ASN, ASP, GLU, GLN, ARG, and HIS- are the only ones participating in all three forms of hydrogen bonding with sugars and are abundant in sugar-binding sites, which explains why their propensities in the binding sites is higher . Our analyses show that aromatic amino acid residues are often present in carbohydrate-binding sites of proteins. These binding sites are characterized by a placement of a carbohydrate moiety in a stacking orientation to an aromatic ring. This arrangement is an example of CH/pi interactions, which have been shown to play an important role in carbohydrate recognition by glycosidases and carbohydrate-binding proteins . Apart from confirming some of the widely accepted ideas on residue preference for carbohydrate binding, our study determines exact role and contribution of each residue to carbohydrate binding.
Highest propensity score (331% over representation) in the carbohydrate binding sites is observed from Figure 2
and Table 2
for tryptophan (TRP), which is in accordance with many reported mutational studies
]. This and other studies have provided experimental and theoretical evidence that the presence of TRP residues in mutation sites is crucial for their binding to carbohydrates
]. Additionally, the conservation of aromatic residues, such as tyrosine and phenylalanine, on an exposed surface is common in carbohydrate-binding modules (CBMs) from families 1, 3, 5 and 10, highlighting the role of aromatic residues in carbohydrate binding
]. (It may be noted that CBMs have been previously classified into different families in which groupings of Carbohydrate binding domains or CBDs were called "Types" and numbered with roman numerals (e.g. Type I or Type II CBDs)
The modification of tryptophan residues has also been shown to cause a compete loss of hemagglutinating activity . Involvement of two tryptophan residues in carbohydrate-binding site was also shown to be essential in the same study. Similarly, Lafora disease-related mutation of TRP32 to glycine (W32G) has also been shown to disrupt the polysaccharide-binding pocket and also potentially unfold the region immediately adjacent to the binding pocket . All these experimental results are well reflected in the high propensity of TRP in carbohydrate binding sites presented in this work (Figure 2).
If the propensity scores of carbohydrate binding are compared with other ligand-binding residues identified by PLD116 database (see Methods), TRP remains the most prominent high propensity residue. However, high propensity score of HIS residues is also shown by PLD binding sites, indicating that the role of HIS in ligand-binding sites does not have specific preference for carbohydrates, but HIS in general being an active site shows high propensity of binding to any ligand, including carbohydrates. Next important residue is ARG whose propensity for carbohydrate binding is less than TRP within Procarb40, yet the propensity (277% over representation) is even higher than what is observed for ARG in DNA-binding proteins database PDNA62 (= 241% over representation) (see Table 2 and Methods section). These results are supported by some published results of transmutagenesis experiments reporting crucial role of ARG residues in some protein-carbohydrate interactions . Lower propensity scores for the other basic residue LYS indicate that the interaction between ARG and sugar is not purely electrostatic in nature. Dahms et al. in 1993,  also report that the substitution of ARG residues by LYS in Insulin-like growth factor also caused loss of binding despite similar electropositive property of these residues and also despite overall conservation of structure upon this mutation. These results were interpreted that the proteins utilize residues with planar side chains (ARG, ASN, ASP, GLU) for their interaction with sugars. Higher propensity scores of ASP and GLU, which are also negatively charged residues, also support this argument. These propensity scores are higher than what is observed for other ligands (PLD116 database), thus highlighting a preference of these residues to interact carbohydrates in contrast to other types of ligands. In comparison to DNA-binding propensity scores of ASP and GLU are much higher, obviously because negatively charged bases in DNA repel negative charged residues.
Solvent accessibility of binding sites compared with the rest of the protein
We next attempted to establish a residue-wise relationship between solvent accessibility and carbohydrate binding. Figure 3 shows the mean solvent accessibility (ASA) values for the binding and non-binding regions in Procarb40 database. We observe that the most frequent carbohydrate binder TRP has a significantly higher ASA in binding locations compared with non-binding ones. Similar higher ASA for binding regions are also observed for other aliphatic residues ALA, GLY, ILE and LEU. Thus, the hydrophobic residues, which are usually in the buried states, do not apparently participate in sugar binding. In order to bind sugars they are expected to be on the surface, thus facilitating their hydrophobic interactions with carbohydrate atoms of protein-carbohydrate complexes and reveals that polar uncharged and certain hydrophobic residues (e.g. TYR, TRP, ALA, LEU and ILE) seem to have higher mean ASA-values in the binding regions. This result contrasts with similar binding sites analysis on DNA-binding proteins, where ASA of charged residues showed a better discrimination between binding and non-binding regions . Most charged and polar residues do not show any difference in their ASA for binding and non-binding regions, presumably because their probability to be on the surface is higher irrespective of their role in binding. For a quick comparison of role of ASA in binding regions of PLD116 and PDNA62 databases with Procarb40, ratio of mean ASA in binding to non-binding regions of the three databases have been plotted in Figure 4 (see 1). As discussed above, aliphatic residues ILE, LEU and GLY show the highest ratio for Procarb40, in addition to the most frequent binder TRP. Very low values of CYS and VAL residues are not significant as there are very few binding residue of this type (see Table 2).
Figure 3. Comparison between mean ASA values of residues in binding and non-binding sites for Procarb40. Error bars are taken from their standard deviation in each protein. The graph does not contain cystein and valine data as none of these residues were found to be in the binding regions.
Role of Secondary structure
We tried to explore if certain residues prefer any secondary structure for binding to carbohydrates. Results of these statistics are presented in 1 as Tables 5-10 and Figures 5a-g. If the number of binding sites is resolved into their secondary structure types, very few binding sites are assigned to each category. This leaves the resulting data to be insufficient for any statistical conclusions. These results are therefore not discussed here, but only provided in 1 for reference.
We also tried to find out the difference between the packing density of the binding and non-binding residues and observed that there is no statistically significant difference of packing density between binding and non-binding residues.
Looking at clear preferences of residues for binding carbohydrates (Figure 1), we sought to develop a prediction method, which could take the predisposition of residues and their sequence environments as an input and thereby identify binding residues from the information of protein-sequence. To do so, sequence environment at each residue level could be represented either as binary 20 bit vectors or by the rows of the matrices depicting evolutionary profiles of residues at each location. Sequence neighbor environment could be added as the corresponding rows of this matrix (called position-specific substitution matrix or PSSM) on either side. Schemes of these representations have been extensively developed for the problem of solvent accessibility and other residue-wise features of proteins . Table 3 summarized the results of predictions obtained in this way, using a leave-one-out method. This method also allows us to compute the standard errors in the prediction scores. Further, prediction performance of sequence-only predictors has been compared with those using PSSMs. The best performance for Procarb40 data set was found to be a modest 61%, indicating that the sequence and evolutionary information do not decisively determine a binding site. This not-so-good prediction performance for Procarb40 is is apparently because carbohydrates are diverse and finding overall general rules for their binding sites in proteins may not be possible with the amount of data we have. We need to have large data with sufficient representation of all types of sugars. To ensure that the low performance is caused by the diversity of sugars, we tried to develop a prediction model for only one type of sugar. We tried many differently classified carbohydrates, but due to further small size of data, could only use galactose binding proteins (GalBind18) data set used by Sujatha et al. (2004)  to have a sufficient number of binding sites to model. As expected prediction performance for proteins binding to only one type of sugars, was very much higher than all carbohydrates taken together. Table 3 shows that in GalBind18 carbohydrate binding sites could be predicted with as much as 79% specificity and 63% sensitivity. We speculate that much better prediction methods will be developed when a large number of proteins binding to each type of carbohydrates become available.
Table 3. Comparison of Binary and PSSM prediction results using jackknife leave-one-out method (binding sites were labeled at 3.5 Å cut-off distance between carbohydrate and protein atoms).
Single sequence versus evolutionary information
It may be a little surprising to note that PSSM based predictions (55%) were somewhat poorer than single sequences (61%) in Procarb40. However in the case of GalBind18 the situation in reversed. Lower values of prediction in PSSM based methods could be due to two reasons. First of all the number of sequences which gave significant alignments with Procarb40 was roughly 400, which is small and hence the evolutionary information transferred to PSSM may not be enough to improve performance. Secondly, the diversity of Procarb40 may lead to higher conservation scores to some residues and hence there would be many false positive predictions by this (that is why the specificity of PSSM based method was as low as 23%). In the case of GalBind18, the situation is reversed because the carbohydrates are more similar and hence conservation of a residue within them does convey positive information about its binding behaviour. Thus PSSMs do not carry false information to the neural network.
Comparison with other studies
Although some of the results presented in this work may be obvious to some experienced biologists, yet this work is the first attempt to summarize the sequence and structure features of carbohydrate binding proteins in such a comprehensive way. Previous studies have either focused on a small set of proteins aiming to analyze one or a few types of residues [43-45], or tried to focus either on the structural aspect [e.g. [16,17,26]] or just the sequence aspect of these interactions. This is also the first attempt to use sequence and evolutionary information to predict carbohydrate-binding sites using neural network based approach, which has been proved to be successful in making other sequence-based predictions. Earlier, structure-based methods have been employed to develop empirical rules on patches and other structure descriptors with a somewhat better (65%) accuracy. However, sequence-based methods, employing only sequence information presented in this work are new and will have a much wider application as no structure information will be required for prediction. We expect that this will trigger interest in the prediction of carbohydrate binding sites using machine learning methods and the performance will improve with the availability of more data.