Carbohydrates are often referred to as the third molecular chain of life, after DNA and proteins . These interactions are responsible for important biological functions such as inter cellular communication particularly in the immune system . Living cells in all organisms are usually covered with one or another type of carbohydrate . Some viruses like influenza, use sugars on the outside of human cells to gain entry. Sometimes the carbohydrate-binding proteins and their sugar-ligands are expressed on the same cell, and the sugar is a part of the regulation machinery of the cell . The functional roles of carbohydrates and their interactions with proteins are drawing more attention than before, since it has been recognized that carbohydrates are used as information carriers, rather than simple storage material . Protein-carbohydrate interactions are involved in a broad range of biological processes. These processes include, among others, infection by invading microorganisms and the subsequent immune response, leukocytic trafficking and infiltration, and tumor metastasis [4-12]. Carbohydrates are uniquely suited for this role in molecular recognition, as they possess the capacity to generate an array of structurally diverse moieties from a relatively small number of monosaccharide units . This could be attributed to the fact, that unlike the components of nucleic acids, carbohydrates can link together in multiple, nonlinear ways because each building block has about four functional groups for linkage. They can even form branched chains. Hence, the number of possible polysaccharides is enormous (Figure 1). Since carbohydrates assume a large variety of configurations, many carbohydrate-binding proteins are being considered as targets for new medicines.
In view of the above, accurate in silico identification of carbohydrate-binding sites is a key issue in genome annotation and drug targeting. The information about the factors, which prevent or support carbohydrate binding of an amino acid, is expected to be present in the evolutionary profile of the sequence as well as the identity and structure of amino acid residues in the neighborhood of potential carbohydrate binding sites. A number of reviews have been published on protein- carbohydrate interactions [14-18]. Different aspects of protein carbohydrate recognition have also been extensively studied [19-26]. However, bioinformatics approaches with a predictive goal are relatively rare [1,27]. Compared with the abundance of methodologies developed for protein-nucleic acid [28-32] or protein-protein interactions [33-38], there are still very few methods for predicting carbohydrate-protein interactions. Shionyu-Mitsuyama  has developed a program that uses the empirical rules of the spatial distribution of protein atoms at known carbohydrate-binding sites for prediction. In that work an analysis of the characteristic properties of sugar binding sites was performed on a set of 19 sugar binding proteins. For each site six parameters were evaluated viz. solvation potential, residue propensity, hydrophobicity, planarity, protrusion and relative accessible surface area. Three of the parameters were found to distinguish the observed sugar binding sites from the other surface patches. These parameters were then used to calculate the probability for a surface patch to be a carbohydrate-binding site . These prediction methods are based on local structural descriptors of proteins and cannot be used if complete 3 dimensional structures are not available. On the other hand, neural network based predictions of post-translational modification (O-glycosylation and phosphorylation) sites have been reported by two groups [39,40]. However, these studies are restricted to only one type of protein-carbohydrate interactions and therefore do not capture all protein-carbohydrate interactions, as sought out in this work.
In this work we explore the exact contribution from different sequence and evolutionary attributes of proteins in determining their carbohydrate binding regions. Propensity of each of the 20 amino acid residues in binding regions has been calculated and compared with non-binding regions. Solvent accessibility, secondary structure and packing density of binding sites have been analyzed in a similar way. We go on to design a neural network to model sequence and evolutionary information (obtained by Position Specific Scoring Matrices) and determine their role in the predictability of carbohydrate binding sites.
We also study the binding sites of other protein-ligand and protein-DNA complexes and compare the propensity scores of all residues and their secondary structures with protein-carbohydrate complexes.