Definition of a binding site
A binding residue is defined as any amino acid in the protein such that any of its atoms is within a cut-off distance from any atom from the sugar in the protein-carbohydrate complex. We tried to determine the best cut-off distance and found that 3.5 Å distance could best separate the binding residues from non-binding ones in the propensity graphs and also gives the best accuracy figures in neural network based predictors. Thus, all the reported results are based on this distance cut-off unless otherwise stated.
PDB search was performed for protein-carbohydrate complexes with a pair-wise similarity of 50% or less. Only one structure was taken in case there were more than one representative from the same family. For polypeptides, only one chain was selected on the basis of maximum number of binding sites present. FASTA formatted sequences were subsequently formatted using formatdb program of the BLAST package. BLASTCLUST program  at 30% threshold refined our search to 40 structures (Table 1). We call this database Procarb40.
This is a data set of 18 Galactose specific proteins selected for another analysis by Sujatha et al. .
This is the (non redundant) data set of 62 DNA-binding proteins .
This is a non-redundant data set of ligand-binding proteins developed for the current study. To begin with, all the 485 protein-ligand complexes were downloaded from Protein-Ligand Database  (v1.3 as on 25/01/06). Redundancy among sequences was first removed by using CD-HIT program from  with a threshold of 40% sequence identity. This resulted in 178 clusters. FASTA formatted sequences were subsequently formatted using formatdb program of the BLAST package. The redundancy was further removed with a threshold of 30% sequence identity using BLASTCLUST program . A data set was thus created, by retaining only the representative ones such that no two sequences in the resulting data set have more than 30% sequence identity. We call this database PLD116.
Other data sets
PDB-ALL (47,189 sequences) is a data set of all protein sequences obtained from NCBI. PIR is the sequence data set (283,177 sequences) of Protein Information Resource at Georgetown University . SWISSPROT is another well-known database of sequences . NCBI-NR is a non-redundant data set of all protein sequences compiled from GeneBank, PIR, SwissProt, PDB and other resources by NCBI  were also used in the current work.
Generation of PSSMs
Target sequences are scanned against the reference data sets to compile a set of alignment profiles or position specific scoring matrices (PSSMs) using Position Specific Iterative BLAST (PSI BLAST) program . Three cycles of PSI-BLAST were run for each protein and the scores were saved as profile matrices (PSSMs). NR database of NCBI, PDBAA (database of all amino acid sequences of proteins in PDB), SWISSPROT and PIR were used for building the profiles. Profiles from NR database of NCBI were used for most of the calculations presented in this work unless otherwise specified.
Calculation of amino acid composition, solvent accessibility and secondary structure at binding sites
We collected statistics on amino acid residues, which were involved in carbohydrate binding. An attempt was then made to determine whether there was a preference for any particular amino acid residue. Frequency of occurrence for each residue type is calculated and corresponds to the relative number of residues of that type out of all the residues that were found in the carbohydrate-binding proteins.
Solvent accessibility or accessible surface area (ASA) values of Procarb40, PDNA62 and PLD116 complexes were obtained from our earlier database of (relative) solvent accessibility of proteins ASAVIEW , whereas the secondary structure was obtained using DSSP program .
Propensity of a residue in the binding site was calculated by the formula: -
NBi / Ni
where NBi is the number of residues of type i, which bind to carbohydrate, Ni is the total number of residues of type i, NBall is the total number of all binding residues, Nall is the total number of all residues. To compute the propensity score of each residue, the data of binding and non-binding residues were pooled together and a single propensity score was obtained for the entire data of proteins. Also, propensity scores for each protein were calculated separately and standard deviation in all propensity scores for the same residue type was used as the error bar.
Neural network inputs
Conservation scores in 20 amino acid positions for every residues form 20 columns (column 3 onwards) of corresponding row in a PSI-BLAST PSSM. For every residue, we make a binary (1 for binding and 0 for non-binding) prediction of that residue being a binding site or not. Input for every prediction is the PSSM score on the row corresponding to this target residue and one more rows on either side (20 × 3 = 60 inputs) as well as two more rows on either side (20 × 5 = 100 inputs).
Network architecture and transfer function
A fully connected, feed-forward neural network was constructed using Stuttgart Neural Network Simulator (SNNS) version 4.2, developed at University of Stuttgart . After varying the number of units, and hidden layers, it was found that a network with two units in the hidden layer and a single output unit performed slightly better than other choices.
Training and validation
Different datasets and their cross validation were tried. Out of these results are presented for which prediction performance was better than others. We use a leave-one-out approach for training and validation. In this approach, data corresponding to one protein is removed from the data set and the remaining proteins are trained using a neural network. The performance on the left out protein is than measured. The process is systematically repeated for all proteins, leaving them out one by one and measuring their prediction accuracy. Finally reported accuracy scores correspond to the averages of the left out proteins.
Most other procedures for training and assessment of prediction accuracy were the same as in our earlier work .
Assessment of prediction performance
Three scores were used for the measure of prediction performance viz. Sensitivity (S1), Specificity (S2) and their average Net Prediction (NP). They are defined as follows:
Sensitivity (S1)= TP/(TP+FN)
Specificity (S2) = TN/(TN+FP)
Where TP stands for correctly identified binding sites, TN stands for correctly identified non-binding residues, FP stands for number of non-binding residues wrongly identified as binding by predictor, and FN is the number of binding residues predicted as non-binding.