Login

Join for Free!
16742 members
table of contents table of contents

Alternative structural models determined experimentally are available for an increasing number of …


Biology Articles » Bioinformatics » Conformational analysis of alternative protein structures » Methods

Methods
- Conformational analysis of alternative protein structures

 

First the criteria for selecting the structural data used in the analysis are described. Then we explain the alignment procedure, and review the clustering method. Finally, we describe the different approaches for characterizing backbone and side-chain conformations: analysis of variation, comparison of subsets and identification of invariant and variable regions. The C{alpha} atom coordinates are used to analyse backbone conformations, alternatively the coordinates of the side-chain atoms are used for more detailed analysis of the side-chain conformations.

2.1 Structural data
Structural models were obtained from ASTRAL SCOP 1.71 (Chandonia et al., 2004), which correspond to the SCOP 1.71 (Andreeva et al., 2004) domain definitions. Each set contains the structural models (entries) classified in the same SCOP species level. There are 75 930 SCOP models in total. The analysis was restricted to sets with two or more entries, but with fewer than 60 entries. It was also restricted to the first seven SCOP classes: all alpha, all beta, alpha and beta (a/b), alpha plus beta (a + b), multi-domain, membrane and cell surface and small proteins. Four classes were excluded: coiled-coil proteins, low-resolution protein structures, peptides and designed proteins. Only entries for which a diffraction-component precision index value could be computed (see below) and which could be aligned were used in the analysis. In total 36 634 SCOP models were used, divided in 5837 sets.

The diffraction-component precision index (DPI) was used to derive estimates of atom coordinate uncertainties for models derived by X-ray crystallography (Cruickshank, 1999). The result is an estimate of coordinate uncertainty ({sigma}i) for atom i. The centroid coordinate uncertainty is the mean uncertainty of the atom coordinates used in the computation of the centroid. DPI values were computed for 85% of all PDB entries available in February 2007, and are available on the STRuster web site. See Supplementary Material for more details.

2.2 Alignment
Each set contains models for one type of protein matching a UniProt entry (Bairoch et al., 2005). Each model was aligned to the protein sequence from UniProt, using the mappings between PDB residue number and UniProt sequence position (Martin, 2005). A set was only used in the analysis if each PDB entry was mapped to the same UniProt. Sets where different PDB entries were mapped to different UniProt sequences were excluded. The PDB residue-UniProt sequence pairwise alignments were then combined using the UniProt sequence as reference, producing a multiple sequence alignment. Alignments can include substitutions and insertions/deletions relative to the UniProt sequence. We refer to the different residue positions in the protein by the alignment positions (starting at index 0). For the three examples provided in the Results section, the mapping between the alignment positions and the PDB residue numbering are provided in the STRuster web site, by typing the appropriate SCOP or PDB codes.

2.3 Clustering
A hierarchical clustering method for grouping the models according to structure similarity was applied to each set. Clustering is implemented as previously described (Domingues et al., 2004), see Supplementary Material for a review of the method. The main difference is that the dissimilarity measure was modified to take into account insertions and deletions in the residue mapping between two entries as obtained from the multiple sequence alignment. The C{alpha} atom distances are used when clustering according to backbone structural similarity. When a more detailed analysis sensitive to side-chain conformations is required, then the distances between the side-chain centroids (including the C{alpha} atom) is also used. The silhouette width value (Rousseeuw, 1987) is a measure of cluster quality, which is used to identify the best number of groups obtained by hierarchical clustering.

2.4 Variation matrices
The variation matrices are used for visualizing the location and extent of structural variability over a set of alternative models. The structural variability is measured at the level of backbone using C{alpha} atom coordinates, or alternatively using the centroid of the side chain (including the C{alpha} atom). Four types of matrices are computed. They provide complementary information. Matrix S is the standard-deviation (SD) matrix, and gives the SDs of the coordinate distances. The total SD matrix (T) accounts not only for the distance variation but also for the coordinate uncertainties. Both the S and T matrices provide measures of distance variability in absolute units (Å). The relative SD matrix (R) provides a measure of structural variability relative to the estimates of uncertainty in order to help in the identification of significant conformational variation. Finally, the maximum relative difference matrix (X) contains the largest pairwise structural differences at each position in the set of models. The matrix rows and columns correspond to the alignment positions (i,j), and all the matrices are symmetric.

Given a set of models A = {a1, ... , ak, ... , am}, for any model akisinA, the expressionFormula denotes the C{alpha} or centroid coordinate distance between residue position i and j in the alignment. We define:


Formula 1

(1)


Formula 2

(2)
The distance SD matrix SA(i, j) is defined by the values ofFormula, for each pair of alignment positions i, j.

The total SD matrix TA(i, j) also takes into account the estimates of coordinate uncertainty. The coordinate uncertainty for residue i in model ak is denoted byFormula. Neglecting the covariance, one can estimate the distance uncertaintyFormula as:


Formula 3

(3)
The average SD for the set based on the uncertainties is then:

Formula 4

(4)
The T matrix contains the values ofFormula, which combine the contributions to the variance from the distribution of distances and from the uncertainties:

Formula 5

(5)

The relative SD matrix RA(i, j) provides a measure of significant variability as the ratio of {sigma}SET and {sigma}SUS:


Formula 6

(6)

The maximum relative difference matrix X describes the structural outliers. Matrix X is based on the maximal differences between the distances and it has been previously proposed (Schneider, 2000).


Formula 7

(7)

2.5 Comparison matrices
The comparison matrices are used for identifying the structural differences between two subsets A = {a1, ... , ak, ... , am} and B = {b1, ... , bk, ... , bn}. The method provides three types of comparisons, namely between two subsets, between a single entry and a subset, and between two single entries. The backbone conformations are compared using C{alpha} atom distances, and side-chain conformations are compared using distances between the centroid of the side-chain and C{alpha} atoms.

For each pair of positions (i, j) in the alignment, the extent of agreement between the two distance distributions of the two subsets A and B, relative to the variance of the distributions, is given by the value of the Welch statistic (Welch, 1938). There are two components in estimating the variance, one resulting from the variance of the distancesFormula, and one from the distance uncertaintiesFormula. If only the distance distributions are considered, then:


Formula 8

(8)


Formula 9

(9)
If the coordinate uncertainties are considered as well then:

Formula 10

(10)
With distance distributions close to normal, a P-value estimate of significance can be computed from the Welch t-test (Welch, 1938). From our experience the distance distributions are usually not close to the normal distribution, therefore the Welch t-test is in general not applicable here.

Extending the formalism to the comparison of a single entry a to a subset B, and not considering the coordinate uncertainties, we define:


Formula 11

(11)


Formula 12

(12)
When considering the distance uncertainties, the distance uncertainty for entry a has to be considered together with the uncertainties from subset B:

Formula 13

(13)


Formula 14

(14)


Formula 15

(15)

If only two entries a and b are compared, the variance is derived from the distance uncertainties as previously proposed (Schneider, 2000):


Formula 16

(16)


Formula 17

(17)

2.6 Identification of hinges, variable and invariant regions
The relative orientation of the backbone is preserved in the invariant regions. Invariant regions can be composed of more than one segment of contiguous residues (invariant segments). Invariant segments in invariant regions are structurally conserved relative to each other. The backbone structure is not preserved in variable segments. Hinge segments are associated with short flexible fragments on the protein backbone and with transitions between different invariant and variable segments.

Invariant backbone regions, as well as hinges and variable segments are identified from the variation matrices or from the comparison matrices using a data smoothing approach. First the hinge segments are identified, then the remaining inter-hinge segments are classified either as variable or as invariant segments. Finally, invariant segments that preserve the relative orientation to each other are grouped into invariant regions. See Supplementary Material for detailed explanation.

A clustering approach is used to identify regions with invariant side chains, based on the variation or comparison matrices computed with side-chain centroid distances. The matrix elements are used as distances for hierarchical clustering with group average agglomeration. The resulting tree is cut at a certain cutoff (s_cutoff).

2.7 Superposition
For an invariant backbone region, an optimal superposition between all structural entries and a representative entry is computed. The representative structure is chosen as the structure with lowest sum of the backbone clustering dissimilarity values in the invariant region. The superposition between each entry and the representative is computed based on the invariant regions. The superpositions were performed using Biopython http://biopython.org/, as provided in the PDB module (Hamelryck and Manderick, 2003).

2.8 Implementation and visualization tools
The methods were implemented in Python http://www.python.org/, using the Biopython library. Version 2.1.0 of the R environment for statistical computing (R Development Core Team, 2005) was used for clustering, for data smoothing and for visualization. PyMOL (http://www.pymol.org) was used for molecular rendering and visualization.


rating: 0.00 from 0 votes | updated on: 1 Dec 2007 | views: 826 |

Rate article:







excellent!bad…