table of contents table of contents

In general, the construction of trees is based on sequence alignments. This …

Home » Biology Articles » Methods & Techniques » Comparing sequences without using alignments: application to HIV/SIV subtyping » Results

- Comparing sequences without using alignments: application to HIV/SIV subtyping

The N-local decoding method on complete genome sequences – tests on HIV/SIV complete sequences

The N-local decoding method was used to calculate dissimilarity matrices for all 70 sequences described in Methods section. Four incomplete genome sequences (HIV-2 subtype C, D, E, F) were also included here because these sequences may have kept strong subtyping signals that are in the complete genome sequences. Figure 1 shows a tree obtained from the dissimilarity matrices calculated by our method. Two types of HIV, representing independent cross-species transmission events [9], are clearly distinguished: HIV-1 is closer to SIV-CPZ, and HIV-2 is closer to SIV-SMM. Nine subtypes of HIV-1 group M cluster distinctly, with sub-subtypes significantly more closely related to each other (A1 and A2, F1 and F2, B and D – though B and D are regarded as subtypes instead of sub-subtypes for historical reasons [10]). Subtype K is more distant from sub-subtypes F1 and F2 than these are from each other, but closer to F1 and F2 that to other subtypes, i.e., in the range of subtype B and D distances [11]. HIV-1 groups N and O are in the expected distances from the M group. The N group intercalates between HIV-1 group M and SIV-CPZ (CPZ-CAM3, -CAM5, -GAB and -US), consistent with a suggestion that group N is a recombinant between a SIV-CPZ strain and a virus related to the ancestor of group M [12]. The group O is intercalated between these CPZ (CPZ-CAM3, -CAM5, -GAB and -US) and CPZ-ANT that is the borderline in the HIV-1/SIV-CPZ lineages [13]. HIV-2 subtypes also form clear clusters, respectively, including subtypes C, D, E, and F that are about half of gag region.

Figure 1. The neighbor-joining tree obtained from 70 HIV/SIV nucleotide sequences (distance matrix calculated by using the N-localdecoding method for N = 15). The sequences names are written as follows: their GenBank accession numbers, followed by their nomenclature names [9–11, 14, 15]. These sequences can be retrieved from the Los Alamos HIV sequence database [24]. Bootstrap values (≥ 90%) are indicated.

Consequently, the topology of the tree shown in Figure 1 agrees very well with existing knowledge [9-11,14,15] and recognizes significant HIV/SIV evolutionary events.

The computational program used to calculate the dissimilarity matrices from which the trees can be constructed has been tested for a wide range of values of N, which is the only user-specified parameter of the local decoding method. When N ranges from 13 to 35, there is no significant change in the tree. Furthermore, the topology of the tree obtained by N-local decoding (N = 13–29) of 46 HIV/SIV complete nucleotide sequences, is identical with the complete genome tree available from HIV/SIV sequence compendium 2000 (compare Figure "Comp2000_13_tree" corresponding to the file "Comp_2000" at [16] with the published tree at [17].

The N-local decoding method on short sequences – tests on HIV/SIV gag, pol, env and nef sequences

HIV/SIV exhibit great variety in different parts of their genomes, with env representing the most variable region. All 70 sequences described in the Methods section were used to test different sequence regions. For gag, 66 HIV/SIV sequences (only the gag region is used) and 4 HIV-2 sequences that cover partial gag (sequence length 771–781 nt, in contrast to 1473–1569 nt of the complete gag sequences) were tested. For other genome regions, pol (2993–3360 nt),env (2499–2658 nt) and nef (292–783 nt), only 66 sequences were selected because of unavailability of sequences in these regions from HIV-2 subtype C, D, E, F. The trees of these four regions, based on the N-local decoding calculated sequence distance matrices (see figures «gag18_tree», «pol22_tree», «env15_tree, and «nef13_tree» at [16]), also agree well with the established HIV/SIV phylogenetic trees in these regions [9-11,14,15] with the exception of nef. A few discrepancies exist in nef: in our method, HIV-1 group M sub-subtypes F1 and F2 mix together. Subtype K, depending on the chosen value of N, either is still close to F1/F2 as expected, or is isolated or loosely related to subtype J. SIV-CPZ-ANT is intermediary between HIV-1-M-N/SIV-CPZ-HIV-1-O and HIV-2/SIV-SMM. These discrepancies may simply reflect uneven sequence complexity in different genome regions, or just reflect the differences in treating ambiguous alignment regions: the N-local decoding method keeps all these regions while the traditional alignment-based methods have to delete those alignment regions in order to produce an unbiased tree.

The best orders N tested in these regions are: gag, N = 11 to 23; pol, N = 11 to 30; env, N = 12 to 24; and nef, N = 11 to 20.

The N-local decoding method on sequences that traditional alignment-based methods cannot deal with – tests on non-coding LTR sequences

Forty-three of the 70 sequences (retrieved from the HIV sequence database as described in Methods section) cover the non-coding part of LTR (complete non-coding LTR region or at least the 5' portion of this segment including the polyadenylation signal AATAAA). The length of this part ranges from 211 to 328 nt in the HIV-1/SIV-CPZ subset, and 433 to 508 nt in the HIV-2/SIV-SMM subset. These short non-coding LTR segments contain many duplications/insertions/deletions that make them difficult for traditional alignment-based phylogenic studies [5].

There is no suitable reference tree available for this non-coding LTR region. Thus we built trees based on CLUSTAL-W [18] and DIALIGN-2 [19] alignments and compared them with the N-local-decoding-method-based tree. In our method (Figure 2), two types of HIV are very well characterized in terms of defining their relations to their SIV origin (HIV-2/SIV-SMM and HIV-1/SIV-CPZ). The sequence relations among different subtypes of HIV-1, HIV-2, and SIV are clearly defined except for A1 and A2 that are not distinguishable from each other. CLUSTAL-W-based tree (Fig. 3), however, incorrectly clusters HIV-1/SIV-CPZ, and HIV-1 M, N, O groups. In contrast, the tree based on DIALIGN-2 (Fig. 4), a multiple alignment program based on homology blocks, features better HIV clades than CLUSTAL-W-based tree. Subtypes in HIV-1 and HIV-2 are well separated from each other. But this DIALIGN-based tree is far less satisfactory than the N-local-decoding-method-based tree (Fig. 2). For instance, sub-subtypes in HIV-1 group M mix together; HIV-1 group N is located inside HIV-1 group M and it is loosely related to SIV-CPZ; subtype A and G of HIV-1 group M are not closely related to each other as expected. In other words, neither CLUSTAL-W- nor DIALIGN-based tree represent better HIV/SIV sequence relations and distances than the tree obtained by our method.

Figure 2. The neighbor-joining tree obtained from 43 HIV/SIV non-coding parts of LTR nucleotide sequences (distance matrix calculated for N = 11). M15390 corresponds to the HIV-2-A ROD isolate just as X05291 for Figure 1. Sequence names follow the same rule as in Figure 1. Bootstrap values (≥ 50%) are indicated.

Figure 3. The neighbor-joining tree calculated from the multiple alignment of the same 43 sequences as in figure 2, produced by the CLUSTAL-W program. Sequence names follow the same rule as in Figure 1.

Figure 4. The neighbor-joining tree calculated from the multiple alignment of the same 43 sequences as in Figure 2, produced by the DIALIGN-2 program. Sequence names follow the same rule as in Figure 1.

In our method, bootstrap values strongly support the clustering of HIV-2 (and their subtypes), SIV-SMM, HIV-1-M, HIV-1-N, SIV-CPZ, and HIV-1-O. However, in contrast to Figure 1, the values are low within HIV-1-M. This is not surprising since these sequences are short and very similar to each other. The parameter N tested here ranges from 10 to 21 in order to generate the most appropriate neighbor-joining trees for those 43 sequences. For various subtypes, however, this N sometimes falls into different ranges. For example, N = 9–14, 27–43 and even further make HIV-1 subtype C cluster clearly defined; N = 10–60 and beyond are needed for HIV-1 subtype G cluster; N = 8–60 and above are necessary to cluster two J sequences of HIV-1-group M. Finally, N = 9–17 are needed to clearly cluster HIV-1-M/B and D subtype, and N = 19–25 are necessary to distinguish HIV-1 subtype B and D.

In short, our HIV/SIV subtyping results confirmed the good performance of the N-local-decoding method, regardless the complexity of genome regions. The N values listed above not only are useful in making an N-parameter reference for future N-local-decoding method applications, but also imply different evolutionary pressures in various HIV/SIV clades.

Similarity blocks displayed by the N-local decoding method – an example of the HIV non-coding LTR sequences

In our previous study of HIV LTR [5], a sequence alignment had to be manually constructed in order to take into account the frequent duplications/insertions/deletions events. This strategy is similar to the N-local-decoding procedure described in this paper.

Here we use the HIV non-coding-LTR NFKB binding site (GGGACTTTCCA/G) (see [5]) and its flanking sequences obtained from 43 sequences (the same sequences used in non-coding HIV LTR Results section just above) as an example to show the relationship between the similarity blocks and the N-classes.

Figure 5 shows the sequences re-written through the N-local decoding method definition (N = 11). Each letter followed by a number identifies an N-class (# indicates a number with more than one digit). Identical colours allow easy identification of repeated identifiers in the same column, and thus facilitate finding similar segments. Two kinds of similarity blocks have been identified in these 43 sequences. One exists in the NFKB binding sites, and the other is in NFKB flanking sequences.

Figure 5. Similarity blocks found by applying the N-local decoding method (for N = 11) to the HIV non-coding LTR sequences. This is a nucleotide sequences alignment of the 43 non-coding LTR sequences that are used in the Results section corresponding to Figures 2-4. The alignment is focused on the NFKB binding site (GGGACTTTCCA/G) and its flanking sequences. Most often, the similarity blocks are aligned. In the left columns, the sequences are given by their database accession numbers followed by their nomenclature (HIV-1; HIV-2; SIV-CPZ-CAM 3; -CAM 5; -GAB; -US; SIV-SMM; SIV-SMM-MAC) and their groups, subtypes and sub-subtypes (for HIV-1 and HIV-2). The sequences are re-grouped according to the phylogeny as seen in Figures 1-4. The letters are re-written by applying the N-local decoding method (N = 11). When the identifier contains a number with more than one digit, this number is replaced by a #. Identical recoded letters that are in the same column are written using the same colour. The sequences are written from the left to the right. The coordinate of the first letter in the file hivltr.fsa [16], is indicated at the left of this first letter of each sequence displayed. When several sequences are re-grouped because they are identical (for instance 6 HIV-2-A sequences), the lowest and the highest coordinates are indicated. The sequences are often written in several lines to show similarities between sequences and parts of sequences.

(A) Similarity blocks in the NFKB binding sites reveal different duplication events of this site in various HIV clades. Each HIV-1 group M subtype has two NFKB site copies, with the exception of subtype C which exhibits an extra ≥ 10-letter long fragment (GGG(g)CgTTCCA) with 9 letters (here upper case letters) matching this NFKB site. Both HIV-1 group N sequences have two NFKB binding sites and one or two incomplete copy of such site (GGGACTTT), and these N-group sequences are more similar to SIV-CPZ and HIV-1-M than to other HIV clades. All HIV-1 group O sequences display two copies of a sequence fragment of 30 letters long (these two copies are written in two lines to show their similarity in Figure 5). Each fragment includes a NFKB binding site, a segment G2A3C2A4 similar to that in HIV-2 subtype B, and a segment C3T#G#C5 similar to SIV-CPZ-GAB. The differences in the similarity blocks in NFKB binding sites in HIV clades may indicate an important role the NFKB binding site played in different independent introductions of SIV-CPZ from chimpanzees into the human populations to establish HIV-1 M, N, O groups.

(B) Similarity blocks in NFKB site flanking sequences may also provide helpful information in tracking evolutionary relationships and distances between HIV clades. One example is the pattern G#C5A1 before the NFKB site, that only exists in HIV-2 and SIV-SMM sequences. Possibly this well-preserved G#C5A1 participated actively in the cross-transmission from SMM to human to establish the HIV-2. The other example is A2 A# G# G# G# pattern (in blue) at the 3' end of HIV-2 subtype A flanking sequences. This pattern distinguishes HIV-2 subtype A from other HIV clades. This, again, may simply reflect different transmission influences in varied HIV clades.

rating: 0.00 from 0 votes | updated on: 12 Aug 2009 | views: 9835 |

Rate article: