Sequences
Our test sequences include 59 HIV (Human Immunodeficiency Virus) and
7 SIV (Simian Immunodeficiency Virus) non-recombinant full length
genome sequences and 4 incomplete non-recombinant HIV-2 sequences.
Recombinants and non-recombinants are very well defined sequences [23]. These 66 complete genome sequences ranging in length from 8555 to 11443 nucleotides cover the following clades:
(1) HIV-1 group M (subtypes A-D, F-H, J, K; subtype A is further split into A1 and A2, and subtype F is divided into F1 and F2),
(2) HIV-1 group N,
(3) HIV-1 group O,
(4) HIV-2 (subtypes A, B, G),
(5) SIV-CPZ, and SIV-SMM
Four incomplete genome sequences are obtained from HIV-2 subtypes C,
D, E, F, with lengths varying from 771 to 781 nucleotides, that cover
about half of gag region. Thus those well-characterized 70
sequences, representing different HIV/SIV groups, sub-groups, subtypes,
and sub-subtypes, are good samples of the genetic variation of the
HIV/SIV world. These sequences can be retrieved from the Los Alamos HIV
sequence database [24]. Their accession numbers are given in Figure 1. The sequence files are also available from [16].
Algorithm
The sequence data described above were analysed using the method introduced in [7,8], which we call «local decoding of order N»
of a sequence. Basically, the aim of the method is to find local
similarities between (and within) the sequences of an input set. The
input consists of a set of non-aligned sequences or a single sequence.
In the input, any site of a sequence is occupied by a letter (typically
a base or an amino acid). In the output, every site is occupied by an
identifier that carries information about the letter and its
neighborhood. The details of this method are described below.
A sequence S consists of letters S1...Si... with i running from 1 to |S|, the length of the sequence. Let N be
a chosen integer parameter, which determines the size of neighborhoods
to be considered. Consider the set of all overlapping words of length N (step 1) in the sequence S. Given a site (i.e. an integer i with 1 ≤ i ≤ |S|),
consider the set of words that cover this site. Two sites are said to
be «directly related» if they appear at the same position in two
occurrences of the same word of length N (see example below).
It is easy to see that two directly related sites have to be occupied
by the same letter. Note that two sites may not be directly related but
each of them may be directly related to a third site. More generally,
two sites are said to be «related» (not necessarily directly) if they
are linked by a chain of direct relations. In mathematical terminology,
we consider the transitive closure of the direct relation. One can show
that any two (even not directly) related sites are also occupied by the
same letter. Finally, one obtains a decomposition of the set of all
sites into disjoint classes (so called «N-classes»), each
class being defined to be a maximal subset of sites that are pairwise
related. This decomposition is always finer than (or the same as) the
decomposition of sites into classes of sites occupied by a same letter.
In this sense, local decodings of sequences are more informative. This
method can be extended to a set of sequences. A site is then labelled
by a pair (sequence name, position).
Using this method, we have re-written a set of HIV sequences in a
larger alphabet. A «letter» of the new alphabet at a given site of the
sequences, can be for instance «A215» where «A» is the letter in the
input sequence at this site, and where A215 is the N-class
identifier used to distinguish «related sites» (i.e. all sites carrying
the symbol «A215») from non-related sites. This re-writing depends only
on the choice of the integer parameter N. In addition, the
re-writing of a set of sequences depends also on this sequence set
itself, and could be changed if sequences are added or removed. In
other words, «relatedness» is not an intrinsic property of sites.
An example
Figure 6a shows three sequences, named seq1, seq2, seq3, of length 30 nt, 20 nt, 20 nt, respectively. We choose N =
5. Consider the site at position 11 (letter T) in seq1. We say it is
related with the site at position 5 in seq2, and with sites at
positions 5 and 12 in seq3. Indeed, the site (seq1,11) is at position 1
in the word TGGAC. The same is true for (seq2,5) and (seq3,12). According to our «local decoding of order N»
definition, these three sites are directly related to each other. The
fourth site (seq3,5) is directly related to (seq2,5), because (seq3,5)
and (seq2,5) are both at position 5 in the word CACTT. Consequently, the four sites (occupied by the letter T) are all pairwise related. They correspond to class_8 in Figure 6b and are identified by T3 in the output sequences (Fig. 6c).
Figure 6. The «local decoding of order N» computing strategy. Figure 6a (top): Four (N =
5)-related sites (containing the letter T) are taken from three input
nucleotide sequences seq1, seq2 and seq3. Each of the four boxed
sectors (2N - 1 = 9 letters in length) has T at its center (in
bold face type) and is identified by the sequence where it is situated
and the position of T in this sequence (that is seq1,11, seq2,5, seq3,5
and seq3,12, see Figure 6a bottom). Figure 6a (bottom):
each 9-letters-long segment (identified by the corresponding site
containing a T in bold face type) is displayed with the set of
corresponding overlapping (step 1) words of length N = 5
underneath the corresponding site (boxed). The four sites are
5-related; seq1,11, seq2,5 and seq3,12, are directly 5-related by TGGAC
(in bold face type) at the position 1; seq1,11 and seq3,12 are also
directly 5-related by CTGGA at the position 2; seq3,5 is directly
5-related with only seq2,5 by CACTT at the position 5, so that it is
connected by seq2,5 with the other two sites. Figure 6b:
the symbols that identify each class containing at least two sites, are
shown together with the segments covered by the overlapping 5-words
that lie over the letter (boxed). Figure 6c: the
re-written sequences generated by the program. The identifiers
corresponding to classes containing only one site are only represented
by their corresponding letter in the input sequence; in fact, they
cannot contribute to calculating the similarities between pairwise
compared re-written sequences. Figure 6d: the double-entry table for constructing a pairwise distance matrix between the three sequences (re-written in figure 6c).
Each class identifier with at least two sites is indicated in the
corresponding row. For each row and for each of the three sequences
that label the three columns, the table gives the number of sites of
this N-class that appear in the sequence. Figure 6e: similarity matrix and the corresponding normalized dissimilarity matrix (see text) for the three sequences.
Figure 6b shows all classes of sites containing at least two members, and Figure 6c shows the output of the program based on the algorithm described in [8]. Each site in the original sequences is replaced by the identifier of the N-class
to which the site belongs. As mentioned above, the identifier consists
of the letter occupying the site in the original sequence, followed by
a number.
Similarity and dissimilarity between sequences
Among many possibilities of defining a similarity between sequences, we choose the one described in [8]. In the program output (Fig. 6c), every site in the compared sequences gets occupied by an identifier. We need an extra notation here and denote by |w|x the
number of occurrences of an identifier x in any sequence (w). Let us
first introduce a similarity score between two sequences u and v. Each
identifier x could occur in any alignment (alignments are not used by
our method) of u and v in at most min(|u|x,|v|x),
that is the smallest number of occurrences of x in the two sequences u
and v. Adding up this number over all identifiers gives an upper bound
of the maximum number of possible matches between u and v (Fig. 6d).
Finally, we normalize the obtained similarity by dividing it by the
length of the shorter sequence (this makes it possible to subtype short
sequence parts together with complete sequences, see Results first section). It is then straightforward to obtain a dissimilarity score, that is 1 minus the normalized similarity (Fig. 6e). Figure 6d–e outlines the method for calculating the dissimilarity matrix between the three compared sequences re-written in Figure 6c, by using the N-local decoding method. For instance, when comparing seq1 and seq2, the sum of all the min(|u|x,|v|x) equals 5, so that the corresponding dissimilarity is 1 - (5/20) = 0.75 (Fig. 6e).
Here, the identifiers that correspond to classes containing only one
site have been omitted, since they cannot contribute to the similarity
(Fig. 6c).
Computational programs used in this study
A program implementing the N-local decoding method was used
for constructing HIV/SIV trees. It reads a set of FASTA input
sequences, and generates a tree based on the dissimilarity matrix
described above. This program is available from [16]. A more recent program [8] written in C, and the supplementary material are available upon request from GD (address in the title page).
Other programs used in this paper are: CLUSTAL-W (version 1.8 June 1999) [18], DIALIGN-2.2.1 [19] and PHYLIP [25].
Some of these programs have been used to compute the alignment-based
dissimilarity matrices from which the trees have been constructed for
Figure 3 (CLUSTAL-W) and for Figure 4 (DIALIGN-2.2.1).
Computing bootstrap values over trees
To assess the reliability of the trees computed from the
dissimilarity matrices, we use a bootstrapping procedure similar to the
one proposed in [26]
which proceeds by resampling the columns of an alignment. Though we do
not deal with an alignment but just a set of local decoded sequences, a
very similar procedure can be applied here. Practically, from the local
decoded set of sequences, we resample a given number of bootstrap
replicates. These replicates are the sets in which each sequence of the
initial local decoded set is replaced by a sequence of the same length.
This sequence is generated by concatenating randomly selected sites in
the sequence of the initial set (some positions can be selected more
than once, others never selected). For all the resampled sets of
sequences we compute dissimilarities and associated neighbor-joining
trees. Those trees are finally combined using the «consense» program of
PHYLIP [25] to give a final tree and related bootstrap values.