Almost all existing methods for comparing biological sequences are
based on certain implicit (and necessary) assumptions about the kinds
of changes sequences undergo during their putative evolution. Under
these assumptions, some changes like permutations and inversions are
often ignored (see however [1-3])
while other kinds of changes (insertions/deletions, for instance) are
poorly evaluated because they do not follow regular evolutionary
models. As a result, the regions in which these changes occur are often
treated as ambiguous and are deleted from sequence alignments before
further evolutionary analysis. In addition, as no single alignment
method copes with tremendous variety of sequence data, expert editing
(manual alignment) becomes often necessary. This is particularly
important if the sequences have successive duplications/deletions [4,5], and permutations.
To solve these problems, there has been work on methods of comparing
sequences without alignments. An intuitive way of comparing sequences
without alignments is to compare their compositions, i.e. the frequency
of nucleotides or amino acids appearing in the sequences, as two
sequences with significantly different compositions cannot be closely
related. However, it is also true that two unrelated sequences can have
very similar compositions. So the next step is to consider all N-words in a set of sequences (an N-word consists of N consecutive letters in a sequence) to calculate sequence dissimilarities .
This can show some evolutionary relationships. The problem is that the
only straightforward relationship between two words is that they are
either identical or different. This can be compensated by considering
«imperfectly matched words». The problem is then a combinatorial
explosion: as we all know, there are too many ways of being imperfect.
To track the evolutionary information embedded in ambiguous regions
of sequence alignments and to alleviate the burden of manual editing,
one of us developed a method that allows comparison by considering
«local decoding of order N» of sequences (also called «N-local decoding method»), and keeping information of each set of overlapping (step 1) N-words that cover each site in a sequence . The method is recapitulated in a recent work describing a new algorithm that computes the «local decoding of order N» of a sequence (or a set of sequences) with a complexity linear in sequence length, both in time and memory space .
The method was found useful for assessing a manually constructed HIV
(Human Immunodeficiency Virus) LTR (long terminal repeat) multiple
alignment, which is homology blocks based .
«Related sites» were found to correspond to homology blocks between
compared sequences, leading us to think that the comparison of the
sequences «composition» in «locally decoded letters» could yield a new
method for calculating dissimilarities between sequences without an
alignment and consequently without taking into account the order of
homology blocks in each sequence.
The new method has also been tested against 10 reference
alignment-based dissimilarity matrices calculated from 10 multiple
alignments of DNA or RNA sequences: it gives alignment-free
dissimilarities in very good agreement with the reference
alignment-based dissimilarity matrices (better than the correlation
obtained when the alignment-free dissimilarity matrix was calculated
from the comparison of sequence compositions in words of length N) .
The aims of this paper are: (A) to give a concise description of this new method (see the section Methods)
(B) to apply it to a problem of biological interest, namely the
classification of HIV/SIV, two of the most variable organisms known so
far (see Results and Discussion).