The current HIV sequence analyses usually involve obtaining a
gap-stripped multiple alignment to construct an unbiased phylogenetic
tree. The alignment usually is first generated by HMMER ([20,21] and/or other alignment softwares, followed by a manual editing [22].
After deleting the ambiguously aligned positions from the alignment,
the final alignment leaves about half the sequence length [13].
This is a time-consuming procedure, and, most importantly, the
alignment quite often underestimates the sequence variabilities,
especially those embedded in the ambiguously aligned positions.
In this paper, we introduced a tree building method without a sequence alignment requirement. This N-local-decoding
method calculates sequence dissimilarity matrices, based on re-writing,
and re-classifies input sequences. Our HIV/SIV subtyping results showed
that the classifications produced by this method agree very well with
those obtained by a combination of standard methods. Thus comparing
biological sequences without alignments appears to be an alternative to
better explore sequence relationships.
However, there exist some discrepancies between our N-local-decoding-method-calculated trees and those obtained from standard methods. The differences may simply suggest that the N parameter used in N-local-decoding-method
needs to be better defined, or they may be a consequence of the fact
that we include ambiguous regions that are often ignored by traditional
methods.
The N-local-decoding-method is particularly useful in the
analysis of sequence variety and in tracking the sequence evolutionary
events when a good sequence alignment is not possible. Our N-local-decoding-method
is meaningful from the evolutionary point of view. Its success in
sequence subtyping relies on capturing (im)perfect repeats or conserved
regions in sequences (similarity blocks that are either closely or
remotely related, and the latter one is often undetected by traditional
methods due to removal of ambiguous alignment regions). The similarity
blocks include internal repeats in one sequence or conserved regions
among sequences, and these blocks are not necessarily to appear in the
same order in the original sequences. In our method, the inversion
could be detected by including the reverse complementary sequences in
the sequence set.
This method is also practically applicable in terms of computing
time and convenience to use. All the calculations in this paper have
been done within a few seconds on a regular PC. The quality of the
algorithm [8] is responsible for this speed. This algorithm has a complexity linear in the total length of the set.
The N value, the only parameter used in this method, could be set empirically according to the N values listed in the Results section.
Our method thus provides an alternative way of constructing sequence
trees. It is helpful in tracking sequence information embedded in
ambiguous alignment regions. In addition, the possibility of comparing
sequences of varied lengths also suggests its direct use in detecting
sequence recombinants. Finally, the similarity blocks found by this
method could also be used as the anchor points for those
similarity-block-based alignment programs to refine the quality of the
alignment.