We have studied very large numbers of (AC)n microsatellite flanking regions culled from the human genome and asked questions about the extent that these evolve in any consistent and unusual manner. Patterning is present in the form of over- and underrepresentation of bases and dinucleotide motifs at odd and even positions either side of the microsatellite. Pattern strength is maximal around (AC)9, but appears present even around sequences as short as (AC)2 and may extend as many as 50 bases either side. Some patterning is more or less symmetrical, but we also found several examples showing strong 5′ to 3′ asymmetry, implying that the two ends of a microsatellite are by no means equivalent. The net result is that sequences flanking microsatellites of a given length tend to be more similar to each other than to random sequences or to sequences flanking microsatellites of different lengths. Thus, there appears to be convergent evolution. Finally, we compared large numbers of homologous flanking sequences between humans and chimpanzees, and found evidence that mutation rates near microsatellites tend to be somewhat elevated.
Sequences surrounding (AC)n tracts exhibit remarkable levels of patterning, with any given dinucleotide motif tending to be much more likely to occur at even numbered positions rather than odd, or vice versa. For several reasons, we believe that the patterning arises due to the structural properties of the microsatellite (see below), becoming more pronounced as repeat number increases. These reasons include the following: the consistently central placement of microsatellites within the patterning, the dependence of the strength of patterning on AC repeat number, the similarity between microsatellites in LINE and SINE elements and those elsewhere, the weakness of the patterning around (AC)2, and the strong influence of cassette type on the form of patterning. Unfortunately, it is surprisingly difficult to eliminate the alternative hypothesis, namely that the patterning arises due to some other force and that AC repeats then either form or expand more rapidly when placed centrally within the pattern. This ambiguity is particularly relevant to the question of cassette distribution, where it seems reasonable both that (AC)n tracts might cause biased interconversion between cassettes and that certain cassettes may allow slippage more than others. For example, while the structural properties of AC repeats are known to generate mutational biases in adjacent bases (Timsit 1999) capable of changing cassette type, minisatellite mutation rate can depend critically on the presence of a particular base in the flanking sequence (Monckton et al. 1994).
The relationship between an AC microsatellite and its flanking sequences begins surprisingly early, with (AC)2 already showing a small but significant bias in the distribution of cassette types and greater similarity to other sequences flanking AC microsatellites than to random sequences. In addition, the moving window assignment test indicates that significant convergence exists even when the ten bases closest to the microsatellite are excluded. Such a wide influence around such a common, short motif is remarkable and suggests that a high proportion of the genome may be affected by these and similar forces. To illustrate, (AC)2 is expected to occur every 250 bases, as is (GT)2. Taking the sphere of influence on each side as ten bases plus half the 25-bp window yields a value of 45. This predicts that over 30% (approximately 45 bases of every 125) of the genome will be affected by (AC)2 on one strand or the other, a figure that will only increase with inclusion of longer arrays and other microsatellite motifs.
As AC repeat number increases, so does the strength of patterning, becoming pronounced by (AC)5 and peaking in strength at (AC)9. Although patterning is seen in several different dinucleotide motifs, even in the human genome there are insufficient data to study any but the commonest cassette–motif combinations over a wide range of microsatellite lengths. Focusing on the motif AT, we found evidence that the strongest patterning was due to the development of AT microsatellites abutting AC tracts. However, this is not the only effect. After removal of all (AT)2 or longer microsatellites, there remains a significant tendency for single AT motifs to appear in phase with AC tracts, suggesting that mutation bias as well as slippage is involved.
Given the increase in strength of patterning between (AC)2 and (AC)9, it might seem logical that the pattern would become stronger and stronger as repeat number increases further. Instead, (AC)9 appears to be the peak strength, with longer microsatellites showing lower amplitude but a broader spread of patterning. It is interesting that this peak coincides with the length at which microsatellites begin to become polymorphic: a common rule of thumb for marker development in mammals is that primers are designed for loci carrying ten or more repeats (Weber 1990). This may be mere coincidence or may reflect, for example, a change in mutation process associated with individuals who are heterozygous for alleles carrying different repeat numbers (Rubinsztein et al. 1995; Amos et al. 1996; Amos and Harwood 1998). Again, there are parallels with minisatellites, where many mutations occur by the transfer of material from one homologous chromosome to the other (Jeffreys et al. 1995).
The rich patterning we find presumably arises through local mutation biases. Previous work on mutation biases has tended to reveal either generic effects such as isochors (Bernardi 2000), where some bases are favoured over others in large regions of the genome, or specific but highly localised biases where one or two bases may influence what happens to their immediate neighbours (Blake et al. 1992; Morton et al. 1997; Goodman and Fygenson 1998; Zavolan and Kepler 2001). The patterns we find suggest a somewhat intermediate process in which mutational dependency appears to extend over distances of 30 bases or more. At the same time, the patterning is position dependent, in that it involves not just, for example, a favouring of A over other bases, but, instead, a favouring of A over other bases at even numbered sites.
The actual mechanism that causes patterning remains unclear, but our data suggest a model based on the structural properties of AC repeat tracts. Local variation in DNA structure is known to be associated with mutational biases (Morton et al. 1997) and variation in mutation rate (Petruska and Goodman 1985; Goodman and Fygenson 1998), as well as possibly influencing the mismatch repair process (Werntges et al. 1986; Marra and Schar 1999). Tracts of repeating AC motifs tend to exhibit unusual structural properties with high propeller twist and shifted base pairing (Timsit 1999), and hence may be considered prime candidates for sequences capable of influencing the evolution of their immediate surroundings. Indeed, crystallographic studies indicate that sequences like (AC)n and (A)n induce local mutation biases (Timsit 1999).
The unusual structure of microsatellite DNA may generate mutational biases in at least two ways. First, in AC repeat tracts, each base interacts unusually strongly with the neighbour of its complement base in a way that may lead to misincorporation of incoming nucleotides toward the ends of the microsatellite or in the immediate flanking region. Second, AC tract structure may influence the efficiency of the mismatch repair machinery in correcting either noncomplementary bases or loops resulting from slipped strand misalignment of repetitive DNA. Given that the mismatch repair system is strongly implicated in moderating the otherwise high rates of slippage mutation at microsatellite loci (Levinson and Gutman 1987; Schlötterer 2000), it seems possible that even a small bias in the repair of loop structures might be responsible for the patterning we observe. However, although variation in mismatch repair efficiency may depend to some extent on DNA structure, the effect of sequence context on repair is not well understood (Marra and Schar 1999). Unfortunately, with current understanding, none of these mechanisms would generate mutation biases that extend tens of bases away from the microsatellite, and hence this aspect must await further research.
An alternative explanation for some of the patterning, for example, the tendency for single AT motifs to lie in phase with the microsatellite, could be that these elements represent the remnants of a longer and now eroded (AC)n repeat tract. Under this scenario, point mutations at specific positions along the microsatellite would presumably interrupt the repeats. Given a strong bias toward transition mutations, we can explain both the existence of strong AT pattern, with C to T transition mutations dominating over C to R (purine) or A to Y (pyrimidine) transversion mutations, and also the increase in pattern strength around longer microsatellites, with interruptions in longer arrays more likely to be internal to the repeat tract and hence be excluded from the analysis. However, we suggest that this model is unlikely for two reasons. First, such a model fails to accommodate the strong asymmetry in patterning that is observed for some dinucleotides and specific cassette bases around the (AC)n repeat tract. Polarity has been noted for minisatellite mutations, with mutational processes differing between the two ends of the repeat tract (Armour et al. 1993; Jeffreys et al. 1994), but a microsatellite is much simpler in structure than a minisatellite and any polarity would have to affect some dinucleotides but not others. Second, the commonest and strongest patterning is observed for dinucleotide AT, and this would require high rates of C to T transitions but effectively no A to G transitions. More generally, the microsatellite erosion model predicts that flanking sequence patterning should be dominated by purine/pyrimidine, and this is not the case (see Table 2).
The patterning we describe appears to represent an important component of the forces that shape genome evolution, both in terms of its ubiquity and the absolute strength of its effect. It follows that there are many possible practical and theoretical implications. For example, even very short microsatellites appear able to cause some level of convergent sequence evolution, and hence to confound phylogenetic analyses. Similarly, microsatellites near genes may increase local mutation rates and influence the spectrum of new mutations that arise. To explore the size of these effects we designed experiments both to measure absolute convergence and to ask about evidence for changes in mutation rate.
To measure convergence, we made various comparisons between blocks of 50 bases chosen randomly, lying next to a microsatellite and lying near a microsatellite. We found an ordered progression of similarity from 12.77/50 bases for random–random through to a maximum of 14.31/50 bases between blocks adjacent to microsatellites 7–10 repeats long, an increase of 12% similarity. Although modest, trends are highly significant, with all comparisons showing a dependency on microsatellite length that peaks at around 7–10 repeats. The most parsimonious explanation for these similarities is that sequences flanking AC microsatellites tend to be AT-rich and to exhibit increased simplicity. Both these characteristics would increase the chance of flanking sequences being unusually similar both to each other and to random sequences that may contain polyA tails or other sources of simplicity. At the same time, the high scores gained by (AC)7–10 both for assignment to their own class and for similarity to each other relative to random blocks provide a clear indication that convergent sequence evolution is occurring. Interestingly, any given flanking sequence tends to be more similar to a block 500 bases away than to a similarly placed block near a different microsatellite, suggesting longer range patterning such as might arise through placement within the same isochore (Bernardi 2000). Furthermore, our attempts to measure variation in mutation rate indicate reduced similarity between homologous human and chimpanzee sequences, implying a higher rate of evolution, at least for a region in the order of ten bases around the microsatellite. On a scale of blocks of 20 bases the trends are less convincing. Having said this, it seems likely that any genuine variation in mutation rate would be to some extent masked by the convergent evolution, and hence that this aspect would benefit from further investigation.
In conclusion, previous studies of microsatellite flanking sequences have identified several features, including a tendency to harbour other microsatellites, a locally increased mutation rate, and, conversely, conservation over unexpectedly large tracts of evolutionary time. Our analyses support all these trends and provide a possible resolution for the apparent contradiction between faster evolution but at the same time greater sequence conservation. Although there is evidence that mutation rates near microsatellites are elevated, we also find evidence of convergent evolution. Consequently, the increased rate of change may be to some extent neutralised and perhaps even reversed by the tendency for similar changes to occur in related lineages. Furthermore, the greatest changes appear to occur in flanking sequences around microsatellites that are below the length used as markers, at least in humans. Overall, therefore, we have been able to formalise previous anecdotal evidence and hence to document a remarkably widespread source of directional change and nonrandom evolution that undoubtedly plays an important role in shaping the make-up of our genomes.