Dataset. Our dataset of (AC)n dinucleotide repeats was extracted from the human genome (build 33, NCBI Reference Sequences; NCBI, Bethesda, Maryland, United States) using a custom macro written in Visual Basic. Only microsatellites separated by at least 100 bp from the nearest (AC)2 or longer were included in the dataset. Thus (AC)2AT(AC)10 would not be included in the dataset, whereas ACAT(AC)10 would be included as (AC)10. Flanking sequences are here defined as the 50 bases lying either side of a microsatellite. No attempt was made to translate TG repeats with complementary AC repeats on the opposite strand. Consequently, all our microsatellites are 5′-(AC)n-3′.
Flanking sequence base composition. For each frequency estimate, 95% confidence intervals were derived based on the binomial distribution (n n 200 observations). Bases used to define cassette type were excluded from all calculations, and expected frequencies were taken as the average frequency across all positions.
Convergence of flanking sequence pattern: assignment test. Microsatellites were divided into 21 classes according to repeat number. Class 1 was the control set, comprising 5,000 randomly selected, non-microsatellite-associated sequences from Chromosome 1. All other classes contained flanking sequences from single-length microsatellites, except class 21, which contained combined data from microsatellites 21–25 repeats long. Analysis was restricted to the most abundant cassette class, T/A, yielding sample sizes that peaked at 1,087 for (AC)5 and declined to 175 for (AC)20 (see Figure 2).
As an index of similarity, we calculated the log likelihood of observing a given sequence based on its position-specific dinucleotide motif composition:
is the frequency of dinucleotide i
at position j
−1 or 0, with position 0 including the microsatellite and its cassette bases) in flanking sequences of class k
. To avoid bias, when a sequence was compared with its own class, its contribution to the dinucleotide frequencies was first removed. For each sequence in turn, A
was calculated for every class and the sequence was then assigned to the class that yielded the highest index value. Under convergent evolution, we expect sequences to tend to be assigned to their own or similar length classes.
Convergence of flanking sequence pattern: quantifying sequence change. Sequences were again divided into length classes 2 to 21, and each sequence contributed four 50-bp blocks of sequence, one from each side immediately adjacent to the microsatellite but excluding the cassette bases (class 1), and one from each side displaced by a randomly selected number 500–600 bases distal (class 2). In addition, we also generated a database of 5,000 non-microsatellite-associated sequences. When making comparisons within a class, nonindependence was avoided by randomising the sequence order and then comparing sequence 1 with sequence 2, 2 with 3, …, (n − 1) with n. Our index of similarity was simply a count of the number of matching bases. A few pairs of sequences (less than 0.1%) gave high similarity scores of over 30/50 matching bases, presumably because these loci have been duplicated or lie in repetitive elements. Such sequences were discarded. As with all other analyses, sequences containing base ambiguities (marked base N) were also discarded.
Rate of evolution around microsatellites. (AC)n repeat microsatellites were extracted from the available chimpanzee finished-quality high-throughput genomic sequence (NCBI) as outlined above for humans. A 300-base region 220 bases upstream from each chimpanzee microsatellite was used by Megablast (Win32 version 2.2.6, NCBI) to identify homologous human loci in the finished genome sequence. Sequences with multiple high-scoring hits were discarded, as they presumably occur because a locus is found in repetitive elements or has been duplicated. Those nonoverlapping hits with at least 280/300 matching bases and an expectation (e-value) greater than five times that of any other hit to the same sequence were thus retained, giving a dataset of 5,537 sequences.