Non-dS measures of mutation are highest in closed chromatin
In order to determine whether background mutation rates are associated with chromatin structure we first determined intergenic divergence rates, using human versus chimpanzee whole genome alignments, in regions whose chromatin environment in human lymphoblastoid cells had been determined. The majority of intergenic bases should be under little or no selection and therefore intergenic divergence should be approximately analogous to background mutation rates. As shown in Figure 1A, we found a negative correlation between intergenic divergence and chromatin structure at non-CpG sites. As open chromatin is generally more gene rich than closed (and may therefore contain more regulatory elements than intergenic regions) we also examined divergence rates in ancient repeats only. However, these also displayed the lowest divergence rates when in open chromatin (Figure 1B).
It has previously been proposed that DNA sequences nearer the centre of the nucleus may be protected from DNA damage by those on the periphery (the "bodyguard hypothesis"). Likewise, the chromosomes most enriched with open chromatin are generally situated towards the centre of a nucleus . The correlation observed between divergence rates and chromatin structure may therefore be an indirect result of these phenomena. We therefore investigated whether a correlation between intergenic divergence and chromatin structure could be observed within chromosomes. Although chromosomes themselves have been shown to display some level of polar organization (such that their most gene-poor regions are those closest to the nuclear periphery ) adjacent intergenic regions within chromosomes often have very different chromatin structures despite displaying approximately the same nuclear localisation. If the observed correlation between intergenic divergence and chromatin structure reflects the predictions of the bodyguard hypothesis we would expect to see no such correlation within chromosomes. This, however, is not the case. For example, as shown in Figure 1C, there is a significant negative correlation between intergenic divergence and chromatin structure within chromosome 1 (r2 = 0.053; p = 0.0043). The two outlier clones observed in this figure, with a divergence greater than 0.025, could represent mutational hotspots in the genome. However, the degree of difference between the divergence observed in these clones compared with the rest of the chromosome suggests to us that the alignments in these regions are more likely be of poor quality. Removal of these clones increases the significance of the correlation observed between divergence and chromatin structure (r2 = 0.113; p = 2.5e-05). In total 7 out of 22 chromosomes display a significant negative correlation (p
Another measure often used to predict mutation rate is SNP density [8,9]. It is predicted that as a large proportion of intergenic sequence is non-functional and that there has been little time for selection to act on SNPs, their density along the genome should generally reflect underlying mutation rates. A further benefit of the use of SNPs in this way is that mutation rates can be predicted without relying on sequence comparisons with other species. We consequently determined the mean intergenic SNP densities observed across chromatin categories. As shown in Figure 1D the mean SNP density was also lowest in the most open regions of the genome.
There is therefore strong evidence that mutation rates are associated with chromatin structure. Not only are intergenic, intronic (Figure 2A) and ancient repeat divergence rates highest in closed chromatin but the density of SNPs is also elevated in the most closed regions of the human genome. Thus we hypothesise that closed regions of the genome are simply less accessible to DNA repair mechanisms.
It should be noted however that chromatin structure is likely to be only one of several factors associated with neutral divergence rates in the human genome. This is most apparent on chromosome 19, and to a lesser extent chromosome 8, which show substantially higher mean intergenic divergence rates in our analysis than the other autosomes. Whereas chromosome 19 and chromosome 8 display mean intergenic divergences of 1.5% and 1.3% respectively, the divergence rates of all other autosomes fall between 1 and 1.2%. As chromosome 19 is particularly enriched with open chromatin , its high divergence levels are contrary to what is generally observed across the autosomes. The high levels of divergence observed along chromosome 19 are consequently likely to be a result of factors other than chromatin structure.
Gene distribution and chromatin structure
As shown in figure 3A, housekeeping genes are generally located in the more open regions of the genome and tissue-specific genes in the most closed regions. This is in agreement with a previous analysis that illustrated that nucleosome formation potential is negatively correlated with expression breadth . Consequently a recent study by Gazave et al. , that showed that the levels of human-chimpanzee divergence observed in the introns of housekeeping genes is significantly lower than in other genes, is in broad agreement with the analysis presented here. Although CpG dinucleotides were not excluded in Gazave et al's analysis this is only likely to have led to an increase in the estimation of divergence in housekeeping genes due to the enrichment of CpG dinucleotides in open chromatin. However, intriguingly, when human-mouse alignments are examined, the introns of tissue-specific genes have been shown to contain a greater proportion of conserved sequence than those of housekeeping genes  (in contradiction to what is observed in human-chimpanzee alignments). We believe this apparent discrepancy is likely to be the result of the difference in evolutionary distance investigated, with the examination of human-mouse alignments potentially leading to the identification of regions under (stabilising) selection. For example, we may expect that closed regions of the genome contain more DNA elements involved in regulating the surrounding chromatin structure whose conservation becomes apparent across larger evolutionary distances.
Through the use of the DAVID program  that determines those biological terms and annotations (for example GO terms) enriched among a set of genes, we identified further classes of genes most over-represented in closed chromatin, and therefore likely to be experiencing the highest mutation rates. Of the 148 genes in the most closed regions of the genome, 40 encode glycoproteins (p for enrichment: 0.000074) and 22 were associated with the G-protein coupled protein signaling pathway (p = 0.00011). Glycoproteins and G-protein coupled receptors are involved in immune response and cell signaling and it has previously been proposed that such genes are likely to evolve quickly in response to changing stimuli . Being located in closed regions of the genome (where we have observed background mutation rates (intergenic divergence and SNP density) are particularly high) will allow this more rapid evolution. Housekeeping genes, on the other hand, that are enriched in open chromatin, have previously been shown to evolve relatively slowly . The location of a gene in the genome and its subsequent local chromatin structure may therefore, at least partly, be governed by the suitability of the local mutation rate it confers.
dS, unlike dN and dN/dS, is highest in regions of open chromatin
dS has historically been used as a further surrogate measure of basal mutation rates, as synonymous sites were believed to be under little or no selection. Changes at synonymous sites, unlike at non-synonymous sites, do not affect the encoded amino acid. In addition, due to the relatively small effective population sizes of mammals, a synonymous site would have to experience relatively strong selection to evolve in a non-neutral manner . As shown in Figure 4A, the average rate of non-synonymous changes (dN) observed in human mouse alignments is 51% higher in the most closed chromatin regions of the genome than in the most open regions. Similarly, the ratio of non-synonymous to synonymous substitution rates (dN/dS), which is frequently used as a measure of selection, is 61% higher (Figure 4B). However, the average synonymous rate (dS) for genes in relatively open chromatin is higher than that for genes in a more closed chromatin structure (Figure 4C). This is consistent with the reported high Ks for human chromosome 19, the human chromosome with one of the most open chromatin structures of all . The observation by Hurst et al. of similar levels of human-mouse dS, dN and dN/dS in linked genes is likely therefore to be the result of linked genes being from similar chromatin environments. To ensure the converse is not true, and that the results observed in this study are not the result of linked genes, we randomly selected only one gene from each clone (so that all genes analysed were approximately 1 Mb apart and therefore unlinked). With this selection strategy we still observed similar correlations to those shown in Figure 3 (not shown).
Although we would expect the enrichment of housekeeping genes in relatively open regions of the genome as shown in Figure 3A (as open chromatin is likely to provide a more conducive environment for transcription), the lower average dN/dS observed in open chromatin may simply be a consequence of this higher number of housekeeping genes (which are known to evolve slowly) in these regions. The exclusion of housekeeping genes from the analysis, however, has little effect on the correlations in Figure 4 (not shown). Even the exclusion from the analysis of all genes whose 5' end is associated with a CpG island (which includes almost all housekeeping genes  and that are also enriched in open chromatin, Figure 3B) does not lead to the loss of the correlations between chromatin structure and dN, dS and dN/dS. In fact the rate of dN in CpG island genes, unlike that in genes not associated with a CpG island, is relatively constant across chromatin categories and does not show a significant correlation with chromatin. Consequently selection appears to maintain similar levels of dN in genes associated with a CpG island irrespective of their local chromatin structure.
To ensure these results were not confounded by CpG associated or sex chromosome specific factors (sex chromosomes have been shown to display abnormal rates of divergence when compared to the autosomes ), we calculated divergence rates at non-CpG, fourfold degenerate sites in genes on autosomes only. We also used human-chimp alignments instead of human-mouse alignments as the chromatin structure of the chimp genome should be more similar to that in humans (and consequently the species of origin for each change is less important). However, as shown in Figure 2A, the highest rates of divergence are still observed in genes from the most open regions of the genome.
Genes in closed chromatin display the highest levels of selection at synonymous sites
Although historically the synonymous substitution rate (dS or Ks) has been used as a measure of the rate of mutation, there is increasing evidence that selection may be occurring at synonymous sites . To investigate the potential role of any selection on synonymous sites in the disparity between dS and other measures of mutation, we analysed the rates of divergence observed across intron-exon boundaries . As shown in Figure 2, the rates of intronic divergence in open regions of the genome are comparable to those observed at corresponding exonic, fourfold degenerate sites. This would suggest that genes in open chromatin display little if any evidence for selection at their synonymous bases. However, genes in closed chromatin display markedly higher rates of divergence at their intronic sites than at corresponding fourfold degenerate sites. Genes in closed chromatin therefore, unlike those in open, display strong evidence for synonymous site selection.
Although the rate of selection against both synonymous transitions and transversions is highest in closed chromatin, only the rate of synonymous transitions is strongly positively correlated with chromatin structure (Figure 5A). The rate of transversions at fourfold degenerate sites shows no obvious trend across chromatin categories (Figure 5B) and consequently selection against transversions, unlike transitions, appears to be independent of any factors associated with chromatin structure. We are not aware of any reason for the observed association between rates of transitions at non-CpG fourfold degenerate sites and chromatin structure, but it could reflect constraint in motifs whose distribution are not uniform across the genome.
As previously shown, open regions of the genome are particularly gene dense whereas closed regions are relatively gene poor . Consequently, the use of dS as a measure of mutation rate may be appropriate for a large proportion of genes. However, the use of dS as a surrogate measure of mutation rate for genes in closed chromatin will lead to the under-estimation of the true mutation rate in these regions and also the miscalculation of the levels of selection when used to measure dN/dS.
Exonic Splice Enhancers and RNA secondary structure
It has been proposed that synonymous sites may experience constraint because they play a role in controlling splicing or RNA stability . For example, synonymous sites may be part of an exonic splice enhancer (ESE) motif or lead to a more stable base-paired RNA that is less susceptible to degradation. Although codon usage bias (resulting from unequal abundances of tRNAs and subsequent selection at synonymous sites in favour of codons corresponding to the most abundant tRNAs) has also been proposed as an explanation of synonymous site selection, the evidence for this in mammals is weak . We therefore looked at the distribution of each predicted ESE motif across chromatin categories to see if their relative densities could explain the high levels of synonymous selection in closed chromatin. The density of a large proportion of ESE hexamers (44%) displayed a significant negative correlation with chromatin structure. However, given the base composition of ESE hexamers and coding regions across chromatin categories, we actually observed far fewer hexamers displaying a negative correlation than we would expect by chance (66%). This is because coding sequence base composition is itself correlated with chromatin structure and ESEs also show biases in their base composition. As shown in Figure 5A, excluding all sites from coding regions that overlap a predicted ESE hexamer leads to only a small increase in the rate of transitions observed at fourfold degenerate sites. Consequently, either there are many ESE motifs that are yet to be determined, or selection at synonymous sites is at most only partly the result of exonic splice enhancers.
We also compared the distribution of gene types across chromatin categories. If genes whose RNA structure is important were preferentially located in closed chromatin we may expect an over-representation of non-protein coding genes in closed regions. As shown in figures 5C+D, certain classes of non-protein coding genes are indeed over-represented in closed chromatin (rRNAs and snRNAs), while the distribution of other types of genes such as miRNAs and snoRNAs show no relationship with chromatin structure.
Further analysis is therefore required to determine why protein coding genes in closed regions of the genome display such comparatively high levels of selection at their synonymous sites. If it is indeed because of a requirement for a more stable secondary structure, then we may expect that the predicted stability of mRNAs from closed regions would be greater than those in open . Future tests of this kind may help determine the reasons behind the enrichment of selection at synonymous sites in closed chromatin observed in this study.