The abundance of open chromatin fibre structure in lymphoblastoid cells, at clones spaced approximately 1 Mb apart along the human genome, was determined as previously described [1]. Relative chromatin structure was represented in this analysis by log2(open chromatin:input chromatin) values (determined by cohybridising differentially labelled "open" and input chromatin fragments to a human genomic DNA microarray). A large log2(open:input) value in this analysis indicates a region enriched with open chromatin (see Gilbert et al. for further details). Clones with similar log2(open:input) values were binned for analysis (with bin sizes adapted to the amount of data available). The 2,787 human protein coding genes that mapped to each of these clones and their corresponding mouse orthologues were obtained from Ensembl (unique best reciprocal hits were taken where possible then reciprocal hits based on synteny). Coding sequence alignments of each of these orthologous pairs were derived via protein alignments (using the MUSCLE [26] and tranalign [27] programs). The codeml program of the PAML package [28] was used to calculate dN, dS and dN/dS using the F3 × 4 codon evolution model. Gene pairs with anomalously high dS values (> 1.270 i.e. twice the median dS of all human vs. mouse pairs) were excluded [29].
Gene expression breadth was determined through the analysis of the Gene Expression Atlas Affymetrix U133A dataset of Su et al. [30]. Intensity levels were averaged across arrays derived from the same tissue and all tumour derived arrays were excluded. A gene was defined as expressed if its mean signal level across all its corresponding probes exceeded that of the data set median [12]. To identify potential genes with CpG islands, the positions of predicted CpG clusters were obtained from the UCSC genome browser [31]. Of these islands, any that were less than 500 bp long, had a G+C content less than 55 or had a CpG to expected CpG ratio of less than 0.65 were excluded [32]. Those genes whose 5' end was within 2 kb of one of these islands were determined to be potential CpG island genes.
Human chimpanzee divergence was determined through the use of the chained and netted human-chimpanzee alignments available at the UCSC website (hg17-panTro1) [33]. Ensembl gene predictions were used to identify intronic, intergenic and protein coding regions. All exclusively intergenic and intronic regions found within clones were identified, and divergence measured in the corresponding sections of the human-chimpanzee alignment using PAML's baseml with the REV model [28]. Before calculating divergence all sequence from the same chromatin category was concatenated, in order to minimise the problems inherent in accurately measuring low divergence levels in regions of finite length. All bases that overlapped a CG dinucleotide in either species were removed from the alignments to conservatively calculate non-CpG rates of divergence [18].
Intergenic repeats were identified through UCSC's RepeatMasker annotation. Ancient repeats were defined as in Gibbs et al [29] and Taylor et al. [34] as repeats from the same RepeatMasker subfamily conserved between mouse and human in the same orientation. Simple repeats and regions of low complexity were excluded.
The SNP Consortium data were used to calculate SNP density across chromatin categories [35]. To ensure these densities were not biased as a result of the variety of protocols used to detect SNPs (some of which were chromosome specific), SNP densities across chromatin categories were also calculated using only SNPs randomly identified via the TSCM0019 protocol (a panel of 24 DNAs sequenced by the Sanger Centre, for more details see: [36]). The location of TSC SNPs was determined by mapping their ssIds to current rsIds via data available at dbSNP.
Predicted Exonic Splice Enhancer (ESE) hexamers were obtained from Fairbrother et al. [37]. The occurrence of each of these hexamers in the coding regions of each of the genes that mapped to a 1 Mb clone was determined. In order to identify the number of hexamers we would expect to detect by chance given the base composition of the genes and hexamers, we randomly shuffled the bases in each of the coding regions 100 times and recalculated the occurrence of each of the hexamers. The distribution of non-protein coding genes across chromatin categories was determined through Ensembl annotations.
Authors' contributions
JGDP undertook initial study design, software implementation, statistical analysis and interpretation, and drafted the initial manuscript. NG and WAB determined the chromatin structure of the 1 Mb cloneset, participated in the study design and contributed to the final manuscript. HC, MGD and CAMS participated in the final study design, coordinated the study and contributed to the final manuscript.
Acknowledgements
JGDP is funded by an MRC Bioinformatics Research Studentship (G74/93). The work by MGD and HC is funded by grants from Cancer Research UK (C348/A3758), Medical Research Council (G0000657-53203) and Scottish Executive Chief Scientist Office (CZB/4/94). WAB, NG and CAMS are supported by the UK Medical Research Council. WAB is a James S.McDonnell Centennial fellow and is supported in part by FP6 through funding for the Epigenome Network of Excellence under contract LSHG-CT-2004-503433.