The genome is A+T-rich (77.57%) and of broadly uniform composition, apart from the more G+C-rich repeat-dense regions (Fig. 2). On a finer scale, nucleotide composition tracks the distribution of exons (see below). Amongst dinucleotides, CpG is under-represented, not just in absolute terms but relative to its isomer GpC (the former occurring only 62% as often as the latter). This bias normally reflects cytosine methylation at CpG sequences, promoting their mutation to TpG (which isoverrepresented relative to GpT by 38%). Hence, these observations suggest that cytosine methylation may occur in Dictyostelium, contrary to earlier findings11.
Simple sequence repeats are abundant and unusual
Simple sequence repeats (SSRs) are more abundant in Dictyostelium
than in any other sequenced genome,
comprising > 11% of bases (Fig. SI 2). In non-coding sequence, tracts of dinucleotides or longer motifs occur every 392bp on average and comprise 6.4% of the bases. There is a bias towards repeat units of 3–6 bases, whereas dinucleotide tracts predominate in most other genomes. Homopolymer tracts are also abundant, comprising a further 16% of non-coding sequence. The base composition of non-coding SSRs and homopolymer tracts (99.2% A+T) is even more biased than that of their surrounding sequence, suggesting that either selection or the mechanism of repeat expansion favours A+T-rich repeats.
Notably, SSRs are also abundant in protein-coding sequence, occurring on average every 724bp within exons. We consider these coding SSRs in further detail below in the context of proteins.
Transposable elements are clustered
The genome is known to be rich in transposable elements (TEs)9, 12
. Completion of the sequence confirms the earlier observation that TEs of the same type are clustered, suggesting their preferential insertion within similar resident elements. However, none of the elements appears to use a specific sequence as a target for insertion: they insert at random within other elements of the same type. Non-LTR (long terminal repeat) retrotransposons are known to insert next to tRNA genes: we find many such instances (Fig. 2
), but again no specific sequences were identified as insertion targets.
tRNAs are numerous and paired by specificity
The sequenced genome encodes 390 tRNAs, a number at the upper end of the eukaryotic spectrum (e.g. Plasmodium falciparum
=43, Drosophila melanogaster=
284, human=496). Allowing for the normal wobble rules in codon-anticodon pairing13, 14
every sense codon can be decoded, apart from the rare alanine codon GCG; we infer that the missing tRNA(s) lie in one or more gaps in the sequence. We also find a possible selenocysteine tRNA in the genome, as well as corresponding selenocysteine insertion targets in two predicted proteins (Supplementary Information; Fig. SI 3).
Dictyostelium, in common only with Acanthamoeba castellanii,15, has been shown to lack certain apparently essential tRNAs in its mitochondrial genome16. It therefore seems likely that at least some chromosomally-encoded tRNAs (those for valine, threonine, asparagine and glycine, as well as one arginine and two serine tRNAs) are imported into the mitochondria.
Though the gross distribution of tRNAs is uniform, their organisation on a finer scale is striking: about 20% occur as pairs or triplets with identical anticodons (and usually 100% sequence identity), separated by Fig. 2). There are 41 such groups in the genome; a random distribution would produce few, if any. This pattern is unique amongst sequenced genomes, and suggests a wave of recent duplications. However, tRNA pairs are found in tandem, converging and diverging orientations with comparable frequencies, suggesting no straightforward duplication mechanism; nor is there usually duplication of extensive flanking sequences. Whether the preference of TRE elements for inserting adjacent to tRNAs is related to the large number and unusual distribution of the latter is unclear.
A chromosomal master copy of the extrachromosomal rDNA element
, rRNA genes lie on an 88-kb palindromic extrachromosomal element17
, present at ~100 copies per nucleus (Fig. 2
). Evidence exists also for chromosomal copies: at least the central 3.2 kb of the element is located17
on chromosome 4, whilst chromosome 2 carries both a partial rDNA sequence and a 5S rRNA pseudogene9, 18
In this study, two unanchored contigs assigned to chromosomes 4/5 were found to contain junctions between rDNA sequences and complex repeats - each of which would confound attempts to extend the sequence and integrate these contigs into the assemblies. We postulate that these contigs represent the junctions between a ‘master copy’ of the rDNA and the remainder of chromosome 4 (Fig. 2). One contig contains sequence matching a region of G+C-rich repeats near the centre of the palindrome, whilst the other matches sequence near the tip of the palindrome arm, adjacent to the one unclosed gap in the rDNA element sequence17. This gap is believed to represent a tandem array of short repeats, probably added post-synthetically to the extrachromosomal elements.
The structure of this master copy suggests a mechanism for generating the extrachromosomal copies by a process of transcription, hairpin formation and second-strand synthesis (Fig. 2). This process would account for the complete absence of sequence variation between the two arms of the palindrome.