Prediction of protein-coding genes (see Methods) was performed on the complete set of chromosomes and floating contigs (Table 2). In assessing the completeness and accuracy of the predictions, we find that of the 957 well-characterized D. discoideum genes that are present in the current sequence, 823 (86%) are predicted as transcripts with structures matching the experimentally determined ones. For a further 123 (13%), the predicted transcript differs from the experimentally determined one, about half of these differing only in their 5' boundary; the remaining 11 (1%), though present in the sequence, were not predicted as transcripts. Similarly, of the 128,207 qualified ESTs present in the current sequence, 127,097 (99.1%) fall within predicted transcripts. Combining our estimate of sequence coverage (above) with these estimates of the success of gene prediction, we infer that approximately 98% of all D. discoideum genes are present in the predicted set.
The level of over-prediction, conversely, is harder to estimate: prediction was performed generously to ensure that most true genes were represented. Of the 13,541 predicted proteins, 47.5% are represented by qualified ESTs, reflecting the inevitable bias in EST sampling. Amongst the shortest predicted proteins, fewer are represented by ESTs (e.g. 21% of those of Table 2). The same relative complexity is seen in the total numbers of amino acids encoded by each genome, which are less biased by shorter (and more dubious) predictions. Introns in Dictyostelium are few and short, and intergenic regions are small, producing a compact genome of which 62% encodes protein.
Genes are distributed approximately uniformly across the genome (Fig. 2). Although we do not see widespread clustering of genes with coordinated expression patterns (see Methods), we do find statistically significant (pFig. 2).
A+T-richness influences amino-acid composition as well as codon usage
Codon usage in Dictyostelium
favours codons of the form NNT or NNA over their NNG or NNC synonyms, the bias being even greater than in the A+T-rich Plasmodium
genome. Comparison of tRNA and codon frequencies (Table SI 2) reveals a similar picture to that in human28
and other eukaryotes, suggesting that the same use is made of 'wobble' and of base modifications (for example, of adenine to inosine in some tRNAs) to expand the effective repertoire of tRNAs.
As in Plasmodium,29, the extreme A+T-richness is reflected not just in the choice of synonymous codons, but also in the amino acid composition of the proteins. Amino acids encoded solely by codons of the form WWN (W=A or T; N=any base; these are Asn, Lys, Ile, Tyr and Phe), are much commoner in Dictyostelium proteins than in human ones; the reverse is true for those encoded solely by SSN codons (S=C or G; these are Pro, Arg, Ala and Gly).
Geometry reflects phylogeny - tandem duplication in the genome
The predicted gene set of Dictyostelium
is rich in relatively recent gene duplications. Of the 13,498 predicted proteins analysed, 3663 fall into 889 families clustered by BLAST-P similarities of e−40. Most (538) families contain only two members, but 351 families contain between three and 81 proteins (Table SI 3). Hence, 2774 (20%) of all predicted proteins have arisen by relatively recent duplication, potentially accounting for much of Dictyostelium
's excess over typical unicellular eukaryotes.
We tried to infer the mechanisms by which such duplications arise and propagate in the genome. Where members of a family are clustered on one chromosome, the physical distance between family members often (23 of 86 families examined) correlates strongly with their evolutionary divergence (Methods). Where a family is split between different chromosomes, members on the same chromosome are often (23 of 50 families examined) more related to each other than to members on different chromosomes; the reverse is never observed.
These findings suggest that three processes combine to account for most duplications in Dictyostelium: tandem duplication, local inversion, and inter-chromosomal exchange. In this model, gene families expand by tandem duplication of either single genes or blocks containing several consecutive genes, as in an earlier model30; inversions within these expanding clusters may reverse local gene order. An elegant illustration of these two processes is provided by a cluster of Acetyl-coA synthetases on chromosome 2 (Fig. 4). The third process - exchange of segments between chromosomes - may fragment these clusters at any stage. If such an interchromosomal exchange splits a gene family early in its expansion, then each of the two resulting sub-families has a long subsequent period of evolution independently of the other, so similarities will be greatest between genes on the same chromosome. If, conversely, the split occurs later then all family members - whether on the same or on different chromosomes - will tend to resemble each other equally closely. We cannot exclude the possibility of duplication occasionally creating a second copy of a gene, or group of genes, directly on a different chromosome from the first. However, all instances that we have examined can be accounted for without such intermolecular duplication.
Amino acid repeats
Tandem repeats of trinucleotides (and of motifs of 6, 9, 12 etc bases) are unusually abundant in Dictyostelium
exons, and naturally correspond to repeated sequences of amino acids. However, at the protein level the situation is even more extreme: there are many further amino-acid repeats which use different synonymous codons, and so do not arise from perfect nucleotide repeats. Amongst the predicted proteins, there are 9582 simple-sequence repeats of amino acids (homopolymers of length ≥10, or ≥5 consecutive repeats of a motif of two or more amino-acids). Of these, the most striking are polyasparagine and polyglutamine tracts of ≥20 residues, present in 2,091 of the predicted proteins. Also abundant are low-complexity regions such as QLQLQQQQQQQLQLQQ: there are 2379 tracts of ≥15 residues composed of only two different amino acids. In total, repeats or simple-sequence tracts of amino acids (even by these conservative definitions) occur in 34% of predicted proteins and encode 3.3% of all amino acids.
It seems likely that these repeats have arisen through nucleotide expansion, but have been selected at the protein level. Evidence for the latter is that any given trinucleotide repeat occurs predominantly in only one of the three reading frames. For example, the repeat ...ACAACAACAACA... is usually translated as polyglutamine ([CAA]n) rather than as polythreonine ([ACA]n) or polyasparagine ([AAC]n). Further evidence comes from the many trinucleotide repeats which have apparently mutated to produce only synonymous codons (e.g. ...GAT,GAC,GAT,GAT,GAC,..., translated as polyaspartate). Moreover, the distribution of repeats and simple-sequence tracts is non-random: most proteins either have no such features (66% of proteins) or have two or more (18% of proteins), suggesting that they are tolerated only in certain types of protein. The polyasparagine- and polyglutamine-containing proteins appear to be over-represented in protein kinases, lipid kinases, transcription factors, RNA helicases and mRNA binding proteins such as spliceosome components (Fig. SI 9). Protein kinases and transcription factors are also over-represented in the polyasparagine- and polyglutamine-containing proteins of S. cerevisiae, so it is possible that these homopolymers serve some functional role in these protein classes. A more detailed analysis of amino acid homopolymers is given in Supplementary Information (tables SI 4–6, Figs. SI 7–10).