such as "Introduction", "Conclusion"..etc
An international initiative to sequence the genome of Dictyostelium discoideum AX4 (references 5, 6) was launched in 1998. The high repeat-content and A+T-richness of the genome (the latter rendering large-insert bacterial clones unstable) posed severe challenges for sequencing and assembly. The response to these challenges was to use a whole-chromosome shotgun (WCS) strategy, partially purifying each chromosome electrophoretically and treating it as a separate project. This approach was supported by novel statistical tools to recover chromosome specificity from the impure WCS libraries, and by highly detailed HAPPY maps that provided a framework for sequence assembly. These approaches have enabled the completion of this difficult genome to a high standard, and are likely to be valuable in tackling the many other genomes which present challenges of composition and complexity.
Repetitive tracts complicated assembly. For chromosomes 1, 2 and 3, inspection of polymorphisms, combined with HAPPY maps, allowed unambiguous assembly in many cases. For chromosomes 4, 5 and 6, low-coverage sequencing of AX4-derived YACs alleviated the problems by providing a local dataset within which the troublesome repeat element was present as a single copy. Nevertheless, some repeat tracts proved intractable and remain as gaps. Thirty-four unlinked (‘floating’) contigs of >1kb, totalling 225,339bp, remain unpositioned in the genome, but can be provisionally assigned to specific chromosomes based on their content of reads from the WCS libraries. Most or all of these floating contigs are bounded by repetitive regions. The chromosome 2 sequence in the current assembly supercedes that previously published9, having benefited from further HAPPY mapping and manual sequence finishing. The six chromosomal assemblies span 33,817kb (Table 1), including ~156kb in the form of clone-, sequence- and repeat-gaps. Assuming that the majority of floating contigs lie beyond the termini of the assemblies, the total genome size is estimated at 34,042,810bp. In estimating the completeness of the sequence, we note that of 967 well-characterized D. discoideum genes, 957 (99%) were found initially in the assemblies. Of the remaining ten, seven (cupE, trxA, trxB, trxC, staA, staB, cinB) have close matches, suggesting that their Genbank entries may contain errors or represent alternative alleles. Only three (fcpA, wasA and roco5) had no matches in the initial assemblies, though the first two of these were recovered by searches of unincorporated sequence followed by local reassembly. Of 133,168 ‘qualified’ D. discoideum AX4 ESTs (expressed sequence tags of >200bp and >20% G+C, and not matching mitochondrial sequence; reference 10 and H. Urushihara et al. unpublished), 128,207 (96.3%) are found in the assemblies (the higher proportion of missing sequences amongst the ESTs probably reflects the higher error rate inherent in EST data).
We conclude that the current assembly represents >>95% of chromosomal sequence (less than 1% of which is in floating contigs) and ≥99% of genes, the majority of missing sequence comprising complex or simple repeats. The most stringent test of the medium- to long-range accuracy of the assembly comes from comparison with the HAPPY maps. This is particularly true for chromosomes 4, 5 and 6, where HAPPY markers were used to nucleate contigs but not to guide their assembly or ordering, specifically to allow such a comparison to be made without circularity of argument. As can be seen, good agreement between map and sequence confirms the accuracy of the assembly (Fig. 1).
Enter the code exactly as it appears. All letters are case insensitive, there is no zero.