such as "Introduction", "Conclusion"..etc
Pan-genome, core-genome, and evolution of genome composition
The number of protein coding genes per genome within the various strains and species of Streptococcus is relatively similar (ranging from 1,697 to 2,376; Table 1), but the gene composition of these genomes is much more variable. Based on the gene content table obtained by OrthoMCL (Additional data file 1), three strains of S. agalactiae, S. pyogenes or S. thermophilus share about 75% of their genes, and three different species of Streptococcus share only around half of their genes (Figure 1). This latter result appears to be independent of the particular strains or species involved in the comparison and of their phylogenetic affinities. Even with the inclusion of 26 genomes, the total number of possible genes - the pan-genome - of Streptococcus appears not to have been reached, as depicted in the gene accumulation curve (Figure 2), and we estimate the Streptococcus pan-genome probably surpasses 6,000 genes. A surprising 21% of the genes in the pan-genome of the genus Streptococcus (based on these 26 genome sequences), were represented in only one lineage, suggesting a remarkable degree of lateral gene transfer in shaping the genomes of each of these taxa (Figure 3). Within species, the pan-genome size also remains uncertain, although our estimates suggest that the pan-genome size of S. pyogenes is smaller, and better estimated with the currently available data, than that of S. agalactiae (Figure 2).
In contrast to the pan-genome estimates, the number of genes in common between the different species within the genus Streptococcus - the core-genome - appears to reach a plateau around 600 genes (Figures 2 and 3). Next to the genome specific genes and the genes shared by only two genomes, the genes of the core-genome were the third most common genes (11%; Figure 3), suggesting they form a coherent group. Similarly, the estimated core-genome for S. pyogenes, based on the 11 available strains, plateaus around 1,400 genes. The pattern was less clear for S. agalactiae, where the estimate of core-genome size does not level out, and appears as though it might still be influenced by the inclusion of new genome sequences. On the whole, these analyses suggest that it is possible to delineate a core-genome at both genus and species level. We analyzed four such core-genome data sets: the Streptococcus core-genome (611 genes), and the core-genomes of S. agalactiae (1,472 genes), S. pyogenes (1,376 genes) and S. thermophilus (1,487 genes). To save computation time, the Streptococcus core-genome data set was reduced to ten taxa by keeping only two strains per species for S. agalactiae, S. pyogenes, S. thermophilus (strains A909 and NEM316, MGAS9429 and M1 GAS, and CNRZ1066 and LMG 18311, respectively). After discarding clusters of genes containing paralogs (that is, clusters containing more than one gene per taxon), and alignments with uncertain site homologies, we obtained four data sets containing 260, 1,297, 1,212 and 1,365 genes representing the alignable core-genomes of Streptococcus, S. pyogenes, S. agalactiae, and S. thermophilus, respectively.
Determinations of the number of genes gained and lost on each of the lineages shows considerable variation (Figure 4) and, in agreement with earlier studies, gene gain was generally considerably greater than gene loss, as well as being particularly evident on external branches . The lineage in the interspecific analysis showing the greatest gene gain was S. suis, followed closely by S. pneumoniae and S. mutans. Even within a species, between strains, the numbers of genes gained and lost were very high, reaching, for example, values in excess of 150 for gene gain in S. agalactiae strain H36B. High levels of gene gain and loss were evident, even for closely related isolates of the same serotype in S. pyogenes (for example, M1 GAS/MGAS5005; SSI-1/MGAS315; MGAS9429/MGAS2096). Branch lengths of the S. pyogenes concatenated tree were much longer than those for S. agalactiae, suggesting the lineages might be much older; however, despite this there was generally more gene gain on the S. agalactiae branches than on S. pyogenes branches. Large values for duplications were also a feature of the lineage specific evolution (Figure 4). Phylogenetic analysis of several of these cases suggests this is a combination of lineage specific duplications as well as LGT events involving homologous sequences from other species of Streptococcus. When gene gain was penalized with respect to gene loss (for example, ), not surprisingly, it globally decreased the number of gene gains and increased the number of gene losses (Additional data file 3) and, as a consequence, increased the number of genes in the pan-genomes of ancestral nodes (data not shown). Nevertheless, even with a penalty, gene gain remained in excess of gene loss on some lineages (Additional data file 3).
Between species of Streptococcus
The results of the approximately unbiased (AU) test indicated that 39 out of 260 genes rejected the concatenated tree. The p value heatmap (Figure 5a) indicates that some gene trees showed the same or very similar histories, depicted by groups of topologies with a similar p value pattern (for example, topologies 1 to 47, and 48 to 65). On the other hand, a small group of genes rejected most topologies (that is, genes 230 to 260, read horizontally in Figure 5a), and at the same time, their trees were rejected by most of the genes (that is, topologies 230 to 260, read vertically in Figure 5a). Although different topologies were supported by various groups of genes, the majority of genes did not reject the concatenated tree and only a small subset of genes proposed significantly different trees. The analysis of bipartitions (Figure 5b) demonstrated that the vast majority of genes supported three distinct bipartitions, corresponding to the monophyly of S. pyogenes, S. pneumoniae and S. thermophilus (bipartitions 28, 29, and 30, respectively). Also generally supported were the monophyly of S. agalactiae, the monophyly of the group S. pneumoniae + S. suis, and the monophyly of the group S. agalactiae + S. pyogenes (bipartitions 27, 26 and 25, respectively). Several other bipartitions were only supported by some genes (for example, bipartition 19, corresponding to the grouping of S. pneumoniae with S. thermophilus), while others were only supported by one or a few genes (for example, bipartition 10 and 11). The well supported conflicting bipartitions figure (Figure 5c) is a summary of the p value heatmap (Figure 5a) and bipartition analyses (Figure 5b). A majority of the genes (around 150 out of 260) show no conflict with each other. Most of them support the monophyly of the different species and the lineage S. pneumoniae + S. suis, and most of them do not reject the concatenated gene tree. Another set of genes showed some instances of conflict with the aforementioned set of 150, but most of them were in conflict with each other. They tend to support the same principal groups as the set of 150, with a few additional bipartitions that are conflicting. A final group of genes conflict with the first and the second group, as well as with each other, corresponding to genes that rejected most of the other gene trees in the AU test (Figure 5a) and that provide support for rare bipartitions; genes of this set have strongly incongruent histories with the other genes (for a detailed list, see Additional data file 4). The topologies used to test for positive selection were the concatenated gene tree for the genes that don't reject it, and individual gene trees for those loci that do reject the concatenated tree.
Within S. agalactiae
The concatenated gene tree was rejected by 750 genes of the core-genome of S. agalactiae. On the whole, most genes rejected most of the other gene trees (Figure 6a), although there were also some genes that did not reject the majority of gene trees. There were no commonly well supported bipartitions across the genes (Figure 6b). Around half of the genes provided either no, or only weak, bootstrap support for any bipartition (genes 1 to 560; Figure 6b), while the rest of the genes supported different sets of bipartitions. The most commonly supported groups of strains were 515+NEM316, A909+H36B, 515+NEM316+COH1, A909+CJB111+H36B, A909+CJB111+H36B, and 515+COH1 (bipartitions 75 to 70, respectively; Figure 6b). Additional, numerous bipartitions were supported by only one or a few genes. Because they possessed a too limited phylogenetic signal, around half of the genes (genes 1 to 560) showed no conflict with any of the other genes (Figure 6c). Although the AU test suggested that some of these genes have different histories, it is difficult to reach any definitive conclusions about the congruence of these gene histories since phylogenetic signal was so limited or absent (genes with no sequence divergence between strains).
The second half of the core-genome can be split into two groups. The first group contains genes that have some conflict with each other, and that tend to support the six bipartitions described earlier, plus three additional ones. The second group contained genes that were largely in conflict with each other, and with the preceding group. This latter group provided support for a number of rarely supported bipartitions. While the first group contained genes that had only partly incongruent histories (only a few bipartitions in conflict), genes of the last group had more incongruent gene histories (greater number of bipartitions in conflict). Given these results, and the ambiguity of defining which genes had the same history, we analyzed each gene with its own gene tree in the subsequent positive selection analyses.
Within S. pyogenes
As for S. agalactiae, while a few genes rejected nothing, the majority of genes rejected the other gene trees (Figure 7a). Three bipartitions were generally supported, although not always, and with various bootstrap scores, corresponding with serotype groupings: MGAS5005+M1 GAS, MGAS315+SSI-1, and MGAS2096+MGAS9429 (bipartitions 131 to 129, respectively; Figure 7b). A total of 434 genes tended to also provide support for various unique bipartitions. Around half of the genes had weak or no phylogenetic signal, and, as a consequence, had no conflict with any other trees (Figure 7b). A set of around 200 genes, most of which supported the three bipartitions detailed above, tended not to conflict with each other, but occasionally with the final grouping of genes. This latter group was composed of the 434 genes mentioned above, which supported variously different bipartitions, and thus tended to be in conflict with each other. Overall, the S. pyogenes core-genome is composed of genes that are largely congruent for a portion of relatively recent history (that is, the serotype monophyly), while one-third of the core-genome appears to have strongly incongruent histories for older events. Because it appeared difficult to define which genes were likely to have the same history, we analyzed each gene with its own gene tree in the subsequent positive selection analyses.
Substitution analysis of recombination
The pairwise homoplasy index (PHI) approach suggested that around 20% of the genes were recombinant within the core-genome of Streptococcus and S. pyogenes, while within S. agalactiae only about 3% of the genes were recombinant (Table 2). Employing a more conservative approach that considers as recombinant only those genes found by three different substitution approaches (PHI, MaxChi and neighbor similarity score (NSS)), these proportions were reduced, but the relative differences between the data sets remained (Table 2). With the phylogenetic approach detailed above, numerous genes had weak phylogenetic signal, and several groups of genes were only partially incongruent; therefore, it can be difficult to define clearly which genes have different histories. It is, however, possible to adopt a conservative approach that considers as putative recombinants only those genes with strong phylogenetic incongruence (SPI), with most of the other genes. Nevertheless, only a small proportion of genes was identified by both PHI and SPI approaches as putative recombinants (Table 2), suggesting that each approach tends to identify different types of recombination event. We therefore propose that an estimate of the complete set of putative recombinants can best be considered as the set of genes identified by SPI plus the genes identified by all three substitution recombination methods (Table 2). This yields an estimate of 18% of the core-genome for S. agalactiae as putative recombinants, 19% for the genus Streptococcus, and 37% for S. pyogenes.
Positive selection analysis
The number of genes that showed evidence for positive selection was particularly high within the Streptococcus core-genome (between 10% and 40%; Table 3). The S. pneumoniae and S. suis lineages, and the ancestral lineage leading to these two species, exhibited the greatest proportion of the core-genome evolving under positive selection (28%, 34% and 32%, respectively; Table 3). Approximately one-third of the genes showed positive selection on only one lineage, and no gene was selected in all possible lineages (Figure 8). There were, however, many examples of genes selected on multiple lineages, including several genes selected on as many as 5 (12 genes) or 6 (4 genes) different lineages (Figures 8 and 9; see Additional data file 5 for a complete list of all genes and lineages under positive selection). A significant proportion of positively selected genes for S. suis, S. pneumoniae, and S. thermophilus was uniquely selected on each of these lineages (21%, 19%, and 24%, respectively), in contrast to that for S. agalactiae, S. pyogenes, and S. mutans, which had either no uniquely selected loci (S. agalactiae), or a very small proportion (Figure 9). Analysis of variance of genes under positive selection pressure supported a significant effect of both lineage and biochemical main role category (Table 4). Post hoc multiple comparisons showed that the main effect was due to two categories, 'DNA metabolism' and 'Transcription'. Less strongly supported, but still significant, was the interaction between lineages and main role categories (Table 4). This interaction appeared mainly due to an increase of genes under positive selection for loci involved in transcription, protein fate, protein synthesis and DNA metabolism for the S. pneumoniae-S. suis ancestral lineage and the S. suis lineage.
In addition to identifying genes and lineages under positive selection, the branch-site test also identifies sites using a Bayes empirical Bayes approach . For 91% of the genes under positive selection, specific sites were proposed (posterior probability >0.95). Interestingly, when a gene was independently selected on different lineages, the sites under positive selection were generally not the same across lineages, arguing for different selection pressure located at different sites. In contrast to the interspecific comparisons, positive selection was evident for only a few genes within the core-genome, across strains of the different Streptococcus species (Table 3, Additional data file 5), including a few lineages that showed slightly increased levels of positive selection relative to the rest. For S. agalactiae the exceptional lineage was COH1, for S. pyogenes the exceptional lineages were MGAS10270 and that leading to SSI-1/MGAS315, and for S. thermophilus it was LMD-9. A significant number of genes evolving under positive selection were also judged as putative recombinants (Table 5). This was particularly true for the S. pyogenes genome, where 78% of the genes under positive selection were putative recombinants. Approximately half of these genes were identified as recombinants by the substitution based recombination methods, and the other half by the phylogenetic approach.
Enter the code exactly as it appears. All letters are case insensitive.