Twenty six genomes of Streptococcus were downloaded from GenBank (Table 1), representing six different species. Coding sequences were extracted from GenBank files, and orthologs were determined using OrthoMCL . This program first makes an all-against-all BLASTp, and then defines putative pairs of orthologs or recent paralogs based on reciprocal BLAST. Recent paralogs are identified as genes within the same genome that are reciprocally more similar to each other than any sequence from another genome. OrthoMCL then converts the reciprocal BLAST p values to a normalized similarity matrix that is analyzed by a Markov Cluster algorithm (MCL) . In return, the MCL yields a set of clusters, with each cluster containing a set of orthologs and/or recent paralogs. OrthoMCL was run with a BLAST E-value cut-off of 1e-5, and an inflation parameter of 1.5. We used the OrthoMCL output to construct a table describing genome gene content (Additional data file 1). Genes that were not included in a cluster were considered taxon specific genes only if they were at least 50 amino acids long and had no BLAST hit with any other protein (E-value ≤ 1e-10). Preliminary analysis indicated that many truncated proteins found at the ends of contigs of the incomplete genomes, although exhibiting clear evidence of homology, were not included in any cluster because they had weak or no reciprocal BLAST hit. This table was used to plot venn diagrams with R 2.2.1  and to construct four core-genome data-sets corresponding to the following taxa: genus Streptococcus, S. agalactiae, S. pyogenes, and S. thermophilus.
Gene loss, acquisition, and duplication were determined on all branches of trees involving these taxa using the parsimony criterion with the DelTran option, implemented in Paup 4.0b10 . Ancestral gene state reconstruction was run with two different step-matrices: the first one assesses gene gain, loss and duplication as being equally likely (for example, ), while the second penalizes gene gain by assuming a double cost compared to loss and duplication (for example, ). The gene content table was also used to perform gene accumulation curves using R, which describe the number of new genes and genes in common, with the addition of new comparative genomes. The procedure was repeated 100 times by randomly modifying genome insertion order to obtain means and standard errors.
Orthologs were first aligned at the DNA level with ClustalW 1.82 . To ensure homology, alignments that contained less than 35% conserved sites for the Streptococcus genus data set, and 50% for the Streptococcus species data sets, were discarded. A preliminary analysis of the data revealed that many sequences identified as under positive selection contained frameshifts that disrupted the reading frame (a single insertion or deletion that modified the reading frame), resulting in high non-synonymous substitution rates. Unfortunately, it is not possible to accurately discriminate sequencing errors from actual insertions or deletions; however, most of these frameshifts were found within the unclosed genomes and appeared at the beginning or end of the contigs, where sequencing errors are more probable. As described by Perrodou et al. , most of these frameshifts are probably sequencing errors, although it is possible that some are not . We chose the conservative approach of removing all codons appearing before or after the frameshift when located at the beginning or end of the coding sequence, respectively. For that purpose, Perl scripts were developed to find frameshifts on the DNA alignments, and the sequences were edited manually. A second alignment step was then used to refine all alignments by translating sequences to amino acids, aligning them with ClustalW, and then back-translating to DNA, using the script transAlign . Finally, amino acid alignments showing a low percentage of conserved sites were manually inspected, and removed if the alignment was ambiguous.
Both phylogenetic and substitution pattern methods were use to detect recombination events. The phylogenetic methods rely on the examination of phylogenetic congruence among genes (for example, [62,63]). If a gene has been laterally transferred, the phylogenetic relationships depicted by this gene will be different from the species tree. We first applied the method suggested by Susko et al. , which tests for the rejection of a set of topologies by a set of orthologous genes using the AU test . When possible (that is, the number of taxa ≤ 7), the trees tested are all the possible unrooted trees (for example, ). When a gene rejects a tree that is supported by the majority of the other genes, this gene is considered to have been laterally transferred. We applied this approach to our data sets (except for S. thermophilus, which contains only three taxa), by using as tested topologies the individual gene trees obtained by phyML (general time reversible (GTR) +Γ4+I model of evolution with a BIONJ starting tree) , with the addition of the tree obtained with Paup (GTR+Γ4+I model of evolution, neighbor-joining (NJ) starting tree, and a tree-bisection-reconnection (TBR) branch-swapping algorithm) reconstructed from the concatenation of all genes. The site likelihood of each tree was than computed by the program baseml (PAML package)  using a GTR+Γ4 model of evolution. The AU test was then applied using Consel .
As suggested by Susko et al. , results (p values for the rejection of each tree) were plotted using heatmaps obtained with R. This approach has the disadvantage of been developed with a test (the AU test) that assesses the rejection of a tree by a gene and not for its acceptance . As a result it is not possible to say if a tree is not rejected because it is not significantly different or because it is simply unresolved. This sort of situation is particularly expected with weakly divergent alignments, and at the opposite spectrum, with saturated alignments. We thus developed and performed a second set of analyses to complement the Susko approach and intended to quantify the amount of supported and incongruent phylogenetic signal between two gene trees. This approach relies on the discovery of well supported conflicting bipartitions (that is, branches that can not be observed in the same tree), as measured by non-parametric bootstrap analysis , thus revealing incongruence between gene histories. Support for each bipartition was obtained by bootstrapping a maximum likelihood (ML) tree search using Paup (GTR+Γ4+I model of evolution). Custom-made scripts were then used to find and count well supported (≥90% bootstrap support) conflicting bipartitions between gene trees. Additional to phylogenetic recombination detection, we employed methods specifically developed to detect homologous intragenic recombination. We used the compatibility approach between site histories, based on the pairwise homoplasy index (PHI), developed by Bruen et al.  and implemented within the program PhiPack . Bruen et al. have suggested PHI is a more robust and sensitive method than many of the earlier approaches. Additional to the PHI statistic, for comparative purposes, we computed MaxChi  and NSS p values for recombination using PhiPack (employing 1,000 permutations).
Positive selection analysis
We employed the branch-site test of Yang and Nielsen [20,74] implemented in the program Codeml of the package PAML  to assess positive selection at particular sites and lineages. Briefly, the likelihood of a model that does not allow positive selection is compared to one allowing positive selection on some specified lineages. The model allowing positive selection is tested using a likelihood ratio test (LRT) that is compared to a χ2 statistic with two degrees of freedom. Likelihoods were estimated on the genes or species trees. For the Streptococcus data set each lineage leading to the six different species was tested. For the species analyses each of the lineages corresponding with the different strains was tested, as well as several internal branches supported by the majority of genes. Finally, p values were corrected for multiple hypothesis testing using the Benjamini and Yekutieli method . The effect of lineages, and genes' TIGR main role categories, and their interaction were tested using an analysis of variance on the LRT, using R. Role categories containing less than ten genes were merged. If F-statistic was significant, Tukey's 'honest significant difference' multi-comparison method was used to discriminate lineages and role categories associated with different LRTs.