Genome-wide identification of genes encoding RNase H and dsRHbd
Complete genomes of 326 strains from 235 bacterial species and 27 strains from 27 archaeal species and the corresponding GenBank files were downloaded from the National Center for Biotechnology Information (NCBI) GenBank FTP site ; their accession numbers are summarized in Additional file 1. Two strategies were applied to identify sequences of RNase H and double-stranded RNA and RNA-DNA hybrid-binding domains (dsRHbd) in the complete genomes. One was a remote homology search with the PSI-BLAST software  and the other was a protein domain search based on Hidden Markov Model (HMM) profiles .
For the PSI-BLAST search, a non-redundant peptide sequence database was downloaded from the NCBI BLAST FTP site . From this database, peptide sequences of prokaryotes and eukaryotes were extracted by using taxonomy information obtained from the NCBI Taxonomy FTP site . To construct a position-specific scoring matrix, a PGP-BLAST search was carried out against 3 506 454 extracted peptide sequences, with an E-value threshold of 0.002 and four iterations. The amino acid and nucleotide sequences corresponding to the RNase HI domain of E. coli K12 [GenBank: AAC73319], the RNase HII domain of E. coli K12 [GenBank: AAC73294], the RNase HIII domain of B. subtilis subsp. subtilis str. 168 [Swissprot: P94541], and the dsRHbd of B. halodurans C-125 [Swissprot: Q9KEI9] were used as queries. Using the resulting matrix, PSI-TBLASTN searches were conducted against the 353 complete genomes by using an E-value threshold of 0.2.
For the HMM profile analysis, the profiles of RNase HI and RNase HII were downloaded from the Sanger Institute's Pfam Web site  and the HMM profile of dsRHbd was newly built by using the hmmbuild module of the HMMER 2.3.2 software  on the basis of the results of the PSI-BLAST search. The 353 complete genomes were translated into six-frame amino acid sequences. Using these HMM profiles as queries, protein domain searches were performed with the hmmpfam module of the HMMER 2.3.2 software against translated complete genomes with an E-value threshold of 1× 10-6.
On the basis of the outputs of the PSI-BLAST and HMM searches, coding sequences including homologous regions of RNase H or dsRHbd were obtained from GenBank files by using G-language Perl modules . When the search revealed unannotated genomic regions, we manually checked for the existence of an open reading frame (ORF) near the genomic region. In order to distinguish genes encoding RNase HII and RNase HIII in the datasets, a PGPBLAST search was conducted against the Conserved Domain Database (a subset of domains from SMART, Pfam, COG, and CD)  downloaded from the NCBI CDD FTP site .
The amino acid and nucleotide sequences of the DNA gyrase subunit B gene (gyrB) were retrieved in a similar way. The CodonAlign 2.0 software (Barry G. Hall, Rochester, NY, USA) was used to align the nucleotide sequences on the basis of alignments of the corresponding amino acid sequences performed with the ClustalW 1.8.3 software . The Modeltest 3.7 software  was applied to select an appropriate model from the output of the PAUP* Version 4.0 software  by using hierarchical likelihood-ratio tests and the Akaike Information Criterion . Phylogenetic trees were estimated by Bayesian methods with MRBAYES Version 3.1.2 software  under the General Time Reversible model with gamma correction and a proportion of invariable sites . In the Bayesian analysis, the Markov chain Monte Carlo search used 1 000 000 generations run with four chains, with trees being sampled every 100 generations, and a consensus tree was estimated by a burn-in of 2500 trees. TreeView software for Power Macintosh  was used for viewing and editing the tree.