Identification of co-orthologs in the Fugu genome
Studies into fish-specific duplicated genes have identified a number of examples in the Fugu genome (for example, [15,39]). As with most genes in general, few of these Fugu specific duplicates have CNEs in their vicinity. Suitable gene candidates for study of CNE evolution between teleost-specific gene paralogs were initially identified using 2,330 CNEs derived from a whole-genome comparison of the non-coding portions of the human and Fugu genome . CNE clusters that mapped to the vicinity of a single human genomic region but were derived from two non-contiguous Fugu scaffolds were considered further. We selected seven genomic regions in human that fitted this criterion, each containing clusters of CNEs in the vicinity of a single gene implicated in developmental regulation: BCL11A (transcription factor B-cell lymphoma/leukemia 11A), EBF1 (early B-cell factor 1), FIGN (fidgetin), PAX2 (paired box transcription factor Pax2), SOX1 (HMG box transcription factor Sox1), UNC4.1 (homeobox gene Unc4.1) and ZNF503 (zinc-finger gene Znf503). Some of these genes have relatively well characterized roles in early development, such as PAX2 (which plays critical roles in eye, ear, central nervous system and urogenital tract development [40-42], SOX1 (involved in neural and lens development [43,44], BCL11A (thought to play important roles in leukaemogenesis and haematopoiesis ) and EBF1 (important for B-cell, neuronal and adipocyte development [46,47]. FIGN, UNC4.1 and ZNF503 are less well characterized, although studies of their orthologs in mouse or rat indicate important roles in retinal, skeletal and neuronal development [48-51].
For each CNE cluster region in the human genome, we identified homologs to the human trans-dev protein on each Fugu scaffold, suggesting the presence of co-orthologous genes. To confirm this, we carried out a phylogeny of these protein sequences together with tetrapod orthologs and all available co-orthologs from the zebrafish genome. In addition, two outgroups utilizing the closest in-paralog as well as an invertebrate ortholog were included in each alignment to help resolve the phylogeny (Figure 1). In all cases where a close paralog could be identified, the Fugu co-ortholog candidates branch with strong bootstrap values with tetrapod orthologs of the target trans-dev gene, rather than the closest paralog, confirming these genes are true co-orthologs. Furthermore, for all phylogenies, the Fugu and zebrafish/medaka sequences branch together after the split with tetrapods, confirming they derive from a fish-specific duplication event. In only one out of three cases (pax2) where two co-orthologous proteins could also be identified in zebrafish does each Fugu copy branch directly with each zebrafish copy, indicating their proteins have followed similar evolutionary paths (Figure 1d). In contrast, the other two cases (sox1 and unc4.1) exhibit a different topology in that both zebrafish co-orthologs are more similar to one of the Fugu co-orthologs than the other (although weak bootstrap values for the fish unc4.1 may suggest alternative phylogenies). This is most likely due to species-specific asymmetrical rates of evolution seen between many genes in teleost fish , as well as elevated rates of evolution in duplicated genes in general, and pufferfish in particular , which may have obscured the true phylogenies in these cases. The given names of the Fugu co-orthologs used in this study (see Materials and methods for more details on nomenclature), their location in the Fugu genome and protein sequence accession codes can be found in Table 1.
CNE distribution and changes in genomic environment around Fugu co-orthologs
CNEs were independently identified within each Fugu co-orthologous region by carrying out a combination of multiple and pairwise alignment with the same orthologous sequence from human, mouse and rat (the entire dataset from this study can be accessed and queried through the web-based CONDOR database ). The regions in which CNEs were located for each co-ortholog together with surrounding gene environment can be seen in Figure 2.
All but one of the CNE regions in human are located in gene-poor regions termed 'gene deserts' that flank or surround the trans-dev gene and are characteristic of regions thought to contain large numbers of cis-regulatory elements . These gene deserts appear to have been conserved to some degree in both Fugu copies (albeit in a highly compact form). For example, a large gene desert of approximately 2.2 Mb is located downstream of BCL11A up to the ubiquitin ligase gene FANCL in human, and similar (compacted) versions of this gene desert are present in both Fugu regions, although downstream of bcl11a.2 it is almost a quarter of the size compared to the same region in bcl11a.1 (98 kb versus 380 kb). In the majority of regions under study (five out of seven), CNEs extend purely within these large intergenic regions directly flanking or within the introns of the trans-dev gene. In those regions in which CNEs extend beyond or within the genes neighboring the trans-dev gene (that is, bcl11a.1, znf503.1 and znf503.2) the gene order and orientation between Fugu and human has remained largely conserved, spanning three to five genes, something that is relatively rare within the Fugu genome [54,55]. This may be due to functional constraints on these regions whereby it is necessary to maintain the CRM and associated gene in cis [34,56]. For the remaining co-orthologous regions the degree of synteny varies widely. For instance, neither Fugu pax2 region has conserved gene order with the human genome. Two orthologs of NDUFB8 and HIF1AN (upstream of human PAX2) are partitioned and rearranged so that hif1an is downstream of pax2.1 and ndufb8 is downstream of pax2.2 (Figure 2).
The preservation of 98.5% of the CNEs (796/811) as well as both trans-dev genes in the same orientation and order along the sequence between human and Fugu, in contrast to the rearrangement of surrounding genes, confirms the likelihood that the CNEs and trans-dev genes identified are associated with each other.
Pattern of CNE retention/partitioning between co-orthologs
The DDC model for the retention of gene duplicates over evolution states that following duplication, genes undergo complementary degenerative loss of subfunctions or, on the regulatory level, expression domains. Based on the assumption that CNEs represent putative autonomous CRMs that control gene expression to one or more specific expression domains, we would predict that this process of regulatory subfunctionalization would involve the degeneration or loss of these elements between gene duplicates so that the ancestral CRMs were to some degree partitioned between the two genes. We identified 811 CNEs in total for all 14 regions in Fugu with lengths ranging from 30-562 bp (mean = 117 bp, median = 85 bp) and human-Fugu percent identities ranging from 60-94% (mean = 74%). CNEs from each co-ortholog were defined as 'overlapping' if there was conservation between them to at least part of the same single sequence in human. CNEs that were conserved between human and only one Fugu co-ortholog with no significant overlap to CNEs in the counterpart co-ortholog were defined as 'distinct'. Figure 3 illustrates the definition of overlapping and distinct CNEs identified in a multiple alignment between Fugu regions around pax2.1 and pax2.2, against the reference human PAX2 region.
Similar to other trans-dev gene regions identified previously (for example, ), the co-orthologs under study have highly variable numbers of CNEs conserved in their vicinity, ranging from 11 CNEs in sox1.2 to 156 in znf503.1 (Figure 4). Comparison of the overall number of CNEs conserved between co-orthologous copies revealed three sets, bcl11a.1/2, ebf1.1/2 and znf503.1/.2, that have notably different overall numbers of CNEs located in their vicinity, indicating a large-scale loss of elements in one co-ortholog compared to its counterpart since duplication (Figure 4). In the cases of bcl11a.1/2 and znf503.1/2, this large-scale asymmetrical loss of elements in one co-ortholog copy correlates to a large decrease in genomic sequence within the same region (Additional data file 2). Many of the co-orthologs have also undergone substantial partitioning of elements, as indicated by the large proportion of the identified CNEs classified as 'distinct' in each co-ortholog. For example, fign.1 and fign.2 have a similar number of CNEs in their vicinity (47 and 50, respectively) but 42% and 56% of these CNEs, respectively, are distinct to each co-ortholog. The extent of distinct CNEs as a proportion of total CNEs differs significantly between sets of co-orthologs, ranging from 24.5% (13/53) in pax2.1 to 83% (34/41) in ebf1.1 (Figure 4). For co-orthologs of BCL11A and EBF1 the majority of CNEs in both genes are distinct. Only in co-orthologs of PAX2 are the majority of CNEs in both genes found to be overlapping (Figures 3 and 4), suggesting a high level of retention of regulatory domains in both genes since duplication. In the majority of gene pairs, namely co-orthologs of FIGN, SOX1, UNC4.1 and ZNF503, one copy has the majority of its CNEs as distinct while the other has a majority of its CNEs overlapping with that of its counterpart co-ortholog, suggesting an asymmetrical rate of element partition.
The accuracy of these results depends heavily on ensuring that the loss of elements in one co-ortholog is the result of subfunctionalization rather than lack of sequence coverage in the genomic sequence. The proportion of 'N's (sections of unfinished sequence) within each Fugu genomic sequence can be seen in Table 1. We found that only one of the gene regions, sox1.2, contains a significant proportion of unfinished sequence (8.9%), suggesting some of the CNEs defined as 'distinct' in sox1.1 may have overlapping counterparts in sox1.2. However, closer examination of the positioning of the unfinished sequence reveals that the vast majority occurs in a region easily defined by two flanking overlapping CNEs that contains just a single distinct CNE in its counterpart co-ortholog. The region in sox1.2 potentially containing counterparts to most of the distinct CNEs in sox1.1 contains less than 3% unfinished sequence, suggesting most, if not all, of these distinct CNEs are defined correctly. Without 100% finished sequence in all cases it is, of course, possible that a small proportion of the CNEs identified as distinct in these co-orthologs may have an overlapping counterpart within unfinished sequence, but given the high levels of finished sequence in most of the gene regions, this is unlikely to account for a significant number.
Evolution of overlapping CNEs since duplication
Overlapping CNEs comprise a large proportion and, in some cases, the majority of CNEs identified around many of the gene pairs and have, therefore, remained to some extent under positive selection in both co-orthologs. The distribution of lengths and percent identities for 381 overlapping CNEs versus 430 distinct CNEs is significantly different for both lengths (p -16) and percent identities (p = 1.1-8). Overlapping CNEs have significantly higher average lengths (mean = 149.6 bp, median = 116.1 bp) than distinct CNEs (mean = 87.6 bp, median = 62 bp) as well as slightly higher percent-identities (mean = 75.2% and median = 75% for overlapping versus mean = 72.4% and median = 71.7% for distinct). Only 4 of the distinct CNEs overlap to some degree but by less than the arbitrary 20 bp cut-off required for CNEs to be defined as overlapping. Removing these leaves the mean lengths and percent-identities virtually unchanged, confirming that the cut-off did not significantly bias the distribution of distinct elements towards smaller elements.
We studied two aspects to gauge evolutionary changes occurring in these elements since duplication: changes in element length and changes in substitution rate between overlapping CNEs in Fugu.
A total of 182 pairs of overlapping CNEs were identified across all co-ortholog pairs with a one-to-one relationship. The length of the overlap in the human sequence between co-orthologous CNEs ranged from 24-460 bp (mean = 107.5 bp ± 2.27 standard error of the mean). For each overlapping pair, we calculated the proportion of the overlapping sequence as a function of the full length Fugu-human conserved sequence in each co-ortholog. We found 62% of the pairs to have undergone significant degeneration in element length in one of the copies compared to its counterpart (Figures 5 and 6); 30% of pairs overlapped over the majority of both elements, suggesting little evolution of element length since duplication, and approximately 8% have undergone a significant level of degeneration in element length in both copies at their edges. These results suggest the process of subfunctionalization may also be occurring, at least in some of these cases, through the partial loss of function in both copies, allowing gene preservation through quantitative complementation (as suggested in ). It is also possible that sequence loss could causes changes in module function through the change in binding site combinations present. In genes such as pax2.1 and pax2.2 that have the majority of their CNEs overlapping in both genes, this presents an additional mechanism by which both copies may be preserved. In addition to overlapping CNEs that have undergone evolution at their edges, 29 overlapping CNEs have undergone evolution at the centre of the element, essentially creating a split element (that is, a CNE in one co-ortholog overlaps two or more CNEs from the other co-ortholog).
CNE sequence evolution
Overlapping CNEs are conserved to the same human sequence across the length of the overlap. However, it is possible that elements have undergone differential evolution, with one element containing a significantly greater number of independent substitutions than the other, indicative of either subfunctionalization or neofunctionalization. To measure whether the sequence of one CNE has diverged faster than its counterpart, we used the Tajima relative rate test  with the human sequence as the outgroup (or ancestral) sequence. The Tajima relative rate test measures the significance in the difference of independent substitutions in each sequence relative to the outgroup sequence using a chi-squared statistic (see Additional file 3 for the results of relative rate tests for all overlapping CNEs). The percentages of overlapping CNEs that show a statistically significant difference in substitution rate in one copy over another range from 17% in sox1 to 26% in znf503 (Table 2). One of the most significant examples within this set was found in a pair of CNEs upstream of co-orthologs of UNC4.1 and can be seen in Figure 6. These results suggest that a substantial number of the elements appear to have undergone an asymmetrical rate of evolution since duplication, something we would expect under the DDC model. Alternatively, if these changes were positively selected it may indicate a process of neofunctionalization whereby co-orthologs have evolved novel regulatory patterns to that of the ancestral copy.
A history of duplications: some co-orthologous CNEs were duplicated in ancient events at the origin of vertebrates
In addition to being involved in a teleost-specific duplication event, a number of the CNEs identified around the trans-dev genes in this study have been previously retained from ancient duplications thought to have occurred at the origin of vertebrates. While the majority of CNEs are single copy in the human genome, a recent study identified 124 families of CNEs genome-wide that have more than one copy across all available vertebrate genomes and are referred to as 'duplicated CNEs' (dCNEs) . dCNEs are associated with nearby trans-dev paralogs and a number have been shown to act as enhancers that drive in vivo reporter-gene expression to similar domains . The absence of these sequences in non-vertebrate chordate genomes and their association with paralogs that arose from whole-genome duplication events at the origin of vertebrates  places their origins sometime prior to this event more than 550 million years ago. The conservation of these elements over such extreme evolutionary distances suggests they play critical roles in the regulation of paralogs that have since undergone neofunctionalization. We found 30 non-redundant human CNEs (conserved to 52 co-orthologous CNEs in Fugu) to be dCNEs in the vicinity of one or more paralogs of the nearby trans-dev gene (Table 3). This further confirms the tight association of these CNEs with their nearby trans-dev genes as dCNEs resolve the CNE-gene association more clearly . These dCNEs were identified in five of the seven co-orthologous regions with some dCNEs associated with more than one paralog (for example, PAX2 associated dCNEs located in the vicinity of PAX5 and PAX8; Table 3; Figure 3). 80% of the co-ortholog CNEs identified as dCNEs (42/52) are conserved in both co-ortholog regions in Fugu, a two-fold enrichment (p