miRNA sequences Mature human and mouse miRNA sequences were obtained from the RFAM miRNA registry (Griffiths-Jones 2004). To cover cases of incomplete data, any mouse miRNA sequence not (yet) described in humans was assumed to be present in human, with the same sequence, and vice versa. Similarly, all mouse miRNAs were assumed to be identical and present in the rat genome. These assumptions are reasonable as sequence identity for known orthologous pairs in human and mouse is, on average, 98% (with 110 out of 146 orthologous sequences being identical). In total, 218 mammalian miRNAs were used. For human target searches, 162 native miRNA sequences were available plus 17 mouse and 39 rat miRNA sequences; for mouse, 191 native, 14 human, and 13 rat sequences; and for rat, 45 native, 159 mouse, and 14 human miRNA sequences.
Mature miRNA sequences for zebrafish and fugu were predicted starting from known human and mouse miRNA precursor sequences (Ambros et al. 2003a). Each precursor sequence was used, in a scan against the zebrafish supercontigs (release 18.2.1) using NCBI BLASTN (version 2.2.6; E-value cutoff, 2.0) (Altschul et al. 1990), to identify a sequence segment containing the potential zebrafish miRNA. The mammalian and fish segments were then realigned using a global alignment protocol (ALIGN in the FASTA package, version 2u65; Pearson and Lipman 1988). After testing the potential fish miRNA precursors for foldback structures (Zuker 2003), the final set of 225 predicted zebrafish miRNAs was selected. The same set of sequences was used for fugu.
3′ UTR sequences The Ensembl database (Birney et al. 2004) served as the source of genomic data. The Ensembl BioPerl application user interface was used to generate 3′ UTR sequences for all transcripts of all genes from each genome. Some transcripts are alternatively spliced from the same gene, so the total number of genes is smaller than the number of transcripts (Table 3). When no Ensembl annotated 3′ UTR sequences were available, we predicted 3′ UTRs by taking 4,000 bp of genomic sequence downstream of the end of the last exon of a transcript (Table 3). If this predicted region overlapped coding sequence on either strand, we halted 3′ UTR extension at that point.
UTR orthology and alignment Orthology mappings between genes from different genomes were obtained using “orthologue tables” from the EnsMart (Kasprzyk et al. 2004) feature of the Ensembl database. Pairs of orthologous UTRs were aligned with each other using the AVID (Bray et al. 2003) alignment algorithm to facilitate analysis of conservation of position and sequence of target sites. In total, 26,205 human transcripts, representing 15,869 genes, were mapped to both mouse and rat transcripts. For zebrafish, 11,442 transcripts, representing 10,909 genes, were mapped to fugu transcripts and 11,306 transcripts mapped to human transcripts (10,063 genes).
miRNA target prediction The miRanda algorithm (version 1.0; Enright et al. 2003) was used to scan all available miRNA sequences for a given genome against 3′ UTR sequences of that genome derived from the Ensembl database and—tabulated separately—against all cDNA sequences and coding regions. The algorithm uses dynamic programming to search for maximal local complementarity alignments, corresponding to a double-stranded antiparallel duplex. A score of +5 was assigned for G:C and A:T pairs, +2 for G:U wobble pairs, and −3 for mismatch pairs, and the gap-open and gap-elongation parameters were set to −8.0 and −2.0, respectively. To significantly increase the speed of miRanda runs, in calculating the optimal alignment score at positions i, j in the alignment scoring matrix, the gap-elongation parameter was used only if the extension to i, j of a given stretch of gaps ending at positions i–1, j or j–1, i (but not of stretches of gaps ending at i–k, j or j, i–k for k > 1) resulted in a higher score than the addition of a nucleotide–nucleotide match at positions i, j. Removal of this restriction with the availability of more computing power would result in a moderate increase in average loop length, but the advantages of this would probably be superceded by overall refinement of target prediction rules. Importantly, complementarity scores at the first eleven positions, counting from the miRNA 5′ end, were multiplied by a scaling factor of 2.0, so as to approximately reflect the experimentally observed 5′–3′ asymmetry; for example, G:C and A:T base pairs contributed +10 to the match score in these positions. The value of the scaling factor at each position is an adjustable parameter subject to optimization as more experimental information becomes available. Because of the ongoing discussion about the rules for target prediction, target genes (a total of 490) that contained target sites with more than one G:U wobble in the 5′ end are flagged in the Table S2. The thresholds for candidate target sites were S > 90 and ΔG S is the sum of single-residue-pair match scores over the alignment trace and ΔG is the free energy of duplex formation from a completely dissociated state, calculated using the Vienna package as in Enright et al. (2003).
After finding optimal local matches above these thresholds between a particular miRNA and the set of 3′ UTRs in each genome, we asked whether target site position and sequence for this miRNA were conserved in the 3′ UTRs of orthologous genes, i.e., between human and mouse or rat, or between fugu and zebrafish. The alignments of target sites were generated transitively (UTR
miRNA
UTR) via a shared (or homologous) miRNA. We required that the positions of pairs of target sites in two species fall within ±10 residues in the aligned 3′ UTRs. Conserved target sites with sequence identity of 90% or more (human versus mouse or rat) and 70% or more (zebrafish versus fugu) were selected as candidate miRNA target sites and stored in a MySQL database. Using human as the reference species, we predicted 10,572 conserved target sites (conserved in either mouse or rat) in 4,463 human transcripts, of which 2,307 transcripts of 2,273 genes contained more than one target site. Similarly, using zebrafish as a reference species, we predicted 7,057 conserved target sites (conserved in fugu) in 4,820 zebrafish transcripts.
To focus on the strongest predictions, conserved target sites for each miRNA were sorted according to alignment score, with free energy as the secondary sort criterion. In cases where multiple miRNAs targeted the same site on a transcript (or within 25 nt of a site), only the highest scoring, lowest energy miRNA was reported for that site.
Functional analysis of targets. To facilitate surveys of target function and analysis of functional enrichment, InterPro domain assignments (Mulder et al. 2003) and GO (molecular function hierarchy) mappings (Ashburner and Lewis 2002) for all human genes were obtained using EnsMart. For each functional class derived from either source, we calculated its degree of under- or overrepresentation, Fclass, using the log-odds ratio of the fraction of annotated target genes with the same class (F1) and the fraction of all annotated Ensembl human genes with that class (F2):
Here, N represents the number of genes of a given functional class for either target genes (Ntar) or all genes (Nall), and C represents the total number of functional classes. To eliminate bias from small counts we did not report assignments that were present in less than 1% of all annotated target genes (F1
0.01 or F2
0.01).
Randomized trials For each random experiment all miRNAs were shuffled by randomly swapping two bases of a miRNA 1,000 times. These shuffled sequences were then searched against human, mouse, and rat 3′ UTR sequences in the same way described for the main analysis, including analysis of conservation of target site sequence and position in orthologous 3′ UTRs. A total of ten randomized experiments were performed. Counts were averaged across all experiments, and the standard deviation and other statistical measures were calculated.
Analysis of FMRP-associated mRNAs We compiled a list of 464 gene identifiers of FMRP-associated mRNAs from five different publications (Brown et al. 2001; Chen et al. 2003; Denman 2003; Miyashiro et al. 2003; Waggoner and Liebhaber 2003). Among the 464 gene identifiers, 397 identifiers were mapped to the corresponding genes in our 3′ UTR dataset. The remaining 67 genes were not mapped because their published identifiers were obsolete, primarily because of their Affymetrix probeset identification numbers. To identify miRNA regulation of the 397 FMRP-associated mRNAs, these genes were then compared with the set of predicted miRNA targets.
CPE motif prediction. We predicted CPE motifs in human, mouse, and rat UTRs. We used a search pattern using four criteria: (1) presence of the CPE motif UUUUAU, (2) presence of the hexanucleotide AAUAAA, (3) the CPE and the hexanucleotide motif being within 100 nucleotides of each other, and (4) the conservation of these motifs and the positions of the motifs in the mouse ortholog (Mendez and Richter 2001).