Translating ribosomes decode three nucleotides per codon into peptides. Their movement along mRNA, captured by ribosome profiling, produces the footprints exhibiting characteristic triplet periodicity. This protocol describes how to use RiboCode to decipher this prominent feature from ribosome profiling data to identify actively translated open reading frames at the whole-transcriptome level.
Identification of open reading frames (ORFs), especially those encoding small peptides and being actively translated under specific physiological contexts, is critical for comprehensive annotations of context-dependent translatomes. Ribosome profiling, a technique for detecting the binding locations and densities of translating ribosomes on RNA, offers an avenue to rapidly discover where translation is occurring at the genome-wide scale. However, it is not a trivial task in bioinformatics to efficiently and comprehensively identify the translating ORFs for ribosome profiling. Described here is an easy-to-use package, named RiboCode, designed to search for actively translating ORFs of any size from distorted and ambiguous signals in ribosome profiling data. Taking our previously published dataset as an example, this article provides step-by-step instructions for the entire RiboCode pipeline, from preprocessing of the raw data to interpretation of the final output result files. Furthermore, for evaluating the translation rates of the annotated ORFs, procedures for visualization and quantification of ribosome densities on each ORF are also described in detail. In summary, the present article is a useful and timely instruction for the research fields related to translation, small ORFs, and peptides.
Recently, a growing body of studies has revealed widespread production of peptides translated from ORFs of coding genes and the previously annotated genes as noncoding, such as long noncoding RNAs (lncRNAs)1,2,3,4,5,6,7,8. These translated ORFs are regulated or induced by cells to respond to environmental changes, stress, and cell differentiation1,8,9,10,11,12,13. The translation products of some ORFs have been demonstrated to play important regulatory roles in diverse biological processes in development and physiology. For example, Chng et al.14 discovered a peptide hormone named Elabela (Ela, also known as Apela/Ende/Toddler), which is critical for cardiovascular development. Pauli et al. suggested that Ela also acts as a mitogen that promotes cell migration in the early fish embryo15. Magny et al. reported two micropeptides of less than 30 amino acids regulating calcium transport and affecting regular muscle contraction in the Drosophila heart10.
It remains unclear how many such peptides are encoded by the genome and whether they are biologically relevant. Therefore, systematic identification of these potentially coding ORFs is highly desirable. However, directly determining the products of these ORFs (i.e., protein or peptide) using traditional approaches such as evolutionary conservation16,17 and mass spectrometry18,19 is challenging because the detection efficiency of both approaches is dependent on the length, abundance, and amino acid composition of the produced proteins or peptides. The advent of ribosome profiling, a technique for identifying the ribosome occupancy on mRNAs at nucleotide resolution, has provided a precise way to evaluate the coding potential of different transcripts3,20,21, irrespective of their length and composition. An important and frequently used feature for identifying actively translating ORFs using ribosome profiling is the three-nucleotide (3-nt) periodicity of the ribosome's footprints on mRNA from the start codon to the stop codon. However, ribosome profiling data often have several issues, including low and sparse sequencing reads along ORFs, high sequencing noise, and ribosomal RNA (rRNA) contaminations. Thus, the distorted and ambiguous signals generated by such data weaken the 3-nt periodicity patterns of ribosomes' footprints on mRNA, which ultimately makes the identification of the high-confidence translated ORFs difficult.
A package named "RiboCode" adapted a modified Wilcoxon-signed-rank test and P-value integration strategy to examine whether the ORF has significantly more in-frame ribosome-protected fragments (RPFs) than off-frame RPFs22. It was demonstrated to be highly efficient, sensitive, and accurate for de novo annotation of the translatome in simulated and real ribosome profiling data. Here, we describe how to use this tool to detect the potential translating ORFs from the raw ribosome profiling sequencing datasets generated by the previous study23. These datasets had been used to explore the function of EIF3 subunit "E" (EIF3E) in translation by comparing the ribosome occupancy profiles of MCF-10A cells transfected with control (si-Ctrl) and EIF3E (si-eIF3e) small-interfering RNAs (siRNAs). By applying RiboCode to these example datasets, we detected 5,633 novel ORFs potentially encoding small peptides or proteins. These ORFs were categorized into various types based on their locations relative to the coding regions, including upstream ORFs (uORFs), downstream ORFs (dORFs), overlapped ORFs, ORFs from novel protein-coding genes (novel PCGs), and ORFs from novel nonprotein-coding genes (novel NonPCGs). The RPF read densities on uORFs were significantly increased in EIF3E-deficient cells compared to control cells, which might be at least partially caused by the enrichment of actively translating ribosomes. The localized ribosome accumulation in the region from the 25th to 75th codon of EIF3E-deficient cells indicated a blockage of translation elongation in the early stage. This protocol also shows how to visualize the RPF density of the desired region for examining the 3-nt periodicity patterns of ribosome footprints on identified ORFs. These analyses demonstrate the powerful role of RiboCode in identifying translating ORFs and studying the regulation of translation.
1. Environment setup and RiboCode installation
2. Data preparation
3. Trim adapters and remove rRNA contamination
4. Align the clean reads to the genome
5. Size selection of RPFs and identification of their P-sites
6. De novo annotate translating ORFs
7. (Optional) ORF quantification and statistics
8. (Optional) Visualization of the predicted ORFs
9. (Optional) Metagene analysis using RiboMiner
NOTE: Perform the metagene analysis to assess the influence of EIF3E knockdown on the translation of identified annotated ORFs, following the steps below:
The example ribosome profiling datasets were deposited in the GEO database under the accession number GSE131074. All the files and codes used in this protocol are available from Supplemental files 1–4. By applying RiboCode to a set of published ribosome profiling datasets23, we identified the novel ORFs actively translated in MCF-10A cells treated with control and EIF3E siRNAs. To select the RPF reads that are most likely bound by the translating ribosomes, the lengths of the sequencing reads were examined, and a metagene analysis was performed using the RPFs that mapped on the known translation genes. The frequency distribution of the lengths of the reads showed that most RPFs were 25-35 nt (Figure 1A), corresponding to a nucleotide sequence covered by the ribosomes as expected. The P-site locations for different lengths of RPFs were determined by examining the distances from their 5' ends to the annotated start and stop codons, respectively (Figure 1B). The RPF reads within 28-32nt displayed strong 3-nt periodicity, and their P-sites were at the +12th nt (Supplemental file 1).
RiboCode searches for the candidate ORFs from a canonical start codon (AUG) or alternative start codons (optional, e.g., CUG and GUG) to the next stop codon. Then, based on the mapping results of RPFs within the defined range, RiboCode assesses the 3-nt periodicity by evaluating whether the number of in-frame RPFs (i.e., their P-sites allocated on the first nucleotide of each codon) is greater than the number of out-of-frame RPFs (i.e., their P-sites allocated on the second or third nucleotide of each codon). We identified 13,120 genes potentially translating ORFs with p < 0.05, among them 10,394 genes (70.8%) encoding annotated ORFs, 168 (1.1%) genes encoding dORFs, 509 (3.5%) genes encoding uORFs, 939 (6.4%) genes encoding upstream or downstream ORFs overlapped with known annotated ORFs (Overlapped), and 68 (0.5%) protein-coding genes encoding novel ORFs, and 2,601 (17.7%) previously assigned as noncoding genes encoding novel ORFs (Figure 2 and Supplemental file 3)
Comparing sizes of different ORFs showed that uORFs and overlapped ORFs are shorter (195 and 188 nt on average, respectively) than annotated ORFs (~1,771 nt). The same trend was also observed for novel ORFs (670 and 385 nt on average for novel PCGs and novel nonPCGS, respectively) and dORFs (~671 nt) (Figure 3). Together, those noncanonical ORFs (unannotated) identified by RiboCode tended to encode peptides that are smaller than those known annotated ORFs.
Relative RPF counts were calculated for each ORF to assess the function of EIF3 in the processes of translation. The results suggested that the ribosome densities of uORFs were significantly higher in EIF3E-deficient cells than in control cells (Figure 4). As many uORFs were reported to exert inhibitory effects on the translation of downstream coding ORFs, we further examined whether the EIF3E knockdown alters the global densities of RPFs downstream of the start codons (Figure 5). The metagene analysis, in which many ORFs' profiles were aligned and then averaged, revealed that a mass of ribosomes stalled between codons 25 and 75 downstream of the start codon, suggesting that the translation elongation might be blocked early in EIF3E-deficient cells. Further investigations are warranted to examine whether the signal-to-noise ratio or the changes in translation efficiency of ORFs contribute to the increase in uORF RPKM and the accumulation of ribosomes between codons 25 to 75 in the absence of EIF3E, that is, whether the 1) less contamination (or good library quality) or 2) active translation (or ribosome pausing) in the samples without EIF3E results in more reads in uORFs and in the defined region between the 25th and 75th codons.
Finally, RiboCode also provides visualization for densities of the P-sites of RPFs on desired ORF, which could help users to examine the 3-nt periodicity patterns and densities of RPFs. For example, Figure 6 presents the RPF densities on an uORF of PSMA6 and a dORF of SENP3-EIF4A1; both were validated by published proteomics data23 (data not shown).
Figure 1: Assessment of sequencing reads and the P-site positions. (A) Length distribution of ribosome protected fragments (RPFs) in EIF3E-deficient cells in replicate 1 (si-eIF3e-1); (B) Inferring P-site position of RPFs of 29nt based on their densities around the known start (top) and stop codons (bottom). Please click here to view a larger version of this figure.
Figure 2: Percentages of genes harboring different types of ORFs identified by RiboCode using all samples together. Abbreviations: ORF = open reading frame; dORF = downstream ORF; PCG = protein-coding gene; NonPCG = nonprotein-coding gene; uORF = upstream ORF. Please click here to view a larger version of this figure.
Figure 3: Length distributions of different ORF types. Abbreviations: ORF = open reading frame; dORF = downstream ORF; PCG = protein-coding gene; NonPCG = nonprotein-coding gene; uORF = upstream ORF; nt = nucleotide. Please click here to view a larger version of this figure.
Figure 4: Comparison of normalized read counts for different ORF types between control and EIF3E-deficient cells. p-values were determined by Wilcoxon signed rank test. Abbreviation: ORF = open reading frame; dORF = downstream ORF; PCG = protein-coding gene; NonPCG = nonprotein-coding gene; uORF = upstream ORF; RPKM = Reads per kilobase per million mapped reads; siRNA = small-interfering RNA; si-Ctrl = control siRNA; si-eIF3e = siRNA targeting EIF3E. Please click here to view a larger version of this figure.
Figure 5: Metagene analysis showing the stall of ribosomes at the 25-75th codon downstream of the start codon of annotated ORFs. Abbreviation: ORF = open reading frame; siRNA = small-interfering RNA; si-Ctrl = control siRNA; si-eIF3e = siRNA targeting EIF3E; A. U., any unit. Please click here to view a larger version of this figure.
Figure 6: P-site density profiles of example ORFs encoding micropeptides. (A) P-site densities of predicted uORF and its position relative to annotated CDS on transcript ENST00000622405; (B) same as in A but for the predicted dORF on transcript ENST00000614237. Bottom panel showing the enlarged view of predicted uORF (A) or dORF (B). Red bar = in-frame reads; Green & blue bars = off-frame reads. Abbreviation: ORF = open reading frame; dORF = downstream ORF; uORF = upstream ORF; CDS = coding sequences. Please click here to view a larger version of this figure.
Supplemental Information: Evaluation of the dependence between two p-values and explanation of RiboCode results (uORF of ATF4 as an example). Please click here to download this File.
Supplemental File 1: The configuration file for RiboCode defining the selected lengths of RPFs and P-site positions. Please click here to download this File.
Supplemental File 2: RiboCode output file containing the information of predicted ORFs. Please click here to download this File.
Supplemental File 3: R script file for performing basic statistics of RiboCode output. Please click here to download this File.
Supplemental File 4: The configuration file (for RiboMiner) modified from Supplemental File 1. Please click here to download this File.
Ribosome profiling offers an unprecedented opportunity to study the ribosomes’ action in cells at a genome scale. Precisely deciphering the information carried by the ribosome profiling data could provide insight into which regions of genes or transcripts are actively translating. This step-by-step protocol provides guidance on how to use RiboCode to analyze ribosome profiling data in detail, including package installation, data preparation, command execution, result explanation, and data visualization. The analysis results of RiboCode indicated that translation is pervasive and occurs on unannotated ORFs of coding genes and many transcripts previously assumed to be noncoding. The downstream analyses provided evidence that the ribosomes move along the predicted ORFs in 3-nucleotide steps as translation occurs; however, it remains unclear whether the process of translation or the produced peptides serve any function. Nevertheless, accurate annotations of translating ORFs on the genome can give rise to exciting opportunities to identify the functions of previously uncharacterized transcripts31.
The prediction of coding potential for each ORFs using ribosome profiling data highly relies on the 3-nt periodicity of the P-sites densities on each codon from the start to the stop codons of ORFs. Therefore, it requires precise detection of the P-site locations of reads of different lengths. Such information is not directly provided by ribosome profiling data but could be inferred from the distances between the 5′ end of RPFs and annotated start or stop codons (protocol step 5.3). Lacking annotations of known start/stop codons in the GTF file, such as for those newly assembled genomes, may cause RiboCode to fail to execute the downstream steps unless the exact P-site locations of the reads are determined by other means. In most cases, the size of ribosome-bound fragments and their P-site locations are constant, for example, 28-30 nt long and at the +12 nt from the 5′ end of reads in human cells. RiboCode allows the selection of the reads in a specific range to define P-site positions based on experience. However, both lengths of RPF reads and the position of their P-sites might be different when the environmental conditions (e.g., stress or stimulus) or the experimental procedure (e.g., nuclease, buffer, library preparation, and sequencing) have been changed. Therefore, we recommend performing the metaplots (protocol step 5.3) for each sample to extract the most high-confidence RPFs (i.e., reads displaying 3-nt periodicity patterns) and determine their P-site positions in different conditions. Although these operations can be automatically done using the metaplots function, often only a minority of reads showing a near-perfect framing or phasing pass the rigorous selection criteria and statistical test. Therefore, it is still necessary to loosen the certain parameters, especially the “-f0_percent,” and then visually inspect the 3-nt periodicity of reads at each length and manually edit the configuration file to include more reads accordingly, especially when the library quality is poor (protocol step 5.3).
RiboCode searches for the candidate ORFs from canonical or noncanonical start codons (NUGs) to the next stop codon. For the transcripts with multiple start codons upstream of the stop codons, the most likely starting codons are determined by assessing the 3-nt periodicity of the RPF reads mapped between two neighboring start codons or simply choosing the upstream start codon having more in-frame than off-frame RPF reads. A limitation of such a strategy is that the actual starting codons might be misidentified if reads aligned to the start codon regions are sparse or absent. Fortunately, recent strategies, such as global translation initiation sequencing (GTI-seq)32 and quantitative translation initiation sequencing (QTI-seq)33, provide more direct ways for locating the translation initiation sites. For NUGs, more studies are still required to investigate their validities as efficient start codons.
We also released a new update for RiboCode by adding three new features: 1) it reports the other potential ORF types assigned according to their locations relative to the transcripts other than the longest one; 2) it provides an option for adjusting combined p-values if the testing of RPF reads in the two out-frames are not independent (see more detailed explanation in Supplemental Information); 3) it performs p-value correction for multiple testing, allowing for screening of translating ORFs more stringently.
As RiboCode identifies the actively translating ORFs by evaluating the 3-nt periodicity of the RPF reads densities, it has certain limitations for those ORFs that are extremely short (e.g., less than 3 codons). Spealman et al. compared the performance of RiboCode with uORF-seqr and reported that no uORFs shorter than 60 nt are predicted by RiboCode in their dataset34. We argue that the parameter for ORF size selection (-m) in the previous version of RiboCode is not properly set. We have changed the default value of this argument to 5 in the updated RiboCode.
RiboCode reports the identified ORFs in two files: “RiboCode_ORFs_result.txt” containing all ORFs, including redundant ORFs from different transcripts of the same gene; “RiboCode_ORFs_result_collapsed.txt” (Supplemental File 2) integrating the overlapping ORFs with the same stop codon but different start codons, i.e., the one harboring the most upstream start codon in the same reading frame will be retained. In both files, the detected ORFs are classified into either “novel” translating ORFs or other different types according to their relative locations to known CDS (see a detailed explanation of ORF types from RiboCode paper22 or at RiboCode website35). We illustrated how to interpret the RiboCode outputs using a predicted uORF of gene ATF4 as an example (Supplemental Information). RiboCode also counts the number of genes containing different types of ORFs and plots them along with their percentages (Figure 2).
A study reported that some expressed but translationally quiescent genes can be activated to translate into peptides upon oxidative stress12, indicating there are probably other ORFs that might be only translated in a condition-dependent manner. RiboCode can be performed for different experimental conditions separately (e.g., si-Ctrl or si-eIF3e) or jointly, as demonstrated in this protocol (steps 5.4 and 6.1). Multiplexing multiple samples into a single run by defining the lengths and P-site positions of selected reads in “merged_config.txt” has several advantages over processing each sample individually. First, it reduces the biases present in a single sample; second, it saves the program running time; lastly, it provides enough data to carry out the statistics. Thus, it theoretically works better than the single-sample mode, especially for the samples with low sequencing coverage and high background noise. Further quantification and comparison of numbers of RPFs assigned to predicted ORFs between different conditions (e.g., si-eIF3e vs. si-Ctrl) allow us to discover context-dependent ORFs or explore the translational regulation of the ORFs.
Note that due to the accumulation of ribosomes at the beginning and ends of ORFs, a phenomenon called “translation ramp,” the RPFs assigned in the first 15 codon and last 5 codons should be excluded from the reads counting to avoid the analysis of differential ORF translation biasing to the differences of initiation rates3,5,36. These results suggested that the abundance of uORFs types is higher in cells without EIF3 than control cells, which might be caused (or at least partially) by the elevated levels of actively translating ribosomes. The meta-analysis of RPF densities around the start codons also suggested that the early translation elongation is regulated by EIF3E. Note that simply counting the RPF reads in an ORF is not accurate for translation quantification, especially when the translation elongation is severely blocked.
In summary, this protocol shows that RiboCode could be easily applied to identify novel translated ORFs of any size, including those encoding micropeptides. It would be a valuable tool for the research community to discover various types of ORFs in different physiological contexts or experimental conditions. Further validation of the protein or peptide products from these ORFs would be useful for the development of future applications of ribosome profiling.
The authors have nothing to disclose.
The authors would like to acknowledge the support from the computational resources provided by the HPCC platform of Xi'an Jiaotong University. Z.X. gratefully thanks the Young Topnotch Talent Support Plan of Xi'an Jiaotong University.
A computer/server running Linux | Any | – | – |
Anaconda or Miniconda | Anaconda | – | Anaconda: https://www.anaconda.com; Miniconda:https://docs.conda.io/en/latest/miniconda.html |
R | R Foundation | – | https://www.r-project.org/ |
Rstudio | Rstudio | – | https://www.rstudio.com/ |