<em>De novo</em> Identification of Actively Translated Open Reading Frames with Ribosome Profiling Data

Yanan Zhu; Fajin Li; Xuerui Yang; Zhengtao Xiao

doi:10.3791/63366

JoVE Journal > Biology

Biologie

De novo Identification of Actively Translated Open Reading Frames with Ribosome Profiling Data

Published: February 18, 2022

doi:

10.3791/63366

Yanan Zhu*¹, Fajin Li*^2,3, Xuerui Yang³, Zhengtao Xiao

¹School of Basic Medical Sciences,Xi’an Jiaotong University Health Science Center, ²MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, School of Life Sciences,Tsinghua University, ³Joint Graduate Program of Peking-Tsinghua-National Institute of Biological Science

Summary

Translating ribosomes decode three nucleotides per codon into peptides. Their movement along mRNA, captured by ribosome profiling, produces the footprints exhibiting characteristic triplet periodicity. This protocol describes how to use RiboCode to decipher this prominent feature from ribosome profiling data to identify actively translated open reading frames at the whole-transcriptome level.

Abstract

Identification of open reading frames (ORFs), especially those encoding small peptides and being actively translated under specific physiological contexts, is critical for comprehensive annotations of context-dependent translatomes. Ribosome profiling, a technique for detecting the binding locations and densities of translating ribosomes on RNA, offers an avenue to rapidly discover where translation is occurring at the genome-wide scale. However, it is not a trivial task in bioinformatics to efficiently and comprehensively identify the translating ORFs for ribosome profiling. Described here is an easy-to-use package, named RiboCode, designed to search for actively translating ORFs of any size from distorted and ambiguous signals in ribosome profiling data. Taking our previously published dataset as an example, this article provides step-by-step instructions for the entire RiboCode pipeline, from preprocessing of the raw data to interpretation of the final output result files. Furthermore, for evaluating the translation rates of the annotated ORFs, procedures for visualization and quantification of ribosome densities on each ORF are also described in detail. In summary, the present article is a useful and timely instruction for the research fields related to translation, small ORFs, and peptides.

Introduction

Recently, a growing body of studies has revealed widespread production of peptides translated from ORFs of coding genes and the previously annotated genes as noncoding, such as long noncoding RNAs (lncRNAs)¹^,²^,³^,⁴^,⁵^,⁶^,⁷^,⁸. These translated ORFs are regulated or induced by cells to respond to environmental changes, stress, and cell differentiation¹^,⁸^,⁹^,¹⁰^,¹¹^,¹²^,¹³. The translation products of some ORFs have been demonstrated to play important regulatory roles in diverse biological processes in development and physiology. For example, Chng et al.¹⁴ discovered a peptide hormone named Elabela (Ela, also known as Apela/Ende/Toddler), which is critical for cardiovascular development. Pauli et al. suggested that Ela also acts as a mitogen that promotes cell migration in the early fish embryo¹⁵. Magny et al. reported two micropeptides of less than 30 amino acids regulating calcium transport and affecting regular muscle contraction in the Drosophila heart¹⁰.

It remains unclear how many such peptides are encoded by the genome and whether they are biologically relevant. Therefore, systematic identification of these potentially coding ORFs is highly desirable. However, directly determining the products of these ORFs (i.e., protein or peptide) using traditional approaches such as evolutionary conservation¹⁶^,¹⁷ and mass spectrometry¹⁸^,¹⁹ is challenging because the detection efficiency of both approaches is dependent on the length, abundance, and amino acid composition of the produced proteins or peptides. The advent of ribosome profiling, a technique for identifying the ribosome occupancy on mRNAs at nucleotide resolution, has provided a precise way to evaluate the coding potential of different transcripts³^,²⁰^,²¹, irrespective of their length and composition. An important and frequently used feature for identifying actively translating ORFs using ribosome profiling is the three-nucleotide (3-nt) periodicity of the ribosome's footprints on mRNA from the start codon to the stop codon. However, ribosome profiling data often have several issues, including low and sparse sequencing reads along ORFs, high sequencing noise, and ribosomal RNA (rRNA) contaminations. Thus, the distorted and ambiguous signals generated by such data weaken the 3-nt periodicity patterns of ribosomes' footprints on mRNA, which ultimately makes the identification of the high-confidence translated ORFs difficult.

A package named "RiboCode" adapted a modified Wilcoxon-signed-rank test and P-value integration strategy to examine whether the ORF has significantly more in-frame ribosome-protected fragments (RPFs) than off-frame RPFs²². It was demonstrated to be highly efficient, sensitive, and accurate for de novo annotation of the translatome in simulated and real ribosome profiling data. Here, we describe how to use this tool to detect the potential translating ORFs from the raw ribosome profiling sequencing datasets generated by the previous study²³. These datasets had been used to explore the function of EIF3 subunit "E" (EIF3E) in translation by comparing the ribosome occupancy profiles of MCF-10A cells transfected with control (si-Ctrl) and EIF3E (si-eIF3e) small-interfering RNAs (siRNAs). By applying RiboCode to these example datasets, we detected 5,633 novel ORFs potentially encoding small peptides or proteins. These ORFs were categorized into various types based on their locations relative to the coding regions, including upstream ORFs (uORFs), downstream ORFs (dORFs), overlapped ORFs, ORFs from novel protein-coding genes (novel PCGs), and ORFs from novel nonprotein-coding genes (novel NonPCGs). The RPF read densities on uORFs were significantly increased in EIF3E-deficient cells compared to control cells, which might be at least partially caused by the enrichment of actively translating ribosomes. The localized ribosome accumulation in the region from the 25^th to 75^th codon of EIF3E-deficient cells indicated a blockage of translation elongation in the early stage. This protocol also shows how to visualize the RPF density of the desired region for examining the 3-nt periodicity patterns of ribosome footprints on identified ORFs. These analyses demonstrate the powerful role of RiboCode in identifying translating ORFs and studying the regulation of translation.

Protocol

1. Environment setup and RiboCode installation

Open a Linux terminal window and create a conda environment:
conda create -n RiboCode python=3.8
Switch to the created environment and install RiboCode and dependencies:
conda activate RiboCode
conda install -c bioconda ribocode ribominer sra-tools fastx_toolkit cutadapt bowtie star samtools

2. Data preparation

Get genome reference files.
1. For the reference sequence, go to the Ensemble website at https://www.ensembl.org/index.html, click the top menu Download and left-side menu FTP Download. In the presented table, click FASTA in the column DNA (FASTA) and the row where Species is Human. In the opened page, copy the link of Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, then download and unzip it in the terminal:
  wget -c
  http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
  gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
2. For reference annotation, right-click GTF in the column Gene sets in the last-opened web page. Copy the link of Homo_sapiens.GRCh38.104.gtf.gz and download it using:
  wget -c
  http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.gtf.gz
  gzip -d Homo_sapiens.GRCh38.104.gtf.gz
  NOTE: It is recommended to get the GTF file from the Ensemble website as it contains genome annotations organized in a three-level hierarchy, i.e., each gene contains transcripts that contain exons and optional translations (e.g., coding sequences [CDS], translation start site, translation end site). When a gene's or transcript's annotations are missing, for instance, a GTF file obtained from UCSC or NCBI, use GTFupdate to generate an updated GTF with complete parent-child hierarchy annotations: GTFupdate original.gtf > updated.gtf. For the annotation file in the .gff format, use the AGAT toolkit²⁴ or any other tool to convert to the .gtf format.
Get rRNA sequences.
1. Open UCSC Genome Browser at https://genome.ucsc.edu and click Tools | Table Browser in the dropdown list.
2. On the opened page, specify Mammal for clade, Human for genome, All Tables for group, rmask for table, and genome for region. For filter, click Create to go to a new page and set repClass as does match rRNA.
3. Click Soumettre and then set the output format to sequence and output filename as hg38_rRNA.fa. Finally, click Get output | Get sequence to retrieve the rRNA sequence.
Get ribosome profiling datasets from Sequence Read Archive (SRA).
1. Download the replicate samples of si-eIF3e treatment group and rename them:
  fastq-dump SRR9047190 SRR9047191 SRR9047192
  mv SRR9047190.fastq si-eIF3e-1.fastq
  mv SRR9047191.fastq si-eIF3e-2.fastq
  mv SRR9047192.fastq si-eIF3e-3.fastq
2. Download the replicate samples of control group and rename them:
  fastq-dump SRR9047193 SRR9047194 SRR9047195
  mv SRR9047193.fastq si-Ctrl-1.fastq
  mv SRR9047194.fastq si-Ctrl-2.fastq
  mv SRR9047195.fastq si-Ctrl-3.fastq
  NOTE: The SRA accession IDs for these example datasets were obtained from Gene Expression Omnibus (GEO) website²⁵ by searching for GSE131074.

3. Trim adapters and remove rRNA contamination

(Optional) Remove adapters from the sequencing data. Skip this step if the adapter sequences have already been trimmed, as in this case. Otherwise, use cutadapt to trim the adapters from reads.
for i in si-Ctrl-1 si-Ctrl-2 si-Ctrl-3 si-eIF3e-1 si-eIF3e-2 si-eIF3e-3
do
cutadapt -m 15 –match-read-wildcards -a CTGTAGGCACCATCAAT
-o ${i}_trimmed.fastq ${i}.fastq
done
NOTE: The adapter sequence after -a parameter will vary dependent on cDNA library preparation. Reads shorter than 15 (given by -m) are discarded because the ribosome-protected fragments are usually longer than this size.
Remove rRNA contamination using the following steps:
1. Index rRNA reference sequences:
  bowtie-build -f hg38_rRNA.fa hg38_rRNA
2. Align the reads to rRNA reference to rule out the reads originating from rRNA:
  for i in si-Ctrl-1 si-Ctrl-2 si-Ctrl-3 si-eIF3e-1 si-eIF3e-2 si-eIF3e-3
  do
  bowtie -n 0 -y -a –norc –best –strata -S -p 4 -l 15
  –un=./${i}_noncontam.fastq hg38_rRNA -q ${i}.fastq ${i}.aln
  done
  -p specifies the number of threads for parallelly running the tasks. Considering the relatively small size of the RPF reads, other arguments (e.g., -n, -y, -a, -norc, –best, –strata, and -l) should be specified to guarantee that the reported alignments are best. For more details, refer to the Bowtie website²⁶.

4. Align the clean reads to the genome

Create a genome index.
mkdir STAR_hg38_genome
STAR –runThreadN 8 –runMode genomeGenerate –genomeDir ./STAR_hg38_genome –genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa –sjdbGTFfile Homo_sapiens.GRCh38.104.gtf
Align the clean reads (no rRNA contamination) to the created reference.
for i in si-Ctrl-1 si-Ctrl-2 si-Ctrl-3 si-eIF3e-1 si-eIF3e-2 si-eIF3e-3
do
STAR –runThreadN 8 –outFilterType Normal –outWigType wiggle –outWigStrand Stranded –outWigNorm RPM –outFilterMismatchNmax 1 –outFilterMultimapNmax 1 –genomeDir STAR_hg38_genome –readFilesIn ${i}_noncontam.fastq –outFileNamePrefix ${i}. –outSAMtype BAM SortedByCoordinate –quantMode TranscriptomeSAM GeneCounts –outSAMattributes All
done
NOTE: An untemplated nucleotide is frequently added to the 5' end of each read by the reverse transcriptase²⁷, which will be efficiently trimmed off by STAR as it performs soft-clipping by default. The parameters for STAR are described in STAR manual²⁸.
Sort and index alignment files.
for i in si-Ctrl-1 si-Ctrl-2 si-Ctrl-3 si-eIF3e-1 si-eIF3e-2 si-eIF3e-3
do
samtools sort -T ${i}.Aligned.toTranscriptome.out.sorted
-o ${i}.Aligned.toTranscriptome.out.sorted.bam
${i}.Aligned.toTranscriptome.out.bam
samtools index ${i}.Aligned.toTranscriptome.out.sorted.bam
samtools index ${i}.Aligned.sortedByCoord.out.bam
done

5. Size selection of RPFs and identification of their P-sites

Prepare the transcript annotations.
prepare_transcripts -g Homo_sapiens.GRCh38.104.gtf
-f Homo_sapiens.GRCh38.dna.primary_assembly.fa -o RiboCode_annot
NOTE: This command collects required information of mRNA transcripts from the GTF file and extracts the sequences for all mRNA transcripts from the FASTA file (each transcript is assembled by merging the exons according to the structures defined in the GTF file).
Select RPFs of specific lengths and identify their P-site positions.
for i in si-Ctrl-1 si-Ctrl-2 si-Ctrl-3 si-eIF3e-1 si-eIF3e-2 si-eIF3e-3
do
metaplots -a RiboCode_annot -r ${i}.Aligned.toTranscriptome.out.bam
-o ${i} -f0_percent 0.35 -pv1 0.001 -pv2 0.001
done
NOTE: This command plots the aggregate profiles of the 5' end of the aligned reads of each length around annotated translation start (or stop) codons. The read length-dependent P-site can be manually determined by examining the distribution plots (e.g., Figure 1B) of offset distances between 5' ends of the major reads and the start codon. RiboCode also generates a configuration file for each sample, in which the P-site positions of reads displaying significant 3-nt periodicity patterns are automatically determined. The parameters -f0_percent, -pv1, and -pv2 define the proportion threshold and p-value cutoffs for selecting the RPF reads enriched in the reading frame. In this example, the +12, +13, and +13 nucleotides from the 5' end of the 29, 30, and 31 nt reads are manually defined in each configuration file.
Edit the configuration files for each sample and merge them
NOTE: To generate a consensus set of unique ORFs and ensure sufficient coverage of reads to perform subsequent analysis, the selected reads of all samples in the previous step are merged. The reads of specific lengths defined in merged_config.txt file (Supplemental File 1) and their P-site information are used for evaluating the translation potential of ORFs in the next step.

6. De novo annotate translating ORFs

Run RiboCode.
RiboCode -a RiboCode_annot -c merged_config.txt -l yes -g
-o RiboCode_ORFs_result -s ATG -m 5 -A CTG,GTG,TTG
Where the important parameters of this command are as follows:
-c, configuration file containing the path of input files and the information of selected reads and their P-sites.
-l, for transcripts having multiple start codons upstream of the stop codons, whether the longest ORFs (the region from the most distal start codon to stop codon) are used for evaluating their translation potential. If set to no, the starting codons will be automatically determined.
-s, the canonical start codon(s) used for ORFs identification.
-A, (optionally) the noncanonical start codons (e.g., CTG, GTG, and TTG for human) used for ORF identification, which may differ in mitochondria or nucleus of other species²⁹.
-m, the minimum length (i.e., amino acids) of ORFs.
-o, the prefix of output filename containing the details of predicted ORFs (Supplemental File 2).
-g and -b, output the predicted ORFs to gtf or bed format, respectively.

7. (Optional) ORF quantification and statistics

Count RPF reads in each ORF.
for i in si-Ctrl-1 si-Ctrl-2 si-Ctrl-3 si-eIF3e-1 si-eIF3e-2 si-eIF3e-3
do
ORFcount -g RiboCode_ORFs_result_collapsed.gtf
-r ${i}.Aligned.sortedByCoord.out.bam -f 15 -l 5 -m 25 -M 35
-o ${i}_ORF.counts -s yes -c intersection-strict
done
NOTE: To exclude the potential accumulating ribosomes around the start and ends of ORFs, the number of reads allocated in the first 15 (specified by -f) and last 5 codons (specific by -l) are not counted. Optionally, the lengths of counted RPFs are restricted to the range from 25 à 35 nt (common sizes of RPFs).
Calculate basic statistics of the detected ORFs using RiboCode:
Rscript RiboCode_utils.R
NOTE: RiboCode_utils.R (Supplemental File 3) provides a series of statistics for the RiboCode output, e.g., counting the number of identified ORFs, viewing the distribution of ORF lengths, and calculating the normalized RPF densities (i.e., RPKM, reads per kilobase per million mapped reads).

8. (Optional) Visualization of the predicted ORFs

Obtain the relative positions of the start and stop codons for the desired ORF (e.g., ENSG00000100902_35292349_35292552_67) on its transcript from RiboCode_ORFs_result_collapsed.txt (Supplemental file 3). Then, plot the density of RPF reads in the ORF:
plot_orf_density -a RiboCode_annot -c merged_config.txt -t ENST00000622405
-s 33 -e 236 –start-codon ATG -o ENSG00000100902_35292349_35292552_67
Where -s and -e specify the translation start and stop position of plotting ORF. –start-codon defines the start codon of the ORF, which will appear in the figure title. -o defines the prefix of the output file name.

9. (Optional) Metagene analysis using RiboMiner

NOTE: Perform the metagene analysis to assess the influence of EIF3E knockdown on the translation of identified annotated ORFs, following the steps below:

Generate transcripts annotations for RiboMiner, which extracts the longest transcript for each gene based on the annotation file generated by RiboCode (step 5.1).
OutputTranscriptInfo -c RiboCode_annot/transcripts_cds.txt
-g Homo_sapiens.GRCh38.104.gtf -f RiboCode_annot/transcripts_sequence.fa
-o longest.transcripts.info.txt -O all.transcripts.info.txt
Prepare the configuration file for RiboMiner. Copy the configuration file generated by the metaplots command of RiboCode (step 5.4) and rename it "RiboMiner_config.txt." Then, modify it according to the format shown in Supplemental file 4.
Metagene analyses using RiboMiner
1. Use MetageneAnalysis to generate an aggregate and averaged profile of RPFs' densities across transcripts.
  MetageneAnalysis -f RiboMiner_config.txt -c longest.transcripts.info.txt
  -o MA_normed -U codon -M RPKM -u 100 -d 400 -l 100 -n 10 -m 1 -e 5 –norm yes
  -y 100 –type UTR
  Where important parameters are: –type, analyzing either CDS or UTR regions; –norm, whether normalized the read density; -y, the number of codons used for each transcript; -U, plot RPF density either at codon level or nt level; -u and -d, define the range of analyzing regions relative to start codon or stop codon; -l, the minimum length (i.e., the number of codons) of CDS; -M, the mode for transcripts filtering, either counts or RPKM; -n minimum counts or RPKM in CDS for analysis. -m minimum counts or RPKM of CDS in the normalized region; -e, the number of codons excluded from the normalized region.
2. Generate a set of pdf files for comparing the ribosome occupancies on mRNA in control cells and eIF3-deficient cells.
  PlotMetageneAnalysis -i MA_normed_dataframe.txt -o MA_normed
  -g si-Ctrl,si-eIF3e -r si-Ctrl-1,si-Ctrl-2,si-Ctrl-3__si-eIF3e-1,si-eIF3e-2,si-eIF3e-3 -u 100 -d 400 –mode mean
  NOTE: PlotMetageneAnalysis generates the set of pdf files. Details about the usage of MetageneAnalysis and PlotMetageneAnalysis are available at RiboMiner website³⁰.

Representative Results

The example ribosome profiling datasets were deposited in the GEO database under the accession number GSE131074. All the files and codes used in this protocol are available from Supplemental files 1–4. By applying RiboCode to a set of published ribosome profiling datasets²³, we identified the novel ORFs actively translated in MCF-10A cells treated with control and EIF3E siRNAs. To select the RPF reads that are most likely bound by the translating ribosomes, the lengths of the sequencing reads were examined, and a metagene analysis was performed using the RPFs that mapped on the known translation genes. The frequency distribution of the lengths of the reads showed that most RPFs were 25-35 nt (Figure 1A), corresponding to a nucleotide sequence covered by the ribosomes as expected. The P-site locations for different lengths of RPFs were determined by examining the distances from their 5' ends to the annotated start and stop codons, respectively (Figure 1B). The RPF reads within 28-32nt displayed strong 3-nt periodicity, and their P-sites were at the +12^th nt (Supplemental file 1).

RiboCode searches for the candidate ORFs from a canonical start codon (AUG) or alternative start codons (optional, e.g., CUG and GUG) to the next stop codon. Then, based on the mapping results of RPFs within the defined range, RiboCode assesses the 3-nt periodicity by evaluating whether the number of in-frame RPFs (i.e., their P-sites allocated on the first nucleotide of each codon) is greater than the number of out-of-frame RPFs (i.e., their P-sites allocated on the second or third nucleotide of each codon). We identified 13,120 genes potentially translating ORFs with p < 0.05, among them 10,394 genes (70.8%) encoding annotated ORFs, 168 (1.1%) genes encoding dORFs, 509 (3.5%) genes encoding uORFs, 939 (6.4%) genes encoding upstream or downstream ORFs overlapped with known annotated ORFs (Overlapped), and 68 (0.5%) protein-coding genes encoding novel ORFs, and 2,601 (17.7%) previously assigned as noncoding genes encoding novel ORFs (Figure 2 and Supplemental file 3)

Comparing sizes of different ORFs showed that uORFs and overlapped ORFs are shorter (195 and 188 nt on average, respectively) than annotated ORFs (~1,771 nt). The same trend was also observed for novel ORFs (670 and 385 nt on average for novel PCGs and novel nonPCGS, respectively) and dORFs (~671 nt) (Figure 3). Together, those noncanonical ORFs (unannotated) identified by RiboCode tended to encode peptides that are smaller than those known annotated ORFs.

Relative RPF counts were calculated for each ORF to assess the function of EIF3 in the processes of translation. The results suggested that the ribosome densities of uORFs were significantly higher in EIF3E-deficient cells than in control cells (Figure 4). As many uORFs were reported to exert inhibitory effects on the translation of downstream coding ORFs, we further examined whether the EIF3E knockdown alters the global densities of RPFs downstream of the start codons (Figure 5). The metagene analysis, in which many ORFs' profiles were aligned and then averaged, revealed that a mass of ribosomes stalled between codons 25 and 75 downstream of the start codon, suggesting that the translation elongation might be blocked early in EIF3E-deficient cells. Further investigations are warranted to examine whether the signal-to-noise ratio or the changes in translation efficiency of ORFs contribute to the increase in uORF RPKM and the accumulation of ribosomes between codons 25 to 75 in the absence of EIF3E, that is, whether the 1) less contamination (or good library quality) or 2) active translation (or ribosome pausing) in the samples without EIF3E results in more reads in uORFs and in the defined region between the 25^th and 75^th codons.

Finally, RiboCode also provides visualization for densities of the P-sites of RPFs on desired ORF, which could help users to examine the 3-nt periodicity patterns and densities of RPFs. For example, Figure 6 presents the RPF densities on an uORF of PSMA6 and a dORF of SENP3-EIF4A1; both were validated by published proteomics data²³ (data not shown).

Figure 1: Assessment of sequencing reads and the P-site positions. (A) Length distribution of ribosome protected fragments (RPFs) in EIF3E-deficient cells in replicate 1 (si-eIF3e-1); (B) Inferring P-site position of RPFs of 29nt based on their densities around the known start (top) and stop codons (bottom). Please click here to view a larger version of this figure.

Figure 2: Percentages of genes harboring different types of ORFs identified by RiboCode using all samples together. Abbreviations: ORF = open reading frame; dORF = downstream ORF; PCG = protein-coding gene; NonPCG = nonprotein-coding gene; uORF = upstream ORF. Please click here to view a larger version of this figure.

Figure 3: Length distributions of different ORF types. Abbreviations: ORF = open reading frame; dORF = downstream ORF; PCG = protein-coding gene; NonPCG = nonprotein-coding gene; uORF = upstream ORF; nt = nucleotide. Please click here to view a larger version of this figure.

Figure 4: Comparison of normalized read counts for different ORF types between control and EIF3E-deficient cells. p-values were determined by Wilcoxon signed rank test. Abbreviation: ORF = open reading frame; dORF = downstream ORF; PCG = protein-coding gene; NonPCG = nonprotein-coding gene; uORF = upstream ORF; RPKM = Reads per kilobase per million mapped reads; siRNA = small-interfering RNA; si-Ctrl = control siRNA; si-eIF3e = siRNA targeting EIF3E. Please click here to view a larger version of this figure.

Figure 5: Metagene analysis showing the stall of ribosomes at the 25-75^th codon downstream of the start codon of annotated ORFs. Abbreviation: ORF = open reading frame; siRNA = small-interfering RNA; si-Ctrl = control siRNA; si-eIF3e = siRNA targeting EIF3E; A. U., any unit. Please click here to view a larger version of this figure.

Figure 6: P-site density profiles of example ORFs encoding micropeptides. (A) P-site densities of predicted uORF and its position relative to annotated CDS on transcript ENST00000622405; (B) same as in A but for the predicted dORF on transcript ENST00000614237. Bottom panel showing the enlarged view of predicted uORF (A) or dORF (B). Red bar = in-frame reads; Green & blue bars = off-frame reads. Abbreviation: ORF = open reading frame; dORF = downstream ORF; uORF = upstream ORF; CDS = coding sequences. Please click here to view a larger version of this figure.

Supplemental Information: Evaluation of the dependence between two p-values and explanation of RiboCode results (uORF of ATF4 as an example). Please click here to download this File.

Supplemental File 1: The configuration file for RiboCode defining the selected lengths of RPFs and P-site positions. Please click here to download this File.

Supplemental File 2: RiboCode output file containing the information of predicted ORFs. Please click here to download this File.

Supplemental File 3: R script file for performing basic statistics of RiboCode output. Please click here to download this File.

Supplemental File 4: The configuration file (for RiboMiner) modified from Supplemental File 1. Please click here to download this File.

Discussion

Ribosome profiling offers an unprecedented opportunity to study the ribosomes’ action in cells at a genome scale. Precisely deciphering the information carried by the ribosome profiling data could provide insight into which regions of genes or transcripts are actively translating. This step-by-step protocol provides guidance on how to use RiboCode to analyze ribosome profiling data in detail, including package installation, data preparation, command execution, result explanation, and data visualization. The analysis results of RiboCode indicated that translation is pervasive and occurs on unannotated ORFs of coding genes and many transcripts previously assumed to be noncoding. The downstream analyses provided evidence that the ribosomes move along the predicted ORFs in 3-nucleotide steps as translation occurs; however, it remains unclear whether the process of translation or the produced peptides serve any function. Nevertheless, accurate annotations of translating ORFs on the genome can give rise to exciting opportunities to identify the functions of previously uncharacterized transcripts³¹.

The prediction of coding potential for each ORFs using ribosome profiling data highly relies on the 3-nt periodicity of the P-sites densities on each codon from the start to the stop codons of ORFs. Therefore, it requires precise detection of the P-site locations of reads of different lengths. Such information is not directly provided by ribosome profiling data but could be inferred from the distances between the 5′ end of RPFs and annotated start or stop codons (protocol step 5.3). Lacking annotations of known start/stop codons in the GTF file, such as for those newly assembled genomes, may cause RiboCode to fail to execute the downstream steps unless the exact P-site locations of the reads are determined by other means. In most cases, the size of ribosome-bound fragments and their P-site locations are constant, for example, 28-30 nt long and at the +12 nt from the 5′ end of reads in human cells. RiboCode allows the selection of the reads in a specific range to define P-site positions based on experience. However, both lengths of RPF reads and the position of their P-sites might be different when the environmental conditions (e.g., stress or stimulus) or the experimental procedure (e.g., nuclease, buffer, library preparation, and sequencing) have been changed. Therefore, we recommend performing the metaplots (protocol step 5.3) for each sample to extract the most high-confidence RPFs (i.e., reads displaying 3-nt periodicity patterns) and determine their P-site positions in different conditions. Although these operations can be automatically done using the metaplots function, often only a minority of reads showing a near-perfect framing or phasing pass the rigorous selection criteria and statistical test. Therefore, it is still necessary to loosen the certain parameters, especially the “-f0_percent,” and then visually inspect the 3-nt periodicity of reads at each length and manually edit the configuration file to include more reads accordingly, especially when the library quality is poor (protocol step 5.3).

RiboCode searches for the candidate ORFs from canonical or noncanonical start codons (NUGs) to the next stop codon. For the transcripts with multiple start codons upstream of the stop codons, the most likely starting codons are determined by assessing the 3-nt periodicity of the RPF reads mapped between two neighboring start codons or simply choosing the upstream start codon having more in-frame than off-frame RPF reads. A limitation of such a strategy is that the actual starting codons might be misidentified if reads aligned to the start codon regions are sparse or absent. Fortunately, recent strategies, such as global translation initiation sequencing (GTI-seq)³² and quantitative translation initiation sequencing (QTI-seq)³³, provide more direct ways for locating the translation initiation sites. For NUGs, more studies are still required to investigate their validities as efficient start codons.

We also released a new update for RiboCode by adding three new features: 1) it reports the other potential ORF types assigned according to their locations relative to the transcripts other than the longest one; 2) it provides an option for adjusting combined p-values if the testing of RPF reads in the two out-frames are not independent (see more detailed explanation in Supplemental Information); 3) it performs p-value correction for multiple testing, allowing for screening of translating ORFs more stringently.

As RiboCode identifies the actively translating ORFs by evaluating the 3-nt periodicity of the RPF reads densities, it has certain limitations for those ORFs that are extremely short (e.g., less than 3 codons). Spealman et al. compared the performance of RiboCode with uORF-seqr and reported that no uORFs shorter than 60 nt are predicted by RiboCode in their dataset³⁴. We argue that the parameter for ORF size selection (-m) in the previous version of RiboCode is not properly set. We have changed the default value of this argument to 5 in the updated RiboCode.

RiboCode reports the identified ORFs in two files: “RiboCode_ORFs_result.txt” containing all ORFs, including redundant ORFs from different transcripts of the same gene; “RiboCode_ORFs_result_collapsed.txt” (Supplemental File 2) integrating the overlapping ORFs with the same stop codon but different start codons, i.e., the one harboring the most upstream start codon in the same reading frame will be retained. In both files, the detected ORFs are classified into either “novel” translating ORFs or other different types according to their relative locations to known CDS (see a detailed explanation of ORF types from RiboCode paper²² or at RiboCode website³⁵). We illustrated how to interpret the RiboCode outputs using a predicted uORF of gene ATF4 as an example (Supplemental Information). RiboCode also counts the number of genes containing different types of ORFs and plots them along with their percentages (Figure 2).

A study reported that some expressed but translationally quiescent genes can be activated to translate into peptides upon oxidative stress¹², indicating there are probably other ORFs that might be only translated in a condition-dependent manner. RiboCode can be performed for different experimental conditions separately (e.g., si-Ctrl or si-eIF3e) or jointly, as demonstrated in this protocol (steps 5.4 and 6.1). Multiplexing multiple samples into a single run by defining the lengths and P-site positions of selected reads in “merged_config.txt” has several advantages over processing each sample individually. First, it reduces the biases present in a single sample; second, it saves the program running time; lastly, it provides enough data to carry out the statistics. Thus, it theoretically works better than the single-sample mode, especially for the samples with low sequencing coverage and high background noise. Further quantification and comparison of numbers of RPFs assigned to predicted ORFs between different conditions (e.g., si-eIF3e vs. si-Ctrl) allow us to discover context-dependent ORFs or explore the translational regulation of the ORFs.

Note that due to the accumulation of ribosomes at the beginning and ends of ORFs, a phenomenon called “translation ramp,” the RPFs assigned in the first 15 codon and last 5 codons should be excluded from the reads counting to avoid the analysis of differential ORF translation biasing to the differences of initiation rates³^,⁵^,³⁶. These results suggested that the abundance of uORFs types is higher in cells without EIF3 than control cells, which might be caused (or at least partially) by the elevated levels of actively translating ribosomes. The meta-analysis of RPF densities around the start codons also suggested that the early translation elongation is regulated by EIF3E. Note that simply counting the RPF reads in an ORF is not accurate for translation quantification, especially when the translation elongation is severely blocked.

In summary, this protocol shows that RiboCode could be easily applied to identify novel translated ORFs of any size, including those encoding micropeptides. It would be a valuable tool for the research community to discover various types of ORFs in different physiological contexts or experimental conditions. Further validation of the protein or peptide products from these ORFs would be useful for the development of future applications of ribosome profiling.

Divulgations

The authors have nothing to disclose.

Acknowledgements

The authors would like to acknowledge the support from the computational resources provided by the HPCC platform of Xi'an Jiaotong University. Z.X. gratefully thanks the Young Topnotch Talent Support Plan of Xi'an Jiaotong University.

Materials

A computer/server running Linux	Any	–	–
Anaconda or Miniconda	Anaconda	–	Anaconda: https://www.anaconda.com; Miniconda:https://docs.conda.io/en/latest/miniconda.html
R	R Foundation	–	https://www.r-project.org/
Rstudio	Rstudio	–	https://www.rstudio.com/

References

Eisenberg, A. R., et al. Translation Initiation Site Profiling Reveals Widespread Synthesis of Non-AUG-Initiated Protein Isoforms in Yeast. Cell Systems. 11 (2), 145-160 (2020).
Spealman, P., et al. Conserved non-AUG uORFs revealed by a novel regression analysis of ribosome profiling data. Genome Research. 28 (2), 214-222 (2018).
Ingolia, N. T., Lareau, L. F., Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 147 (4), 789-802 (2011).
Bazzini, A. A., et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. The EMBO Journal. 33 (9), 981-993 (2014).
Ingolia, N. T., et al. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Reports. 8 (5), 1365-1379 (2014).
Chew, G. L., Pauli, A., Schier, A. F. Conservation of uORF repressiveness and sequence features in mouse, human and zebrafish. Nature Communications. 7, 11663 (2016).
Zhang, H., et al. Determinants of genome-wide distribution and evolution of uORFs in eukaryotes. Nature Communications. 12 (1), 1076 (2021).
Guenther, U. P., et al. The helicase Ded1p controls use of near-cognate translation initiation codons in 5′ UTRs. Nature. 559 (7712), 130-134 (2018).
Goldsmith, J., et al. Ribosome profiling reveals a functional role for autophagy in mRNA translational control. Communications Biology. 3 (1), 388 (2020).
Magny, E. G., et al. Conserved regulation of cardiac calcium uptake by peptides encoded in small open reading frames. Science. 341 (6150), 1116-1120 (2013).
Stumpf, C. R., Moreno, M. V., Olshen, A. B., Taylor, B. S., Ruggero, D. The translational landscape of the mammalian cell cycle. Molecular Cell. 52 (4), 574-582 (2013).
Gerashchenko, M. V., Lobanov, A. V., Gladyshev, V. N. Genome-wide ribosome profiling reveals complex translational regulation in response to oxidative stress. Proceedings of the National Academy of Sciences of the United States of America. 109 (43), 17394-17399 (2012).
Andreev, D. E., et al. Oxygen and glucose deprivation induces widespread alterations in mRNA translation within 20 minutes. Genome Biology. 16, 90 (2015).
Chng, S. C., Ho, L., Tian, J., Reversade, B. ELABELA: a hormone essential for heart development signals via the apelin receptor. Developmental Cell. 27 (6), 672-680 (2013).
Pauli, A., et al. Toddler: an embryonic signal that promotes cell movement via Apelin receptors. Science. 343 (6172), 1248636 (2014).
Stark, A., et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature. 450 (7167), 219-232 (2007).
Lin, M. F., Jungreis, I., Kellis, M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics. 27 (13), 275-282 (2011).
Slavoff, S. A., et al. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nature Chemical Biology. 9 (1), 59-64 (2013).
Schwaid, A. G., et al. Chemoproteomic discovery of cysteine-containing human short open reading frames. Journal of the American Chemical Society. 135 (45), 16750-16753 (2013).
Ingolia, N. T., Brar, G. A., Rouskin, S., McGeachy, A. M., Weissman, J. S. Genome-wide annotation and quantitation of translation by ribosome profiling. Current Protocols in Molecular Biology. , 1-19 (2013).
Ingolia, N. T., Ghaemmaghami, S., Newman, J. R., Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science. 324 (5924), 218-223 (2009).
Xiao, Z., et al. De novo annotation and characterization of the translatome with ribosome profiling data. Nucleic Acids Research. 46 (10), 61 (2018).
Lin, Y., et al. eIF3 Associates with 80S Ribosomes to Promote Translation Elongation, Mitochondrial Homeostasis, and Muscle Health. Molecular Cell. 79 (4), 575-587 (2020).
. AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF/GFF format Available from: https://agat.readthedocs.io/en/latest/gff_to_gtf.html (2020)
. Gene Expression Omnibus Available from: https://www.ncbi.nim.nih.gov/geo (2002)
Ingolia, N. T., Brar, G. A., Rouskin, S., McGeachy, A. M., Weissman, J. S. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments. Nature Protocols. 7 (8), 1534-1550 (2012).
. STAR manual Available from: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf (2022)
. The genetic codes Available from: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi (2019)
. RiboMiner Available from: https://github.com/xryanglab/RiboMiner (2020)
Ingolia, N. T., Hussmann, J. A., Weissman, J. S. Ribosome profiling: global views of translation. Cold Spring Harbor Perspectives in Biology. 11 (5), 032698 (2018).
Lee, S., et al. Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. Proceedings of the National Academy of Sciences of the United States of America. 109 (37), 2424-2432 (2012).
Gao, X., et al. Quantitative profiling of initiating ribosomes in vivo. Nature Methods. 12 (2), 147-153 (2015).
Spealman, P., Naik, A., McManus, J. uORF-seqr: A Machine Learning-Based approach to the identification of upstream open reading frames in yeast. Methods in Molecular Biol. 2252, 313-329 (2021).
. RiboCode Available from: https://github.com/xryanglab/RiboCode (2018)
Sharma, P., Wu, J., Nilges, B. S., Leidel, S. A. Humans and other commonly used model organisms are resistant to cycloheximide-mediated biases in ribosome profiling experiments. Nature Communications. 12 (1), 5094 (2021).

Play Video

PDF

DOI

DOWNLOAD MATERIALS LIST

Citer Cet Article

Zhu, Y., Li, F., Yang, X., Xiao, Z. De novo Identification of Actively Translated Open Reading Frames with Ribosome Profiling Data. J. Vis. Exp. (180), e63366, doi:10.3791/63366 (2022).