An Integrated Approach for Microprotein Identification and Sequence Analysis

Published: July 12, 2022

doi:

Omar Brito-Estrada*¹, Keira R. Hassel*¹, Catherine A. Makarewich²

¹The Heart Institute, Division of Molecular Cardiovascular Biology,Cincinnati Children’s Hospital Medical Center, ²Department of Pediatrics,University of Cincinnati College of Medicine

Summary

The protocol described here provides detailed instructions on how to analyze genomic regions of interest for microprotein-coding potential using PhyloCSF on the user-friendly UCSC Genome Browser. Additionally, several tools and resources are recommended to further investigate sequence characteristics of identified microproteins to gain insight into their putative functions.

Abstract

Next-generation sequencing (NGS) has propelled the field of genomics forward and produced whole genome sequences for numerous animal species and model organisms. However, despite this wealth of sequence information, comprehensive gene annotation efforts have proven challenging, especially for small proteins. Notably, conventional protein annotation methods were designed to intentionally exclude putative proteins encoded by short open reading frames (sORFs) less than 300 nucleotides in length to filter out the exponentially higher number of spurious noncoding sORFs throughout the genome. As a result, hundreds of functional small proteins called microproteins (<100 amino acids in length) have been incorrectly classified as noncoding RNAs or overlooked entirely.

Here we provide a detailed protocol to leverage free, publicly available bioinformatic tools to query genomic regions for microprotein-coding potential based on evolutionary conservation. Specifically, we provide step-by-step instructions on how to examine sequence conservation and coding potential using Phylogenetic Codon Substitution Frequencies (PhyloCSF) on the user-friendly University of California Santa Cruz (UCSC) Genome Browser. Additionally, we detail steps to efficiently generate multiple species alignments of identified microprotein sequences to visualize amino acid sequence conservation and recommend resources to analyze microprotein characteristics, including predicted domain structures. These powerful tools can be used to help identify putative microprotein-coding sequences in noncanonical genomic regions or to rule out the presence of a conserved coding sequence with translational potential in a noncoding transcript of interest.

Introduction

The identification of the complete set of coding elements in the genome has been a major goal since the initiation of the Human Genome Project, and remains a central objective toward the understanding of biological systems and the etiology of genetic-based diseases¹^,²^,³^,⁴. Advances in NGS techniques have led to the production of whole genome sequences for an extensive number of organisms, including vertebrates, invertebrates, yeast, and plants⁵. Additionally, high-throughput transcriptional sequencing methods have further revealed the complexity of the cellular transcriptome, and identified thousands of novel RNA molecules with both protein-coding and noncoding functions⁶^,⁷. Decoding this vast amount of sequence information is an ongoing process, and challenges remain with comprehensive gene annotation efforts⁸.

The recent development of translational profiling methods, including ribosome profiling⁹^,¹⁰ and poly-ribosome sequencing¹¹, have provided evidence indicating that hundreds of noncanonical translation events map to currently unannotated sORFs throughout the genome, with the potential to generate small proteins called microproteins or micropeptides¹²^,¹³^,¹⁴^,¹⁵^,¹⁶^,¹⁷. Microproteins have emerged as a novel class of versatile proteins previously overlooked by standard gene annotation methods due to their small size (<100 amino acids) and lack of classical protein-coding gene characteristics⁸^,¹²^,¹⁸^,¹⁹^,²⁰. Microproteins have been described in virtually all organisms, including yeast²¹^,²², flies¹⁷^,²³^,²⁴, and mammals²⁵^,²⁶^,²⁷^,²⁸, and have been shown to play critical roles in diverse processes, including development, metabolism, and stress signaling¹⁹^,²⁰^,²⁹^,³⁰^,³¹^,³²^,³³^,³⁴. Thus, it is imperative to continue to mine the genome for additional members of this long-overlooked class of functional small proteins.

Despite the widespread recognition of the biological importance of microproteins, this class of genes remains vastly underrepresented in genome annotations, and their accurate identification continues to be an ongoing challenge that has hindered progress in the field. Various computational tools and experimental methods have recently been developed to overcome the difficulties associated with identifying microprotein-coding sequences (discussed extensively in several comprehensive reviews⁸^,³⁵^,³⁶^,³⁷). Many recent microprotein identification studies³⁸^,³⁹^,⁴⁰^,⁴¹^,⁴²^,⁴³^,⁴⁴^,⁴⁵^,⁴⁶^,⁴⁷ have relied heavily on the use of one such algorithm called PhyloCSF⁴⁸^,⁴⁹, a powerful comparative genomics approach that can be leveraged to distinguish conserved protein-coding regions of the genome from those that are noncoding.

PhyloCSF compares codon substitution frequencies (CSF) using multi-species nucleotide alignments and phylogenetic models to detect evolutionary signatures of protein-coding genes. This empirical model-based approach relies on the premise that proteins are primarily conserved at the amino acid level rather than the nucleotide sequence. Therefore, synonymous codon substitutions, which encode the same amino acid, or codon substitutions to amino acids with conserved properties (i.e., charge, hydrophobicity, polarity) are scored positively, while non-synonymous substitutions, including missense and nonsense substitutions, score negatively. PhyloCSF is trained on whole-genome data and has proven to be effective in scoring short portions of a coding sequence (CDS) in isolation from the full sequence, which is necessary when analyzing microproteins or individual exons of standard protein-coding genes⁴⁸^,⁴⁹.

Notably, the recent integration of the PhyloCSF track hubs in the University of California Santa Cruz (UCSC) Genome Browser⁴⁹^,⁵⁰^,⁵¹ enables investigators of all backgrounds to easily access a user-friendly interface to query genomic regions of interest for protein-coding potential. The protocol outlined below provides detailed instruction on how to load the PhyloCSF track hubs on the UCSC Genome Browser and subsequently interrogate genomic regions of interest to probe for high-confidence protein-coding regions (or the lack thereof). Additionally, in the case where a positive PhyloCSF score is observed, steps are delineated to further analyze microprotein-coding potential and efficiently generate multiple species alignments of the identified amino acid sequences to illustrate cross-species sequence conservation. Lastly, several additional publicly available resources and tools are introduced in the discussion to survey identified microprotein characteristics, including predicted domain structures and insight into putative microprotein function.

Protocol

The protocol outlined below details steps to load and navigate the PhyloCSF browser tracks on the UCSC Genome Browser (generated by Mudge et al.⁴⁹). For general questions regarding the UCSC Genome Browser, an extensive Genome Browser User's Guide can be found here: https://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html.

1. Loading the PhyloCSF Track Hub to the UCSC Genome Browser

Open an internet browser window and navigate to the UCSC Genome Browser (https://genome.ucsc.edu/).
Under the Our tools heading, select the Track Hubs option.
NOTE: The Track Hubs option can also be found under the My Data tab.
In the Public Hubs tab, type PhyloCSF into the Search terms box. Click on the Search Public Hubs button.
Connect to PhyloCSF by clicking on the Connect button for the Hub Name PhyloCSF (Description: Evolutionary protein-coding potential as measured by PhyloCSF).
NOTE: This Track Hub will load to numerous assemblies, including human (hg19 and hg38) and mouse (mm10 and mm39).
After clicking on connect, wait to be redirected to the UCSC Genome Browser Gateway page (https://genome.ucsc.edu/cgi-bin/hgGateway).

2. Navigating to genes of interest using Gene Identifiers

Select the species and genome assembly to query. To query a different species (e.g., mouse), select the species of interest under the Browse/Select Species heading by clicking on the appropriate icon, or type the species into the text box that says, Enter species, common name or assembly ID.
NOTE: The assembly is listed directly under the Find Position heading. Typically, the default is the Human Assembly (e.g., Dec. 2009 [GRCh37/hg19]).
Choose the assembly to search under the Find Position heading using the dropdown menu.
Enter the position, gene symbol, or search terms in the Position/Search Term box and click on Go to navigate to a gene of interest on the Genome Browser.
If the search resulted in multiple matches, wait to be redirected to a page that requires the selection of a position of interest. Click on the appropriate gene of interest.

3. Navigating to genomic regions of interest using sequence information

Navigate to the UCSC Genome Browser (https://genome.ucsc.edu/) and select the BLAST-Like Alignment Tool (BLAT) under the Our tools heading to query a specific DNA or protein sequence. Alternatively, hover the cursor over the Tools tab and select the Blat option or follow this link: https://genome.ucsc.edu/cgi-bin/hgBlat.
Select the species (Genome) and Assembly of interest using the dropdown menus.
Define the Query type using the dropdown menu.
Paste the sequence of interest into the BLAT Search Genome text box and click Soumettre.
Click on the browser link under the ACTIONS heading to navigate to the genomic region of interest.

4. Identifying conserved sORFs using PhyloCSF Track Data

Visually scan the genomic area of interest for positively scoring PhyloCSF regions (Figure 1).
NOTE: For a detailed explanation of how to visually interpret PhyloCSF scores on the UCSC Genome Browser, see the representative results section below.
Use the zoom feature to magnify regions of interest to examine sequence characteristics and search for start/stop codons. To zoom in manually, hold the shift key and click and hold the mouse button while dragging along the region of interest. Alternatively, use the zoom in and zoom out buttons at the top of the page to navigate (1.5x, 3x, 10x, or base zoom options are available).
NOTE: Before using the zoom in/zoom out buttons, it is necessary to reposition the gene so that the region of interest is in the middle of the screen. To perform this action, click on the image and drag it left or right to move the genomic region horizontally as desired or use the move arrows at the top of the page.
Zoom in until the nucleotide (base) sequence is visible.
NOTE: The nucleotide sequence will appear directly above the +1 Smoothed PhyloCSF score.
Visually scan the nucleotide sequence near the beginning and end of the positively scoring PhyloCSF regions to identify putative start (ATG) and stop (TGA/TAA/TAG) codons.
NOTE: If the gene of interest is on the minus strand of DNA, the start and stop codons will be the reverse complement (i.e., CAT for the start codon and TCA/TTA/CTA for the stop codon).

5. Viewing homologous regions in other genomes

Hover the mouse over the View heading at the top of the page and click on the In Other Genomes (Convert) option.
Define the genome of interest using the dropdown menu below the New Genome heading.
Select the genomic assembly of interest using the dropdown menu under the New Assembly heading, then click the Soumettre button.
Once the browser returns a list of regions in the new assembly with similarity, click on the chromosome position link to navigate to the homologous region of interest.
NOTE: The percentage of total bases (nucleotides) and the span that are covered by the region will be defined for each region listed. The higher the percentage of matching bases, the higher the conservation is for the region of interest.
Follow the same navigational strategies detailed in Section 4 to analyze the sequence.

6. Generating multi-species sequence alignments for microproteins of interest

Click on the gene of interest in the GENCODE track on the UCSC Genome Browser (indicated in Figure 1A with a blue box) to navigate to the gene description page.
Under the Sequence and Links to Tools and Databases heading, click on the link in the table that reads Other Species FASTA.
Click on the boxes associated with the species of interest to select them. Click on Soumettre. Copy and paste the sequences appearing at the bottom of the page in FASTA format into a word processing document.
Open a second browser window and navigate to the Clustal Omega Multiple Sequence Alignment tool⁵² on the European Bioinformatics Institute (EMBL-EBI) website⁵³^,⁵⁴: https://www.ebi.ac.uk/Tools/msa/clustalo/.
Paste the sequence files that are still on the clipboard into the box in STEP 1 that reads sequences in any supported format. Scroll to the bottom of the page and click on Soumettre. Look below the aligned results (in black font) for symbols that indicate the degree of conservation of each amino acid (symbols are defined in Table 1).
NOTE: It may take several minutes to generate the alignment.
To view the amino acid properties in color, click on the Show Colors link directly above the sequences to color the amino acids according to their properties (defined in Table 2).
Copy and paste the sequence alignment into a word processing or slideshow program to generate a figure or illustration file (e.g., Figure 2).
NOTE: Use a monospaced font for the alignment such as Courier.
To view other outputs from the Clustal Omega results page, click on the appropriate tabs (i.e., Guide Tree or Phylogenetic Tree).
Click on the Results Viewers tab for options to view the sequence information using Jalview, a free program that specializes in multiple sequence alignment editing, visualization, and analysis⁵⁵, or to access direct links to MView and Simple Phylogeny⁵⁶.

Representative Results

Here we will use the validated microprotein mitoregulin (Mtln) as an example to demonstrate how a conserved sORF will generate a positive PhyloCSF score that can be easily visualized and analyzed on the UCSC Genome Browser. Mitoregulin was previously annotated as a noncoding RNA (formerly human gene ID LINC00116 and mouse gene ID 1500011K16Rik). Comparative genomics and sequence conservation analysis methods played a critical role in its initial discovery⁴⁰^,⁵⁷^,⁵⁸^,⁵⁹^,⁶⁰^,⁶¹, highlighting the strength of these methods. For this example, the mouse GRCm38/mm10 (Dec. 2011) assembly will be used. The search can be performed using the gene identifiers (mitoregulin, Mtln) or the gene position (chr2:127,791,364-127,792,496) as described in protocol section 2. Alternatively, the amino acid sequence for mitoregulin (shown in Figure 2) can be searched using the BLAT tool (described in protocol section 3).

A screen similar to the one depicted in Figure 1A will appear with the PhyloCSF Track Hub visible at the top of the screen. The Smoothed PhyloCSF tracks (smoothed with a hidden Markov model defining a probability that each codon is coding) are depicted as six total tracks, with three tracks corresponding to the plus strand of DNA (depicted in green as PhyloCSF +1, +2 and +3) and three tracks corresponding to the minus strand of DNA (depicted in red as PhyloCSF -1, -2 and -3). These tracks represent the three potential reading frames for the gene of interest in each direction. On the browser window, exons are depicted as blue rectangles connected by thin blue horizontal lines, which represent the introns. The arrowheads on the intronic regions indicate which direction the gene is transcribed in (and thus, which strand to focus on for the PhyloCSF score). For the example of Mtln de Figure 1, the intronic arrowheads are pointing to the left. Therefore, the Mtln gene is transcribed from the minus strand of DNA, and the relevant PhyloCSF score is depicted in the -1, -2, and -3 tracks (in red).

Each PhyloCSF track is depicted as a thin black line with negative scoring regions depicted in light green/red below the line and positive scoring regions indicated in dark green/red above the line. As described in the introduction, a positive PhyloCSF score indicates a conserved region that is likely coding. Note that for protein-coding regions with particularly high sequence conservation, they often also score positively on the antisense strand; however, the PhyloCSF score is usually higher on the correct strand. For example, this can be seen in Figure 1 for Mtln where the correct coding sequence scores very highly in the PhyloCSF -1 track, and the antisense strand (PhyloCSF +2 track) also generates a positive score. As seen in Figure 1A (indicated with black box), there is a region in the first exon of Mtln that scores very highly on the PhyloCSF -1 track, suggesting this may correspond to a coding region. To examine this region in further detail, it is helpful to zoom in and magnify the region (Figure 1B). As shown in Figure 1C,D, the positively scoring region in the first exon of Mtln begins directly over a start codon (Figure 1C) and terminates at a stop codon (Figure 1D), which indicates this ORF is highly conserved and strongly suggests it is a coding ORF. As Mtln is on the minus strand of DNA, the start and stop codons are shown as the reverse complement of the codon (i.e., the ATG start codon is shown as CAT [Figure 1C] and the TGA stop codon is shown as TCA [Figure 1D]).

In addition to using PhyloCSF to search for conserved regions with microprotein-coding potential, this technique can also be applied as a first-pass analysis of putative noncoding RNAs to rule out the presence of a conserved ORF, thus providing support for a noncoding annotation. For example, analysis of the well-characterized lncRNA HOTAIR⁶²^,⁶³ using PhyloCSF shows a negative score throughout the entire gene across all six tracks (Figure 3), strongly indicating a lack of sequence conservation and providing support that HOTAIR is correctly annotated as a noncoding RNA.

As clearly seen in Figure 1, the entire coding ORF for mitoregulin is located within a single exon, thereby producing a simple and straightforward readout by PhyloCSF with a single, uninterrupted, positively scoring region. However, PhyloCSF track hub data is not always as clear-cut and easy to interpret. For example, the mitolamban/Stmp1/Mm47 microprotein encoded by the mouse 1810058I24Rik gene⁴⁷^,⁶⁴^,⁶⁵ depicts a conserved ORF that spans three exons (Figure 4A), and the positive PhyloCSF score jumps from the +2 track in exon 1 (Figure 4B) to the +3 track in exon 2 (Figure 4C), and then back to the +2 track in exon 3 (Figure 4D). While at first glance this looks confusing, the explanation is quite straightforward. PhyloCSF scores the six potential reading frames (three on the plus strand of DNA and three on the minus strand) of genomic regions without considering the specific exon/intron architecture for each gene. Therefore, it retains the intronic sequence information in the 3-nucleotide periodicity of the reading frames. Thus, if an intron contains a number of nucleotides that is not divisible by three (i.e., three nucleotides/codon), the PhyloCSF reading frame will jump from one track to another.

Lastly, PhyloCSF can also be effectively used to identify multiple distinct coding ORFs within a single RNA molecule. For example, the MIEF1 microprotein (MIEF1-MP) is encoded within the 5' UTR of mitochondrial elongation factor 1 (MIEF1)⁶⁶ (Figure 5). When the MIEF1 genomic region is analyzed by PhyloCSF, a discrete positive PhyloCSF score corresponding to the MIEF1-MP (Figure 5C) can be readily observed upstream of the main CDS for MIEF1 (Figure 5B). Further discussion on MIEF1 and its associated microprotein (MIEF1-MP) is provided below in the discussion along with a summary of the strengths and weaknesses of the methods and protocols outlined in this article.

Figure 1: PhyloCSF analysis of the mitoregulin (Mtln) gene indicates a region of high sequence conservation corresponding to a validated microprotein. (A) Screenshots of the UCSC Genome Browser and PhyloCSF Tracks show that Mtln contains two exons and a single intron. The arrowheads within the intron point to the left, indicating the Mtln gene is transcribed from the minus strand of DNA, and the relevant PhyloCSF scores are therefore shown in the -1, -2, and -3 tracks (in red). The complete mitoregulin coding sequence is contained within Exon 1 and scores highly on the PhyloCSF -1 track (B). A conserved start codon can be clearly observed at the beginning of the positively scoring region in the PhyloCSF -1 track (C), which is highlighted with a green box (CAT, reverse complement ATG). Additionally, a conserved stop codon (TCA, reverse complement TGA) is indicated with a red box in panel (D), which aligns with the end of the positively scoring PhyloCSF region. Detailed information about the Mtln gene can be found by clicking on the Mtln gene identifier within the blue box (shown in panel A). Of note, highly conserved protein-coding regions often also score positively on the antisense strand (seen here in the PhyloCSF +2 track for Mtln). However, the PhyloCSF score is typically higher on the correct strand (the PhyloCSF -1 track in this example). Please click here to view a larger version of this figure.

Figure 2: Multiple species sequence alignment of the microprotein mitoregulin generated using the Clustal Omega program. The mitoregulin amino acid sequences for the eight species indicated were extracted as detailed in protocol section 6 and aligned with the Clustal Omega multiple sequence alignment tool. The properties of the amino acids are indicated by color (red, small/hydrophobic; blue, acidic; magenta, basic; green, hydroxl/sulfhydryl/amine) (further defined in Table 2). The symbols below the amino acids indicate the degree of conservation (asterisks, fully-conserved residues; colons, amino acids with strongly similar properties; periods, conservation between groups of weakly similar properties) (detailed extensively in Table 1). Please click here to view a larger version of this figure.

Figure 3: A screenshot of the PhyloCSF tracks for the validated long noncoding RNA Hotair shows a lack of sequence conservation throughout its genomic locus. The arrowheads in the intronic region of Hotair are pointing left, indicating that the lncRNA is transcribed from the negative strand of DNA, and therefore the PhyloCSF -1, -2, and -3 tracks should be the focus of analysis. Note that the PhyloCSF score is negative throughout the entire gene (for all six tracks), indicating a lack of sequence conservation, which supports its proper annotation as a noncoding RNA. Please click here to view a larger version of this figure.

Figure 4: PhyloCSF analysis of the mouse 1810058I24Rik gene, which encodes the microprotein mitolamban/Stmp1/Mm47. (A) The mouse 1810058I24Rik gene is comprised of three exons, and the arrowheads in the intronic regions point right, indicating it is transcribed on the plus strand of DNA and therefore the PhyloCSF +1, +2, and +3 tracks should be analyzed. The conserved microprotein coding sequence spans all three exons, starting in exon 1 (B), reading through exon 2 (C), and ending in exon 3 (D). Note that the positive PhyloCSF score is found on the +2 track in exon 1, the +3 track in exon 2, and the +2 track in exon 1. The reason for the movement of the positive score from one track to the other is that PhyloCSF analyzes the six potential reading frames of the DNA sequence independent of the gene's exon/intron structure. Therefore, an intron containing a number of nucleotides that is not divisible by three (three nucleotides/codon) will cause a shift in the reading frame to a different track. Please click here to view a larger version of this figure.

Figure 5: Analysis of the Mief1 genomic locus with PhyloCSF identifies a region with protein-coding potential in the 5' UTR that is independent of the main Mief1 CDS on the shared RNA. This conserved upstream ORF (uORF) has been shown to encode a microprotein named Mief1-MP. (A) Overview of the Mief1 genomic locus. The arrowheads in the introns point to the right, indicating Mief1 is transcribed from the plus strand of DNA (focus on the PhyloCSF +1, +2, and +3 tracks to determine coding potential). The main Mief1 CDS encodes a 463 amino acid protein and is shown in panel (B). However, there is also a distinct conserved upstream ORF within the 5' UTR of Mief1 that encodes a unique 70 amino acid microprotein called Mief1-MP (C). As seen in Panel C, the Mief1-MP has its own conserved start and stop codon within the Mief1 5' UTR, and the ORF scores very highly on the PhyloCSF +1 track, providing strong evidence that it encodes a functional microprotein. Abbreviations: ORF = open reading frame; uORF = upstream ORF; UTR = untranslated region; CDS = coding sequence. Please click here to view a larger version of this figure.

Symbol	Level of Amino Acid Conservation	Grouped Amino Acids
Asterisk (*)	Fully-conserved residue	Not applicable (single, fully-conserved residue)
Colon (:)	Groups with strongly similar properties	STA; NEQK; NHQK; NDEQ; QHRK; MILV; MILF; HY; FYW
Period (.)	Groups with weakly similar properties	CSA; ATV; SAG; STNK; STPA; SGND; SNDEQK; NDEQHK; NEQHRK; FVLIM; HFY
Space (no symbol)	No similarity	Not applicable (no similarity)

Table 1: Definitions of consensus symbols for Multiple Sequence Alignments generated by Clustal Omega. The multiple species sequence alignment shown in Figure 2 was generated using Clustal Omega⁵². Abbreviations: serine (S), threonine (T), alanine (A), asparagine (N), glutamic acid (E), glutamine (Q), lysine (K), aspartic acid (D), arginine (R), methionine (M), isoleucine (I), leucine (L), phenylalanine (F), histidine (H), tyrosine (Y), tryptophan (W), cysteine (C), valine (V), glycine (G), proline (P).

Font Color	Property	Amino Acid Residue [Abbreviation]
Red	Small, hydrophobic	alanine [A], valine [V], phenylalanine [F], proline [P], methionine [M], isoleucine [I], leucine [L], tryptophan [W]
Blue	Acidic	aspartic acid [D], glutamic acid [E]
Magenta	Basic	arginine [R], lysine [K]
Green	Hydroxl, sulfhydryl, amine, +G	serine [S], threonine [T], tyrosine [Y], histidine [H], cysteine [C], asparagine [N], glycine [G], glutamine [Q]

Table 2: Properties of the amino acids depicted in Figure 2. Clustal Omega⁵² was used to generate the multiple sequence alignment shown in Figure 2.

Discussion

The protocol presented here provides detailed instructions on how to interrogate genomic regions of interest for microprotein-coding potential using PhyloCSF on the user-friendly UCSC Genome Browser⁴⁸^,⁴⁹^,⁵⁰^,⁵¹. As detailed above, PhyloCSF is a powerful comparative genomics algorithm that integrates phylogenetic models and codon substitution frequencies to identify evolutionary signatures that are typical of protein-coding genes⁴⁸^,⁴⁹. PhyloCSF has been widely used to identify functional microproteins in genomic regions previously annotated as noncoding³⁸^,³⁹^,⁴⁰^,⁴¹^,⁴²^,⁴³^,⁴⁴^,⁴⁵^,⁴⁶^,⁴⁷, and this approach has been shown to outperform other comparative genomics methods for short sequences such as microproteins as small as 13 amino acids and for small exons of canonical proteins³⁵^,⁴⁸^,⁴⁹. Notably, the utility of PhyloCSF as a robust method to identify functional protein-coding sequences via evolutionary conservation extends beyond that of vertebrate and invertebrate species and has even been recently applied to viral genomes to successfully interrogate the protein-coding capacity of the SARS-CoV-2 genome⁶⁷.

In addition to identifying putative coding sequences within annotated noncoding RNAs, an advantage of PhyloCSF is that it can also reliably detect conserved microproteins encoded by ORFs within annotated untranslated regions (UTRs) of canonical protein coding genes, including both 5' upstream and 3' downstream ORFs (uORFs and dORFs, respectively)⁸^,¹⁹^,⁶⁶^,⁶⁸. For example, the MIEF1 microprotein (MIEF1-MP) is encoded in the 5' UTR of mitochondrial elongation factor 1 (MIEF1)⁶⁶. In the case of MIEF1-MP, a discrete positive PhyloCSF score corresponding to the MIEF1-MP is observed upstream of the ORF that encodes MIEF1 (Figure 5). While some uORF encoded microproteins directly interact with the downstream canonical proteins on their shared mRNA, (ex. MIEF1-MP and MIEF1), others function independently of the protein encoded by the main CDS⁶⁶^,⁶⁸. Therefore, when characterizing uORF-encoded microproteins, it should not be assumed that they function via direct regulation of their downstream protein product.

While PhyloCSF has many clear strengths as a tool for the identification of conserved microprotein-coding sequences, it is important to recognize several limitations of this method. First, while sequence conservation strongly suggests that a genomic region has undergone functional selection and is thus coding, a lack of robust conservation and a resultant negative PhyloCSF score does not definitively rule out coding potential for a given sequence. In other words, relying exclusively on PhyloCSF may result in the oversight of translated ORFs that are not strongly conserved but still produce functional microproteins. Notably, genomic regions with low conservation or negative conservation scores could correspond to species-specific coding regions or those of evolutionary "young" genes via sequence divergence or de novo gene birth⁴⁶^,⁶⁹^,⁷⁰^,⁷¹^,⁷²^,⁷³^,⁷⁴. For example, the microprotein ASAP, which is encoded by what was formerly thought to be the human noncoding RNA LINC00467, is not scored positively by PhyloCSF because the amino acid sequence is only conserved in higher mammals⁷⁵. Additionally, recent studies identified several human-specific microproteins, including one encoded by the intergenic lncRNA RP3-527G5.1, that does not generate a positive PhyloCSF score⁶⁸^,⁷². In this regard, the absence of a positive PhyloCSF score cannot be interpreted as proof of a noncoding region and should be interpreted with caution.

A second consideration to keep in mind when using PhyloCSF is that even though a positive score is highly suggestive of functional selection and protein-coding capacity, this line of evidence cannot stand alone and must be experimentally validated. Examples of methods that can be used to generate supporting evidence for stable microprotein expression include the detection of the putative protein by mass spectrometry or western blotting using an antibody raised against the microprotein sequence of interest. Alternatively, since it can be challenging to generate reliable antibodies for microproteins due to the lack of sequence choices for optimal antigenicity, it is also possible to use CRISPR/Cas9 and the homology-directed repair (HDR) pathway to introduce an epitope tag into the endogenous locus in frame with the putative microprotein sequence, thereby facilitating the detection of the protein of interest using a high-affinity antibody (e.g., FLAG, HA, V5, Myc)¹⁸. A final limitation of PhyloCSF to acknowledge is that although it is currently integrated into many of the commonly used genomic assemblies, including Homo sapiens (human hg19, hg38), Mus musculus (mouse mm10, mm39), Gallus gallus (chicken, galGal4, galGal6), Drosophila melanogaster (fruit fly, dm6), Caenorhabditis elegans (nematodes, ce11), and SARS-CoV-2 (wuhCor1), there are still many species that cannot currently be queried directly on the UCSC Genome Browser.

The identification of conserved domains or sequence characteristics within identified microproteins can help increase confidence in their functional relevance and provide some insight into their putative function. Here we provide recommendations for specific tools and resources that can be used to analyze identified microprotein amino acid sequences in further detail to gain such insight. The specific tools listed below (and summarized in the Table of Materials) are freely available to the public, and we have found them to be particularly user-friendly and robust in microprotein studies¹⁸^,³⁸^,³⁹^,⁴⁰^,⁴¹^,⁴⁷. Beyond the tools described here, there are a multitude of additional resources that can be found in bioinformatics resource portals such as Expasy (https://www.expasy.org) and EMBL-EBI (https://www.ebi.ac.uk/services/all). However, detailing the specifics for each of the tools within these repositories is beyond the scope of this article. Here we recommend the following resources.

First, TMHMM⁷⁶ (https://services.healthtech.dtu.dk/service.php?TMHMM-2.0) analyzes protein sequences of interest for the presence of transmembrane domains. Notably, a number of microproteins that have been functionally characterized thus far contain single-pass transmembrane domains, which facilitates their localization to membrane regions and enables their direct regulation of ion channels, exchangers, and membrane-associated enzymes³⁰. Second, the National Center for Biotechnology Information (NCBI) Conserved Domain Search⁷⁷ (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) is a popular tool used to identify conserved domains within protein or coding nucleotide sequences. Third, Protein family (Pfam)⁷⁸ database (http://pfam.xfam.org) provides alignments and classifications of protein families and domains. Fourth, WoLF PSORT⁷⁹ (https://wolfpsort.hgc.jp/) is a tool that can be employed to predict subcellular protein localization. Fifth, COXPRESdB⁸⁰ is a gene co-expression database (https://coxpresdb.jp) that provides co-regulated gene relationships to estimate gene functions. Finally, SignalP 6.0⁸¹ is a widely used prediction program (https://services.healthtech.dtu.dk/service.php?SignalP) that recognizes the presence of a signal peptide sequence and predicts the location of the cleavage site.

In summary, the methods described here can be used to effectively analyze genomic regions of interest for protein-coding potential using PhyloCSF on the UCSC Genome Browser. These methods are highly accessible and can be easily learned and efficiently applied by individuals without prior training or expertise in bioinformatics or comparative genomics. As demonstrated here in detail, PhyloCSF is a powerful tool that can be applied as a first-pass analysis to help distinguish protein-coding versus noncoding genes in vertebrate, invertebrate, and viral genomes, and the strengths of this approach heavily outweigh the noted weaknesses.

Divulgations

The authors have nothing to disclose.

Acknowledgements

This work was supported by grants from the National Institutes of Health (HL-141630 and HL-160569) and Cincinnati Children's Research Foundation (Trustee Award).

Materials

Website	Website Address	Requirements
Clustal Omega Multiple Sequence Alignment Tool	https://www.ebi.ac.uk/Tools/msa/clustalo/	Web browser	Multiple sequence alignment program for the efficient alignment of FASTA sequences (i.e. for cross-species comparison of identified microproteins)
COXPRESSdb	https://coxpresdb.jp	Web browser	Provides co-regulated gene relationships to estimate gene functions
EMBL-EBI Bioinformatics Tools FAQs	https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Bioinformatics+Tools+FAQ	Web browser	Frequently Asked Questions (FAQs) for EMBL-EBI tools. Includes the color coding key for protein sequence alignments
European Bioinformatics Institute (EMBL-EBI), Tools and Data Resources	https://www.ebi.ac.uk/services/all	Web browser	Comprehensive list of freely available websites, tools and data resources
Expasy – Swiss Bioinformatics Resource Portal	https://www.expasy.org	Web browser	Suite of bioinformatic tools and resources for protein sequence analysis that is maintained by the Swiss Institute of Bioinformatics (SIB)
National Center for Biotechnology Information (NCBI) Conserved Domain Search	https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi	Web browser	Search tool to identify conserved domains within protein or coding nucleotide sequences
Pfam 35	http://pfam.xfam.org	Web browser	Protein family (Pfam) database, provides alignments and classification of protein families and domains
PhyloCSF Track Hub Description	https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=1267045267_TEc99h2oW5Q edaCd4ir8aZ65ryaD&db=mm10 &c=chr2&g=hub_109801_ PhyloCSF_smooth	Web browser	Detailed description of the Smoothed PhyloCSF tracks and PhyloCSF Track Hub
SignalP 6.0	https://services.healthtech.dtu.dk/service.php?SignalP-6.0	Web browser	Predicts the presence of signal peptides and the location of their cleavage sites
TMHMM – 2.0	https://services.healthtech.dtu.dk/service.php?TMHMM-2.0	Web browser	Prediction of transmembrane helices in proteins
UCSC Genome Browser BLAT Search	https://genome.ucsc.edu/cgi-bin/hgBlat	Web browser	Tool used to find genomic regions using DNA or protein sequence information
UCSC Genome Browser Gateway	https://genome.ucsc.edu/cgi-bin/hgGateway	Web browser	Direct link to the UCSC Genome Browser Gateway
UCSC Genome Browser Home	https://genome.ucsc.edu/	Web browser	Home website for the UCSC Genome Browser
UCSC Genome Browser Track Data Hubs	https://genome.ucsc.edu/cgi-bin/hgHubConnect#publicHubs	Web browser	Direct link to Track Data Hubs/Public Hubs database to search for and load the PhyloCSF Tracks
UCSC Genome Browser User Guide	https://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html	Web browser	Comprehensive user guide detailing how to navigate the UCSC Genome Browser
WoLF PSORT	https://wolfpsort.hgc.jp	Web browser	Protein subcellular localization prediction tool

References

Collins, F. S., Morgan, M., Patrinos, A. The human genome project: lessons from large-scale biology. Science. 300 (5617), 286-290 (2003).
Lander, E. S., et al. Initial sequencing and analysis of the human genome. Nature. 409 (6822), 860-921 (2001).
Sachidanandam, R., et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 409 (6822), 928-933 (2001).
Venter, J. C., et al. The sequence of the human genome. Science. 291 (5507), 1304-1351 (2001).
Fuentes-Pardo, A. P., Ruzzante, D. E. Whole-genome sequencing approaches for conservation biology: Advantages, limitations and practical recommendations. Molecular Ecology. 26 (20), 5369-5406 (2017).
Carninci, P., et al. The transcriptional landscape of the mammalian genome. Science. 309 (5740), 1559-1563 (2005).
Maeda, N., et al. Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs. PLoS Genetics. 2 (4), 62 (2006).
Schlesinger, D., Elsasser, S. J. Revisiting sORFs: overcoming challenges to identify and characterize functional microproteins. The FEBS Journal. 289 (1), 53-74 (2022).
Ingolia, N. T., et al. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Reports. 8 (5), 1365-1379 (2014).
Ingolia, N. T., Ghaemmaghami, S., Newman, J. R., Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science. 324 (5924), 218-223 (2009).
Aspden, J. L., et al. Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq. Elife. 3, 03528 (2014).
Andrews, S. J., Rothnagel, J. A. Emerging evidence for functional peptides encoded by short open reading frames. Nature Reviews Genetics. 15 (3), 193-204 (2014).
Mackowiak, S. D., et al. Extensive identification and analysis of conserved small ORFs in animals. Genome Biology. 16 (1), 1-21 (2015).
Ruiz-Orera, J., Messeguer, X., Subirana, J. A., Alba, M. M. Long non-coding RNAs as a source of new peptides. Elife. 3, 03523 (2014).
Basrai, M. A., Hieter, P., Boeke, J. D. Small open reading frames: beautiful needles in the haystack. Genome Research. 7 (8), 768-771 (1997).
Frith, M. C., et al. The abundance of short proteins in the mammalian proteome. PLoS Genetics. 2 (4), 52 (2006).
Ladoukakis, E., Pereira, V., Magny, E. G., Eyre-Walker, A., Couso, J. P. Hundreds of putatively functional small open reading frames in Drosophila. Genome Biology. 12 (11), 118 (2011).
Makarewich, C. A., Olson, E. N. Mining for Micropeptides. Trends in Cell Biology. 27 (9), 685-696 (2017).
Wright, B. W., Yi, Z., Weissman, J. S., Chen, J. The dark proteome: translation from noncanonical open reading frames. Trends in Cell Biology. , (2021).
Saghatelian, A., Couso, J. P. Discovery and characterization of smORF-encoded bioactive polypeptides. Nature Chemical Biology. 11 (12), 909-916 (2015).
Kastenmayer, J. P., et al. Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Research. 16 (3), 365-373 (2006).
Smith, J. E., et al. Translation of small open reading frames within unannotated RNA transcripts in Saccharomyces cerevisiae. Cell Reports. 7 (6), 1858-1866 (2014).
Lin, M. F., et al. Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes. Genome Research. 17 (12), 1823-1836 (2007).
Magny, E. G., et al. Conserved regulation of cardiac calcium uptake by peptides encoded in small open reading frames. Science. 341 (6150), 1116-1120 (2013).
Bazzini, A. A., et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 33 (9), 981-993 (2014).
Ingolia, N. T., Lareau, L. F., Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 147 (4), 789-802 (2011).
Ma, J., et al. Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue. J Proteome Res. 13 (3), 1757-1765 (2014).
Slavoff, S. A., et al. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nature Chemical Biology. 9 (1), 59-64 (2013).
Khitun, A., Ness, T. J., Slavoff, S. A. Small open reading frames and cellular stress responses. Molecular Omics. 15 (2), 108-116 (2019).
Makarewich, C. A. The hidden world of membrane microproteins. Experimental Cell Research. 388 (2), 111853 (2020).
Pueyo, J. I., Magny, E. G., Couso, J. P. New peptides under the s(ORF)ace of the genome. Trends in Biochemical Sciences. 41 (8), 665-678 (2016).
Pauli, A., et al. Toddler: an embryonic signal that promotes cell movement via Apelin receptors. Science. 343 (6172), 1248636 (2014).
Chng, S. C., Ho, L., Tian, J., Reversade, B. ELABELA: a hormone essential for heart development signals via the apelin receptor. Developmental Cell. 27 (6), 672-680 (2013).
Lee, C., et al. The mitochondrial-derived peptide MOTS-c promotes metabolic homeostasis and reduces obesity and insulin resistance. Cell Metabolism. 21 (3), 443-454 (2015).
Pauli, A., Valen, E., Schier, A. F. Identifying (non-)coding RNAs and small peptides: challenges and opportunities. Bioessays. 37 (1), 103-112 (2015).
Plaza, S., Menschaert, G., Payre, F. In search of lost small peptides. Annual Review of Cell and Developmental Biology. 33, 391-416 (2017).
Kiniry, S. J., Michel, A. M., Baranov, P. V. Computational methods for ribosome profiling data analysis. Wiley Interdisciplinary Reviews: RNA. 11 (3), 1577 (2020).
Anderson, D. M., et al. A micropeptide encoded by a putative long noncoding RNA regulates muscle performance. Cell. 160 (4), 595-606 (2015).
Anderson, D. M., et al. Widespread control of calcium signaling by a family of SERCA-inhibiting micropeptides. Science Signaling. 9 (457), (2016).
Makarewich, C. A., et al. MOXI Is a mitochondrial micropeptide that enhances fatty acid beta-oxidation. Cell Reports. 23 (13), 3701-3709 (2018).
Nelson, B. R., et al. A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle. Science. 351 (6270), 271-275 (2016).
Chu, Q., et al. Regulation of the ER stress response by a mitochondrial microprotein. Nat Commun. 10 (1), 4883 (2019).
Senis, E., et al. TUNAR lncRNA encodes a microprotein that regulates neural differentiation and neurite formation by modulating calcium dynamics. Frontiers in Cell and Developmental Biology. 9, 747667 (2021).
Li, M., et al. A putative long noncoding RNA-encoded micropeptide maintains cellular homeostasis in pancreatic beta cells. Molecular Therapy-Nucleic Acids. 26, 307-320 (2021).
Martinez, T. F., et al. Accurate annotation of human protein-coding small open reading frames. Nature Chemical Biology. 16 (4), 458-468 (2020).
van Heesch, S., et al. The translational landscape of the human heart. Cell. 178 (1), 242-260 (2019).
Makarewich, C. A., et al. The cardiac-enriched microprotein mitolamban regulates mitochondrial respiratory complex assembly and function in mice. Proceedings of the National Academy of Sciences of the United States of America. 119 (6), 2120476119 (2022).
Lin, M. F., Jungreis, I., Kellis, M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics. 27 (13), 275-282 (2011).
Mudge, J. M., et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Research. 29 (12), 2073-2087 (2019).
Kent, W. J., et al. The human genome browser at UCSC. Genome Research. 12 (6), 996-1006 (2002).
Raney, B. J., et al. Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser. Bioinformatics. 30 (7), 1003-1005 (2014).
Sievers, F., et al. scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology. 7 (1), 539 (2011).
Goujon, M., et al. A new bioinformatics analysis tools framework at EMBL-EBI. Nucleic Acids Research. 38 (2), 695-699 (2010).
Harte, N., et al. Public web-based services from the European Bioinformatics Institute. Nucleic Acids Research. 32 (2), 3-9 (2004).
Waterhouse, A. M., Procter, J. B., Martin, D. M., Clamp, M., Barton, G. J. Jalview Version 2-a multiple sequence alignment editor and analysis workbench. Bioinformatics. 25 (9), 1189-1191 (2009).
Madeira, F., et al. The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Research. 47 (1), 636-641 (2019).
Friesen, M., et al. Mitoregulin controls beta-oxidation in human and mouse adipocytes. Stem Cell Reports. 14 (4), 590-602 (2020).
Stein, C. S., et al. Mitoregulin: A lncRNA-Encoded microprotein that supports mitochondrial supercomplexes and respiratory efficiency. Cell Reports. 23 (13), 3710-3720 (2018).
Chugunova, A., et al. LINC00116 codes for a mitochondrial peptide linking respiration and lipid metabolism. Proceedings of the Nationall Academy of Sciences of the United States of America. 116 (11), 4940-4945 (2019).
Lin, Y. F., et al. A novel mitochondrial micropeptide MPM enhances mitochondrial respiratory activity and promotes myogenic differentiation. Cell Death and Disease. 10 (7), 528 (2019).
Wang, L., et al. The micropeptide LEMP plays an evolutionarily conserved role in myogenesis. Cell Death and Disease. 11 (5), 357 (2020).
He, S., Liu, S., Zhu, H. The sequence, structure and evolutionary features of HOTAIR in mammals. BMC Evolutionary Biology. 11 (1), 1-14 (2011).
Rinn, J. L., et al. Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs. Cell. 129 (7), 1311-1323 (2007).
Bhatta, A., et al. A Mitochondrial micropeptide is required for activation of the Nlrp3 inflammasome. Journal of Immunology. 204 (2), 428-437 (2020).
Zhang, D., et al. Functional prediction and physiological characterization of a novel short trans-membrane protein 1 as a subunit of mitochondrial respiratory complexes. Physiological Genomics. 44 (23), 1133-1140 (2012).
Rathore, A., et al. MIEF1 microprotein regulates mitochondrial translation. Biochimie. 57 (38), 5564-5575 (2018).
Jungreis, I., Sealfon, R., Kellis, M. SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nature Communications. 12 (1), 2642 (2021).
Chen, J., et al. Pervasive functional translation of noncanonical human open reading frames. Science. 367 (6482), 1140-1146 (2020).
Ruiz-Orera, J., Verdaguer-Grau, P., Villanueva-Canas, J. L., Messeguer, X., Alba, M. M. Translation of neutrally evolving peptides provides a basis for de novo gene evolution. Nature Ecology and Evolution. 2 (5), 890-896 (2018).
Blevins, W. R., et al. Uncovering de novo gene birth in yeast using deep transcriptomics. Nature Communications. 12 (1), 604 (2021).
Papadopoulos, C., et al. Intergenic ORFs as elementary structural modules of de novo gene birth and protein evolution. Genome Research. , (2021).
Vakirlis, N., Duggan, K. M., McLysaght, A. De novo birth of functional, human-specific microproteins. bioRxiv. , 462744 (2021).
Van Oss, S. B., Carvunis, A. R. De novo gene birth. PLoS Genetics. 15 (5), 1008160 (2019).
Andersson, D. I., Jerlstrom-Hultqvist, J., Nasvall, J. Evolution of new functions de novo and from preexisting genes. Cold Spring Harbor Perspectives in Biology. 7 (6), 017996 (2015).
Ge, Q., et al. Micropeptide ASAP encoded by LINC00467 promotes colorectal cancer progression by directly modulating ATP synthase activity. Journal of Clinical Investigations. 131 (22), (2021).
Sonnhammer, E. L., von Heijne, G., Krogh, A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proceedings. International Conference on Intelligent Syststems for Molecular Biology. 6, 175-182 (1998).
Lu, S., et al. CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Research. 48, 265-268 (2020).
Mistry, J., et al. Pfam: The protein families database in 2021. Nucleic Acids Research. 49, 412-419 (2021).
Horton, P., et al. PSORT: protein localization predictor. Nucleic Acids Research. 35 (2), 585-587 (2007).
Obayashi, T., Kagaya, Y., Aoki, Y., Tadaka, S., Kinoshita, K. COXPRESdb v7: a gene coexpression database for 11 animal species supported by 23 coexpression platforms for technical evaluation and evolutionary inference. Nucleic Acids Research. 47, 55-62 (2019).
Teufel, F., et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nature Biotechnology. , 01156 (2022).

Play Video

PDF

DOI

DOWNLOAD MATERIALS LIST

Citer Cet Article

Brito-Estrada, O., Hassel, K. R., Makarewich, C. A. An Integrated Approach for Microprotein Identification and Sequence Analysis. J. Vis. Exp. (185), e63841, doi:10.3791/63841 (2022).

An Integrated Approach for Microprotein Identification and Sequence Analysis

Summary

Abstract

Introduction

Protocol

Representative Results

Discussion

Divulgations

Acknowledgements

Materials

References

Tags

Play Video

Citer Cet Article

View Video

An Integrated Approach for Microprotein Identification and Sequence Analysis

Summary

Abstract

Introduction

Protocol

Representative Results

Discussion

Divulgations

Acknowledgements

Materials

References

Tags

Play Video

Citer Cet Article

View Video

✖

To prove you're not a robot, please enter the text in the image below