The protocol described here provides detailed instructions on how to analyze genomic regions of interest for microprotein-coding potential using PhyloCSF on the user-friendly UCSC Genome Browser. Additionally, several tools and resources are recommended to further investigate sequence characteristics of identified microproteins to gain insight into their putative functions.
Next-generation sequencing (NGS) has propelled the field of genomics forward and produced whole genome sequences for numerous animal species and model organisms. However, despite this wealth of sequence information, comprehensive gene annotation efforts have proven challenging, especially for small proteins. Notably, conventional protein annotation methods were designed to intentionally exclude putative proteins encoded by short open reading frames (sORFs) less than 300 nucleotides in length to filter out the exponentially higher number of spurious noncoding sORFs throughout the genome. As a result, hundreds of functional small proteins called microproteins (<100 amino acids in length) have been incorrectly classified as noncoding RNAs or overlooked entirely.
Here we provide a detailed protocol to leverage free, publicly available bioinformatic tools to query genomic regions for microprotein-coding potential based on evolutionary conservation. Specifically, we provide step-by-step instructions on how to examine sequence conservation and coding potential using Phylogenetic Codon Substitution Frequencies (PhyloCSF) on the user-friendly University of California Santa Cruz (UCSC) Genome Browser. Additionally, we detail steps to efficiently generate multiple species alignments of identified microprotein sequences to visualize amino acid sequence conservation and recommend resources to analyze microprotein characteristics, including predicted domain structures. These powerful tools can be used to help identify putative microprotein-coding sequences in noncanonical genomic regions or to rule out the presence of a conserved coding sequence with translational potential in a noncoding transcript of interest.
The identification of the complete set of coding elements in the genome has been a major goal since the initiation of the Human Genome Project, and remains a central objective toward the understanding of biological systems and the etiology of genetic-based diseases1,2,3,4. Advances in NGS techniques have led to the production of whole genome sequences for an extensive number of organisms, including vertebrates, invertebrates, yeast, and plants5. Additionally, high-throughput transcriptional sequencing methods have further revealed the complexity of the cellular transcriptome, and identified thousands of novel RNA molecules with both protein-coding and noncoding functions6,7. Decoding this vast amount of sequence information is an ongoing process, and challenges remain with comprehensive gene annotation efforts8.
The recent development of translational profiling methods, including ribosome profiling9,10 and poly-ribosome sequencing11, have provided evidence indicating that hundreds of noncanonical translation events map to currently unannotated sORFs throughout the genome, with the potential to generate small proteins called microproteins or micropeptides12,13,14,15,16,17. Microproteins have emerged as a novel class of versatile proteins previously overlooked by standard gene annotation methods due to their small size (<100 amino acids) and lack of classical protein-coding gene characteristics8,12,18,19,20. Microproteins have been described in virtually all organisms, including yeast21,22, flies17,23,24, and mammals25,26,27,28, and have been shown to play critical roles in diverse processes, including development, metabolism, and stress signaling19,20,29,30,31,32,33,34. Thus, it is imperative to continue to mine the genome for additional members of this long-overlooked class of functional small proteins.
Despite the widespread recognition of the biological importance of microproteins, this class of genes remains vastly underrepresented in genome annotations, and their accurate identification continues to be an ongoing challenge that has hindered progress in the field. Various computational tools and experimental methods have recently been developed to overcome the difficulties associated with identifying microprotein-coding sequences (discussed extensively in several comprehensive reviews8,35,36,37). Many recent microprotein identification studies38,39,40,41,42,43,44,45,46,47 have relied heavily on the use of one such algorithm called PhyloCSF48,49, a powerful comparative genomics approach that can be leveraged to distinguish conserved protein-coding regions of the genome from those that are noncoding.
PhyloCSF compares codon substitution frequencies (CSF) using multi-species nucleotide alignments and phylogenetic models to detect evolutionary signatures of protein-coding genes. This empirical model-based approach relies on the premise that proteins are primarily conserved at the amino acid level rather than the nucleotide sequence. Therefore, synonymous codon substitutions, which encode the same amino acid, or codon substitutions to amino acids with conserved properties (i.e., charge, hydrophobicity, polarity) are scored positively, while non-synonymous substitutions, including missense and nonsense substitutions, score negatively. PhyloCSF is trained on whole-genome data and has proven to be effective in scoring short portions of a coding sequence (CDS) in isolation from the full sequence, which is necessary when analyzing microproteins or individual exons of standard protein-coding genes48,49.
Notably, the recent integration of the PhyloCSF track hubs in the University of California Santa Cruz (UCSC) Genome Browser49,50,51 enables investigators of all backgrounds to easily access a user-friendly interface to query genomic regions of interest for protein-coding potential. The protocol outlined below provides detailed instruction on how to load the PhyloCSF track hubs on the UCSC Genome Browser and subsequently interrogate genomic regions of interest to probe for high-confidence protein-coding regions (or the lack thereof). Additionally, in the case where a positive PhyloCSF score is observed, steps are delineated to further analyze microprotein-coding potential and efficiently generate multiple species alignments of the identified amino acid sequences to illustrate cross-species sequence conservation. Lastly, several additional publicly available resources and tools are introduced in the discussion to survey identified microprotein characteristics, including predicted domain structures and insight into putative microprotein function.
The protocol outlined below details steps to load and navigate the PhyloCSF browser tracks on the UCSC Genome Browser (generated by Mudge et al.49). For general questions regarding the UCSC Genome Browser, an extensive Genome Browser User's Guide can be found here: https://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html.
1. Loading the PhyloCSF Track Hub to the UCSC Genome Browser
2. Navigating to genes of interest using Gene Identifiers
3. Navigating to genomic regions of interest using sequence information
4. Identifying conserved sORFs using PhyloCSF Track Data
5. Viewing homologous regions in other genomes
6. Generating multi-species sequence alignments for microproteins of interest
Here we will use the validated microprotein mitoregulin (Mtln) as an example to demonstrate how a conserved sORF will generate a positive PhyloCSF score that can be easily visualized and analyzed on the UCSC Genome Browser. Mitoregulin was previously annotated as a noncoding RNA (formerly human gene ID LINC00116 and mouse gene ID 1500011K16Rik). Comparative genomics and sequence conservation analysis methods played a critical role in its initial discovery40,57,58,59,60,61, highlighting the strength of these methods. For this example, the mouse GRCm38/mm10 (Dec. 2011) assembly will be used. The search can be performed using the gene identifiers (mitoregulin, Mtln) or the gene position (chr2:127,791,364-127,792,496) as described in protocol section 2. Alternatively, the amino acid sequence for mitoregulin (shown in Figure 2) can be searched using the BLAT tool (described in protocol section 3).
A screen similar to the one depicted in Figure 1A will appear with the PhyloCSF Track Hub visible at the top of the screen. The Smoothed PhyloCSF tracks (smoothed with a hidden Markov model defining a probability that each codon is coding) are depicted as six total tracks, with three tracks corresponding to the plus strand of DNA (depicted in green as PhyloCSF +1, +2 and +3) and three tracks corresponding to the minus strand of DNA (depicted in red as PhyloCSF -1, -2 and -3). These tracks represent the three potential reading frames for the gene of interest in each direction. On the browser window, exons are depicted as blue rectangles connected by thin blue horizontal lines, which represent the introns. The arrowheads on the intronic regions indicate which direction the gene is transcribed in (and thus, which strand to focus on for the PhyloCSF score). For the example of Mtln de Figure 1, the intronic arrowheads are pointing to the left. Therefore, the Mtln gene is transcribed from the minus strand of DNA, and the relevant PhyloCSF score is depicted in the -1, -2, and -3 tracks (in red).
Each PhyloCSF track is depicted as a thin black line with negative scoring regions depicted in light green/red below the line and positive scoring regions indicated in dark green/red above the line. As described in the introduction, a positive PhyloCSF score indicates a conserved region that is likely coding. Note that for protein-coding regions with particularly high sequence conservation, they often also score positively on the antisense strand; however, the PhyloCSF score is usually higher on the correct strand. For example, this can be seen in Figure 1 for Mtln where the correct coding sequence scores very highly in the PhyloCSF -1 track, and the antisense strand (PhyloCSF +2 track) also generates a positive score. As seen in Figure 1A (indicated with black box), there is a region in the first exon of Mtln that scores very highly on the PhyloCSF -1 track, suggesting this may correspond to a coding region. To examine this region in further detail, it is helpful to zoom in and magnify the region (Figure 1B). As shown in Figure 1C,D, the positively scoring region in the first exon of Mtln begins directly over a start codon (Figure 1C) and terminates at a stop codon (Figure 1D), which indicates this ORF is highly conserved and strongly suggests it is a coding ORF. As Mtln is on the minus strand of DNA, the start and stop codons are shown as the reverse complement of the codon (i.e., the ATG start codon is shown as CAT [Figure 1C] and the TGA stop codon is shown as TCA [Figure 1D]).
In addition to using PhyloCSF to search for conserved regions with microprotein-coding potential, this technique can also be applied as a first-pass analysis of putative noncoding RNAs to rule out the presence of a conserved ORF, thus providing support for a noncoding annotation. For example, analysis of the well-characterized lncRNA HOTAIR62,63 using PhyloCSF shows a negative score throughout the entire gene across all six tracks (Figure 3), strongly indicating a lack of sequence conservation and providing support that HOTAIR is correctly annotated as a noncoding RNA.
As clearly seen in Figure 1, the entire coding ORF for mitoregulin is located within a single exon, thereby producing a simple and straightforward readout by PhyloCSF with a single, uninterrupted, positively scoring region. However, PhyloCSF track hub data is not always as clear-cut and easy to interpret. For example, the mitolamban/Stmp1/Mm47 microprotein encoded by the mouse 1810058I24Rik gene47,64,65 depicts a conserved ORF that spans three exons (Figure 4A), and the positive PhyloCSF score jumps from the +2 track in exon 1 (Figure 4B) to the +3 track in exon 2 (Figure 4C), and then back to the +2 track in exon 3 (Figure 4D). While at first glance this looks confusing, the explanation is quite straightforward. PhyloCSF scores the six potential reading frames (three on the plus strand of DNA and three on the minus strand) of genomic regions without considering the specific exon/intron architecture for each gene. Therefore, it retains the intronic sequence information in the 3-nucleotide periodicity of the reading frames. Thus, if an intron contains a number of nucleotides that is not divisible by three (i.e., three nucleotides/codon), the PhyloCSF reading frame will jump from one track to another.
Lastly, PhyloCSF can also be effectively used to identify multiple distinct coding ORFs within a single RNA molecule. For example, the MIEF1 microprotein (MIEF1-MP) is encoded within the 5' UTR of mitochondrial elongation factor 1 (MIEF1)66 (Figure 5). When the MIEF1 genomic region is analyzed by PhyloCSF, a discrete positive PhyloCSF score corresponding to the MIEF1-MP (Figure 5C) can be readily observed upstream of the main CDS for MIEF1 (Figure 5B). Further discussion on MIEF1 and its associated microprotein (MIEF1-MP) is provided below in the discussion along with a summary of the strengths and weaknesses of the methods and protocols outlined in this article.
Figure 1: PhyloCSF analysis of the mitoregulin (Mtln) gene indicates a region of high sequence conservation corresponding to a validated microprotein. (A) Screenshots of the UCSC Genome Browser and PhyloCSF Tracks show that Mtln contains two exons and a single intron. The arrowheads within the intron point to the left, indicating the Mtln gene is transcribed from the minus strand of DNA, and the relevant PhyloCSF scores are therefore shown in the -1, -2, and -3 tracks (in red). The complete mitoregulin coding sequence is contained within Exon 1 and scores highly on the PhyloCSF -1 track (B). A conserved start codon can be clearly observed at the beginning of the positively scoring region in the PhyloCSF -1 track (C), which is highlighted with a green box (CAT, reverse complement ATG). Additionally, a conserved stop codon (TCA, reverse complement TGA) is indicated with a red box in panel (D), which aligns with the end of the positively scoring PhyloCSF region. Detailed information about the Mtln gene can be found by clicking on the Mtln gene identifier within the blue box (shown in panel A). Of note, highly conserved protein-coding regions often also score positively on the antisense strand (seen here in the PhyloCSF +2 track for Mtln). However, the PhyloCSF score is typically higher on the correct strand (the PhyloCSF -1 track in this example). Please click here to view a larger version of this figure.
Figure 2: Multiple species sequence alignment of the microprotein mitoregulin generated using the Clustal Omega program. The mitoregulin amino acid sequences for the eight species indicated were extracted as detailed in protocol section 6 and aligned with the Clustal Omega multiple sequence alignment tool. The properties of the amino acids are indicated by color (red, small/hydrophobic; blue, acidic; magenta, basic; green, hydroxl/sulfhydryl/amine) (further defined in Table 2). The symbols below the amino acids indicate the degree of conservation (asterisks, fully-conserved residues; colons, amino acids with strongly similar properties; periods, conservation between groups of weakly similar properties) (detailed extensively in Table 1). Please click here to view a larger version of this figure.
Figure 3: A screenshot of the PhyloCSF tracks for the validated long noncoding RNA Hotair shows a lack of sequence conservation throughout its genomic locus. The arrowheads in the intronic region of Hotair are pointing left, indicating that the lncRNA is transcribed from the negative strand of DNA, and therefore the PhyloCSF -1, -2, and -3 tracks should be the focus of analysis. Note that the PhyloCSF score is negative throughout the entire gene (for all six tracks), indicating a lack of sequence conservation, which supports its proper annotation as a noncoding RNA. Please click here to view a larger version of this figure.
Figure 4: PhyloCSF analysis of the mouse 1810058I24Rik gene, which encodes the microprotein mitolamban/Stmp1/Mm47. (A) The mouse 1810058I24Rik gene is comprised of three exons, and the arrowheads in the intronic regions point right, indicating it is transcribed on the plus strand of DNA and therefore the PhyloCSF +1, +2, and +3 tracks should be analyzed. The conserved microprotein coding sequence spans all three exons, starting in exon 1 (B), reading through exon 2 (C), and ending in exon 3 (D). Note that the positive PhyloCSF score is found on the +2 track in exon 1, the +3 track in exon 2, and the +2 track in exon 1. The reason for the movement of the positive score from one track to the other is that PhyloCSF analyzes the six potential reading frames of the DNA sequence independent of the gene's exon/intron structure. Therefore, an intron containing a number of nucleotides that is not divisible by three (three nucleotides/codon) will cause a shift in the reading frame to a different track. Please click here to view a larger version of this figure.
Figure 5: Analysis of the Mief1 genomic locus with PhyloCSF identifies a region with protein-coding potential in the 5' UTR that is independent of the main Mief1 CDS on the shared RNA. This conserved upstream ORF (uORF) has been shown to encode a microprotein named Mief1-MP. (A) Overview of the Mief1 genomic locus. The arrowheads in the introns point to the right, indicating Mief1 is transcribed from the plus strand of DNA (focus on the PhyloCSF +1, +2, and +3 tracks to determine coding potential). The main Mief1 CDS encodes a 463 amino acid protein and is shown in panel (B). However, there is also a distinct conserved upstream ORF within the 5' UTR of Mief1 that encodes a unique 70 amino acid microprotein called Mief1-MP (C). As seen in Panel C, the Mief1-MP has its own conserved start and stop codon within the Mief1 5' UTR, and the ORF scores very highly on the PhyloCSF +1 track, providing strong evidence that it encodes a functional microprotein. Abbreviations: ORF = open reading frame; uORF = upstream ORF; UTR = untranslated region; CDS = coding sequence. Please click here to view a larger version of this figure.
Symbol | Level of Amino Acid Conservation | Grouped Amino Acids |
Asterisk (*) | Fully-conserved residue | Not applicable (single, fully-conserved residue) |
Colon (:) | Groups with strongly similar properties | STA; NEQK; NHQK; NDEQ; QHRK; MILV; MILF; HY; FYW |
Period (.) | Groups with weakly similar properties | CSA; ATV; SAG; STNK; STPA; SGND; SNDEQK; NDEQHK; NEQHRK; FVLIM; HFY |
Space (no symbol) | No similarity | Not applicable (no similarity) |
Table 1: Definitions of consensus symbols for Multiple Sequence Alignments generated by Clustal Omega. The multiple species sequence alignment shown in Figure 2 was generated using Clustal Omega52. Abbreviations: serine (S), threonine (T), alanine (A), asparagine (N), glutamic acid (E), glutamine (Q), lysine (K), aspartic acid (D), arginine (R), methionine (M), isoleucine (I), leucine (L), phenylalanine (F), histidine (H), tyrosine (Y), tryptophan (W), cysteine (C), valine (V), glycine (G), proline (P).
Font Color | Property | Amino Acid Residue [Abbreviation] |
Red | Small, hydrophobic | alanine [A], valine [V], phenylalanine [F], proline [P], methionine [M], isoleucine [I], leucine [L], tryptophan [W] |
Blue | Acidic | aspartic acid [D], glutamic acid [E] |
Magenta | Basic | arginine [R], lysine [K] |
Green | Hydroxl, sulfhydryl, amine, +G | serine [S], threonine [T], tyrosine [Y], histidine [H], cysteine [C], asparagine [N], glycine [G], glutamine [Q] |
Table 2: Properties of the amino acids depicted in Figure 2. Clustal Omega52 was used to generate the multiple sequence alignment shown in Figure 2.
The protocol presented here provides detailed instructions on how to interrogate genomic regions of interest for microprotein-coding potential using PhyloCSF on the user-friendly UCSC Genome Browser48,49,50,51. As detailed above, PhyloCSF is a powerful comparative genomics algorithm that integrates phylogenetic models and codon substitution frequencies to identify evolutionary signatures that are typical of protein-coding genes48,49. PhyloCSF has been widely used to identify functional microproteins in genomic regions previously annotated as noncoding38,39,40,41,42,43,44,45,46,47, and this approach has been shown to outperform other comparative genomics methods for short sequences such as microproteins as small as 13 amino acids and for small exons of canonical proteins35,48,49. Notably, the utility of PhyloCSF as a robust method to identify functional protein-coding sequences via evolutionary conservation extends beyond that of vertebrate and invertebrate species and has even been recently applied to viral genomes to successfully interrogate the protein-coding capacity of the SARS-CoV-2 genome67.
In addition to identifying putative coding sequences within annotated noncoding RNAs, an advantage of PhyloCSF is that it can also reliably detect conserved microproteins encoded by ORFs within annotated untranslated regions (UTRs) of canonical protein coding genes, including both 5' upstream and 3' downstream ORFs (uORFs and dORFs, respectively)8,19,66,68. For example, the MIEF1 microprotein (MIEF1-MP) is encoded in the 5' UTR of mitochondrial elongation factor 1 (MIEF1)66. In the case of MIEF1-MP, a discrete positive PhyloCSF score corresponding to the MIEF1-MP is observed upstream of the ORF that encodes MIEF1 (Figure 5). While some uORF encoded microproteins directly interact with the downstream canonical proteins on their shared mRNA, (ex. MIEF1-MP and MIEF1), others function independently of the protein encoded by the main CDS66,68. Therefore, when characterizing uORF-encoded microproteins, it should not be assumed that they function via direct regulation of their downstream protein product.
While PhyloCSF has many clear strengths as a tool for the identification of conserved microprotein-coding sequences, it is important to recognize several limitations of this method. First, while sequence conservation strongly suggests that a genomic region has undergone functional selection and is thus coding, a lack of robust conservation and a resultant negative PhyloCSF score does not definitively rule out coding potential for a given sequence. In other words, relying exclusively on PhyloCSF may result in the oversight of translated ORFs that are not strongly conserved but still produce functional microproteins. Notably, genomic regions with low conservation or negative conservation scores could correspond to species-specific coding regions or those of evolutionary "young" genes via sequence divergence or de novo gene birth46,69,70,71,72,73,74. For example, the microprotein ASAP, which is encoded by what was formerly thought to be the human noncoding RNA LINC00467, is not scored positively by PhyloCSF because the amino acid sequence is only conserved in higher mammals75. Additionally, recent studies identified several human-specific microproteins, including one encoded by the intergenic lncRNA RP3-527G5.1, that does not generate a positive PhyloCSF score68,72. In this regard, the absence of a positive PhyloCSF score cannot be interpreted as proof of a noncoding region and should be interpreted with caution.
A second consideration to keep in mind when using PhyloCSF is that even though a positive score is highly suggestive of functional selection and protein-coding capacity, this line of evidence cannot stand alone and must be experimentally validated. Examples of methods that can be used to generate supporting evidence for stable microprotein expression include the detection of the putative protein by mass spectrometry or western blotting using an antibody raised against the microprotein sequence of interest. Alternatively, since it can be challenging to generate reliable antibodies for microproteins due to the lack of sequence choices for optimal antigenicity, it is also possible to use CRISPR/Cas9 and the homology-directed repair (HDR) pathway to introduce an epitope tag into the endogenous locus in frame with the putative microprotein sequence, thereby facilitating the detection of the protein of interest using a high-affinity antibody (e.g., FLAG, HA, V5, Myc)18. A final limitation of PhyloCSF to acknowledge is that although it is currently integrated into many of the commonly used genomic assemblies, including Homo sapiens (human hg19, hg38), Mus musculus (mouse mm10, mm39), Gallus gallus (chicken, galGal4, galGal6), Drosophila melanogaster (fruit fly, dm6), Caenorhabditis elegans (nematodes, ce11), and SARS-CoV-2 (wuhCor1), there are still many species that cannot currently be queried directly on the UCSC Genome Browser.
The identification of conserved domains or sequence characteristics within identified microproteins can help increase confidence in their functional relevance and provide some insight into their putative function. Here we provide recommendations for specific tools and resources that can be used to analyze identified microprotein amino acid sequences in further detail to gain such insight. The specific tools listed below (and summarized in the Table of Materials) are freely available to the public, and we have found them to be particularly user-friendly and robust in microprotein studies18,38,39,40,41,47. Beyond the tools described here, there are a multitude of additional resources that can be found in bioinformatics resource portals such as Expasy (https://www.expasy.org) and EMBL-EBI (https://www.ebi.ac.uk/services/all). However, detailing the specifics for each of the tools within these repositories is beyond the scope of this article. Here we recommend the following resources.
First, TMHMM76 (https://services.healthtech.dtu.dk/service.php?TMHMM-2.0) analyzes protein sequences of interest for the presence of transmembrane domains. Notably, a number of microproteins that have been functionally characterized thus far contain single-pass transmembrane domains, which facilitates their localization to membrane regions and enables their direct regulation of ion channels, exchangers, and membrane-associated enzymes30. Second, the National Center for Biotechnology Information (NCBI) Conserved Domain Search77 (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) is a popular tool used to identify conserved domains within protein or coding nucleotide sequences. Third, Protein family (Pfam)78 database (http://pfam.xfam.org) provides alignments and classifications of protein families and domains. Fourth, WoLF PSORT79 (https://wolfpsort.hgc.jp/) is a tool that can be employed to predict subcellular protein localization. Fifth, COXPRESdB80 is a gene co-expression database (https://coxpresdb.jp) that provides co-regulated gene relationships to estimate gene functions. Finally, SignalP 6.081 is a widely used prediction program (https://services.healthtech.dtu.dk/service.php?SignalP) that recognizes the presence of a signal peptide sequence and predicts the location of the cleavage site.
In summary, the methods described here can be used to effectively analyze genomic regions of interest for protein-coding potential using PhyloCSF on the UCSC Genome Browser. These methods are highly accessible and can be easily learned and efficiently applied by individuals without prior training or expertise in bioinformatics or comparative genomics. As demonstrated here in detail, PhyloCSF is a powerful tool that can be applied as a first-pass analysis to help distinguish protein-coding versus noncoding genes in vertebrate, invertebrate, and viral genomes, and the strengths of this approach heavily outweigh the noted weaknesses.
The authors have nothing to disclose.
This work was supported by grants from the National Institutes of Health (HL-141630 and HL-160569) and Cincinnati Children's Research Foundation (Trustee Award).
Website | Website Address | Requirements | |
Clustal Omega Multiple Sequence Alignment Tool | https://www.ebi.ac.uk/Tools/msa/clustalo/ | Web browser | Multiple sequence alignment program for the efficient alignment of FASTA sequences (i.e. for cross-species comparison of identified microproteins) |
COXPRESSdb | https://coxpresdb.jp | Web browser | Provides co-regulated gene relationships to estimate gene functions |
EMBL-EBI Bioinformatics Tools FAQs | https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Bioinformatics+Tools+FAQ | Web browser | Frequently Asked Questions (FAQs) for EMBL-EBI tools. Includes the color coding key for protein sequence alignments |
European Bioinformatics Institute (EMBL-EBI), Tools and Data Resources |
https://www.ebi.ac.uk/services/all | Web browser | Comprehensive list of freely available websites, tools and data resources |
Expasy – Swiss Bioinformatics Resource Portal | https://www.expasy.org | Web browser | Suite of bioinformatic tools and resources for protein sequence analysis that is maintained by the Swiss Institute of Bioinformatics (SIB) |
National Center for Biotechnology Information (NCBI) Conserved Domain Search |
https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi | Web browser | Search tool to identify conserved domains within protein or coding nucleotide sequences |
Pfam 35 | http://pfam.xfam.org | Web browser | Protein family (Pfam) database, provides alignments and classification of protein families and domains |
PhyloCSF Track Hub Description |
https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=1267045267_TEc99h2oW5Q edaCd4ir8aZ65ryaD&db=mm10 &c=chr2&g=hub_109801_ PhyloCSF_smooth |
Web browser | Detailed description of the Smoothed PhyloCSF tracks and PhyloCSF Track Hub |
SignalP 6.0 | https://services.healthtech.dtu.dk/service.php?SignalP-6.0 | Web browser | Predicts the presence of signal peptides and the location of their cleavage sites |
TMHMM – 2.0 | https://services.healthtech.dtu.dk/service.php?TMHMM-2.0 | Web browser | Prediction of transmembrane helices in proteins |
UCSC Genome Browser BLAT Search | https://genome.ucsc.edu/cgi-bin/hgBlat | Web browser | Tool used to find genomic regions using DNA or protein sequence information |
UCSC Genome Browser Gateway | https://genome.ucsc.edu/cgi-bin/hgGateway | Web browser | Direct link to the UCSC Genome Browser Gateway |
UCSC Genome Browser Home | https://genome.ucsc.edu/ | Web browser | Home website for the UCSC Genome Browser |
UCSC Genome Browser Track Data Hubs | https://genome.ucsc.edu/cgi-bin/hgHubConnect#publicHubs | Web browser | Direct link to Track Data Hubs/Public Hubs database to search for and load the PhyloCSF Tracks |
UCSC Genome Browser User Guide | https://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html | Web browser | Comprehensive user guide detailing how to navigate the UCSC Genome Browser |
WoLF PSORT | https://wolfpsort.hgc.jp | Web browser | Protein subcellular localization prediction tool |