Summary

一种用于微蛋白鉴定和序列分析的综合方法

Published: July 12, 2022
doi:

Summary

这里描述的协议提供了有关如何在用户友好的UCSC基因组浏览器上使用PhyloCSF分析微蛋白编码潜力的感兴趣基因组区域的详细说明。此外,建议使用几种工具和资源来进一步研究已鉴定的微蛋白的序列特征,以深入了解其假定的功能。

Abstract

下一代测序(NGS)推动了基因组学领域的发展,并为许多动物物种和模式生物产生了全基因组序列。然而,尽管有如此丰富的序列信息,但全面的基因注释工作已被证明具有挑战性,特别是对于小蛋白质。值得注意的是,传统的蛋白质注释方法被设计成有意排除由长度小于300个核苷酸的短开放阅读框(sORFs)编码的推定蛋白质,以过滤掉整个基因组中呈指数级增长的虚假非编码SORF。结果,数百种称为微蛋白的功能性小蛋白质(长度<100个氨基酸)被错误地归类为非编码RNA或完全被忽略。

在这里,我们提供了一个详细的协议,以利用免费的,公开可用的生物信息学工具,根据进化保护来查询基因组区域的微蛋白编码潜力。具体而言,我们提供了有关如何在用户友好的加州大学圣克鲁兹分校(UCSC)基因组浏览器上使用系统发育密码子替换频率(PhyloCSF)检查序列保存和编码潜力的分步说明。此外,我们还详细介绍了有效生成已鉴定微蛋白序列的多个物种比对的步骤,以可视化氨基酸序列保存,并推荐分析微蛋白特征的资源,包括预测的结构域结构。这些强大的工具可用于帮助鉴定非正统基因组区域中的推定微蛋白编码序列,或排除在感兴趣的非编码转录本中存在具有翻译潜力的保守编码序列。

Introduction

自人类基因组计划启动以来,基因组中全套编码元件的鉴定一直是一个主要目标,并且仍然是了解生物系统和基于遗传的疾病的病因学的中心目标1234。NGS技术的进步导致为大量生物体(包括脊椎动物,无脊椎动物,酵母和植物)生成全基因组序列5.此外,高通量转录测序方法进一步揭示了细胞转录组的复杂性,并鉴定了数千种具有蛋白质编码和非编码功能的新型RNA分子67。解码如此大量的序列信息是一个持续的过程,全面的基因注释工作仍然存在挑战8.

最近开发的翻译分析方法,包括核糖体分析910和多核糖体测序11,已经提供了证据,表明数百个非规范翻译事件映射到整个基因组中当前未注释的SORFs,有可能产生称为微蛋白或微肽的小蛋白质121314151617. 微蛋白已成为一类新颖的多功能蛋白质,由于其体积小(<100个氨基酸)和缺乏经典的蛋白质编码基因特征而被标准基因注释方法所忽视812181920。微蛋白已经在几乎所有生物体中都有描述,包括酵母2122,苍蝇172324和哺乳动物25262728,并且已被证明在各种过程中起关键作用,包括发育,代谢和应激信号传导1920293031323334.因此,必须继续为这一类长期被忽视的功能小蛋白质的其他成员挖掘基因组。

尽管人们普遍认识到微蛋白的生物学重要性,但这类基因在基因组注释中仍然远远不足,它们的准确鉴定仍然是一个持续的挑战,阻碍了该领域的进展。最近开发了各种计算工具和实验方法来克服与识别微蛋白编码序列相关的困难(在几个综合综述835,3637中广泛讨论)。最近的许多微蛋白鉴定研究38394041424344454647都严重依赖使用一种称为PhyloCSF4849的算法。,一种强大的比较基因组学方法,可用于区分基因组的保守蛋白质编码区域和非编码区域。

PhyloCSF使用多物种核苷酸比对和系统发育模型来比较密码子取代频率(CSF),以检测蛋白质编码基因的进化特征。这种基于经验模型的方法依赖于这样一个前提,即蛋白质主要在氨基酸水平而不是核苷酸序列上是保守的。因此,编码相同氨基酸的同义词密码子取代或对具有保守性质(即电荷,疏水性,极性)的氨基酸的密码子取代被评分为正,而非同义取代,包括错义和无意义取代,得分为负数。PhyloCSF在全基因组数据上进行训练,并已被证明在分离于完整序列的编码序列(CDS)的短部分方面是有效的,这在分析微蛋白或标准蛋白质编码基因的单个外显子4849时是必需的。

值得注意的是,最近在加州大学圣克鲁兹分校(UCSC)基因组浏览器495051中集成了PhyloCSF跟踪中心,使所有背景的研究人员都可以轻松访问用户友好的界面,以查询感兴趣的基因组区域,以获得蛋白质编码潜力。下面概述的协议提供了有关如何在UCSC基因组浏览器上加载PhyloCSF跟踪中心并随后询问感兴趣的基因组区域以探测高置信度蛋白质编码区域(或缺乏高置信度蛋白质编码区域)的详细说明。此外,在观察到阳性PhyloCSF评分的情况下,描述步骤以进一步分析微蛋白编码潜力并有效地生成已鉴定氨基酸序列的多个物种比对,以说明跨物种序列守恒。最后,在讨论中引入了几个额外的公开资源和工具,以调查已鉴定的微蛋白特征,包括预测的结构域结构和对假定微蛋白功能的见解。

Protocol

下面概述的协议详细介绍了在UCSC基因组浏览器(由Mudge等人生成)上加载和导航PhyloCSF浏览器轨道的步骤。有关UCSC基因组浏览器的一般问题,可以在此处找到广泛的基因组浏览器用户指南:https://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html。 1. 将系统级社群跟踪中心加载到 UCSC 基因组浏览器 <l…

Representative Results

在这里,我们将使用经过验证的微蛋白米托雷古林(Mtln)作为示例,以演示保守的sORF将如何产生阳性的PhyloCSF评分,该评分可以在UCSC基因组浏览器上轻松可视化和分析。米托瑞古林以前被注释为非编码RNA(以前是人类基因ID LINC00116和小鼠基因ID 1500011K16Rik)。比较基因组学和序列保存分析方法在其初步发现40、57、58、…

Discussion

这里介绍的协议提供了有关如何在用户友好的UCSC基因组浏览器48495051上使用PhyloCSF询问感兴趣的基因组区域以识别微蛋白编码潜力的详细说明。如上所述,PhyloCSF是一种强大的比较基因组学算法,它集成了系统发育模型和密码子替换频率,以识别蛋白质编码基因48<sup …

Disclosures

The authors have nothing to disclose.

Acknowledgements

这项工作得到了美国国立卫生研究院(HL-141630和HL-160569)和辛辛那提儿童研究基金会(受托人奖)的资助。

Materials

Website Website Address Requirements
Clustal Omega Multiple Sequence Alignment Tool https://www.ebi.ac.uk/Tools/msa/clustalo/ Web browser Multiple sequence alignment program for the efficient alignment of FASTA sequences (i.e. for cross-species comparison of identified microproteins)
COXPRESSdb https://coxpresdb.jp Web browser Provides co-regulated gene relationships to estimate gene functions
EMBL-EBI Bioinformatics Tools FAQs https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Bioinformatics+Tools+FAQ Web browser Frequently Asked Questions (FAQs) for EMBL-EBI tools. Includes the color coding key for protein sequence alignments
European Bioinformatics Institute (EMBL-EBI),
Tools and Data Resources
https://www.ebi.ac.uk/services/all Web browser Comprehensive list of freely available websites, tools and data resources
Expasy – Swiss Bioinformatics Resource Portal https://www.expasy.org Web browser Suite of bioinformatic tools and resources for protein sequence analysis that is maintained by the Swiss Institute of Bioinformatics (SIB)
National Center for Biotechnology Information (NCBI)
Conserved Domain Search
https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi Web browser Search tool to identify conserved domains within protein or coding nucleotide sequences
Pfam 35 http://pfam.xfam.org Web browser Protein family (Pfam) database, provides alignments and classification of protein families and domains
PhyloCSF Track Hub Description https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=1267045267_TEc99h2oW5Q
edaCd4ir8aZ65ryaD&db=mm10
&c=chr2&g=hub_109801_
PhyloCSF_smooth
Web browser Detailed description of the Smoothed PhyloCSF tracks and PhyloCSF Track Hub
   
   
   
   
   
SignalP 6.0 https://services.healthtech.dtu.dk/service.php?SignalP-6.0 Web browser Predicts the presence of signal peptides and the location of their cleavage sites
TMHMM – 2.0 https://services.healthtech.dtu.dk/service.php?TMHMM-2.0 Web browser Prediction of transmembrane helices in proteins
UCSC Genome Browser BLAT Search https://genome.ucsc.edu/cgi-bin/hgBlat Web browser Tool used to find genomic regions using DNA or protein sequence information
UCSC Genome Browser Gateway https://genome.ucsc.edu/cgi-bin/hgGateway Web browser Direct link to the UCSC Genome Browser Gateway
UCSC Genome Browser Home https://genome.ucsc.edu/ Web browser Home website for the UCSC Genome Browser
UCSC Genome Browser Track Data Hubs https://genome.ucsc.edu/cgi-bin/hgHubConnect#publicHubs Web browser Direct link to Track Data Hubs/Public Hubs database to search for and load the PhyloCSF Tracks
UCSC Genome Browser User Guide https://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html Web browser Comprehensive user guide detailing how to navigate the UCSC Genome Browser
WoLF PSORT https://wolfpsort.hgc.jp Web browser Protein subcellular localization prediction tool

References

  1. Collins, F. S., Morgan, M., Patrinos, A. The human genome project: lessons from large-scale biology. Science. 300 (5617), 286-290 (2003).
  2. Lander, E. S., et al. Initial sequencing and analysis of the human genome. Nature. 409 (6822), 860-921 (2001).
  3. Sachidanandam, R., et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 409 (6822), 928-933 (2001).
  4. Venter, J. C., et al. The sequence of the human genome. Science. 291 (5507), 1304-1351 (2001).
  5. Fuentes-Pardo, A. P., Ruzzante, D. E. Whole-genome sequencing approaches for conservation biology: Advantages, limitations and practical recommendations. Molecular Ecology. 26 (20), 5369-5406 (2017).
  6. Carninci, P., et al. The transcriptional landscape of the mammalian genome. Science. 309 (5740), 1559-1563 (2005).
  7. Maeda, N., et al. Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs. PLoS Genetics. 2 (4), 62 (2006).
  8. Schlesinger, D., Elsasser, S. J. Revisiting sORFs: overcoming challenges to identify and characterize functional microproteins. The FEBS Journal. 289 (1), 53-74 (2022).
  9. Ingolia, N. T., et al. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Reports. 8 (5), 1365-1379 (2014).
  10. Ingolia, N. T., Ghaemmaghami, S., Newman, J. R., Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science. 324 (5924), 218-223 (2009).
  11. Aspden, J. L., et al. Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq. Elife. 3, 03528 (2014).
  12. Andrews, S. J., Rothnagel, J. A. Emerging evidence for functional peptides encoded by short open reading frames. Nature Reviews Genetics. 15 (3), 193-204 (2014).
  13. Mackowiak, S. D., et al. Extensive identification and analysis of conserved small ORFs in animals. Genome Biology. 16 (1), 1-21 (2015).
  14. Ruiz-Orera, J., Messeguer, X., Subirana, J. A., Alba, M. M. Long non-coding RNAs as a source of new peptides. Elife. 3, 03523 (2014).
  15. Basrai, M. A., Hieter, P., Boeke, J. D. Small open reading frames: beautiful needles in the haystack. Genome Research. 7 (8), 768-771 (1997).
  16. Frith, M. C., et al. The abundance of short proteins in the mammalian proteome. PLoS Genetics. 2 (4), 52 (2006).
  17. Ladoukakis, E., Pereira, V., Magny, E. G., Eyre-Walker, A., Couso, J. P. Hundreds of putatively functional small open reading frames in Drosophila. Genome Biology. 12 (11), 118 (2011).
  18. Makarewich, C. A., Olson, E. N. Mining for Micropeptides. Trends in Cell Biology. 27 (9), 685-696 (2017).
  19. Wright, B. W., Yi, Z., Weissman, J. S., Chen, J. The dark proteome: translation from noncanonical open reading frames. Trends in Cell Biology. , (2021).
  20. Saghatelian, A., Couso, J. P. Discovery and characterization of smORF-encoded bioactive polypeptides. Nature Chemical Biology. 11 (12), 909-916 (2015).
  21. Kastenmayer, J. P., et al. Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Research. 16 (3), 365-373 (2006).
  22. Smith, J. E., et al. Translation of small open reading frames within unannotated RNA transcripts in Saccharomyces cerevisiae. Cell Reports. 7 (6), 1858-1866 (2014).
  23. Lin, M. F., et al. Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes. Genome Research. 17 (12), 1823-1836 (2007).
  24. Magny, E. G., et al. Conserved regulation of cardiac calcium uptake by peptides encoded in small open reading frames. Science. 341 (6150), 1116-1120 (2013).
  25. Bazzini, A. A., et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 33 (9), 981-993 (2014).
  26. Ingolia, N. T., Lareau, L. F., Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 147 (4), 789-802 (2011).
  27. Ma, J., et al. Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue. J Proteome Res. 13 (3), 1757-1765 (2014).
  28. Slavoff, S. A., et al. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nature Chemical Biology. 9 (1), 59-64 (2013).
  29. Khitun, A., Ness, T. J., Slavoff, S. A. Small open reading frames and cellular stress responses. Molecular Omics. 15 (2), 108-116 (2019).
  30. Makarewich, C. A. The hidden world of membrane microproteins. Experimental Cell Research. 388 (2), 111853 (2020).
  31. Pueyo, J. I., Magny, E. G., Couso, J. P. New peptides under the s(ORF)ace of the genome. Trends in Biochemical Sciences. 41 (8), 665-678 (2016).
  32. Pauli, A., et al. Toddler: an embryonic signal that promotes cell movement via Apelin receptors. Science. 343 (6172), 1248636 (2014).
  33. Chng, S. C., Ho, L., Tian, J., Reversade, B. ELABELA: a hormone essential for heart development signals via the apelin receptor. Developmental Cell. 27 (6), 672-680 (2013).
  34. Lee, C., et al. The mitochondrial-derived peptide MOTS-c promotes metabolic homeostasis and reduces obesity and insulin resistance. Cell Metabolism. 21 (3), 443-454 (2015).
  35. Pauli, A., Valen, E., Schier, A. F. Identifying (non-)coding RNAs and small peptides: challenges and opportunities. Bioessays. 37 (1), 103-112 (2015).
  36. Plaza, S., Menschaert, G., Payre, F. In search of lost small peptides. Annual Review of Cell and Developmental Biology. 33, 391-416 (2017).
  37. Kiniry, S. J., Michel, A. M., Baranov, P. V. Computational methods for ribosome profiling data analysis. Wiley Interdisciplinary Reviews: RNA. 11 (3), 1577 (2020).
  38. Anderson, D. M., et al. A micropeptide encoded by a putative long noncoding RNA regulates muscle performance. Cell. 160 (4), 595-606 (2015).
  39. Anderson, D. M., et al. Widespread control of calcium signaling by a family of SERCA-inhibiting micropeptides. Science Signaling. 9 (457), (2016).
  40. Makarewich, C. A., et al. MOXI Is a mitochondrial micropeptide that enhances fatty acid beta-oxidation. Cell Reports. 23 (13), 3701-3709 (2018).
  41. Nelson, B. R., et al. A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle. Science. 351 (6270), 271-275 (2016).
  42. Chu, Q., et al. Regulation of the ER stress response by a mitochondrial microprotein. Nat Commun. 10 (1), 4883 (2019).
  43. Senis, E., et al. TUNAR lncRNA encodes a microprotein that regulates neural differentiation and neurite formation by modulating calcium dynamics. Frontiers in Cell and Developmental Biology. 9, 747667 (2021).
  44. Li, M., et al. A putative long noncoding RNA-encoded micropeptide maintains cellular homeostasis in pancreatic beta cells. Molecular Therapy-Nucleic Acids. 26, 307-320 (2021).
  45. Martinez, T. F., et al. Accurate annotation of human protein-coding small open reading frames. Nature Chemical Biology. 16 (4), 458-468 (2020).
  46. van Heesch, S., et al. The translational landscape of the human heart. Cell. 178 (1), 242-260 (2019).
  47. Makarewich, C. A., et al. The cardiac-enriched microprotein mitolamban regulates mitochondrial respiratory complex assembly and function in mice. Proceedings of the National Academy of Sciences of the United States of America. 119 (6), 2120476119 (2022).
  48. Lin, M. F., Jungreis, I., Kellis, M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics. 27 (13), 275-282 (2011).
  49. Mudge, J. M., et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Research. 29 (12), 2073-2087 (2019).
  50. Kent, W. J., et al. The human genome browser at UCSC. Genome Research. 12 (6), 996-1006 (2002).
  51. Raney, B. J., et al. Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser. Bioinformatics. 30 (7), 1003-1005 (2014).
  52. Sievers, F., et al. scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology. 7 (1), 539 (2011).
  53. Goujon, M., et al. A new bioinformatics analysis tools framework at EMBL-EBI. Nucleic Acids Research. 38 (2), 695-699 (2010).
  54. Harte, N., et al. Public web-based services from the European Bioinformatics Institute. Nucleic Acids Research. 32 (2), 3-9 (2004).
  55. Waterhouse, A. M., Procter, J. B., Martin, D. M., Clamp, M., Barton, G. J. Jalview Version 2-a multiple sequence alignment editor and analysis workbench. Bioinformatics. 25 (9), 1189-1191 (2009).
  56. Madeira, F., et al. The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Research. 47 (1), 636-641 (2019).
  57. Friesen, M., et al. Mitoregulin controls beta-oxidation in human and mouse adipocytes. Stem Cell Reports. 14 (4), 590-602 (2020).
  58. Stein, C. S., et al. Mitoregulin: A lncRNA-Encoded microprotein that supports mitochondrial supercomplexes and respiratory efficiency. Cell Reports. 23 (13), 3710-3720 (2018).
  59. Chugunova, A., et al. LINC00116 codes for a mitochondrial peptide linking respiration and lipid metabolism. Proceedings of the Nationall Academy of Sciences of the United States of America. 116 (11), 4940-4945 (2019).
  60. Lin, Y. F., et al. A novel mitochondrial micropeptide MPM enhances mitochondrial respiratory activity and promotes myogenic differentiation. Cell Death and Disease. 10 (7), 528 (2019).
  61. Wang, L., et al. The micropeptide LEMP plays an evolutionarily conserved role in myogenesis. Cell Death and Disease. 11 (5), 357 (2020).
  62. He, S., Liu, S., Zhu, H. The sequence, structure and evolutionary features of HOTAIR in mammals. BMC Evolutionary Biology. 11 (1), 1-14 (2011).
  63. Rinn, J. L., et al. Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs. Cell. 129 (7), 1311-1323 (2007).
  64. Bhatta, A., et al. A Mitochondrial micropeptide is required for activation of the Nlrp3 inflammasome. Journal of Immunology. 204 (2), 428-437 (2020).
  65. Zhang, D., et al. Functional prediction and physiological characterization of a novel short trans-membrane protein 1 as a subunit of mitochondrial respiratory complexes. Physiological Genomics. 44 (23), 1133-1140 (2012).
  66. Rathore, A., et al. MIEF1 microprotein regulates mitochondrial translation. Biochemistry. 57 (38), 5564-5575 (2018).
  67. Jungreis, I., Sealfon, R., Kellis, M. SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nature Communications. 12 (1), 2642 (2021).
  68. Chen, J., et al. Pervasive functional translation of noncanonical human open reading frames. Science. 367 (6482), 1140-1146 (2020).
  69. Ruiz-Orera, J., Verdaguer-Grau, P., Villanueva-Canas, J. L., Messeguer, X., Alba, M. M. Translation of neutrally evolving peptides provides a basis for de novo gene evolution. Nature Ecology and Evolution. 2 (5), 890-896 (2018).
  70. Blevins, W. R., et al. Uncovering de novo gene birth in yeast using deep transcriptomics. Nature Communications. 12 (1), 604 (2021).
  71. Papadopoulos, C., et al. Intergenic ORFs as elementary structural modules of de novo gene birth and protein evolution. Genome Research. , (2021).
  72. Vakirlis, N., Duggan, K. M., McLysaght, A. De novo birth of functional, human-specific microproteins. bioRxiv. , 462744 (2021).
  73. Van Oss, S. B., Carvunis, A. R. De novo gene birth. PLoS Genetics. 15 (5), 1008160 (2019).
  74. Andersson, D. I., Jerlstrom-Hultqvist, J., Nasvall, J. Evolution of new functions de novo and from preexisting genes. Cold Spring Harbor Perspectives in Biology. 7 (6), 017996 (2015).
  75. Ge, Q., et al. Micropeptide ASAP encoded by LINC00467 promotes colorectal cancer progression by directly modulating ATP synthase activity. Journal of Clinical Investigations. 131 (22), (2021).
  76. Sonnhammer, E. L., von Heijne, G., Krogh, A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proceedings. International Conference on Intelligent Syststems for Molecular Biology. 6, 175-182 (1998).
  77. Lu, S., et al. CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Research. 48, 265-268 (2020).
  78. Mistry, J., et al. Pfam: The protein families database in 2021. Nucleic Acids Research. 49, 412-419 (2021).
  79. Horton, P., et al. PSORT: protein localization predictor. Nucleic Acids Research. 35 (2), 585-587 (2007).
  80. Obayashi, T., Kagaya, Y., Aoki, Y., Tadaka, S., Kinoshita, K. COXPRESdb v7: a gene coexpression database for 11 animal species supported by 23 coexpression platforms for technical evaluation and evolutionary inference. Nucleic Acids Research. 47, 55-62 (2019).
  81. Teufel, F., et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nature Biotechnology. , 01156 (2022).
check_url/63841?article_type=t

Play Video

Cite This Article
Brito-Estrada, O., Hassel, K. R., Makarewich, C. A. An Integrated Approach for Microprotein Identification and Sequence Analysis. J. Vis. Exp. (185), e63841, doi:10.3791/63841 (2022).

View Video