Summary

用相似有效的二元分类性能选择多个生物标记子集

Published: October 11, 2018
doi:

Summary

现有算法为生物标志物检测数据集生成一个解决方案。该协议演示了多个类似的有效解决方案的存在, 并提供了一个用户友好的软件, 以帮助生物医学研究人员调查其数据集的建议的挑战。计算机科学家也可以在他们的生物标志物检测算法中提供这一特性。

Abstract

生物标志物检测是高通量 “组学” 研究人员最重要的生物医学问题之一, 几乎所有现有的生物标志物检测算法都能通过对给定数据集的优化性能测量来生成一个生物标记子集。.然而, 最近的一项研究表明, 有多个生物标志物子集具有相似的有效甚至相同的分类性能。该协议提供了一种简单明了的方法, 用于检测具有二进制分类性能的生物标记子集, 优于用户定义的截止。该协议包括数据准备和加载、基线信息汇总、参数调整、生物标志物筛选、结果可视化和解释、生物标记基因注释、结果和可视化输出出版质量。所提出的生物标志物筛选策略是直观的, 并证明了开发生物标志物检测算法的一般规则。使用编程语言 Python 开发了用户友好的图形用户界面 (GUI), 使生物医学研究人员能够直接访问其结果。kSolutionVis 的源代码和手册可以从 http://www.healthinformaticslab.org/supp/resources.php 下载。

Introduction

二进制分类是生物医学领域中最常见、最具挑战性的数据挖掘问题之一, 用于构建对两组样本进行培训的分类模型, 其中最精确的判别功率为1,2,3,4,5,6,7. 然而, 生物医学领域产生的大数据具有固有的 “大 p 小 n” 范式, 其特征数通常远远大于样本689的数量。因此, 生物医学研究人员必须在使用分类算法之前减少特征维度, 以避免过度拟合问题8,9。诊断生物标志物被定义为从健康对照样本10,11分离某一疾病患者的检测特征的子集。患者通常被定义为阳性样本, 健康控制被定义为阴性样本12

最近的研究表明, 生物医学数据集5具有相同或类似有效的分类性能的一个以上的解决方案。几乎所有的特征选择算法都是确定性算法, 只为同一数据集生成一个解决方案。遗传算法可以同时生成具有类似性能的多个解决方案, 但他们仍然尝试选择一个具有最佳健身功能的解决方案, 作为给定数据集1314的输出。

特征选择算法可以大致分组为过滤器或包装12。筛选器算法根据要素相互独立的假设, 选择与二进制类标签的重要个人关联排名的顶级k要素151617.尽管此假设对于几乎所有实际数据集都不适用, 但启发式筛选规则在许多情况下都很好地执行, 例如, mRMR (最小冗余和最大相关性) 算法、基于魏氏测试的功能筛选 (WRank)算法, 以及 ROC (接收机操作特性) 基于图的滤波 (ROCRank) 算法。mRMR 是一种有效的滤波算法, 因为它近似于一系列较小问题的组合估计问题, 与最大依赖特征选择算法相比, 其中每一个只涉及两个变量,因此使用更健壮的1819的成对联合概率。但是, mRMR 可能低估某些功能的用处, 因为它不测量可以增加相关性的要素之间的交互, 因此错过了一些单独无用但仅在组合时有用的要素组合。WRank 算法计算一个非参数分数, 说明特征在两类样本之间的判别方式, 并以其对异常值2021的鲁棒性而著称。此外, ROCRank 算法评估特定特征的 ROC 曲线 (AUC) 下的区域对于调查的二进制分类性能2223的重要性。

另一方面, 包装器会评估给定要素子集的预定义分类程序的性能, 并通过启发式规则迭代生成, 并使用最佳性能测量24创建特征子集。包装通常优于分类性能中的过滤器, 但运行速度较慢25。例如, 正则随机林 (RRF)2627算法使用贪婪规则, 方法是在每个随机林节点上评估训练数据子集上的要素, 其要素重要性分数由基尼索引评估.如果其信息增益不能提高所选要素的性能, 则选择新特征将受到惩罚。此外, 微阵列 (PAM)2829算法的预测分析, 还有一个包装算法, 计算每个类标签的质心, 然后选择特征, 以缩小基因质心向整体类质心。PAM 具有强大的外围功能。

对于任何给定数据集, 可能需要具有顶级分类性能的多个解决方案。首先, 确定算法的优化目标是由一个数学公式定义的,例如最小误差率30, 这不一定是生物样本的理想选择。其次, 数据集可能具有多个、显著不同的解决方案, 具有类似的有效或甚至相同的性能。几乎所有现有的特征选择算法都将随机选择其中一个解决方案作为输出31

本研究将介绍一种信息分析协议, 用于为任何给定的二进制分类数据集生成具有相似性能的多特征选择解。考虑到大多数生物医学研究人员不熟悉地学信息技术或计算机编码, 开发了一个用户友好的图形用户界面 (GUI), 以便快速分析生物医学二进制分类数据集。分析协议包括数据加载和汇总、参数调整、管道执行和结果解释。通过简单的单击, 研究人员能够生成生物标记子集和出版物质量可视化图。该协议已通过转录两个二进制分类数据集的急性淋巴细胞白血病 (ALL),ALL1 和 ALL212进行了测试。ALL1 和 ALL2 的数据集是从 http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi 的广泛的研究所基因组数据分析中心下载的。ALL1 包含128个具有12625个特征的样本。这些样本中, 95 是 B 细胞全部, 33 是 T 细胞。ALL2 还包括100个具有12625个功能的样本。在这些样本中, 有65例患者复发和35例没有。ALL1 是一个简单的二进制分类数据集, 最小精度为四个过滤器和四个包装为 96.7%, 6 的8功能选择算法实现 100%12。虽然 ALL2 是一个比较困难的数据集, 但上面的8个特征选择算法达到了83.7% 精度12。通过封装算法检测到的56特性、基于相关性的特征选择 (CFS), 实现了最佳精度。

Protocol

注: 以下协议描述了主要模块的信息学分析程序和伪代码的详细信息。自动分析系统是使用 python 版本3.6.0 和 python 模块 (熊猫、abc、numpy、scipy、sklearn、sys、PyQt5、sys、mRMR、数学和 matplotlib) 开发的。本研究中使用的材料列在材料表中。 1. 准备数据矩阵和类标签 将数据矩阵文件作为制表符或逗号分隔的矩阵文件进行准备, 如图 1A所示。…

Representative Results

此工作流的目标 (图 6) 是检测多个具有类似效率的二进制分类数据集的生物标记子集。整个过程由两个示例数据集 ALL1 和 ALL2 从最近发布的生物标志物检测研究中提取12,48。用户可以按照补充材料中的说明安装 kSolutionVis。 数据集 ALL1 分析了 12 625 转录组功能 95 …

Discussion

本研究为用户指定的二进制分类数据集提供了易于遵循的多解决方案生物标志物检测和表征协议。该软件强调了用户友好性和灵活的导入/导出接口的各种文件格式, 使生物医学研究员可以很容易地使用软件 GUI 调查他们的数据集。本研究还强调了生成多个具有类似有效建模性能的解决方案的必要性, 以前许多现有的生物标记检测算法都忽略了这些方法。在未来, 新开发的生物标志物检测算法可能包?…

Divulgazioni

The authors have nothing to disclose.

Acknowledgements

这项工作得到了中国科学院战略优先研究计划 (XDB13040400) 和吉林大学启动补助金的支持。匿名审阅者和生物医学测试用户对提高 kSolutionVis 的可用性和功能的建设性意见表示赞赏。

Materials

Hardware
laptop Lenovo X1 carbon Any computer works. Recommended minimum configuration: 1GB extra hard disk space, 1 GB memory, 2.0MHz CPU
Name Company Catalog Number Comments
Software
Python 3.0 WingWare Wing Personal Any python programming and running environments support Python version 3.0 or above

Riferimenti

  1. Heckerman, D., et al. Genetic variants associated with physical performance and anthropometry in old age: a genome-wide association study in the ilSIRENTE cohort. Scientific Reports. 7, 15879 (2017).
  2. Li, Z., et al. Genome-wide association analysis identifies 30 new susceptibility loci for schizophrenia. Nature Genetics. 49, 1576-1583 (2017).
  3. Winkler, T. W., et al. Quality control and conduct of genome-wide association meta-analyses. Nature Protocols. 9, 1192-1212 (2014).
  4. Harrison, R. N. S., et al. Development of multivariable models to predict change in Body Mass Index within a clinical trial population of psychotic individuals. Scientific Reports. 7, 14738 (2017).
  5. Liu, J., et al. Multiple similarly-well solutions exist for biomedical feature selection and classification problems. Scientific Reports. 7, 12830 (2017).
  6. Ye, Y., Zhang, R., Zheng, W., Liu, S., Zhou, F. RIFS: a randomly restarted incremental feature selection algorithm. Scientific Reports. 7, 13013 (2017).
  7. Zhou, F. F., Xue, Y., Chen, G. L., Yao, X. GPS: a novel group-based phosphorylation predicting and scoring method. Biochemical and Biophysical Research Communications. 325, 1443-1448 (2004).
  8. Sanchez, B. N., Wu, M., Song, P. X., Wang, W. Study design in high-dimensional classification analysis. Biostatistics. 17, 722-736 (2016).
  9. Shujie, M. A., Carroll, R. J., Liang, H., Xu, S. Estimation and Inference in Generalized Additive Coefficient Models for Nonlinear Interactions with High-Dimensional Covariates. Annals of Statistics. 43, 2102-2131 (2015).
  10. Li, J. H., et al. MiR-205 as a promising biomarker in the diagnosis and prognosis of lung cancer. Oncotarget. 8, 91938-91949 (2017).
  11. Lyskjaer, I., Rasmussen, M. H., Andersen, C. L. Putting a brake on stress signaling: miR-625-3p as a biomarker for choice of therapy in colorectal cancer. Epigenomics. 8, 1449-1452 (2016).
  12. Ge, R., et al. McTwo: a two-step feature selection algorithm based on maximal information coefficient. BMC Bioinformatics. 17, 142 (2016).
  13. Tumuluru, J. S., McCulloch, R. Application of Hybrid Genetic Algorithm Routine in Optimizing Food and Bioengineering Processes. Foods. 5, (2016).
  14. Gen, M., Cheng, R., Lin, L. . Network models and optimization: Multiobjective genetic algorithm approach. , (2008).
  15. Radovic, M., Ghalwash, M., Filipovic, N., Obradovic, Z. Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinformatics. 18, 9 (2017).
  16. Ciuculete, D. M., et al. A methylome-wide mQTL analysis reveals associations of methylation sites with GAD1 and HDAC3 SNPs and a general psychiatric risk score. Translational Psychiatry. 7, e1002 (2017).
  17. Lin, H., et al. Methylome-wide Association Study of Atrial Fibrillation in Framingham Heart Study. Scientific Reports. 7, 40377 (2017).
  18. Wang, S., Li, J., Yuan, F., Huang, T., Cai, Y. D. Computational method for distinguishing lysine acetylation, sumoylation, and ubiquitination using the random forest algorithm with a feature selection procedure. combinatorial chemistry & high throughput screening. , (2017).
  19. Zhang, Q., et al. Predicting Citrullination Sites in Protein Sequences Using mRMR Method and Random Forest Algorithm. combinatorial chemistry & high throughput screening. 20, 164-173 (2017).
  20. Cuena-Lombrana, A., Fois, M., Fenu, G., Cogoni, D., Bacchetta, G. The impact of climatic variations on the reproductive success of Gentiana lutea L. in a Mediterranean mountain area. International journal of biometeorology. , (2018).
  21. Coghe, G., et al. Fatigue, as measured using the Modified Fatigue Impact Scale, is a predictor of processing speed improvement induced by exercise in patients with multiple sclerosis: data from a randomized controlled trial. Journal of Neurology. , (2018).
  22. Hong, H., et al. Applying genetic algorithms to set the optimal combination of forest fire related variables and model forest fire susceptibility based on data mining models. The case of Dayu County, China. Science of the Total Environment. 630, 1044-1056 (2018).
  23. Borges, D. L., et al. Photoanthropometric face iridial proportions for age estimation: An investigation using features selected via a joint mutual information criterion. Forensic Science International. 284, 9-14 (2018).
  24. Kohavi, R., John, G. H. Wrappers for feature subset selection. Artificial intelligence. 97, 273-324 (1997).
  25. Yu, L., Liu, H. Efficient feature selection via analysis of relevance and redundancy. Journal of machine learning research. 5, 1205-1224 (2004).
  26. Wexler, R. B., Martirez, J. M. P., Rappe, A. M. Chemical Pressure-Driven Enhancement of the Hydrogen Evolving Activity of Ni2P from Nonmetal Surface Doping Interpreted via Machine Learning. Journal of American Chemical Society. , (2018).
  27. Wijaya, S. H., Batubara, I., Nishioka, T., Altaf-Ul-Amin, M., Kanaya, S. Metabolomic Studies of Indonesian Jamu Medicines: Prediction of Jamu Efficacy and Identification of Important Metabolites. Molecular Informatics. 36, (2017).
  28. Shangkuan, W. C., et al. Risk analysis of colorectal cancer incidence by gene expression analysis. PeerJ. 5, e3003 (2017).
  29. Chu, C. M., et al. Gene expression profiling of colorectal tumors and normal mucosa by microarrays meta-analysis using prediction analysis of microarray, artificial neural network, classification, and regression trees. Disease Markers. , 634123 (2014).
  30. Fleuret, F. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research. 5, 1531-1555 (2004).
  31. Pacheco, J., Alfaro, E., Casado, S., Gámez, M., García, N. A GRASP method for building classification trees. Expert Systems with Applications. 39, 3241-3248 (2012).
  32. Jiao, X., et al. DAVID-WS: a stateful web service to facilitate gene/protein list analysis. Bioinformatics. 28, 1805-1806 (2012).
  33. Rappaport, N., et al. Rational confederation of genes and diseases: NGS interpretation via GeneCards, MalaCards and VarElect. Biomedical Engineering OnLine. 16, 72 (2017).
  34. Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D. GeneCards: integrating information about genes, proteins and diseases. Trends in Genet. 13, 163 (1997).
  35. Joosten, R. P., Long, F., Murshudov, G. N., Perrakis, A. The PDB_REDO server for macromolecular structure model optimization. IUCrJ. 1, 213-220 (2014).
  36. Maglott, D., Ostell, J., Pruitt, K. D., Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research. 39, D52-D57 (2011).
  37. Amberger, J. S., Bocchini, C. A., Schiettecatte, F., Scott, A. F., Hamosh, A. OMIM.org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of human genes and genetic disorders. Nucleic Acids Research. 43, D789-D798 (2015).
  38. Boutet, E., et al. the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. Methods in Molecular Biology. 1374, 23-54 (2016).
  39. Zerbino, D. R., et al. Ensembl 2018. Nucleic Acids Res. , (2017).
  40. McKusick, V. A., Amberger, J. S. The morbid anatomy of the human genome: chromosomal location of mutations causing disease. Journal of Medical Genetics. 30, 1-26 (1993).
  41. Finn, R. D., et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research. 44, D279-D285 (2016).
  42. Xue, Y., et al. GPS: a comprehensive www server for phosphorylation sites prediction. Nucleic Acids Research. 33, W184-W187 (2005).
  43. Deng, W., et al. GPS-PAIL: prediction of lysine acetyltransferase-specific modification sites from protein sequences. Scientific Reports. 6, 39787 (2016).
  44. Zhao, Q., et al. GPS-SUMO: a tool for the prediction of sumoylation sites and SUMO-interaction motifs. Nucleic Acids Research. 42, W325-W330 (2014).
  45. Wan, S., Duan, Y., Zou, Q. HPSLPred: An Ensemble Multi-Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source. Proteomics. 17, (2017).
  46. Zhang, H., Zhu, L., Huang, D. S. WSMD: weakly-supervised motif discovery in transcription factor ChIP-seq data. Scientific Reports. 7, 3217 (2017).
  47. Szklarczyk, D., et al. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Research. 43, D447-D452 (2015).
  48. Chiaretti, S., et al. Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood. 103, 2771-2778 (2004).
  49. Rowley, J. D., et al. Mapping chromosome band 11q23 in human acute leukemia with biotinylated probes: identification of 11q23 translocation breakpoints with a yeast artificial chromosome. Proceedings of the National Academy of Sciences of the United States of America. 87, 9358-9362 (1990).
  50. Rabbitts, T. H., et al. The chromosomal location of T-cell receptor genes and a T cell rearranging gene: possible correlation with specific translocations in human T cell leukaemia. Embo Journal. 4, 1461-1465 (1985).
  51. Yin, L., et al. SH2D1A mutation analysis for diagnosis of XLP in typical and atypical patients. Human Genetics. 105, 501-505 (1999).
  52. Brandau, O., et al. Epstein-Barr virus-negative boys with non-Hodgkin lymphoma are mutated in the SH2D1A gene, as are patients with X-linked lymphoproliferative disease (XLP). Human Molecular Genetics. 8, 2407-2413 (1999).
  53. Burnett, R. C., Thirman, M. J., Rowley, J. D., Diaz, M. O. Molecular analysis of the T-cell acute lymphoblastic leukemia-associated t(1;7)(p34;q34) that fuses LCK and TCRB. Blood. 84, 1232-1236 (1994).
  54. Taylor, G. M., et al. Genetic susceptibility to childhood common acute lymphoblastic leukaemia is associated with polymorphic peptide-binding pocket profiles in HLA-DPB1*0201. Human Molecular Genetics. 11, 1585-1597 (2002).
  55. Wadia, P. P., et al. Antibodies specifically target AML antigen NuSAP1 after allogeneic bone marrow transplantation. Blood. 115, 2077-2087 (2010).
  56. Wilson, D. M., et al. 3rd et al. Hex1: a new human Rad2 nuclease family member with homology to yeast exonuclease 1. Nucleic Acids Research. 26, 3762-3768 (1998).
  57. O’Sullivan, R. J., et al. Rapid induction of alternative lengthening of telomeres by depletion of the histone chaperone ASF1. Nature Structural & Molecular Biology. 21, 167-174 (2014).
  58. Lee-Sherick, A. B., et al. Aberrant Mer receptor tyrosine kinase expression contributes to leukemogenesis in acute myeloid leukemia. Oncogene. 32, 5359-5368 (2013).
  59. Guyon, I., Elisseeff, A. An introduction to variable and feature selection. Journal of machine learning research. 3, 1157-1182 (2003).
  60. John, G. H., Kohavi, R., Pfleger, K. . Machine learning: proceedings of the eleventh international conference. , 121-129 (1994).
  61. Jain, A., Zongker, D. Feature selection: Evaluation, application, and small sample performance. IEEE transactions on pattern analysis and machine intelligence. 19, 153-158 (1997).
  62. Taylor, S. L., Kim, K. A jackknife and voting classifier approach to feature selection and classification. Cancer Informatics. 10, 133-147 (2011).
  63. Andresen, K., et al. Novel target genes and a valid biomarker panel identified for cholangiocarcinoma. Epigenetics. 7, 1249-1257 (2012).
  64. Guo, P., et al. Gene expression profile based classification models of psoriasis. Genomics. 103, 48-55 (2014).
  65. Xie, J., Wang, C. Using support vector machines with a novel hybrid feature selection method for diagnosis of erythemato-squamous diseases. Expert Systems with Applications. 38, 5809-5815 (2011).
  66. Zou, Q., Zeng, J., Cao, L., Ji, R. A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing. 173, 346-354 (2016).
check_url/it/57738?article_type=t

Play Video

Citazione di questo articolo
Feng, X., Wang, S., Liu, Q., Li, H., Liu, J., Xu, C., Yang, W., Shu, Y., Zheng, W., Yu, B., Qi, M., Zhou, W., Zhou, F. Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances. J. Vis. Exp. (140), e57738, doi:10.3791/57738 (2018).

View Video