Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

Machine learning in rare disease

Abstract

High-throughput profiling methods (such as genomics or imaging) have accelerated basic research and made deep molecular characterization of patient samples routine. These approaches provide a rich portrait of genes, molecular pathways and cell types involved in disease phenotypes. Machine learning (ML) can be a useful tool for extracting disease-relevant patterns from high-dimensional datasets. However, depending upon the complexity of the biological question, machine learning often requires many samples to identify recurrent and biologically meaningful patterns. Rare diseases are inherently limited in clinical cases, leading to few samples to study. In this Perspective, we outline the challenges and emerging solutions for using ML for small sample sets, specifically in rare diseases. Advances in ML methods for rare diseases are likely to be informative for applications beyond rare diseases for which few samples exist with high-dimensional data. We propose that the method community prioritize the development of ML techniques for rare disease research.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Combining datasets to increase data for training ML models.
Fig. 2: Representation learning can extract useful features from high-dimensional data.
Fig. 3: Strategies to reduce misinterpretation of ML model output in rare disease.
Fig. 4: Application of KGs can improve ML in rare disease.
Fig. 5: Feature-representation-transfer approaches learn representations from a source domain and apply them to a target domain.
Fig. 6: Combining multiple strategies strengthens the performance of ML models in rare disease.

Similar content being viewed by others

References

  1. Schaefer, J., Lehne, M., Schepers, J., Prasser, F. & Thun, S. The use of machine learning in rare diseases: a scoping review. Orphanet J. Rare Dis. 15, 145 (2020).

    Google Scholar 

  2. Decherchi, S., Pedrini, E., Mordenti, M., Cavalli, A. & Sangiorgi, L. Opportunities and challenges for machine learning in rare diseases. Front. Med. 8, 747612 (2021).

    Google Scholar 

  3. Li, A. et al. Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. Cancer Res. 69, 2091–2099 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. Senate and House of Representatives of the United States of America in Congress. Orphan Drug Act (1983).

  5. Agarwal, V. et al. Learning statistical models of phenotypes using noisy labeled training data. J. Am. Med. Inform. Assoc. 23, 1166–1173 (2016).

    PubMed  PubMed Central  Google Scholar 

  6. Frénay, B. & Verleysen, M. Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25, 845–869 (2014).

    PubMed  Google Scholar 

  7. Toh, T. S., Dondelinger, F. & Wang, D. Looking beyond the hype: applied AI and machine learning in translational medicine. EBioMedicine 47, 607–615 (2019).

    PubMed  PubMed Central  Google Scholar 

  8. Clarke, R. et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat. Rev. Cancer 8, 37–49 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Altman, N. & Krzywinski, M. The curse(s) of dimensionality. Nat. Methods 15, 399–400 (2018).

    CAS  PubMed  Google Scholar 

  10. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

    PubMed  Google Scholar 

  11. Leek, J. T. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161 (2014).

    PubMed  PubMed Central  Google Scholar 

  12. Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).

    PubMed  PubMed Central  Google Scholar 

  13. Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10, 5416 (2019).

    PubMed  PubMed Central  Google Scholar 

  14. Dorrity, M. W., Saunders, L. M., Queitsch, C., Fields, S. & Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun. 11, 1537 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Chellappa, R. & Turaga, P. Feature selection. In Computer Vision: a Reference Guide 1–5 (Springer International, 2020).

  16. Chen, C.-H., Härdle, W. & Unwin, A. Handbook of Data Visualization (Springer, 2008).

  17. Jolliffe, I. T. & Cadima, J. Principal component analysis: a review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci. 374, 20150202 (2016).

    PubMed  PubMed Central  Google Scholar 

  18. McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://doi.org/10.48550/arXiv.1802.03426 (2018).

  19. Nguyen, L. H. & Holmes, S. Ten quick tips for effective dimensionality reduction. PLoS Comput. Biol. 15, e1006907 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. Wattenberg, M., Viégas, F. & Johnson, I. How to use t-SNE effectively. Distill 1, https://doi.org/10.23915/distill.00002 (2016).

  21. Way, G. P., Zietz, M., Rubinetti, V., Himmelstein, D. S. & Greene, C. S. Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations. Genome Biol. 21, 109 (2020).

    PubMed  PubMed Central  Google Scholar 

  22. de Souto, M. C. P., Costa, I. G., de Araujo, D. S. A., Ludermir, T. B. & Schliep, A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 9, 497 (2008).

    Google Scholar 

  23. Kothari, S. et al. Removing batch effects from histopathological images for enhanced cancer diagnosis. IEEE J. Biomed. Health Inform. 18, 765–772 (2014).

    PubMed  PubMed Central  Google Scholar 

  24. Dwivedi, S. K., Tjärnberg, A., Tegnér, J. & Gustafsson, M. Deriving disease modules from the compressed transcriptional space embedded in a deep autoencoder. Nat. Commun. 11, 856 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. Fertig, E. J., Ding, J., Favorov, A. V., Parmigiani, G. & Ochs, M. F. CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics 26, 2792–2793 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Quellec, G., Lamard, M., Conze, P.-H., Massin, P. & Cochener, B. Automatic detection of rare pathologies in fundus photographs using few-shot learning. Med. Image Anal. 61, 101660 (2020).

    PubMed  Google Scholar 

  27. Arvaniti, E. & Claassen, M. Sensitive detection of rare disease-associated cell subsets via representation learning. Nat. Commun. 8, 14825 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. Chaabane, I., Guermazi, R. & Hammami, M. Enhancing techniques for learning decision trees from imbalanced data. Adv. Data Anal. Classif. 14, 677–745 (2020).

    Google Scholar 

  29. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Google Scholar 

  30. Köpcke, F. et al. Evaluating predictive modeling algorithms to assess patient eligibility for clinical trials from routine data. BMC Med. Inform. Decis. Mak. 13, 134 (2013).

    PubMed  PubMed Central  Google Scholar 

  31. Banerjee, J. et al. Integrative analysis identifies candidate tumor microenvironment and intracellular signaling pathways that define tumor heterogeneity in NF1. Genes 11, 226 (2020).

  32. Colbaugh, R., Glass, K., Rudolf, C., & Tremblay, M. Learning to identify rare disease patients from electronic health records. AMIA Annu. Symp. Proc. 2018, 340–347 (2018).

    PubMed  PubMed Central  Google Scholar 

  33. Heiselet, B., Serre, T., Pontil, M. & Poggio, T. Component-based face detection. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition I (CPRV, 2001).

  34. Kasinski, A. & Schmidt, A. The architecture of the face and eyes detection system based on cascade classifiers. In Computer Recognition Systems 2 (ed. Kurzynski, M. et al.) 124–131 (Springer, 2007).

  35. Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at arXiv https://doi.org/10.48550/arXiv.1301.3781 (2013).

  36. Han, S., Williamson, B. D. & Fong, Y. Improving random forest predictions in small datasets from two-phase sampling designs. BMC Med. Inform. Decis. Mak. 21, 322 (2021).

    PubMed  PubMed Central  Google Scholar 

  37. Ambert, K. H. & Cohen, A. M. A system for classifying disease comorbidity status from medical discharge summaries using automated hotspot and negated concept detection. J. Am. Med. Inform. Assoc. 16, 590–595 (2009).

    PubMed  PubMed Central  Google Scholar 

  38. Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).

    Google Scholar 

  39. More, A. Survey of resampling techniques for improving classification performance in unbalanced datasets. Preprint at arXiv https://doi.org/10.48550/arXiv.1608.06048 (2016).

  40. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT, 2016).

  41. Futoma, J., Simons, M., Doshi-Velez, F. & Kamaleswaran, R. Generalization in clinical prediction models: the blessing and curse of measurement indicator variables. Crit. Care Explor. 3, e0453 (2021).

    Google Scholar 

  42. Okser, S. et al. Regularized machine learning in the genetic prediction of complex traits. PLoS Genet. 10, e1004754 (2014).

    PubMed  PubMed Central  Google Scholar 

  43. Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B Stat. Methodol. 67, 301–320 (2005).

    Google Scholar 

  44. Founta, K. et al. Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning. Mol. Med. 29, 12 (2023).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Torang, A., Gupta, P. & Klinke, D. J. 2nd An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets. BMC Bioinformatics 20, 433 (2019).

    Google Scholar 

  46. Dincer, A. B., Celik, S., Hiranuma, N. & Lee, S.-I. DeepProfile: deep learning of cancer molecular profiles for precision medicine. Preprint at bioRxiv https://doi.org/10.1101/278739 (2018).

  47. Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at arXiv https://doi.org/10.48550/arXiv.1312.6114 (2013).

  48. Sánchez Fernández, I. et al. Deep learning in rare disease. Detection of tubers in tuberous sclerosis complex. PLoS ONE 15, e0232376 (2020).

    PubMed  PubMed Central  Google Scholar 

  49. Mungall, C. J. et al. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 45, D712–D722 (2017).

    CAS  PubMed  Google Scholar 

  50. Himmelstein, D. S. et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. eLife 6, e26726 (2017).

    PubMed  PubMed Central  Google Scholar 

  51. Callahan, T. J., Tripodi, I. J., Hunter, L. E. & Baumgartner, W. A. A framework for automated construction of heterogeneous large-scale biomedical knowledge graphs. Preprint at bioRxiv https://doi.org/10.1101/2020.04.30.071407 (2020).

  52. Percha, B. & Altman, R. B. A global network of biomedical relationships derived from text. Bioinformatics 34, 2614–2624 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. Orphanet https://www.orpha.net/consor/cgi-bin/index.php (2023).

  54. Queralt-Rosinach, N. et al. Structured reviews for data and knowledge-driven research. Database 2020, baaa015 (2020).

    PubMed  PubMed Central  Google Scholar 

  55. Moon, C. et al. Learning drug–disease–target embedding (DDTE) from knowledge graphs to inform drug repurposing hypotheses. J. Biomed. Inform. 119, 103838 (2021).

    PubMed  Google Scholar 

  56. Li, X. et al. Improving rare disease classification using imperfect knowledge graph. BMC Med. Inform. Decis. Mak. 19, 238 (2019).

    PubMed  PubMed Central  Google Scholar 

  57. Sosa, D. N. et al. A literature-based knowledge graph embedding method for identifying drug repurposing opportunities in rare diseases. In Biocomputing 2020 463–474 (World Scientific, 2019).

  58. Shen, F. et al. Rare disease knowledge enrichment through a data-driven approach. BMC Med. Inform. Decis. Mak. 19, 32 (2019).

    PubMed  PubMed Central  Google Scholar 

  59. Rao, A. et al. Phenotype-driven gene prioritization for rare diseases using graph convolution on heterogeneous networks. BMC Med. Genomics 11, 57 (2018).

    Google Scholar 

  60. Köhler, S. et al. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 49, D1207–D1217 (2021).

    PubMed  Google Scholar 

  61. Rolland, T. et al. A proteome-scale map of the human interactome network. Cell 159, 1212–1226 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. Martens, M. et al. WikiPathways: connecting communities. Nucleic Acids Res. 49, D613–D621 (2021).

    CAS  PubMed  Google Scholar 

  63. Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).

    Google Scholar 

  64. Lee, S.-I. et al. A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia. Nat. Commun. 9, 42 (2018).

    PubMed  PubMed Central  Google Scholar 

  65. Mao, W., Zaslavsky, E., Hartmann, B. M., Sealfon, S. C. & Chikina, M. Pathway-level information extractor (PLIER) for gene expression data. Nat. Methods 16, 607–610 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  66. Taroni, J. N. et al. MultiPLIER: a transfer learning framework for transcriptomics reveals systemic features of rare disease. Cell Syst. 8, 380–394 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  67. Greene, D., NIHR BioResource, Richardson, S. & Turro, E. Phenotype similarity regression for identifying the genetic determinants of rare diseases. Am. J. Hum. Genet. 98, 490–499 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  68. Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  69. Ionita-Laza, I., Capanu, M., De Rubeis, S., McCallum, K. & Buxbaum, J. D. Identification of rare causal variants in sequence-based studies: methods and applications to VPS13B, a gene involved in Cohen syndrome and autism. PLoS Genet. 10, e1004729 (2014).

    PubMed  PubMed Central  Google Scholar 

  70. Greene, D., NIHR BioResource, Richardson, S. & Turro, E. A fast association test for identifying pathogenic variants involved in rare diseases. Am. J. Hum. Genet. 101, 104–114 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  71. Boycott, K. M., Vanstone, M. R., Bulman, D. E. & MacKenzie, A. E. Rare-disease genetics in the era of next-generation sequencing: discovery to translation. Nat. Rev. Genet. 14, 681–691 (2013).

    CAS  PubMed  Google Scholar 

  72. Wright, C. F., FitzPatrick, D. R. & Firth, H. V. Paediatric genomics: diagnosing rare disease in children. Nat. Rev. Genet. 19, 253–268 (2018).

    CAS  PubMed  Google Scholar 

  73. Adams, D. R. & Eng, C. M. Next-generation sequencing to diagnose suspected genetic disorders. N. Engl. J. Med. 379, 1353–1362 (2018).

    CAS  PubMed  Google Scholar 

  74. Byrd, J. B., Greene, A. C., Prasad, D. V., Jiang, X. & Greene, C. S. Responsible, practical genomic data sharing that accelerates research. Nat. Rev. Genet. 21, 615–629 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  75. Rieke, N. et al. The future of digital health with federated learning. NPJ Digit. Med. 3, 119 (2020).

    PubMed  PubMed Central  Google Scholar 

  76. Yan, Y. et al. A continuously benchmarked and crowdsourced challenge for rapid development and evaluation of models to predict COVID-19 diagnosis and hospitalization. JAMA Netw. Open 4, e2124946 (2021).

    PubMed  PubMed Central  Google Scholar 

  77. Lundberg, S. M. et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng. 2, 749–760 (2018).

    PubMed  PubMed Central  Google Scholar 

  78. Zhou, G., Zhang, J., Su, J., Shen, D. & Tan, C. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics 20, 1178–1190 (2004).

    CAS  PubMed  Google Scholar 

  79. Blitzer, J., McDonald, R. & Pereira, F. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (eds. Jurafsky, D. & Gaussier, E.) 120–128 (Association for Computational Linguistics, 2006).

  80. Wang, C. & Mahadevan, S. Heterogeneous domain adaptation using manifold alignment. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence 2 (ed. Walsh, T.) 1541–1546 (AAAI, 2011).

  81. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  82. Collado-Torres, L. et al. Reproducible RNA-seq analysis using recount2. Nat. Biotechnol. 35, 319–321 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  83. Kuhn, M. & Johnson, K. Applied Predictive Modeling (Springer, 2013).

  84. Davis, J. & Goadrich, M. The relationship between precision–recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (eds. Cohen, W. W. & Moore, A.) 233–240 (Association for Computing Machinery, 2006).

  85. Hastie, T., Friedman, J. & Tibshirani, R. The Elements of Statistical Learning (Springer, 2001).

  86. Shin, H.-C. et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35, 1285–1298 (2016).

    PubMed  Google Scholar 

Download references

Acknowledgements

This work was supported in part by Alex’s Lemonade Stand Foundation, the National Human Genome Research Institute (R01 HG010067), the Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01 HD109765), the Neurofibromatosis Therapeutic Acceleration Program, the Children’s Tumor Foundation and the Gilbert Family Foundation.

Author information

Authors and Affiliations

Authors

Contributions

Authorship was determined using ICMJE recommendations. Conceptualization: J.B., J.N.T., R.J.A., C.G., J.G; data curation: not applicable; formal analysis: not applicable; funding acquisition: R.J.A.; investigation: J.B., J.N.T., R.J.A.; methodology: J.B., J.N.T., R.J.A.; project administration: J.B.; resources: J.B., J.N.T., R.J.A.; software: not applicable; supervision: J.B., C.G.; validation: not applicable; visualization: D.V.P.; writing (original draft): J.B., J.N.T., R.J.A.; writing (review and editing): J.B., J.N.T., R.J.A., C.G., J.G.

Corresponding author

Correspondence to Casey Greene.

Ethics declarations

Competing interests

J.G. is currently employed at Tempus Labs, a precision medicine company. J.N.T. and D.V.P. are employed with Alex’s Lemonade Stand Foundation, a research funder. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Pejman Mohammadi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Banerjee, J., Taroni, J.N., Allaway, R.J. et al. Machine learning in rare disease. Nat Methods 20, 803–814 (2023). https://doi.org/10.1038/s41592-023-01886-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-023-01886-z

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing