Elsevier

Biotechnology Advances

Volume 49, July–August 2021, 107739
Biotechnology Advances

Research review paper
Using machine learning approaches for multi-omics data analysis: A review

https://doi.org/10.1016/j.biotechadv.2021.107739Get rights and content

Highlights

  • Machine learning methods are novel techniques to integrate omics datasets

  • Recently, publications based on ‘multi-omics integration’ have gained popularity

  • Integration of omics data using concatenation, model- or transformation-based methods

  • Multi-omics studies offer a more comprehensive view of complex diseases

  • Recommendation flowchart included for interdisciplinary professionals

Abstract

With the development of modern high-throughput omic measurement platforms, it has become essential for biomedical studies to undertake an integrative (combined) approach to fully utilise these data to gain insights into biological systems. Data from various omics sources such as genetics, proteomics, and metabolomics can be integrated to unravel the intricate working of systems biology using machine learning-based predictive algorithms. Machine learning methods offer novel techniques to integrate and analyse the various omics data enabling the discovery of new biomarkers. These biomarkers have the potential to help in accurate disease prediction, patient stratification and delivery of precision medicine. This review paper explores different integrative machine learning methods which have been used to provide an in-depth understanding of biological systems during normal physiological functioning and in the presence of a disease. It provides insight and recommendations for interdisciplinary professionals who envisage employing machine learning skills in multi-omics studies.

Introduction

Digital information is growing rapidly, in terms of five V’s (volume, velocity, veracity, variety and value), and hence this is hailed as the big data era (BCS, 2014; Bellazzi, 2014; Lee and Yoon, 2017). Health-based big data including linked information for patients, such as their clinical data (for example gender, age, pathological and physiological history) and omics data (such as genetics, proteomics and metabolomics) has now become more widely available (Canuel et al., 2015; Singhal et al., 2016). Recently, such data has been used for precision (also called personalised or stratified) medicine to provide customised healthcare, i.e. providing a bespoke treatment for individuals (Gibson et al., 2015; Kalaitzopoulos, 2016; Malod-Dognin et al., 2017). There has been unprecedented growth in the development of precision medicine supported by ML (machine learning) approaches (Delavan et al., 2017; Peterson et al., 2013; Zou et al., 2017) and data mining tools (Chawla and Davis, 2013; Cheng et al., 2015; Margolies et al., 2016). These techniques have also helped to discover novel omics biological markers which can identify the molecular cause of a disease.

A biomarker is a substance, structure, or process that can be measured in the human body or its products and can provide surrogate information about the presence of a disease/condition (Strimbu and Tavel, 2010). Molecular biomarkers are discovered by analysing the cascade of information provided by different omics (Debnath et al., 2010). For example, the high-sensitivity C-Reactive protein test provides an accurate and quantitative risk assessment for cardiovascular disease (Pfützner and Forst, 2006; Shrivastava et al., 2015). Biomarkers play a significant role in planning preventive measures and decisions for patients (Nielsen, 2017) and can be classified as either diagnostic, prognostic or predictive (Le et al., 2016; Shaw et al., 2015). Diagnostic biomarkers are used for determining the presence of disease in a patient, while prognostic biomarkers provide information on the overall outcome with or without the standard treatment (Carlomagno et al., 2017). Predictive biomarkers are used to identify who is at risk of an outcome (Nalejska et al., 2014). All of these biomarkers can also be used to identify which treatment will be most suitable for a given patient. For example, the ADNI (Alzheimer's Disease Neuroimaging Initiative) study used a combination of neuroimaging, biochemical and genetic biomarkers to discriminate early Alzheimer’s patients from healthy volunteers with an accuracy of 98% (Gupta et al., 2019). Similarly, different forms of Parkinson's syndromes have been investigated by developing an automated tool that fuses multi-site diffusion-weighted MRI imaging biomarkers and disease rating score (MDS-UPDRS III) (Archer et al., 2019). Biomarkers can help identify high-risk individuals before their physiological symptoms are evident. Moreover, they also help in measuring disease progression (Mandel et al., 2010).

In the context of precision medicine, ML has been used to develop diagnostic, prognostic and predictive tools from single omics data (Dias-Audibert et al., 2020; Mamoshina et al., 2018; Sonsare and Gunavathi, 2019). However, ML may have deteriorated performance for certain single omics such as gene data due to inherent characteristics (Kim et al., 2020). ML methods are now also being applied to multi-omics data (Bersanelli et al., 2016), to investigate and interpret the relationships between data and phenotypes (Kim and Tagkopoulos, 2018). Although ML analysis of multi-omics is still in its embryonic stage, it has already been explored for a wide range of applications, as reported in recent reviews on brain diseases (Garali et al., 2018; Young et al., 2013), diabetes (Kavakiotis et al., 2017), cancers (Borad and LoRusso, 2017; Chaudhary et al., 2017; Wong et al., 2016) cardiovascular disease (Weng et al., 2017), medical imaging (Erickson et al., 2017), single-cell analysis in humans (Cao et al., 2020; Ma et al., 2020a) and plant science studies (Acharjee et al., 2011). Currently, many of the multi-omic reviews are focused on individual sub-topics. For example, designing studies (Haas et al., 2017; Hasin et al., 2017), setting up workflows (Kohl et al., 2014), choosing software tools (Misra et al., 2019) and evaluating overfitted performance (McCabe et al., 2020).

In contrast, this review aims at a broader focus, presenting an interdisciplinary perspective to new readers in this domain by providing a background on multi-omics and ML. It takes forward the integration terminologies introduced by Ritchie (Ritchie et al., 2015) and summarises the recent integrative state-of-the-art approaches. We aim to cover various integration methods concisely and include a recommendation flowchart enabling interdisciplinary scientists to have a quick head start in this domain (Bersanelli et al., 2016; Nguyen and Wang, 2020).

Scope of this review: This review investigates the two primary learning strategies in ML, i.e. supervised and unsupervised, which are commonly used within the context of multi-omics integration. This review considers multi-omics integration as a process of combining different single omics. Although various ML specialisations such as reinforcement (Coronato et al., 2020), hybrid (Zhou et al., 2019), multi-view (Zhao et al., 2017) and self-supervised learning (Chen et al., 2019) are now emerging in generic healthcare applications, they have not yet gained enough momentum in multi-omics analysis, hence they remain beyond the scope of this review.

This paper is organised as follows. Section 2 provides a short background related to multi-omics and ML. Section 3 describes how ML is employed for multi-omics analysis and what are the various real-world challenges of it. In Section 4, details of different multi-omics integration approaches are presented. Section 5, published multi-omics studies using ML methods are discussed. Section 6 describes a recommendation flowchart for choosing an appropriate method for multi-omics integration. Conclusions are provided in Section 7.

Section snippets

Multi-omics

In living beings, genetic information in the cells flows from DNA (deoxyribo-nucleic acid) to the mRNA (messenger ribo-nucleic acid) to protein and is dictated by the central dogma of molecular biology (Lodish et al., 2000). This flow of information is often considered analogous to a computer system which has facilitated the understanding of biological information processing (Wang and Gribskov, 2005; D’Onofrio and An, 2010).

The study of DNA, mRNA and proteins is broadly denoted as genomics,

Challenges in multi-omics analysis using machine learning

The use of ML to analyse high-throughput generated multi-omic data poses key unique challenges. They can be summarised as follows.

Data integration methods for multi-omics

In recent years, various new data integration methods have been introduced from the modern developments in mathematical, statistical and computational sciences. For the benefit of the readers, Table 5 includes a summary of a few reviews which cover the breadth of multi-omics integration for generic as well as specialised domains such as oncology (Buescher and Driggers, 2016; Nicora et al., 2020) and toxicology (Canzler et al., 2020). Most of these reviews have strived to introduce different

Application of integrative methods in multi-omics studies

The availability of high-throughput omics provides a unique opportunity to explore the complex relationships between different omics and phenotypic targets instead of mono-omics evaluation. This section describes various multi-omics studies which deployed methods investigated in the previous section. Table 8 summarises different phenotypic target-based, multi-omics studies published and tabulates them across the span of 7 main omics namely, genomics, transcriptomics, metabolomics, proteomics,

Recommendations

Today a plethora of multi-omic integration methods are available for both supervised and unsupervised learning as evident in the current review. This information can overwhelm interdisciplinary scientists and would require a time-consuming effort to understand the challenging mathematical and computational concepts behind them. Hence, we suggest that interdisciplinary teams working on multi-omics always include ML practitioners to assist with the choice of methods, the development of solutions,

Conclusions

This paper reviewed various ML approaches used for the integration of multi-omics data for analysis. A concise background of multi-omics and ML was presented. It examined the concatenation-, model- and transformation-based integration methods, employed for multi-omics data along with their advantages and disadvantages. Also, various existing multi-omics studies have been summarised. Finally, a recommendation flowchart is presented for interdisciplinary professionals to choose an appropriate

Disclosure

The authors have nothing to disclose.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 633983. Ewan Pearson and Emanuele Trucco would like to acknowledge the National Institute for Health Research (NIHR) global health research unit on global diabetes outcomes research at the University of Dundee (INSPIRED project, Award number 16/136/102) for useful discussions.

References (352)

  • M. Caffrey et al.

    LIPIDAT: A database of lipid phase transition temperatures and enthalpy changes. DMPC data subset analysis

    Chem. Phys. Lipids

    (1992)
  • M.P. Campbell et al.

    Validation of the curation pipeline of UniCarb-DB: Building a global glycan reference MS/MS repository. Biochim. Biophys

    Acta BBA - Proteins Proteomics, Computational Proteomics in the Post-Identification Era

    (2014)
  • L. Chen et al.

    Self-supervised learning for medical image analysis using image context restoration

    Med. Image Anal.

    (2019)
  • J.G. Cleary et al.

    K*: An Instance-based Learner Using and Entropic Distance Measure, in: Proceedings of the Twelfth International Conference on International Conference on Machine Learning, ICML’95

    (1995)
  • B.M. de Andrade et al.

    Comparison of the performance of multiclass classifiers in chemical data: Addressing the problem of overfitting with the permutation test

    Chemom. Intell. Lab. Syst.

    (2020)
  • R. Domingues et al.

    A comparative evaluation of outlier detection algorithms: Experiments and analyses

    Pattern Recogn.

    (2018)
  • T.M.D. Ebbels et al.

    Bioinformatic methods in NMR-based metabolic profiling

    Prog. Nucl. Magn. Reson. Spectrosc.

    (2009)
  • A. Acharjee et al.

    Integration of multi-omics data for prediction of phenotypic traits using random forest

    BMC Bioinformat.

    (2016)
  • I. Agache et al.

    Asthma biomarkers: do they bring precision medicine closer to the clinic?

    Allergy, Asthma Immunol. Res.

    (2017)
  • B. Alberts et al.

    Molecular Biology of the Cell 5E

    (2008)
  • E.K. Alidjinou et al.

    RNA and DNA Sanger sequencing versus next-generation sequencing for HIV-1 drug resistance testing in treatment-naive patients

    J. Antimicrob. Chemother.

    (2017)
  • N.S. Altman

    An introduction to kernel and nearest-neighbor nonparametric regression

    Am. Stat.

    (1992)
  • D.R. Amancio et al.

    A systematic comparison of supervised classifiers

    PLoS One

    (2014)
  • Amazon EC2

    Amaz. Web Serv. Inc

  • J. Antonelli et al.

    Statistical workflow for feature selection in human metabolomics data

    Metabolites

    (2019)
  • R. Argelaguet et al.

    Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets

    Mol. Syst. Biol.

    (2018)
  • M. Armbrust et al.

    A view of cloud computing

    Commun. ACM

    (2010)
  • B. Aslam et al.

    Proteomics: technologies and their applications

    J. Chromatogr. Sci.

    (2017)
  • N. Auslander et al.

    A joint analysis of transcriptomic and metabolomic data uncovers enhanced enzyme-metabolite coupling in breast cancer

    Sci. Rep.

    (2016)
  • M. Awad et al.

    Support vector regression

  • M.J. Azur et al.

    Multiple imputation by chained equations: what is it and how does it work?

    Int. J. Methods Psychiatr. Res.

    (2011)
  • S. Badillo et al.

    An introduction to machine learning

    Clin. Pharmacol. Ther.

    (2020)
  • J.W. Barnes et al.

    Novel methods in pulmonary hypertension phenotyping in the age of precision medicine (2015 Grover Conference series)

    Pulm. Circ.

    (2016)
  • Z. Barnett-Itzhaki et al.

    Machine learning vs. classic statistics for the prediction of IVF outcomes

    J. Assist. Reprod. Genet.

    (2020)
  • E. Bavafaye Haghighi et al.

    Hierarchical classification of cancers of unknown primary using multi-omics data

    Cancer Informat.

    (2019)
  • BCS et al.

    Big Data: Opportunities and challenges

    (2014)
  • R. Bellazzi

    Big data and biomedical informatics: a challenging opportunity

    Yearb. Med. Inform.

    (2014)
  • S. Benjamens et al.

    The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database

    Npj Digit. Med.

    (2020)
  • D.A. Benson et al.

    GenBank. Nucleic Acids Res.

    (2011)
  • M. Bersanelli et al.

    Methods for the integration of multi-omics data: mathematical aspects

    BMC Bioinformat.

    (2016)
  • A. Bhardwaj et al.

    Multi-omics data and analytics integration in ovarian cancer

    Artif. Intell. Appl. Innov.

    (2020)
  • C.M. Bishop

    Neural Networks for Pattern Recognition

    (1995)
  • C.M. Bishop

    Pattern recognition and machine learning, Information science and statistics

    (2006)
  • J.T. Bjerrum et al.

    Integration of transcriptomics and metabonomics: improving diagnostics, biomarker identification and phenotyping in ulcerative colitis

    Metabolomics Off. J. Metabolomic Soc.

    (2014)
  • Black box medicine and transparency (Executive Summary)

    PHG Foundation (University of Cambridge)

    (2020)
  • S. Boellner et al.

    Reverse phase protein arrays—quantitative assessment of multiple biomarkers in biopsies for clinical use

    Microarrays

    (2015)
  • E. Bonnet et al.

    Integrative multi-omics module network inference with lemon-tree

    PLoS Comput. Biol.

    (2015)
  • C. Bowd et al.

    Relevance vector machine and support vector machine classifier analysis of scanning laser polarimetry retinal nerve fiber layer measurements

    Invest. Ophthalmol. Vis. Sci.

    (2005)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • Cited by (268)

    View all citing articles on Scopus
    1

    These authors contributed equally to this study (shared first authorship).

    View full text