Research review paperUsing machine learning approaches for multi-omics data analysis: A review
Introduction
Digital information is growing rapidly, in terms of five V’s (volume, velocity, veracity, variety and value), and hence this is hailed as the big data era (BCS, 2014; Bellazzi, 2014; Lee and Yoon, 2017). Health-based big data including linked information for patients, such as their clinical data (for example gender, age, pathological and physiological history) and omics data (such as genetics, proteomics and metabolomics) has now become more widely available (Canuel et al., 2015; Singhal et al., 2016). Recently, such data has been used for precision (also called personalised or stratified) medicine to provide customised healthcare, i.e. providing a bespoke treatment for individuals (Gibson et al., 2015; Kalaitzopoulos, 2016; Malod-Dognin et al., 2017). There has been unprecedented growth in the development of precision medicine supported by ML (machine learning) approaches (Delavan et al., 2017; Peterson et al., 2013; Zou et al., 2017) and data mining tools (Chawla and Davis, 2013; Cheng et al., 2015; Margolies et al., 2016). These techniques have also helped to discover novel omics biological markers which can identify the molecular cause of a disease.
A biomarker is a substance, structure, or process that can be measured in the human body or its products and can provide surrogate information about the presence of a disease/condition (Strimbu and Tavel, 2010). Molecular biomarkers are discovered by analysing the cascade of information provided by different omics (Debnath et al., 2010). For example, the high-sensitivity C-Reactive protein test provides an accurate and quantitative risk assessment for cardiovascular disease (Pfützner and Forst, 2006; Shrivastava et al., 2015). Biomarkers play a significant role in planning preventive measures and decisions for patients (Nielsen, 2017) and can be classified as either diagnostic, prognostic or predictive (Le et al., 2016; Shaw et al., 2015). Diagnostic biomarkers are used for determining the presence of disease in a patient, while prognostic biomarkers provide information on the overall outcome with or without the standard treatment (Carlomagno et al., 2017). Predictive biomarkers are used to identify who is at risk of an outcome (Nalejska et al., 2014). All of these biomarkers can also be used to identify which treatment will be most suitable for a given patient. For example, the ADNI (Alzheimer's Disease Neuroimaging Initiative) study used a combination of neuroimaging, biochemical and genetic biomarkers to discriminate early Alzheimer’s patients from healthy volunteers with an accuracy of 98% (Gupta et al., 2019). Similarly, different forms of Parkinson's syndromes have been investigated by developing an automated tool that fuses multi-site diffusion-weighted MRI imaging biomarkers and disease rating score (MDS-UPDRS III) (Archer et al., 2019). Biomarkers can help identify high-risk individuals before their physiological symptoms are evident. Moreover, they also help in measuring disease progression (Mandel et al., 2010).
In the context of precision medicine, ML has been used to develop diagnostic, prognostic and predictive tools from single omics data (Dias-Audibert et al., 2020; Mamoshina et al., 2018; Sonsare and Gunavathi, 2019). However, ML may have deteriorated performance for certain single omics such as gene data due to inherent characteristics (Kim et al., 2020). ML methods are now also being applied to multi-omics data (Bersanelli et al., 2016), to investigate and interpret the relationships between data and phenotypes (Kim and Tagkopoulos, 2018). Although ML analysis of multi-omics is still in its embryonic stage, it has already been explored for a wide range of applications, as reported in recent reviews on brain diseases (Garali et al., 2018; Young et al., 2013), diabetes (Kavakiotis et al., 2017), cancers (Borad and LoRusso, 2017; Chaudhary et al., 2017; Wong et al., 2016) cardiovascular disease (Weng et al., 2017), medical imaging (Erickson et al., 2017), single-cell analysis in humans (Cao et al., 2020; Ma et al., 2020a) and plant science studies (Acharjee et al., 2011). Currently, many of the multi-omic reviews are focused on individual sub-topics. For example, designing studies (Haas et al., 2017; Hasin et al., 2017), setting up workflows (Kohl et al., 2014), choosing software tools (Misra et al., 2019) and evaluating overfitted performance (McCabe et al., 2020).
In contrast, this review aims at a broader focus, presenting an interdisciplinary perspective to new readers in this domain by providing a background on multi-omics and ML. It takes forward the integration terminologies introduced by Ritchie (Ritchie et al., 2015) and summarises the recent integrative state-of-the-art approaches. We aim to cover various integration methods concisely and include a recommendation flowchart enabling interdisciplinary scientists to have a quick head start in this domain (Bersanelli et al., 2016; Nguyen and Wang, 2020).
Scope of this review: This review investigates the two primary learning strategies in ML, i.e. supervised and unsupervised, which are commonly used within the context of multi-omics integration. This review considers multi-omics integration as a process of combining different single omics. Although various ML specialisations such as reinforcement (Coronato et al., 2020), hybrid (Zhou et al., 2019), multi-view (Zhao et al., 2017) and self-supervised learning (Chen et al., 2019) are now emerging in generic healthcare applications, they have not yet gained enough momentum in multi-omics analysis, hence they remain beyond the scope of this review.
This paper is organised as follows. Section 2 provides a short background related to multi-omics and ML. Section 3 describes how ML is employed for multi-omics analysis and what are the various real-world challenges of it. In Section 4, details of different multi-omics integration approaches are presented. Section 5, published multi-omics studies using ML methods are discussed. Section 6 describes a recommendation flowchart for choosing an appropriate method for multi-omics integration. Conclusions are provided in Section 7.
Section snippets
Multi-omics
In living beings, genetic information in the cells flows from DNA (deoxyribo-nucleic acid) to the mRNA (messenger ribo-nucleic acid) to protein and is dictated by the central dogma of molecular biology (Lodish et al., 2000). This flow of information is often considered analogous to a computer system which has facilitated the understanding of biological information processing (Wang and Gribskov, 2005; D’Onofrio and An, 2010).
The study of DNA, mRNA and proteins is broadly denoted as genomics,
Challenges in multi-omics analysis using machine learning
The use of ML to analyse high-throughput generated multi-omic data poses key unique challenges. They can be summarised as follows.
Data integration methods for multi-omics
In recent years, various new data integration methods have been introduced from the modern developments in mathematical, statistical and computational sciences. For the benefit of the readers, Table 5 includes a summary of a few reviews which cover the breadth of multi-omics integration for generic as well as specialised domains such as oncology (Buescher and Driggers, 2016; Nicora et al., 2020) and toxicology (Canzler et al., 2020). Most of these reviews have strived to introduce different
Application of integrative methods in multi-omics studies
The availability of high-throughput omics provides a unique opportunity to explore the complex relationships between different omics and phenotypic targets instead of mono-omics evaluation. This section describes various multi-omics studies which deployed methods investigated in the previous section. Table 8 summarises different phenotypic target-based, multi-omics studies published and tabulates them across the span of 7 main omics namely, genomics, transcriptomics, metabolomics, proteomics,
Recommendations
Today a plethora of multi-omic integration methods are available for both supervised and unsupervised learning as evident in the current review. This information can overwhelm interdisciplinary scientists and would require a time-consuming effort to understand the challenging mathematical and computational concepts behind them. Hence, we suggest that interdisciplinary teams working on multi-omics always include ML practitioners to assist with the choice of methods, the development of solutions,
Conclusions
This paper reviewed various ML approaches used for the integration of multi-omics data for analysis. A concise background of multi-omics and ML was presented. It examined the concatenation-, model- and transformation-based integration methods, employed for multi-omics data along with their advantages and disadvantages. Also, various existing multi-omics studies have been summarised. Finally, a recommendation flowchart is presented for interdisciplinary professionals to choose an appropriate
Disclosure
The authors have nothing to disclose.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 633983. Ewan Pearson and Emanuele Trucco would like to acknowledge the National Institute for Health Research (NIHR) global health research unit on global diabetes outcomes research at the University of Dundee (INSPIRED project, Award number 16/136/102) for useful discussions.
References (352)
- et al.
Data integration and network reconstruction with ~omics data using Random Forest regression in potato
Anal. Chim. Acta
(2011) - et al.
An integrated approach to uncover drivers of cancer
Cell
(2010) - et al.
Network-constrained forest for regularized classification of omics data
- et al.
Development and validation of the automated imaging differentiation in parkinsonism (AID-P): a multicentre machine learning study
Lancet Digit. Health
(2019) - et al.
Self-driving cars: a survey
Expert Syst. Appl.
(2021) - et al.
Multi-omics-based identification of SARS-CoV-2 infection biology and candidate drugs against COVID-19
Comput. Biol. Med.
(2020) - et al.
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
Inf. Fusion
(2020) - et al.
Implementing ReliefF filters to extract meaningful features from genetic lifetime datasets
J. Biomed. Inform.
(2011) - et al.
Applications and analysis of targeted genomic sequencing in cancer studies
Comput. Struct. Biotechnol. J.
(2019) - et al.
Twenty-first century precision medicine in oncology: genomic profiling in patients with cancer
Mayo Clin. Proc.
(2017)
LIPIDAT: A database of lipid phase transition temperatures and enthalpy changes. DMPC data subset analysis
Chem. Phys. Lipids
Validation of the curation pipeline of UniCarb-DB: Building a global glycan reference MS/MS repository. Biochim. Biophys
Acta BBA - Proteins Proteomics, Computational Proteomics in the Post-Identification Era
Self-supervised learning for medical image analysis using image context restoration
Med. Image Anal.
K*: An Instance-based Learner Using and Entropic Distance Measure, in: Proceedings of the Twelfth International Conference on International Conference on Machine Learning, ICML’95
Comparison of the performance of multiclass classifiers in chemical data: Addressing the problem of overfitting with the permutation test
Chemom. Intell. Lab. Syst.
A comparative evaluation of outlier detection algorithms: Experiments and analyses
Pattern Recogn.
Bioinformatic methods in NMR-based metabolic profiling
Prog. Nucl. Magn. Reson. Spectrosc.
Integration of multi-omics data for prediction of phenotypic traits using random forest
BMC Bioinformat.
Asthma biomarkers: do they bring precision medicine closer to the clinic?
Allergy, Asthma Immunol. Res.
Molecular Biology of the Cell 5E
RNA and DNA Sanger sequencing versus next-generation sequencing for HIV-1 drug resistance testing in treatment-naive patients
J. Antimicrob. Chemother.
An introduction to kernel and nearest-neighbor nonparametric regression
Am. Stat.
A systematic comparison of supervised classifiers
PLoS One
Amaz. Web Serv. Inc
Statistical workflow for feature selection in human metabolomics data
Metabolites
Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets
Mol. Syst. Biol.
A view of cloud computing
Commun. ACM
Proteomics: technologies and their applications
J. Chromatogr. Sci.
A joint analysis of transcriptomic and metabolomic data uncovers enhanced enzyme-metabolite coupling in breast cancer
Sci. Rep.
Support vector regression
Multiple imputation by chained equations: what is it and how does it work?
Int. J. Methods Psychiatr. Res.
An introduction to machine learning
Clin. Pharmacol. Ther.
Novel methods in pulmonary hypertension phenotyping in the age of precision medicine (2015 Grover Conference series)
Pulm. Circ.
Machine learning vs. classic statistics for the prediction of IVF outcomes
J. Assist. Reprod. Genet.
Hierarchical classification of cancers of unknown primary using multi-omics data
Cancer Informat.
Big Data: Opportunities and challenges
Big data and biomedical informatics: a challenging opportunity
Yearb. Med. Inform.
The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database
Npj Digit. Med.
GenBank. Nucleic Acids Res.
Methods for the integration of multi-omics data: mathematical aspects
BMC Bioinformat.
Multi-omics data and analytics integration in ovarian cancer
Artif. Intell. Appl. Innov.
Neural Networks for Pattern Recognition
Pattern recognition and machine learning, Information science and statistics
Integration of transcriptomics and metabonomics: improving diagnostics, biomarker identification and phenotyping in ulcerative colitis
Metabolomics Off. J. Metabolomic Soc.
PHG Foundation (University of Cambridge)
Reverse phase protein arrays—quantitative assessment of multiple biomarkers in biopsies for clinical use
Microarrays
Integrative multi-omics module network inference with lemon-tree
PLoS Comput. Biol.
Relevance vector machine and support vector machine classifier analysis of scanning laser polarimetry retinal nerve fiber layer measurements
Invest. Ophthalmol. Vis. Sci.
Random forests
Mach. Learn.
Cited by (268)
Identification and isolation of BZR transcription factor and screening of cell wall degradation marker genes based on machine learning in ripening kiwifruit
2024, Postharvest Biology and TechnologyA pathway-based computational framework for identification of a new modal of multi-omics biomarkers and its application in esophageal cancer
2024, Computer Methods and Programs in BiomedicineChronic arsenic exposure induces malignant transformation of human HaCaT cells through both deterministic and stochastic changes in transcriptome expression
2024, Toxicology and Applied PharmacologyA personalized probabilistic approach to ovarian cancer diagnostics
2024, Gynecologic Oncology
- 1
These authors contributed equally to this study (shared first authorship).