Reject and Cascade Classifier with Subgroup Discovery for Interpretable Metagenomic Signatures

Queyrel, Maxence; Templier, Alexandre; Zucker, Jean-Daniel

doi:10.1007/978-3-030-93736-2_5

Maxence Queyrel^64,65,
Alexandre Templier⁶⁴ &
Jean-Daniel Zucker^65,66

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1524))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2242 Accesses
1 Citations

Abstract

Over the past decade, technological advances have made high-speed, high-resolution sequencing of genetic material possible at ever lower cost (from millions to one hundred dollars). In this context, the human microbiome has demonstrated its ability to support the stratification and the classification of various human diseases. Thus, the gut microbiota is set to play a key role in precision medicine as a “super-integrator” of patient status. Identifying metagenomic signatures is becoming increasingly important in precision medicine. To address the interpretability/accuracy trade off, we propose a hybrid approach based on a cascade classifier combining a first step of Subgroup Discovery (for interpretability) and then a classifier model (for accuracy). With this approach, different interpretable signatures stratify the maximum possible number of patients while those remaining are defined by a default non-interpretable signature. Several datasets from the NCBI public repository on different diseases (colorectal cancer, cirrhosis, diabetes, obesity) have been used to evaluate the interest of our approach to build both accurate and interpretable metagenomic diseases signatures. The results show that the approach reaches comparable or superior performances to the state-of-the-art approaches while offering better interpretability than black box.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Taxon: (plural taxa), it is an entity grouping all living organisms having in common certain well-defined characteristics. The term taxon is used in phylogenetic classification to group (from the most general to the most specific: domain, kingdom, phylum, class, order, family, genus, species) living beings according to various criteria.
2.
A basic pattern corresponds to an elementary unit of a rule characterized by a variable, a comparator and a value (e.g. age > 10).
3.
www.ncbi.nlm.nih.gov.
4.
Isomorphic: meaning that the mapping between the simplex and the new basis is preserved.
5.
Isometric: meaning that the distances in the simplex are equivalent to the distances of the new transformed values.

References

Pasolli, E., Truong, D.T., Malik, F.: Machine learning meta-analysis of large metagenomic datasets: tools and biological insights (2015)
Google Scholar
Esnault, C., Gadonna, M.-L., Queyrel, M., Templier, A., Zucker, J.-D.: Q-Finder: an algorithm for credible subgroup discovery in clinical data analysis - an application to the international diabetes management practice study. Front. Artif. Intell. 3, 559927 (2020)
Article Google Scholar
Friedman, J., Alm, E.J.: Inferring correlation networks from genomic survey data. PLoS Comput. Biol. 8(9), e1002687 (2012)
Article Google Scholar
Fritz, A., Hofmann, P., Majda, S., et al.: CAMISIM: simulating metagenomes and microbial communities. Microbiome 7(1), 17 (2019)
Article Google Scholar
Harris, Z.N., Dhungel, E., Mosior, M., Ahn, T.-H.: Massive metagenomic data analysis using abundance-based machine learning. Biol. Direct 14(1), 12 (2019)
Article Google Scholar
Imparato, A.: Interactive Subgroup Discovery, p. 134 (2012)
Google Scholar
Korepanova, N.: Subgroup discovery for treatment optimization. In: Workshop on Data Analysis in Medicine, WDAM 2017, pp. 48–41 (2017)
Google Scholar
Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K., Hugenholtz, P.: A bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 72(4), 557–578 (2008)
Article Google Scholar
Loh, W.-Y., Cao, L., Zhou, P.: Subgroup identification for precision medicine: a comparative review of 13 methods. Wiley Interdisc. Rev. Data Min. Knowl. Disc. 9(5), 604–621 (2019)
Google Scholar
Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions, p. 10 (2017)
Google Scholar
Menegaux, R., Vert, J.-P.: Continuous embeddings of DNA sequencing reads and application to metagenomics. J. Comput. Biol. 26(6), 509–518 (2019)
Article Google Scholar
Le Chatelier, E., Nielsen, T., et al.: Richness of human gut microbiome correlates with metabolic markers. Nature 500(7464), 541–546 (2013)
Article Google Scholar
Nayfach, S., Pollard, K.S.: Toward accurate and quantitative comparative metagenomics. Cell 166(5), 1103–1116 (2016)
Article Google Scholar
Oh, M., Zhang, L.: DeepMicro: deep representation learning for disease prediction based on microbiome data. Sci. Rep. 10(1), 6026 (2020)
Article Google Scholar
Pasolli, E., Truong, D.T., Malik, F., Waldron, L., Segata, N.: Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLOS Comput. Biol. 12(7), e1004977 (2016)
Article Google Scholar
Petrosino, J.F.: The microbiome in precision medicine: the way forward. Genome Med. 10(1), 12 (2018)
Article Google Scholar
Prifti, E., Chevaleyre, Y., Hanczar, B., et al.: Interpretable and accurate prediction models for metagenomics data. GigaScience 9(3), giaa010 (2020)
Article Google Scholar
Qin, J., et al.: A metagenome-wide association study of gut microbiota in type 2 diabetes, p. 6 (2012)
Google Scholar
Qin, N., Yang, F., Li, A., et al.: Alterations of the human gut microbiome in liver cirrhosis. Nature 513(7516), 59–64 (2014)
Article Google Scholar
Queyrel, M., Prifti, E., Templier, A., Zucker, J.-D.: Towards end-to-end disease prediction from raw metagenomic data. Int. J. Biomed. Biol. Eng. 15(6), 234–246 (2021)
Google Scholar
Quince, C., Walker, A.W., Simpson, J.T., Loman, N.J., Segata, N.: Shotgun metagenomics, from sampling to sequencing and analysis, p. 27 (2017)
Google Scholar
Quinn, T.P., Erb, I.: Interpretable log contrasts for the classification of health biomarkers: a new approach to balance selection. mSystems 5(2), e00230-19 (2020)
Article Google Scholar
Segata, N., Izard, J., Waldron, L., et al.: Metagenomic biomarker discovery and explanation. Genome Biol. 12(6), R60 (2011)
Article Google Scholar
Thomas, A.M., Manghi, P., Asnicar, F., et al.: Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. 25(4), 667–678 (2019)
Article Google Scholar
Wen, C., Zheng, Z., Shao, T., et al.: Quantitative metagenomics reveals unique gut microbiome biomarkers in ankylosing spondylitis. Genome Biol. 18(1), 142 (2017)
Article Google Scholar
Wu, G., Zhao, N., Zhang, C., Lam, Y.Y., Zhao, L.: Guild-based analysis for understanding gut microbiome in human health and diseases. Genome Med. 13(1), 22 (2021)
Article Google Scholar
Yang, F., Zou, Q., Gao, B.: GutBalance: a server for the human gut microbiome-based disease prediction and biomarker discovery with compositionality addressed. Brief. Bioinf. 22(5), bbaa436 (2021)
Article Google Scholar
Zeller, G., Tap, J., Voigt, A.Y., et al.: Potential of fecal microbiota for early stage detection of colorectal cancer. Mol. Syst. Biol. 10(11), 766 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Quinten France, 8 rue Vernier, 75017, Paris, France
Maxence Queyrel & Alexandre Templier
Sorbonne University, IRD, UMMISCO, 93143, Bondy, France
Maxence Queyrel & Jean-Daniel Zucker
Sorbonne University, INSERM, NUTRIOMICS, 75013, Paris, France
Jean-Daniel Zucker

Authors

Maxence Queyrel
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Templier
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Daniel Zucker
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IKIM, Ruhr-University Bochum, Bochum, Germany
Michael Kamp
University of Sydney, Sydney, NSW, Australia
Irena Koprinska
University of Namur, Namur, Belgium
Adrien Bibal
University of Rennes 1, Rennes, France
Tassadit Bouadi
University of Namur, Namur, Belgium
Benoît Frénay
Inria, Rennes, France
Luis Galárraga
University of Antwerp, Antwerp, Belgium
José Oramas
Ruhr University Bochum, Bochum, Germany
Linara Adilova
Royal Holloway University of London, Egham, UK
Yamuna Krishnamurthy
Ghent University, Ghent, Belgium
Bo Kang
Université Jean Monnet, Saint-Etienne cedex 2, France
Christine Largeron
Ghent University, Gent, Belgium
Jefrey Lijffijt
Telecom Paris, Paris, France
Tiphaine Viard
University of Bonn, Bonn, Germany
Pascal Welke
Norwegian Univesity of Science and Technology, Trondheim, Norway
Massimiliano Ruocco
BI Norwegian Business School, Oslo, Norway
Erlend Aune
University of Pisa, Pisa, Italy
Claudio Gallicchio
University of Duisburg-Essen, Essen, Germany
Gregor Schiele
Graz University of Technology, Graz, Austria
Franz Pernkopf
Xilinx Research, Dublin, Ireland
Michaela Blott
Heidelberg University, Heidelberg, Germany
Holger Fröning
Heidelberg University, Heidelberg, Germany
Günther Schindler
University of Pisa, Pisa, Italy
Riccardo Guidotti
University of Pisa, Pisa, Italy
Anna Monreale
ISTI-CNR, Pisa, Italy
Salvatore Rinzivillo
Warsaw University of Technology, Warsaw, Poland
Przemyslaw Biecek
Freie Universität Berlin, Berlin, Germany
Eirini Ntoutsi
Eindhoven University of Technology, Eindhoven, The Netherlands
Mykola Pechenizkiy
Leibniz University Hannover, Hannover, Germany
Bodo Rosenhahn
University of Sussex, Brighton, UK
Christopher Buckley
University of Chieti-Pescara, Chieti, Italy
Daniela Cialfi
Radboud University Nijmegen, Nijmegen, The Netherlands
Pablo Lanillos
McGill University, Montreal, Canada
Maxwell Ramstead
Ghent University, Ghent, Belgium
Tim Verbelen
University of Lisbon, Lisboa, Portugal
Pedro M. Ferreira
University of Bari Aldo Moro, Bari, Italy
Giuseppina Andresini
Universita di Bari Aldo Moro, Bari, Italy
Donato Malerba
University of Lisbon, Lisbon, Portugal
Ibéria Medeiros
Shenzhen University, Shenzhen, China
Philippe Fournier-Viger
Harbin Institute of Technology, Harbin, China
M. Saqib Nawaz
University of Córdoba, Córdoba, Spain
Sebastian Ventura
Peking University, Beijing, China
Meng Sun
Noah's Ark Lab, Huawei, Beijing, China
Min Zhou
UniCredit, Milan, Italy
Valerio Bitetta
UniCredit, Rome, Italy
Ilaria Bordino
UniCredit, Milan, Italy
Andrea Ferretti
Unicredit, Rome, Italy
Francesco Gullo
ENEA Headquarters, Portici, Italy
Giovanni Ponti
Unicredit, Rome, Italy
Lorenzo Severini
University of Porto, Porto, Portugal
Rita Ribeiro
University of Porto, Porto, Portugal
João Gama
UPC BarcelonaTech, Barcelona, Spain
Ricard Gavaldà
Northwestern University, Chicago, IL, USA
Lee Cooper
PD Personalised Healthcare, Basel, Switzerland
Naghmeh Ghazaleh
University of Lausanne, Lausanne, Switzerland
Jonas Richiardi
ETH Zurich, Basel, Switzerland
Damian Roqueiro
F. Hoffmann–La Roche Ltd, Basel, Switzerland
Diego Saldana Miranda
Novartis Pharma AG, Basel, Switzerland
Konstantinos Sechidis
University of Lisbon, Lisbon, Portugal
Guilherme Graça

Appendices

Appendix

A Example of Metagenomic Abundance Table

B Compositional Data and Log-Ratio Transformations

One of the major difficulties often underestimated is the composition of the quantitative metagenomic data. Indeed, the number of sequences generated by NGS is not the same and varies from one sample or study to another. When the biological objects included in the samples are counted, it should not be restricted to an absolute count because it would not be representative of the real composition. A normalization step must therefore be applied and consist of dividing each abundance by the total number of taxonomic units, resulting in a table of relative abundance (see Fig. 1). They are characterized as compositional data that are defined by covering all vectors representing parts of a whole that carry only relative information. This is the case of relative abundance tables because they represent relative information, namely the percentage of total abundance, which restricts them to a sample space with the constraints of having the sum of each characteristic always equal to 1 and having their values included in the interval [0, 1]. These constraints require specific mathematical transformations to avoid misinterpretations or irreproducible analyses (Yang et al. 2021). The data processing often used are log-ratio transformations and refer to Additive Log-ratio (ALR), Centered Log-Ratio (CLR) and Isometric Log-Ratio (ILR). The choice of the method is defined by the desired interpretation as described below:

ALR: Isomorphic^{Footnote 4} and not isometric^{Footnote 5}. Transforms the original D features to $D-1$ features space. Formula:
$$alr(x) = [ln\frac{x_1}{x_D}, ln\frac{x_2}{x_D},..., ln\frac{x_{D-1}}{x_D}]$$
CLR: Both isometric and isomorphic. It removes the value-range restriction, but it does not remove the sum constraint. It does not change the dimension of the basis as the ALR or ILR making it easier to train interpretable models Formula:
$$clr(x) = [ln\frac{x_1}{g(x)}, ln\frac{x_2}{g(x)},..., ln\frac{x_{D-1}}{g(x)}$$
Where g(x) is the geometric mean of x.
ILR: Isomorphic and isometric. It is often the most suitable transformation that manage the issue of sum and range value constraints because it is associated with orthonormal bases in the simplex. Nevertheless, as ALR it transforms the original D features to $D-1$ features space. Formula:
$$ilr(x) = clr(x) \cdot \varPsi '$$

$$\varPsi \varPsi ' = I_{D-1}$$
Where $\varPsi $ is a $(D-1, D)$-matrix whose rows are $clr(e_i)$ and ${e_1, e_2,...,e_{D-1}}$ is a generic orthonormal basis of the simplex $S^D$.

As ILR transformations are difficult to interpret, recent studies have defined a method called balance (Quinn and Erb 2020; Yang et al. 2021), which is the log-ratio of the geometric means of two non-overlapping groups of features defined by a sequential binary partition (SBP). In that way, balances are more interpretable than common log-ratio transformations. Metagenomic compositionality is also managed by Friedman and Alm 2012 to create a clustering graph network interaction of species. They proposed a robust approximation method called SparCC to derive the correlation matrix based on a rough estimate of the variance of the ratio-log of species.

C Q-Classifier Schemes and Overview

D Q-Classifier Data Processing

Two data preprocessing options could be included in the algorithm:

A dimensionality reduction operation to limit the number of rules generated by the Q-Finder, which reduces runtime and improves statistical power when adjusting the p-value with the Bonferroni correction. Indeed, one of the weaknesses of the Q-Finder, and thus of the Q-Classifier, is its high complexity which is equal to $O(G \times (F \times (M \times D)^C)$. Where C is the rule complexity, F is the aggregation rules complexity, G is the number of groups (e.g., control/case), D is the number of variables and M is the maximum of modalities per variable. In practice, the recursive feature elimination with a SVM model is fitted to perform the features reduction.
A log-ratio transformation related to the nature of the compositional data. As explained in Appendix B, the only log-ratio transformation that does not alter the dimensionality, which is necessary to retain the variable names and produce an interpretable prediction, is the centered log-ratio transformation (CLR).

E Pseudo Code of Optimal Union

F Simulated Dataset Used to Train Taxa Classifier

To learn a good representation of the sequences and improve the classification power of the models, it is necessary to ensure that the simulated dataset has close abundances for each genome according to the taxonomy level. We used the CAMISIM (Fritz et al. 2019) framework which takes a file of genome abundances as configuration. CAMISIM takes in consideration the size of the genome in addition to its abundance, so to generate an abundance equally proportionate between genomes we have the following formula:

$$A_g = \frac{1}{L_g}$$

$$AG_g = \frac{A_g}{\sum _i^{|G|}A_i}, g \in G$$

Where G is the ensemble of genomes, $L_g$ is the base pair length of the genome g and $AG_g$ stands for the equally balanced abundance between genomes for the genome g. However, if a certain taxonomic level is defined, such as species, the abundance formula should be modified accordingly. This will avoid that a species appears too often or not often enough, which could be the case when several genomes have the same species. The formulas are defined by:

$$L_s = \sum _i^{|G_s|} L_{i} \; ; \; L\_norm_s = \frac{L_s}{\sum _i^{|S|}L_{i}}, \; s \in S$$

$$A_g = \frac{1}{L\_norm_s \times (L_g / L_s)} \; ; \; A_g = \frac{A_g}{\sum _i^{|G_s|} A_{i}}$$

$$AS_g = \frac{A_g}{\sum _i^{|G|}A_{i}}$$

Where S is the ensemble of species, $G_s$ is the ensemble of genomes for the species s, $L_s$ is the base pair length of the species s and $AS_g$ stands for the abundance equally balanced between genomes at the species level for the genome g.

3.5M and 1.5M Illumina reads with an average of 150 base pairs have been simulated for the train and validation set respectively, corresponding to a depth of coverage (mapping depth) of 27% over all the genomes. The initial abundance given was calculated with the formulas above to be equally proportionate at the species level. Indeed, the model is trained at the species level rather than the genome level because genomes of the same species are close enough to have the same prediction and it is easier to train the model with a smaller number of classes. An almost equal number of sequences for each species is not representative of real metagenomes where abundance follows exponential distributions. Nevertheless, in the case of read classification modeling, this prevents the classifier from focusing and predicting the predominant classes while learning a more robust embedding. These simulated data have the advantage over real NGS data of providing information on the origin of a sequence, allowing the training of a supervised algorithm.

G Pseudo Code of Q-Classifier’s Training and Classification Stage

H Rules Coverage Analysis on the Cirrhosis Dataset

(See Fig. 5)

The best rule for the control class and the case class are respectively “RCO1” and “RCA1”. We notice that the subgroups associated with these rules have a high intersection with the other subgroups in the optimal union. This is even more important on the validation set where sometimes one subgroup is completely included in another. As the optimal union of the rules is computed on the training set, it is possible in validation to obtain these results. This visualization allows us to determine the disjunction or union of the subgroups’ samples.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Queyrel, M., Templier, A., Zucker, JD. (2021). Reject and Cascade Classifier with Subgroup Discovery for Interpretable Metagenomic Signatures. In: Kamp, M., et al. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2021. Communications in Computer and Information Science, vol 1524. Springer, Cham. https://doi.org/10.1007/978-3-030-93736-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-93736-2_5
Published: 17 February 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93735-5
Online ISBN: 978-3-030-93736-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Reject and Cascade Classifier with Subgroup Discovery for Interpretable Metagenomic Signatures