Skip to main content

Reject and Cascade Classifier with Subgroup Discovery for Interpretable Metagenomic Signatures

  • Conference paper
  • First Online:
Book cover Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021)

Abstract

Over the past decade, technological advances have made high-speed, high-resolution sequencing of genetic material possible at ever lower cost (from millions to one hundred dollars). In this context, the human microbiome has demonstrated its ability to support the stratification and the classification of various human diseases. Thus, the gut microbiota is set to play a key role in precision medicine as a “super-integrator” of patient status. Identifying metagenomic signatures is becoming increasingly important in precision medicine. To address the interpretability/accuracy trade off, we propose a hybrid approach based on a cascade classifier combining a first step of Subgroup Discovery (for interpretability) and then a classifier model (for accuracy). With this approach, different interpretable signatures stratify the maximum possible number of patients while those remaining are defined by a default non-interpretable signature. Several datasets from the NCBI public repository on different diseases (colorectal cancer, cirrhosis, diabetes, obesity) have been used to evaluate the interest of our approach to build both accurate and interpretable metagenomic diseases signatures. The results show that the approach reaches comparable or superior performances to the state-of-the-art approaches while offering better interpretability than black box.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Taxon: (plural taxa), it is an entity grouping all living organisms having in common certain well-defined characteristics. The term taxon is used in phylogenetic classification to group (from the most general to the most specific: domain, kingdom, phylum, class, order, family, genus, species) living beings according to various criteria.

  2. 2.

    A basic pattern corresponds to an elementary unit of a rule characterized by a variable, a comparator and a value (e.g. age > 10).

  3. 3.

    www.ncbi.nlm.nih.gov.

  4. 4.

    Isomorphic: meaning that the mapping between the simplex and the new basis is preserved.

  5. 5.

    Isometric: meaning that the distances in the simplex are equivalent to the distances of the new transformed values.

References

  • Pasolli, E., Truong, D.T., Malik, F.: Machine learning meta-analysis of large metagenomic datasets: tools and biological insights (2015)

    Google Scholar 

  • Esnault, C., Gadonna, M.-L., Queyrel, M., Templier, A., Zucker, J.-D.: Q-Finder: an algorithm for credible subgroup discovery in clinical data analysis - an application to the international diabetes management practice study. Front. Artif. Intell. 3, 559927 (2020)

    Article  Google Scholar 

  • Friedman, J., Alm, E.J.: Inferring correlation networks from genomic survey data. PLoS Comput. Biol. 8(9), e1002687 (2012)

    Article  Google Scholar 

  • Fritz, A., Hofmann, P., Majda, S., et al.: CAMISIM: simulating metagenomes and microbial communities. Microbiome 7(1), 17 (2019)

    Article  Google Scholar 

  • Harris, Z.N., Dhungel, E., Mosior, M., Ahn, T.-H.: Massive metagenomic data analysis using abundance-based machine learning. Biol. Direct 14(1), 12 (2019)

    Article  Google Scholar 

  • Imparato, A.: Interactive Subgroup Discovery, p. 134 (2012)

    Google Scholar 

  • Korepanova, N.: Subgroup discovery for treatment optimization. In: Workshop on Data Analysis in Medicine, WDAM 2017, pp. 48–41 (2017)

    Google Scholar 

  • Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K., Hugenholtz, P.: A bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 72(4), 557–578 (2008)

    Article  Google Scholar 

  • Loh, W.-Y., Cao, L., Zhou, P.: Subgroup identification for precision medicine: a comparative review of 13 methods. Wiley Interdisc. Rev. Data Min. Knowl. Disc. 9(5), 604–621 (2019)

    Google Scholar 

  • Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions, p. 10 (2017)

    Google Scholar 

  • Menegaux, R., Vert, J.-P.: Continuous embeddings of DNA sequencing reads and application to metagenomics. J. Comput. Biol. 26(6), 509–518 (2019)

    Article  Google Scholar 

  • Le Chatelier, E., Nielsen, T., et al.: Richness of human gut microbiome correlates with metabolic markers. Nature 500(7464), 541–546 (2013)

    Article  Google Scholar 

  • Nayfach, S., Pollard, K.S.: Toward accurate and quantitative comparative metagenomics. Cell 166(5), 1103–1116 (2016)

    Article  Google Scholar 

  • Oh, M., Zhang, L.: DeepMicro: deep representation learning for disease prediction based on microbiome data. Sci. Rep. 10(1), 6026 (2020)

    Article  Google Scholar 

  • Pasolli, E., Truong, D.T., Malik, F., Waldron, L., Segata, N.: Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLOS Comput. Biol. 12(7), e1004977 (2016)

    Article  Google Scholar 

  • Petrosino, J.F.: The microbiome in precision medicine: the way forward. Genome Med. 10(1), 12 (2018)

    Article  Google Scholar 

  • Prifti, E., Chevaleyre, Y., Hanczar, B., et al.: Interpretable and accurate prediction models for metagenomics data. GigaScience 9(3), giaa010 (2020)

    Article  Google Scholar 

  • Qin, J., et al.: A metagenome-wide association study of gut microbiota in type 2 diabetes, p. 6 (2012)

    Google Scholar 

  • Qin, N., Yang, F., Li, A., et al.: Alterations of the human gut microbiome in liver cirrhosis. Nature 513(7516), 59–64 (2014)

    Article  Google Scholar 

  • Queyrel, M., Prifti, E., Templier, A., Zucker, J.-D.: Towards end-to-end disease prediction from raw metagenomic data. Int. J. Biomed. Biol. Eng. 15(6), 234–246 (2021)

    Google Scholar 

  • Quince, C., Walker, A.W., Simpson, J.T., Loman, N.J., Segata, N.: Shotgun metagenomics, from sampling to sequencing and analysis, p. 27 (2017)

    Google Scholar 

  • Quinn, T.P., Erb, I.: Interpretable log contrasts for the classification of health biomarkers: a new approach to balance selection. mSystems 5(2), e00230-19 (2020)

    Article  Google Scholar 

  • Segata, N., Izard, J., Waldron, L., et al.: Metagenomic biomarker discovery and explanation. Genome Biol. 12(6), R60 (2011)

    Article  Google Scholar 

  • Thomas, A.M., Manghi, P., Asnicar, F., et al.: Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. 25(4), 667–678 (2019)

    Article  Google Scholar 

  • Wen, C., Zheng, Z., Shao, T., et al.: Quantitative metagenomics reveals unique gut microbiome biomarkers in ankylosing spondylitis. Genome Biol. 18(1), 142 (2017)

    Article  Google Scholar 

  • Wu, G., Zhao, N., Zhang, C., Lam, Y.Y., Zhao, L.: Guild-based analysis for understanding gut microbiome in human health and diseases. Genome Med. 13(1), 22 (2021)

    Article  Google Scholar 

  • Yang, F., Zou, Q., Gao, B.: GutBalance: a server for the human gut microbiome-based disease prediction and biomarker discovery with compositionality addressed. Brief. Bioinf. 22(5), bbaa436 (2021)

    Article  Google Scholar 

  • Zeller, G., Tap, J., Voigt, A.Y., et al.: Potential of fecal microbiota for early stage detection of colorectal cancer. Mol. Syst. Biol. 10(11), 766 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Appendices

Appendix

A Example of Metagenomic Abundance Table

Fig. 1.
figure 1

An example of an abundance table where two metagenomes have different numbers of taxa. For yellow DNA, both have an absolute abundance equal to four, but the relative abundances in percentage are different: 50% for the former versus 66.6% for the latter. The relative abundance thus allows us to obtain the proportion of one taxon in relation to the others.

B Compositional Data and Log-Ratio Transformations

One of the major difficulties often underestimated is the composition of the quantitative metagenomic data. Indeed, the number of sequences generated by NGS is not the same and varies from one sample or study to another. When the biological objects included in the samples are counted, it should not be restricted to an absolute count because it would not be representative of the real composition. A normalization step must therefore be applied and consist of dividing each abundance by the total number of taxonomic units, resulting in a table of relative abundance (see Fig. 1). They are characterized as compositional data that are defined by covering all vectors representing parts of a whole that carry only relative information. This is the case of relative abundance tables because they represent relative information, namely the percentage of total abundance, which restricts them to a sample space with the constraints of having the sum of each characteristic always equal to 1 and having their values included in the interval [0, 1]. These constraints require specific mathematical transformations to avoid misinterpretations or irreproducible analyses (Yang et al. 2021). The data processing often used are log-ratio transformations and refer to Additive Log-ratio (ALR), Centered Log-Ratio (CLR) and Isometric Log-Ratio (ILR). The choice of the method is defined by the desired interpretation as described below:

  • ALR: IsomorphicFootnote 4 and not isometricFootnote 5. Transforms the original D features to \(D-1\) features space. Formula:

    $$alr(x) = [ln\frac{x_1}{x_D}, ln\frac{x_2}{x_D},..., ln\frac{x_{D-1}}{x_D}]$$
  • CLR: Both isometric and isomorphic. It removes the value-range restriction, but it does not remove the sum constraint. It does not change the dimension of the basis as the ALR or ILR making it easier to train interpretable models Formula:

    $$clr(x) = [ln\frac{x_1}{g(x)}, ln\frac{x_2}{g(x)},..., ln\frac{x_{D-1}}{g(x)}$$

    Where g(x) is the geometric mean of x.

  • ILR: Isomorphic and isometric. It is often the most suitable transformation that manage the issue of sum and range value constraints because it is associated with orthonormal bases in the simplex. Nevertheless, as ALR it transforms the original D features to \(D-1\) features space. Formula:

    $$ilr(x) = clr(x) \cdot \varPsi '$$
    $$\varPsi \varPsi ' = I_{D-1}$$

    Where \(\varPsi \) is a \((D-1, D)\)-matrix whose rows are \(clr(e_i)\) and \({e_1, e_2,...,e_{D-1}}\) is a generic orthonormal basis of the simplex \(S^D\).

As ILR transformations are difficult to interpret, recent studies have defined a method called balance (Quinn and Erb 2020; Yang et al. 2021), which is the log-ratio of the geometric means of two non-overlapping groups of features defined by a sequential binary partition (SBP). In that way, balances are more interpretable than common log-ratio transformations. Metagenomic compositionality is also managed by Friedman and Alm 2012 to create a clustering graph network interaction of species. They proposed a robust approximation method called SparCC to derive the correlation matrix based on a rough estimate of the variance of the ratio-log of species.

C Q-Classifier Schemes and Overview

Fig. 2.
figure 2

Q-Classifier overview: The algorithm takes as input the calculated metagenomic abundance data and starts by preprocessing them to the selected parameters (such as CLR transformation). The training phase is composed by one step of statistically credible rule generation followed by state-of-the-art classifier training. At the end, the algorithm consists of a set of rules and a state-of-the-art classifier cascaded during the classification step.

Fig. 3.
figure 3

Q-Classifier training stage: An optional feature selection is first processed, then statistically credible subgroups on all classes (control and case) are generated. Optimal unions of metagenomic signatures for each class are computed and gathered. Finally, a SOTA classifier is trained by adding more weight to the data that has been rejected.

Fig. 4.
figure 4

Q-Classifier classification stage: samples which are not rejected by the rule set have therefore an interpretable prediction while the rejected ones are predicted by a fitted SOTA classifier.

D Q-Classifier Data Processing

Two data preprocessing options could be included in the algorithm:

  • A dimensionality reduction operation to limit the number of rules generated by the Q-Finder, which reduces runtime and improves statistical power when adjusting the p-value with the Bonferroni correction. Indeed, one of the weaknesses of the Q-Finder, and thus of the Q-Classifier, is its high complexity which is equal to \(O(G \times (F \times (M \times D)^C)\). Where C is the rule complexity, F is the aggregation rules complexity, G is the number of groups (e.g., control/case), D is the number of variables and M is the maximum of modalities per variable. In practice, the recursive feature elimination with a SVM model is fitted to perform the features reduction.

  • A log-ratio transformation related to the nature of the compositional data. As explained in Appendix B, the only log-ratio transformation that does not alter the dimensionality, which is necessary to retain the variable names and produce an interpretable prediction, is the centered log-ratio transformation (CLR).

E Pseudo Code of Optimal Union

F Simulated Dataset Used to Train Taxa Classifier

To learn a good representation of the sequences and improve the classification power of the models, it is necessary to ensure that the simulated dataset has close abundances for each genome according to the taxonomy level. We used the CAMISIM (Fritz et al. 2019) framework which takes a file of genome abundances as configuration. CAMISIM takes in consideration the size of the genome in addition to its abundance, so to generate an abundance equally proportionate between genomes we have the following formula:

$$A_g = \frac{1}{L_g}$$
$$AG_g = \frac{A_g}{\sum _i^{|G|}A_i}, g \in G$$
figure a

Where G is the ensemble of genomes, \(L_g\) is the base pair length of the genome g and \(AG_g\) stands for the equally balanced abundance between genomes for the genome g. However, if a certain taxonomic level is defined, such as species, the abundance formula should be modified accordingly. This will avoid that a species appears too often or not often enough, which could be the case when several genomes have the same species. The formulas are defined by:

$$L_s = \sum _i^{|G_s|} L_{i} \; ; \; L\_norm_s = \frac{L_s}{\sum _i^{|S|}L_{i}}, \; s \in S$$
$$A_g = \frac{1}{L\_norm_s \times (L_g / L_s)} \; ; \; A_g = \frac{A_g}{\sum _i^{|G_s|} A_{i}}$$
$$AS_g = \frac{A_g}{\sum _i^{|G|}A_{i}}$$

Where S is the ensemble of species, \(G_s\) is the ensemble of genomes for the species s, \(L_s\) is the base pair length of the species s and \(AS_g\) stands for the abundance equally balanced between genomes at the species level for the genome g.

3.5M and 1.5M Illumina reads with an average of 150 base pairs have been simulated for the train and validation set respectively, corresponding to a depth of coverage (mapping depth) of 27% over all the genomes. The initial abundance given was calculated with the formulas above to be equally proportionate at the species level. Indeed, the model is trained at the species level rather than the genome level because genomes of the same species are close enough to have the same prediction and it is easier to train the model with a smaller number of classes. An almost equal number of sequences for each species is not representative of real metagenomes where abundance follows exponential distributions. Nevertheless, in the case of read classification modeling, this prevents the classifier from focusing and predicting the predominant classes while learning a more robust embedding. These simulated data have the advantage over real NGS data of providing information on the origin of a sequence, allowing the training of a supervised algorithm.

G Pseudo Code of Q-Classifier’s Training and Classification Stage

figure b
figure c

H Rules Coverage Analysis on the Cirrhosis Dataset

(See Fig. 5)

Fig. 5.
figure 5

Venn diagram of the subgroups in the optimal union of the Cirrhosis dataset. Each circle corresponds to a subgroup characterized by a rule (with a name referring to the one in Sect. 4.2). The values inside the circles correspond to the number of samples in the subgroups. When a value lies between several circles, it represents the number of samples shared by the corresponding subgroups.

The best rule for the control class and the case class are respectively “RCO1” and “RCA1”. We notice that the subgroups associated with these rules have a high intersection with the other subgroups in the optimal union. This is even more important on the validation set where sometimes one subgroup is completely included in another. As the optimal union of the rules is computed on the training set, it is possible in validation to obtain these results. This visualization allows us to determine the disjunction or union of the subgroups’ samples.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Queyrel, M., Templier, A., Zucker, JD. (2021). Reject and Cascade Classifier with Subgroup Discovery for Interpretable Metagenomic Signatures. In: Kamp, M., et al. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2021. Communications in Computer and Information Science, vol 1524. Springer, Cham. https://doi.org/10.1007/978-3-030-93736-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-93736-2_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-93735-5

  • Online ISBN: 978-3-030-93736-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics