Abstract
Over the past decade, technological advances have made high-speed, high-resolution sequencing of genetic material possible at ever lower cost (from millions to one hundred dollars). In this context, the human microbiome has demonstrated its ability to support the stratification and the classification of various human diseases. Thus, the gut microbiota is set to play a key role in precision medicine as a “super-integrator” of patient status. Identifying metagenomic signatures is becoming increasingly important in precision medicine. To address the interpretability/accuracy trade off, we propose a hybrid approach based on a cascade classifier combining a first step of Subgroup Discovery (for interpretability) and then a classifier model (for accuracy). With this approach, different interpretable signatures stratify the maximum possible number of patients while those remaining are defined by a default non-interpretable signature. Several datasets from the NCBI public repository on different diseases (colorectal cancer, cirrhosis, diabetes, obesity) have been used to evaluate the interest of our approach to build both accurate and interpretable metagenomic diseases signatures. The results show that the approach reaches comparable or superior performances to the state-of-the-art approaches while offering better interpretability than black box.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Taxon: (plural taxa), it is an entity grouping all living organisms having in common certain well-defined characteristics. The term taxon is used in phylogenetic classification to group (from the most general to the most specific: domain, kingdom, phylum, class, order, family, genus, species) living beings according to various criteria.
- 2.
A basic pattern corresponds to an elementary unit of a rule characterized by a variable, a comparator and a value (e.g. age > 10).
- 3.
- 4.
Isomorphic: meaning that the mapping between the simplex and the new basis is preserved.
- 5.
Isometric: meaning that the distances in the simplex are equivalent to the distances of the new transformed values.
References
Pasolli, E., Truong, D.T., Malik, F.: Machine learning meta-analysis of large metagenomic datasets: tools and biological insights (2015)
Esnault, C., Gadonna, M.-L., Queyrel, M., Templier, A., Zucker, J.-D.: Q-Finder: an algorithm for credible subgroup discovery in clinical data analysis - an application to the international diabetes management practice study. Front. Artif. Intell. 3, 559927 (2020)
Friedman, J., Alm, E.J.: Inferring correlation networks from genomic survey data. PLoS Comput. Biol. 8(9), e1002687 (2012)
Fritz, A., Hofmann, P., Majda, S., et al.: CAMISIM: simulating metagenomes and microbial communities. Microbiome 7(1), 17 (2019)
Harris, Z.N., Dhungel, E., Mosior, M., Ahn, T.-H.: Massive metagenomic data analysis using abundance-based machine learning. Biol. Direct 14(1), 12 (2019)
Imparato, A.: Interactive Subgroup Discovery, p. 134 (2012)
Korepanova, N.: Subgroup discovery for treatment optimization. In: Workshop on Data Analysis in Medicine, WDAM 2017, pp. 48–41 (2017)
Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K., Hugenholtz, P.: A bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 72(4), 557–578 (2008)
Loh, W.-Y., Cao, L., Zhou, P.: Subgroup identification for precision medicine: a comparative review of 13 methods. Wiley Interdisc. Rev. Data Min. Knowl. Disc. 9(5), 604–621 (2019)
Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions, p. 10 (2017)
Menegaux, R., Vert, J.-P.: Continuous embeddings of DNA sequencing reads and application to metagenomics. J. Comput. Biol. 26(6), 509–518 (2019)
Le Chatelier, E., Nielsen, T., et al.: Richness of human gut microbiome correlates with metabolic markers. Nature 500(7464), 541–546 (2013)
Nayfach, S., Pollard, K.S.: Toward accurate and quantitative comparative metagenomics. Cell 166(5), 1103–1116 (2016)
Oh, M., Zhang, L.: DeepMicro: deep representation learning for disease prediction based on microbiome data. Sci. Rep. 10(1), 6026 (2020)
Pasolli, E., Truong, D.T., Malik, F., Waldron, L., Segata, N.: Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLOS Comput. Biol. 12(7), e1004977 (2016)
Petrosino, J.F.: The microbiome in precision medicine: the way forward. Genome Med. 10(1), 12 (2018)
Prifti, E., Chevaleyre, Y., Hanczar, B., et al.: Interpretable and accurate prediction models for metagenomics data. GigaScience 9(3), giaa010 (2020)
Qin, J., et al.: A metagenome-wide association study of gut microbiota in type 2 diabetes, p. 6 (2012)
Qin, N., Yang, F., Li, A., et al.: Alterations of the human gut microbiome in liver cirrhosis. Nature 513(7516), 59–64 (2014)
Queyrel, M., Prifti, E., Templier, A., Zucker, J.-D.: Towards end-to-end disease prediction from raw metagenomic data. Int. J. Biomed. Biol. Eng. 15(6), 234–246 (2021)
Quince, C., Walker, A.W., Simpson, J.T., Loman, N.J., Segata, N.: Shotgun metagenomics, from sampling to sequencing and analysis, p. 27 (2017)
Quinn, T.P., Erb, I.: Interpretable log contrasts for the classification of health biomarkers: a new approach to balance selection. mSystems 5(2), e00230-19 (2020)
Segata, N., Izard, J., Waldron, L., et al.: Metagenomic biomarker discovery and explanation. Genome Biol. 12(6), R60 (2011)
Thomas, A.M., Manghi, P., Asnicar, F., et al.: Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. 25(4), 667–678 (2019)
Wen, C., Zheng, Z., Shao, T., et al.: Quantitative metagenomics reveals unique gut microbiome biomarkers in ankylosing spondylitis. Genome Biol. 18(1), 142 (2017)
Wu, G., Zhao, N., Zhang, C., Lam, Y.Y., Zhao, L.: Guild-based analysis for understanding gut microbiome in human health and diseases. Genome Med. 13(1), 22 (2021)
Yang, F., Zou, Q., Gao, B.: GutBalance: a server for the human gut microbiome-based disease prediction and biomarker discovery with compositionality addressed. Brief. Bioinf. 22(5), bbaa436 (2021)
Zeller, G., Tap, J., Voigt, A.Y., et al.: Potential of fecal microbiota for early stage detection of colorectal cancer. Mol. Syst. Biol. 10(11), 766 (2014)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Appendices
Appendix
A Example of Metagenomic Abundance Table
B Compositional Data and Log-Ratio Transformations
One of the major difficulties often underestimated is the composition of the quantitative metagenomic data. Indeed, the number of sequences generated by NGS is not the same and varies from one sample or study to another. When the biological objects included in the samples are counted, it should not be restricted to an absolute count because it would not be representative of the real composition. A normalization step must therefore be applied and consist of dividing each abundance by the total number of taxonomic units, resulting in a table of relative abundance (see Fig. 1). They are characterized as compositional data that are defined by covering all vectors representing parts of a whole that carry only relative information. This is the case of relative abundance tables because they represent relative information, namely the percentage of total abundance, which restricts them to a sample space with the constraints of having the sum of each characteristic always equal to 1 and having their values included in the interval [0, 1]. These constraints require specific mathematical transformations to avoid misinterpretations or irreproducible analyses (Yang et al. 2021). The data processing often used are log-ratio transformations and refer to Additive Log-ratio (ALR), Centered Log-Ratio (CLR) and Isometric Log-Ratio (ILR). The choice of the method is defined by the desired interpretation as described below:
-
ALR: IsomorphicFootnote 4 and not isometricFootnote 5. Transforms the original D features to \(D-1\) features space. Formula:
$$alr(x) = [ln\frac{x_1}{x_D}, ln\frac{x_2}{x_D},..., ln\frac{x_{D-1}}{x_D}]$$ -
CLR: Both isometric and isomorphic. It removes the value-range restriction, but it does not remove the sum constraint. It does not change the dimension of the basis as the ALR or ILR making it easier to train interpretable models Formula:
$$clr(x) = [ln\frac{x_1}{g(x)}, ln\frac{x_2}{g(x)},..., ln\frac{x_{D-1}}{g(x)}$$Where g(x) is the geometric mean of x.
-
ILR: Isomorphic and isometric. It is often the most suitable transformation that manage the issue of sum and range value constraints because it is associated with orthonormal bases in the simplex. Nevertheless, as ALR it transforms the original D features to \(D-1\) features space. Formula:
$$ilr(x) = clr(x) \cdot \varPsi '$$$$\varPsi \varPsi ' = I_{D-1}$$Where \(\varPsi \) is a \((D-1, D)\)-matrix whose rows are \(clr(e_i)\) and \({e_1, e_2,...,e_{D-1}}\) is a generic orthonormal basis of the simplex \(S^D\).
As ILR transformations are difficult to interpret, recent studies have defined a method called balance (Quinn and Erb 2020; Yang et al. 2021), which is the log-ratio of the geometric means of two non-overlapping groups of features defined by a sequential binary partition (SBP). In that way, balances are more interpretable than common log-ratio transformations. Metagenomic compositionality is also managed by Friedman and Alm 2012 to create a clustering graph network interaction of species. They proposed a robust approximation method called SparCC to derive the correlation matrix based on a rough estimate of the variance of the ratio-log of species.
C Q-Classifier Schemes and Overview
D Q-Classifier Data Processing
Two data preprocessing options could be included in the algorithm:
-
A dimensionality reduction operation to limit the number of rules generated by the Q-Finder, which reduces runtime and improves statistical power when adjusting the p-value with the Bonferroni correction. Indeed, one of the weaknesses of the Q-Finder, and thus of the Q-Classifier, is its high complexity which is equal to \(O(G \times (F \times (M \times D)^C)\). Where C is the rule complexity, F is the aggregation rules complexity, G is the number of groups (e.g., control/case), D is the number of variables and M is the maximum of modalities per variable. In practice, the recursive feature elimination with a SVM model is fitted to perform the features reduction.
-
A log-ratio transformation related to the nature of the compositional data. As explained in Appendix B, the only log-ratio transformation that does not alter the dimensionality, which is necessary to retain the variable names and produce an interpretable prediction, is the centered log-ratio transformation (CLR).
E Pseudo Code of Optimal Union
F Simulated Dataset Used to Train Taxa Classifier
To learn a good representation of the sequences and improve the classification power of the models, it is necessary to ensure that the simulated dataset has close abundances for each genome according to the taxonomy level. We used the CAMISIM (Fritz et al. 2019) framework which takes a file of genome abundances as configuration. CAMISIM takes in consideration the size of the genome in addition to its abundance, so to generate an abundance equally proportionate between genomes we have the following formula:
Where G is the ensemble of genomes, \(L_g\) is the base pair length of the genome g and \(AG_g\) stands for the equally balanced abundance between genomes for the genome g. However, if a certain taxonomic level is defined, such as species, the abundance formula should be modified accordingly. This will avoid that a species appears too often or not often enough, which could be the case when several genomes have the same species. The formulas are defined by:
Where S is the ensemble of species, \(G_s\) is the ensemble of genomes for the species s, \(L_s\) is the base pair length of the species s and \(AS_g\) stands for the abundance equally balanced between genomes at the species level for the genome g.
3.5M and 1.5M Illumina reads with an average of 150 base pairs have been simulated for the train and validation set respectively, corresponding to a depth of coverage (mapping depth) of 27% over all the genomes. The initial abundance given was calculated with the formulas above to be equally proportionate at the species level. Indeed, the model is trained at the species level rather than the genome level because genomes of the same species are close enough to have the same prediction and it is easier to train the model with a smaller number of classes. An almost equal number of sequences for each species is not representative of real metagenomes where abundance follows exponential distributions. Nevertheless, in the case of read classification modeling, this prevents the classifier from focusing and predicting the predominant classes while learning a more robust embedding. These simulated data have the advantage over real NGS data of providing information on the origin of a sequence, allowing the training of a supervised algorithm.
G Pseudo Code of Q-Classifier’s Training and Classification Stage
H Rules Coverage Analysis on the Cirrhosis Dataset
(See Fig. 5)
The best rule for the control class and the case class are respectively “RCO1” and “RCA1”. We notice that the subgroups associated with these rules have a high intersection with the other subgroups in the optimal union. This is even more important on the validation set where sometimes one subgroup is completely included in another. As the optimal union of the rules is computed on the training set, it is possible in validation to obtain these results. This visualization allows us to determine the disjunction or union of the subgroups’ samples.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Queyrel, M., Templier, A., Zucker, JD. (2021). Reject and Cascade Classifier with Subgroup Discovery for Interpretable Metagenomic Signatures. In: Kamp, M., et al. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2021. Communications in Computer and Information Science, vol 1524. Springer, Cham. https://doi.org/10.1007/978-3-030-93736-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-93736-2_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93735-5
Online ISBN: 978-3-030-93736-2
eBook Packages: Computer ScienceComputer Science (R0)