Label Shift Quantification with Robustness Guarantees via Distribution Feature Matching

Dussap, Bastien; Blanchard, Gilles; Chérief-Abdellatif, Badr-Eddine

doi:10.1007/978-3-031-43424-2_5

Bastien Dussap¹²,
Gilles Blanchard¹² &
Badr-Eddine Chérief-Abdellatif¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14173))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

653 Accesses
1 Citations

Abstract

Quantification learning deals with the task of estimating the target label distribution under label shift. In this paper, we first present a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature. We derive a general performance bound for DFM procedures, improving in several key aspects upon previous bounds derived in particular cases. We then extend this analysis to study robustness of DFM procedures in the misspecified setting under departure from the exact label shift hypothesis, in particular in the case of contamination of the target by an unknown distribution. These theoretical findings are confirmed by a detailed numerical study on simulated and real-world datasets. We also introduce an efficient, scalable and robust version of kernel-based DFM using Random Fourier Features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
There can be large variability between samples coming from different laboratories, while there is homogeneity within each lab. The label shift hypothesis is therefore reasonable when keeping source and target from the same lab.

References

Alexandari, A., Kundaje, A., Shrikumar, A.: Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In: International Conference on Machine Learning, pp. 222–232. PMLR (2020)
Google Scholar
Azizzadenesheli, K., Liu, A., Yang, F., Anandkumar, A.: Regularized learning for domain adaptation under label shifts. arXiv preprint arXiv:1903.09734 (2019)
Barranquero, J., Díez, J., del Coz, J.J.: Quantification-oriented learning based on reliable classifiers. Pattern Recogn. 48(2), 591–604 (2015)
Article MATH Google Scholar
Barranquero, J., González, P., Díez, J., Del Coz, J.J.: On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recogn. 46(2), 472–482 (2013)
Article MATH Google Scholar
Bigot, J., Freulon, P., Hejblum, B.P., Leclaire, A.: On the potential benefits of entropic regularization for smoothing Wasserstein estimators. arXiv preprint arXiv:2210.06934 (2022)
Brusic, V., Gottardo, R., Kleinstein, S.H., Davis, M.M.: Computational resources for high-dimensional immune analysis from the human immunology project consortium. Nat. Biotechnol. 32, 146–148 (2014)
Article Google Scholar
Camoriano, R., Angles, T., Rudi, A., Rosasco, L.: Nytro: when subsampling meets early stopping. In: Artificial Intelligence and Statistics, pp. 1403–1411. PMLR (2016)
Google Scholar
Charlier, B., Feydy, J., Glaunès, J.A., Collin, F.D., Durif, G.: Kernel operations on the GPU, with autodiff, without memory overflows. J. Mach. Learn. Res. 22(74), 1–6 (2021). https://www.kernel-operations.io/keops/index.html
Tachet des Combes, R., Zhao, H., Wang, Y.X., Gordon, G.J.: Domain adaptation with conditional distribution matching and generalized label shift. In: Advances in Neural Information Processing Systems, vol. 33, pp. 19276–19289 (2020)
Google Scholar
Du Plessis, M.C., Sugiyama, M.: Semi-supervised learning of class balance under class-prior change by distribution matching. Neural Netw. 50, 110–119 (2014)
Article MATH Google Scholar
Dussap, B.: Distribution Feature Matching for Label Shift (2023). https://plmlab.math.cnrs.fr/dussap/Label-shift-DFM
Dussap, B., Blanchard, G., Chérief-Abdellatif, B.E.: Label shift quantification with robustness guarantees via distribution feature matching. arXiv preprint arXiv:2306.04376 (2023)
Esuli, A., Fabris, A., Moreo, A., Sebastiani, F.: Learning to quantify (2023)
Google Scholar
Finak, G., et al.: Standardizing flow cytometry immunophenotyping analysis from the human immunophenotyping consortium. Sci. Rep. 6(1), 1–11 (2016)
Article Google Scholar
Forman, G.: Counting positives accurately despite inaccurate classification. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 564–575. Springer, Heidelberg (2005). https://doi.org/10.1007/11564096_55
Chapter Google Scholar
Forman, G.: Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 157–166 (2006)
Google Scholar
Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Disc. 17(2), 164–206 (2008)
Article MathSciNet Google Scholar
Garg, S., Wu, Y., Balakrishnan, S., Lipton, Z.C.: A unified view of label shift estimation. arXiv preprint arXiv:2003.07554 (2020)
González, P., Castaño, A., Chawla, N.V., Coz, J.J.D.: A review on quantification learning. ACM Comput. Surv. (CSUR) 50(5), 1–40 (2017)
Article Google Scholar
González-Castro, V., Alaiz-Rodríguez, R., Alegre, E.: Class distribution estimation based on the hellinger distance. Inf. Sci. 218, 146–164 (2013)
Article Google Scholar
Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13(1), 723–773 (2012)
MathSciNet MATH Google Scholar
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., Schölkopf, B.: Covariate shift by kernel mean matching. Dataset Shift Mach. Learn. 3(4), 5 (2009)
Google Scholar
Hopkins, D.J., King, G.: A method of automated nonparametric content analysis for social science. Am. J. Political Sci. 54(1), 229–247 (2010)
Article Google Scholar
Iyer, A., Nath, S., Sarawagi, S.: Maximum mean discrepancy for class ratio estimation: convergence bounds and kernel selection. In: International Conference on Machine Learning, pp. 530–538. PMLR (2014)
Google Scholar
Kawakubo, H., Du Plessis, M.C., Sugiyama, M.: Computationally efficient class-prior estimation under class balance change using energy distance. IEICE Trans. Inf. Syst. 99(1), 176–186 (2016)
Article Google Scholar
Lipton, Z., Wang, Y.X., Smola, A.: Detecting and correcting for label shift with black box predictors. In: International Conference on Machine Learning, pp. 3122–3130. PMLR (2018)
Google Scholar
Maletzke, A., dos Reis, D., Cherman, E., Batista, G.: DyS: a framework for mixture models in quantification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4552–4560 (2019)
Google Scholar
Milli, L., Monreale, A., Rossetti, G., Giannotti, F., Pedreschi, D., Sebastiani, F.: Quantification trees. In: 2013 IEEE 13th International Conference on Data Mining, pp. 528–536. IEEE (2013)
Google Scholar
Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B., et al.: Kernel mean embedding of distributions: a review and beyond. Found. Trends® Mach. Learn. 10(1–2), 1–141 (2017)
Google Scholar
Patel, V.M., Gopalan, R., Li, R., Chellappa, R.: Visual domain adaptation: a survey of recent advances. IEEE Signal Process. Mag. 32(3), 53–69 (2015)
Article Google Scholar
Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset Shift in Machine Learning. MIT Press, Cambridge (2008)
Book Google Scholar
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, vol. 20 (2007)
Google Scholar
Rudi, A., Camoriano, R., Rosasco, L.: Less is more: Nyström computational regularization. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Rudi, A., Rosasco, L.: Generalization properties of learning with random features. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Saerens, M., Latinne, P., Decaestecker, C.: Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput. 14(1), 21–41 (2002)
Article MATH Google Scholar
Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fukumizu, K.: Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Stat. 2263–2291 (2013)
Google Scholar
Sutherland, D.J., Schneider, J.: On the error of random Fourier features. In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pp. 862–871 (2015)
Google Scholar
Zhang, K., Schölkopf, B., Muandet, K., Wang, Z.: Domain adaptation under target and conditional shift. In: International Conference on Machine Learning, pp. 819–827. PMLR (2013)
Google Scholar

Download references

Acknowledgements

B. Dussap was supported by the program Paris Region Ph.D. of DIM Mathinnov. G. Blanchard acknowledges support of the ANR under ANR-19-CHIA-0021-01 “BiSCottE”, IDEX REC-2019-044, and of the DFG under SFB1294 - 318763901.

Author information

Authors and Affiliations

Université Paris-Saclay, CNRS, Inria, Laboratoire de mathématiques d’Orsay, Orsay, France
Bastien Dussap & Gilles Blanchard
CNRS, LPSM, Sorbonne Université, Université Paris Cité, Paris, France
Badr-Eddine Chérief-Abdellatif

Authors

Bastien Dussap
View author publications
You can also search for this author in PubMed Google Scholar
Gilles Blanchard
View author publications
You can also search for this author in PubMed Google Scholar
Badr-Eddine Chérief-Abdellatif
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bastien Dussap .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Danai Koutra
University of Vienna, Vienna, Austria
Claudia Plant
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Manuel Gomez Rodriguez
Politecnico di Torino, Turin, Italy
Elena Baralis
CENTAI, Turin, Italy
Francesco Bonchi

Ethics declarations

Ethical Statement

Label shift quantification has uses in a number of application domains; the results in this paper are chiefly oriented towards theory and general methodology so that we don’t discuss an application in particular. We only mention that the flow cytometry data used for proof-of-concept experiments is publicly available from a reputable scientific consortium and has been up to our knowledge collected following all established ethical standards.

The original Classify & Count [15] method for label shift quantification is known to inherit the potential biases of the classification method it is based on (i.e. the misclassification errors can be very unevenly distributed across classes and “favor” majority classes). The Adjusted Classify & Count (ACC) approach and related methods [17, 26] aim at rectifying this bias. In the present paper, we aim at going one step further and analyze certain robustness properties of the proposed label shift quantification methods, and introduce the contaminated label shift (\(\mathcal {CLS}\)) setting with the goal of investigating trustworthiness of such methods under mild violations of the standard Label Shift model. Certainly the robustness property is desirable for improved reliability in practice, but does not mean immunity against biases; additionally, the user should always be wary of stronger model violations between reference and test data, in particular class-conditional distribution shifts. We therefore recommend established good practice of regularly checking on control data possible biases or drifts from the model, in particular for sensitive applications.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dussap, B., Blanchard, G., Chérief-Abdellatif, BE. (2023). Label Shift Quantification with Robustness Guarantees via Distribution Feature Matching. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14173. Springer, Cham. https://doi.org/10.1007/978-3-031-43424-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-43424-2_5
Published: 18 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43423-5
Online ISBN: 978-3-031-43424-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Label Shift Quantification with Robustness Guarantees via Distribution Feature Matching