Skip to main content

Label Shift Quantification with Robustness Guarantees via Distribution Feature Matching

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases: Research Track (ECML PKDD 2023)

Abstract

Quantification learning deals with the task of estimating the target label distribution under label shift. In this paper, we first present a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature. We derive a general performance bound for DFM procedures, improving in several key aspects upon previous bounds derived in particular cases. We then extend this analysis to study robustness of DFM procedures in the misspecified setting under departure from the exact label shift hypothesis, in particular in the case of contamination of the target by an unknown distribution. These theoretical findings are confirmed by a detailed numerical study on simulated and real-world datasets. We also introduce an efficient, scalable and robust version of kernel-based DFM using Random Fourier Features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    There can be large variability between samples coming from different laboratories, while there is homogeneity within each lab. The label shift hypothesis is therefore reasonable when keeping source and target from the same lab.

References

  1. Alexandari, A., Kundaje, A., Shrikumar, A.: Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In: International Conference on Machine Learning, pp. 222–232. PMLR (2020)

    Google Scholar 

  2. Azizzadenesheli, K., Liu, A., Yang, F., Anandkumar, A.: Regularized learning for domain adaptation under label shifts. arXiv preprint arXiv:1903.09734 (2019)

  3. Barranquero, J., Díez, J., del Coz, J.J.: Quantification-oriented learning based on reliable classifiers. Pattern Recogn. 48(2), 591–604 (2015)

    Article  MATH  Google Scholar 

  4. Barranquero, J., González, P., Díez, J., Del Coz, J.J.: On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recogn. 46(2), 472–482 (2013)

    Article  MATH  Google Scholar 

  5. Bigot, J., Freulon, P., Hejblum, B.P., Leclaire, A.: On the potential benefits of entropic regularization for smoothing Wasserstein estimators. arXiv preprint arXiv:2210.06934 (2022)

  6. Brusic, V., Gottardo, R., Kleinstein, S.H., Davis, M.M.: Computational resources for high-dimensional immune analysis from the human immunology project consortium. Nat. Biotechnol. 32, 146–148 (2014)

    Article  Google Scholar 

  7. Camoriano, R., Angles, T., Rudi, A., Rosasco, L.: Nytro: when subsampling meets early stopping. In: Artificial Intelligence and Statistics, pp. 1403–1411. PMLR (2016)

    Google Scholar 

  8. Charlier, B., Feydy, J., Glaunès, J.A., Collin, F.D., Durif, G.: Kernel operations on the GPU, with autodiff, without memory overflows. J. Mach. Learn. Res. 22(74), 1–6 (2021). https://www.kernel-operations.io/keops/index.html

  9. Tachet des Combes, R., Zhao, H., Wang, Y.X., Gordon, G.J.: Domain adaptation with conditional distribution matching and generalized label shift. In: Advances in Neural Information Processing Systems, vol. 33, pp. 19276–19289 (2020)

    Google Scholar 

  10. Du Plessis, M.C., Sugiyama, M.: Semi-supervised learning of class balance under class-prior change by distribution matching. Neural Netw. 50, 110–119 (2014)

    Article  MATH  Google Scholar 

  11. Dussap, B.: Distribution Feature Matching for Label Shift (2023). https://plmlab.math.cnrs.fr/dussap/Label-shift-DFM

  12. Dussap, B., Blanchard, G., Chérief-Abdellatif, B.E.: Label shift quantification with robustness guarantees via distribution feature matching. arXiv preprint arXiv:2306.04376 (2023)

  13. Esuli, A., Fabris, A., Moreo, A., Sebastiani, F.: Learning to quantify (2023)

    Google Scholar 

  14. Finak, G., et al.: Standardizing flow cytometry immunophenotyping analysis from the human immunophenotyping consortium. Sci. Rep. 6(1), 1–11 (2016)

    Article  Google Scholar 

  15. Forman, G.: Counting positives accurately despite inaccurate classification. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 564–575. Springer, Heidelberg (2005). https://doi.org/10.1007/11564096_55

    Chapter  Google Scholar 

  16. Forman, G.: Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 157–166 (2006)

    Google Scholar 

  17. Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Disc. 17(2), 164–206 (2008)

    Article  MathSciNet  Google Scholar 

  18. Garg, S., Wu, Y., Balakrishnan, S., Lipton, Z.C.: A unified view of label shift estimation. arXiv preprint arXiv:2003.07554 (2020)

  19. González, P., Castaño, A., Chawla, N.V., Coz, J.J.D.: A review on quantification learning. ACM Comput. Surv. (CSUR) 50(5), 1–40 (2017)

    Article  Google Scholar 

  20. González-Castro, V., Alaiz-Rodríguez, R., Alegre, E.: Class distribution estimation based on the hellinger distance. Inf. Sci. 218, 146–164 (2013)

    Article  Google Scholar 

  21. Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13(1), 723–773 (2012)

    MathSciNet  MATH  Google Scholar 

  22. Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., Schölkopf, B.: Covariate shift by kernel mean matching. Dataset Shift Mach. Learn. 3(4), 5 (2009)

    Google Scholar 

  23. Hopkins, D.J., King, G.: A method of automated nonparametric content analysis for social science. Am. J. Political Sci. 54(1), 229–247 (2010)

    Article  Google Scholar 

  24. Iyer, A., Nath, S., Sarawagi, S.: Maximum mean discrepancy for class ratio estimation: convergence bounds and kernel selection. In: International Conference on Machine Learning, pp. 530–538. PMLR (2014)

    Google Scholar 

  25. Kawakubo, H., Du Plessis, M.C., Sugiyama, M.: Computationally efficient class-prior estimation under class balance change using energy distance. IEICE Trans. Inf. Syst. 99(1), 176–186 (2016)

    Article  Google Scholar 

  26. Lipton, Z., Wang, Y.X., Smola, A.: Detecting and correcting for label shift with black box predictors. In: International Conference on Machine Learning, pp. 3122–3130. PMLR (2018)

    Google Scholar 

  27. Maletzke, A., dos Reis, D., Cherman, E., Batista, G.: DyS: a framework for mixture models in quantification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4552–4560 (2019)

    Google Scholar 

  28. Milli, L., Monreale, A., Rossetti, G., Giannotti, F., Pedreschi, D., Sebastiani, F.: Quantification trees. In: 2013 IEEE 13th International Conference on Data Mining, pp. 528–536. IEEE (2013)

    Google Scholar 

  29. Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B., et al.: Kernel mean embedding of distributions: a review and beyond. Found. Trends® Mach. Learn. 10(1–2), 1–141 (2017)

    Google Scholar 

  30. Patel, V.M., Gopalan, R., Li, R., Chellappa, R.: Visual domain adaptation: a survey of recent advances. IEEE Signal Process. Mag. 32(3), 53–69 (2015)

    Article  Google Scholar 

  31. Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset Shift in Machine Learning. MIT Press, Cambridge (2008)

    Book  Google Scholar 

  32. Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, vol. 20 (2007)

    Google Scholar 

  33. Rudi, A., Camoriano, R., Rosasco, L.: Less is more: Nyström computational regularization. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  34. Rudi, A., Rosasco, L.: Generalization properties of learning with random features. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  35. Saerens, M., Latinne, P., Decaestecker, C.: Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput. 14(1), 21–41 (2002)

    Article  MATH  Google Scholar 

  36. Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fukumizu, K.: Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Stat. 2263–2291 (2013)

    Google Scholar 

  37. Sutherland, D.J., Schneider, J.: On the error of random Fourier features. In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pp. 862–871 (2015)

    Google Scholar 

  38. Zhang, K., Schölkopf, B., Muandet, K., Wang, Z.: Domain adaptation under target and conditional shift. In: International Conference on Machine Learning, pp. 819–827. PMLR (2013)

    Google Scholar 

Download references

Acknowledgements

B. Dussap was supported by the program Paris Region Ph.D. of DIM Mathinnov. G. Blanchard acknowledges support of the ANR under ANR-19-CHIA-0021-01 “BiSCottE”, IDEX REC-2019-044, and of the DFG under SFB1294 - 318763901.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bastien Dussap .

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

Label shift quantification has uses in a number of application domains; the results in this paper are chiefly oriented towards theory and general methodology so that we don’t discuss an application in particular. We only mention that the flow cytometry data used for proof-of-concept experiments is publicly available from a reputable scientific consortium and has been up to our knowledge collected following all established ethical standards.

The original Classify & Count [15] method for label shift quantification is known to inherit the potential biases of the classification method it is based on (i.e. the misclassification errors can be very unevenly distributed across classes and “favor” majority classes). The Adjusted Classify & Count (ACC) approach and related methods [17, 26] aim at rectifying this bias. In the present paper, we aim at going one step further and analyze certain robustness properties of the proposed label shift quantification methods, and introduce the contaminated label shift (\(\mathcal {CLS}\)) setting with the goal of investigating trustworthiness of such methods under mild violations of the standard Label Shift model. Certainly the robustness property is desirable for improved reliability in practice, but does not mean immunity against biases; additionally, the user should always be wary of stronger model violations between reference and test data, in particular class-conditional distribution shifts. We therefore recommend established good practice of regularly checking on control data possible biases or drifts from the model, in particular for sensitive applications.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dussap, B., Blanchard, G., Chérief-Abdellatif, BE. (2023). Label Shift Quantification with Robustness Guarantees via Distribution Feature Matching. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14173. Springer, Cham. https://doi.org/10.1007/978-3-031-43424-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43424-2_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43423-5

  • Online ISBN: 978-3-031-43424-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics