Abstract
Quantification learning deals with the task of estimating the target label distribution under label shift. In this paper, we first present a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature. We derive a general performance bound for DFM procedures, improving in several key aspects upon previous bounds derived in particular cases. We then extend this analysis to study robustness of DFM procedures in the misspecified setting under departure from the exact label shift hypothesis, in particular in the case of contamination of the target by an unknown distribution. These theoretical findings are confirmed by a detailed numerical study on simulated and real-world datasets. We also introduce an efficient, scalable and robust version of kernel-based DFM using Random Fourier Features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
There can be large variability between samples coming from different laboratories, while there is homogeneity within each lab. The label shift hypothesis is therefore reasonable when keeping source and target from the same lab.
References
Alexandari, A., Kundaje, A., Shrikumar, A.: Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In: International Conference on Machine Learning, pp. 222–232. PMLR (2020)
Azizzadenesheli, K., Liu, A., Yang, F., Anandkumar, A.: Regularized learning for domain adaptation under label shifts. arXiv preprint arXiv:1903.09734 (2019)
Barranquero, J., Díez, J., del Coz, J.J.: Quantification-oriented learning based on reliable classifiers. Pattern Recogn. 48(2), 591–604 (2015)
Barranquero, J., González, P., Díez, J., Del Coz, J.J.: On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recogn. 46(2), 472–482 (2013)
Bigot, J., Freulon, P., Hejblum, B.P., Leclaire, A.: On the potential benefits of entropic regularization for smoothing Wasserstein estimators. arXiv preprint arXiv:2210.06934 (2022)
Brusic, V., Gottardo, R., Kleinstein, S.H., Davis, M.M.: Computational resources for high-dimensional immune analysis from the human immunology project consortium. Nat. Biotechnol. 32, 146–148 (2014)
Camoriano, R., Angles, T., Rudi, A., Rosasco, L.: Nytro: when subsampling meets early stopping. In: Artificial Intelligence and Statistics, pp. 1403–1411. PMLR (2016)
Charlier, B., Feydy, J., Glaunès, J.A., Collin, F.D., Durif, G.: Kernel operations on the GPU, with autodiff, without memory overflows. J. Mach. Learn. Res. 22(74), 1–6 (2021). https://www.kernel-operations.io/keops/index.html
Tachet des Combes, R., Zhao, H., Wang, Y.X., Gordon, G.J.: Domain adaptation with conditional distribution matching and generalized label shift. In: Advances in Neural Information Processing Systems, vol. 33, pp. 19276–19289 (2020)
Du Plessis, M.C., Sugiyama, M.: Semi-supervised learning of class balance under class-prior change by distribution matching. Neural Netw. 50, 110–119 (2014)
Dussap, B.: Distribution Feature Matching for Label Shift (2023). https://plmlab.math.cnrs.fr/dussap/Label-shift-DFM
Dussap, B., Blanchard, G., Chérief-Abdellatif, B.E.: Label shift quantification with robustness guarantees via distribution feature matching. arXiv preprint arXiv:2306.04376 (2023)
Esuli, A., Fabris, A., Moreo, A., Sebastiani, F.: Learning to quantify (2023)
Finak, G., et al.: Standardizing flow cytometry immunophenotyping analysis from the human immunophenotyping consortium. Sci. Rep. 6(1), 1–11 (2016)
Forman, G.: Counting positives accurately despite inaccurate classification. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 564–575. Springer, Heidelberg (2005). https://doi.org/10.1007/11564096_55
Forman, G.: Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 157–166 (2006)
Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Disc. 17(2), 164–206 (2008)
Garg, S., Wu, Y., Balakrishnan, S., Lipton, Z.C.: A unified view of label shift estimation. arXiv preprint arXiv:2003.07554 (2020)
González, P., Castaño, A., Chawla, N.V., Coz, J.J.D.: A review on quantification learning. ACM Comput. Surv. (CSUR) 50(5), 1–40 (2017)
González-Castro, V., Alaiz-Rodríguez, R., Alegre, E.: Class distribution estimation based on the hellinger distance. Inf. Sci. 218, 146–164 (2013)
Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13(1), 723–773 (2012)
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., Schölkopf, B.: Covariate shift by kernel mean matching. Dataset Shift Mach. Learn. 3(4), 5 (2009)
Hopkins, D.J., King, G.: A method of automated nonparametric content analysis for social science. Am. J. Political Sci. 54(1), 229–247 (2010)
Iyer, A., Nath, S., Sarawagi, S.: Maximum mean discrepancy for class ratio estimation: convergence bounds and kernel selection. In: International Conference on Machine Learning, pp. 530–538. PMLR (2014)
Kawakubo, H., Du Plessis, M.C., Sugiyama, M.: Computationally efficient class-prior estimation under class balance change using energy distance. IEICE Trans. Inf. Syst. 99(1), 176–186 (2016)
Lipton, Z., Wang, Y.X., Smola, A.: Detecting and correcting for label shift with black box predictors. In: International Conference on Machine Learning, pp. 3122–3130. PMLR (2018)
Maletzke, A., dos Reis, D., Cherman, E., Batista, G.: DyS: a framework for mixture models in quantification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4552–4560 (2019)
Milli, L., Monreale, A., Rossetti, G., Giannotti, F., Pedreschi, D., Sebastiani, F.: Quantification trees. In: 2013 IEEE 13th International Conference on Data Mining, pp. 528–536. IEEE (2013)
Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B., et al.: Kernel mean embedding of distributions: a review and beyond. Found. Trends® Mach. Learn. 10(1–2), 1–141 (2017)
Patel, V.M., Gopalan, R., Li, R., Chellappa, R.: Visual domain adaptation: a survey of recent advances. IEEE Signal Process. Mag. 32(3), 53–69 (2015)
Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset Shift in Machine Learning. MIT Press, Cambridge (2008)
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, vol. 20 (2007)
Rudi, A., Camoriano, R., Rosasco, L.: Less is more: Nyström computational regularization. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Rudi, A., Rosasco, L.: Generalization properties of learning with random features. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Saerens, M., Latinne, P., Decaestecker, C.: Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput. 14(1), 21–41 (2002)
Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fukumizu, K.: Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Stat. 2263–2291 (2013)
Sutherland, D.J., Schneider, J.: On the error of random Fourier features. In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pp. 862–871 (2015)
Zhang, K., Schölkopf, B., Muandet, K., Wang, Z.: Domain adaptation under target and conditional shift. In: International Conference on Machine Learning, pp. 819–827. PMLR (2013)
Acknowledgements
B. Dussap was supported by the program Paris Region Ph.D. of DIM Mathinnov. G. Blanchard acknowledges support of the ANR under ANR-19-CHIA-0021-01 “BiSCottE”, IDEX REC-2019-044, and of the DFG under SFB1294 - 318763901.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Ethical Statement
Label shift quantification has uses in a number of application domains; the results in this paper are chiefly oriented towards theory and general methodology so that we don’t discuss an application in particular. We only mention that the flow cytometry data used for proof-of-concept experiments is publicly available from a reputable scientific consortium and has been up to our knowledge collected following all established ethical standards.
The original Classify & Count [15] method for label shift quantification is known to inherit the potential biases of the classification method it is based on (i.e. the misclassification errors can be very unevenly distributed across classes and “favor” majority classes). The Adjusted Classify & Count (ACC) approach and related methods [17, 26] aim at rectifying this bias. In the present paper, we aim at going one step further and analyze certain robustness properties of the proposed label shift quantification methods, and introduce the contaminated label shift (\(\mathcal {CLS}\)) setting with the goal of investigating trustworthiness of such methods under mild violations of the standard Label Shift model. Certainly the robustness property is desirable for improved reliability in practice, but does not mean immunity against biases; additionally, the user should always be wary of stronger model violations between reference and test data, in particular class-conditional distribution shifts. We therefore recommend established good practice of regularly checking on control data possible biases or drifts from the model, in particular for sensitive applications.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dussap, B., Blanchard, G., Chérief-Abdellatif, BE. (2023). Label Shift Quantification with Robustness Guarantees via Distribution Feature Matching. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14173. Springer, Cham. https://doi.org/10.1007/978-3-031-43424-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-43424-2_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43423-5
Online ISBN: 978-3-031-43424-2
eBook Packages: Computer ScienceComputer Science (R0)