Abstract
The ratio of two probability densities can be used for solving various machine learning tasks such as covariate shift adaptation (importance sampling), outlier detection (likelihood-ratio test), feature selection (mutual information), and conditional probability estimation. Several methods of directly estimating the density ratio have recently been developed, e.g., moment matching estimation, maximum-likelihood density-ratio estimation, and least-squares density-ratio fitting. In this paper, we propose a kernelized variant of the least-squares method for density-ratio estimation, which is called kernel unconstrained least-squares importance fitting (KuLSIF). We investigate its fundamental statistical properties including a non-parametric convergence rate, an analytic-form solution, and a leave-one-out cross-validation score. We further study its relation to other kernel-based density-ratio estimators. In experiments, we numerically compare various kernel-based density-ratio estimation methods, and show that KuLSIF compares favorably with other approaches.
Article PDF
Similar content being viewed by others
References
Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B, 28, 131–142.
Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404.
Bartlett, P. L., & Tewari, A. (2007). Sparseness vs estimating conditional probabilities: some asymptotic results. Journal of Machine Learning Research, 8, 775–790.
Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101, 138–156.
Bickel, S., Brückner, M., & Scheffer, T. (2007). Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on machine learning (pp. 81–88).
Bickel, S., Brückner, M., & Scheffer, T. (2009). Discriminative learning under covariate shift. Journal of Machine Learning Research, 10, 2137–2155.
Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2, 229–318.
Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39, 1–49.
Golub, G. H., & Loan, C. F. V. (1996). Matrix computations. Baltimore: Johns Hopkins University Press.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. J. (2006). A kernel method for the two-sample-problem. Advances in Neural Information Processing Systems, 19, 513–520.
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., & Schölkopf, B. (2009). Covariate shift by kernel mean matching. In J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, & N. Lawrence (Eds.), Dataset shift in machine learning, Chap. 8 (pp. 131–160). Cambridge: MIT Press.
Härdle, W., Müller, M., Sperlich, S., & Werwatz, A. (2004). Nonparametric and semiparametric models. Springer series in statistics. Berlin: Springer.
Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. (2008). Inlier-based outlier detection via direct density ratio estimation. In Proceedings of IEEE international conference on data mining (ICDM2008) (pp. 223–232), Pisa, Italy.
Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. (2011). Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 26, 309–336.
Huang, J., Smola, A., Gretton, A., Borgwardt, K. M., & Schölkopf, B. (2007). Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, vol. 19 (pp. 601–608). Cambridge: MIT Press.
Kanamori, T., Hido, S., & Sugiyama, M. (2009). A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10, 1391–1445.
Kanamori, T., Suzuki, T., & Sugiyama, M. (2010). Theoretical analysis of density ratio estimation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E93-A, 787–798.
Kanamori, T., Suzuki, T., & Sugiyama, M. (2011, submitted). Kernel-based density ratio estimation. Part II. Condition number analysis. Machine Learning. .
Kawahar, Y., & Sugiyama, M. (2011, to appear) Sequential change-point detection based on direct density-ratio estimation. Statistical Analysis and Data Mining.
Keerthi, S. S., Duan, K., Shevade, S. K., & Poo, A. N. (2005). A fast dual algorithm for kernel logistic regression. Machine Learning, 61, 151–165.
Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33, 82–95.
Luenberger, D., & Ye, Y. (2008). Linear and nonlinear programming. Berlin: Springer.
Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56, 5847–5861.
Park, C. (2009). Convergence rates of generalization errors for margin-based classification. Journal of Statistical Planning and Inference, 139, 2543–2551.
Platt, J. C. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in large margin classifiers (pp. 61–74).
Qin, J. (1998). Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85, 619–639.
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. (Eds.) (2009). Dataset shift in machine learning. Cambridge: MIT Press.
R Development Core Team (2009). R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. ISBN 3-900051-07-0.
Rätsch, G., Onoda, T., & Müller, K.-R. (2001). Soft margins for adaboost. Machine Learning, 42, 287–320.
Reed, M., & Simon, B. (1972). Functional analysis. New York: Academic Press.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27, 832–837.
Rüping, S. (2003). myklr—kernel logistic regression. Dortmund: University of Dortmund, Department of Computer Science.
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge: MIT Press.
Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 227–244.
Smola, A., Song, L., & Teo, C. H. (2009). Relative novelty detection. In Twelfth international conference on artificial intelligence and statistics (pp. 536–543).
Steinwart, I. (2001). On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2, 67–93.
Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel classifiers. IEEE Transactions on Information Theory, 51, 128–142.
Sugiyama, M. (2010). Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Transactions on Information and Systems, E93-D, 2690–2701.
Sugiyama, M., & Müller, K.-R. (2005). Input-dependent estimation of generalization error under covariate shift. Statistics & Decisions, 23, 249–279.
Sugiyama, M., Krauledat, M., & Müller, K.-R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8, 985–1005.
Sugiyama, M., Kanamori, T., Suzuki, T., Hido, S., Sese, J., Takeuchi, I., & Wang, L. (2009). A density-ratio framework for statistical data processing. IPSJ Transactions on Computer Vision and Applications, 1, 183–208.
Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P., & Kawanabe, M. (2008a). Direct importance estimation with model selection and its application to covariate shift adaptation. Advances in Neural information processing systems, vol. 20 (pp. 1433–1440). Cambridge: MIT Press.
Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., & Kawanabe, M. (2008b). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60, 699–746.
Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya, H., & Okanohara, D. (2010). Least-squares conditional density estimation. IEICE Transactions on Information and Systems, E93-D, 583–594.
Sugiyama, M., & Kawanabe, M. (2011, to appear). Machine learning in non-stationary environments: introduction to covariate shift adaptation. Cambridge: MIT Press.
Sugiyama, M., Suzuki, T., & Kanamori, T. (2012, to appear). Density ratio estimation in machine learning. Cambridge: Cambridge University Press.
Suzuki, T., Sugiyama, M., Sese, J., & Kanamori, T. (2008). Approximating mutual information by maximum likelihood density ratio estimation. In JMLR workshop and conference proceedings (pp. 5–20).
Suzuki, T., Sugiyama, M., & Tanaka, T. (2009). Mutual information approximation via maximum likelihood estimation of density ratio. In Proceedings of 2009 IEEE international symposium on information theory (ISIT2009) (pp. 463–467), Seoul, Korea.
Tsuboi, Y., Kashima, H., Hido, S., Bickel, S., & Sugiyama, M. (2008). Direct density ratio estimation for large-scale covariate shift adaptation. In SDM (pp. 443–454).
van de Geer, S. (2000). Empirical processes in M-estimation. Cambridge: Cambridge University Press.
Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.
Wahba, G., Gu, C., Wang, Y., & Chappel, R. (1993). Soft classification, a.k.a. risk estimation, via penalized log likelihood and smoothing spline analysis of variance. The mathematics of generalization. Reading: Addison-Wesley.
Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., & Sugiyama, M. (2011, to appear). Relative density-ratio estimation for robust distribution comparison. In Advances in neural information processing systems vol. 24.
Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on machine learning. New York: ACM Press.
Zeidler, E. (1986). Nonlinear functional analysis and its applications. In Fixed-point theorems. Berlin: Springer.
Zhu, J., & Hastie, T. (2001). Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics, 14, 1081–1088.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Massimiliano Pontil
Rights and permissions
About this article
Cite this article
Kanamori, T., Suzuki, T. & Sugiyama, M. Statistical analysis of kernel-based least-squares density-ratio estimation. Mach Learn 86, 335–367 (2012). https://doi.org/10.1007/s10994-011-5266-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-011-5266-3