Abstract
This paper proposes a learning criterion for stochastic rules. This criterion is developed by extending Valiant's PAC (Probably Approximately Correct) learning model, which is a learning criterion for deterministic rules. Stochastic rules here refer to those which probabilistically asign a number of classes, {Y}, to each attribute vector X. The proposed criterion is based on the idea that learning stochastic rules may be regarded as probably approximately correct identification of conditional probability distributions over classes for given input attribute vectors. An algorithm (an MDL algorithm) based on the MDL (Minimum Description Length) principle is used for learning stochastic rules. Specifically, for stochastic rules with finite partitioning (each of which is specified by a finite number of disjoint cells of the domain and a probability parameter vector associated with them), this paper derives target-dependent upper bounds and worst-case upper bounds on the sample size required by the MDL algorithm to learn stochastic rules with given accuracy and confidence. Based on these sample size bounds, this paper proves polynomial-sample-size learnability of stochastic decision lists (which are newly proposed in this paper as a stochastic analogue of Rivest's decision lists) with at mostk literals (k is fixed) in each decision, and polynomial-sample-size learnability of stochastic decision trees (a stochastic analogue of decision trees) with at mostk depth. Sufficient conditions for polynomial-sample-size learnability and polynomial-time learnability of any classes of stochastic rules with finite partitioning are also derived.
Article PDF
Similar content being viewed by others
References
Abe, N. & Warmuth, M. (1990). On the computational complexity of approximating distributions by probabilistic automata.Proceedings of the Third Workshop on Computational Learning Theory (pp. 52–66), Rochester, NY: Morgan Kaufmann.
Akaike, H. (1974). A new look at the statistical model identification.IEEE Trans. Autom. Contr., AC-19, 716–723.
Angluin, D. & Laird, P. (1988). Learning from noisy examles.Machine Learning, 343–370.
Barron, A.R. (1985).Logically smooth density estimation. Ph.D. dissertation, Dept. of Electrical Eng., Stanford Univ.
Barron, A.R. & Cover, T.M. (1991). Minimum complexity density estimation.IEEE Trans. on IT, IT-37, 1034–1054.
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M.K. (1987). Occam's razor.Information Processing Letters, 24, 377–380.
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M.K. (1989). Learnability and Vapnik-Chervonenkis dimension.Journal of ACM, 36, 929–965.
Cesa-Bianchi, N. (1990). Learning the distribution in the extended PAC model.Proceedings of the First International Workshop on Algorithmic Learning Theory (pp. 236–246). Tokyo, Japan: Japanese Society for Artificial Intelligence.
Ehrenfeucht, A., Haussler, D., Kearns, M., & Valiant, L. (1989). A general lower bound on the number of examples needed for learning.Information and Computation, 82, 247–251.
Fisher, R.A. (1956).Statistical Methods and Scientific Inference. Olyver and Boyd.
Gallager, R.G. (1986).Information theory and reliable communication. New York: Wiley, 1986.
Haussler, D. (1989).Generalizing the PAC model for neural net and other learning applications. Technical Report UCSC CRL-89-30, Univ. of California at Santa Cruz.
Haussler, D. (1990). Decision theoretic generalizations of the PAC learning model.Proceedings of the First International Workshop on Algorithmic Learning Theory (pp. 21–41), Tokyo, Japan: Japanese Society for Artificial Intelligence.
Haussler, D. & Long, P. (1990).A generalization of Sauer's lemma. Technical Report UCSC CRL-90-15, Univ. of California at Santa Cruz.
Kearns, M. & Li, M. (1988). Learning in the presence of malicious errors.Proceedings of the 20th Annual ACM Symposium on Theory of Computing (pp. 267–279), Chicago, IL.
Kearns, M. & Schapire, R. (1990). Efficient distribution-free learning of probabilistic concepts.Proceedings of the 31st Symposium on Foundations of Computer Science (pp. 382–391), St. Louis, Missouri.
Kraft, C. (1949).A device for quantizing, grouping, and coding amplitude modulated pulses. M.S. Thesis, Department of Electrical Engineering, MIT, Cambridge, MA.
Kraft, C. (1955). Some conditions for consistency and uniform consistency of statistical procedures.University of California Publications in Statistics, 2, 125–141.
Kullback, S. (1967). A lower bound for discrimination in terms of variation.IEEE Trans. on IT, IT-13, 126–127.
Laird, P.D. (1988). Efficient unsupervised learning.Proceedings of the First Annual Workshop on Computational Learning Theory (pp. 91–96), Cambridge, MA: Morgan Kaufmann.
Pednault, E.P.D. (1989). Some experiments in applying inductive inference principles to surface reconstruction.Proceedings of the 11th International Joint Conference on Artificial Intelligence (pp. 1603–1609), Morgan Kaufmann.
Pitman, E.J.G. (1979).Some Basic Theory for Statistical Inference. London: Chapman and Hall.
Pitt, L. & Valiant, L.G. (1988). Computational limitation on learning from examples.Journal of ACM, 35, 965–984.
Quinlan, J.R. & Rivest, R.L. (1989). Inferring decision trees using the minimum description length criterion.Information and Computation, 80, 227–248.
Rissanen, J. (1978). Modeling by shortest data description.Automatica, 14, 465–471.
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length.Annals of Statistics, 11, 416–431.
Rissanen, J. (1984). Universal coding, information, prediction, and estimation.IEEE Trans. on IT, IT-30, 629–636.
Rissanen, J. (1986). Stochastic complexity and modeling.Annals of Statistics, 14, 1080–1100.
Rissanen, J. (1989).Stochastic complexity in statistical inquiry, World Scientific, Series in Computer Science, 15.
Rivest, R.L. (1987). Learning decision lists.Machine Learning, 2, 229–246.
Schreiber, F. (1985). The Bayes Laplace statistic of the multinomial distributions.AEU, 39, 293–298.
Segen, J. (1989).From features to symbols: Learning relational shape. In J.C. Simon, (Ed.),Pixels to features. Elsevier Science Publishers B.V.
Sloan, R. (1988). Types of noise in data for concept learning. InProceedings of the First Annual Workshop on Computational Learning Theory (pp. 91–96), Cambridge, MA, CA: Morgan Kaufmann.
Solomonoff, R.J. (1964). A formal theory of inductive inference.Part 1. Information and Control, 7, 1–22.
Valiant, L.G. (1984). A theory of the learnable.Communications of the ACM, 27, 1134–1142.
Valiant, L.G. (1985). Learning disjunctions of conjunctions.Proceedings of the Ninth International Joint Conference on Artificial Intelligence (pp. 560–566), Los Angeles, CA: Morgan Kaufmann.
Vapnik, V.N. & Chervonenkis, A.Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities.Theory of Probability and its Applications, XVI(2, 264–280.
Wallace, C.S. & Boulton, D.M. (1968). An information measure for classification.Computer Journal, 185–194.
Yamanishi, K. (1989). Inductive inference and learning criterion of stochastic classification rules with hierarchical parameter structures.Proceedings of the 12th Symposium of Information Theory and Its Applications, 2 (pp. 707–712) (in Japanese), Inuyama, Japan.
Yamanishi, K. (1990a). Inferring optimal decision lists from stochastic data using the minimum description length criterion. Presented at1990 IEEE International Symposium on Information Theory, San Diego, CA.
Yamanishi, K. (1990b). A learning criterion for stochastic rules.Proceedings of the Third Annual Workshop on Computational Learning Theory (pp. 67–81), Rochester, NY: Morgan Kaufmann.
Author information
Authors and Affiliations
Additional information
An extended abstract of this paper appeared in Proceedings of the 3rd Annual Workshop on Computational Learning Theory.
Rights and permissions
About this article
Cite this article
Yamanishi, K. A learning criterion for stochastic rules. Mach Learn 9, 165–203 (1992). https://doi.org/10.1007/BF00992676
Issue Date:
DOI: https://doi.org/10.1007/BF00992676