Abstract
In statistical learning theory, numerous works established non-asymptotic bounds assessing the generalization capacity of empirical risk minimizers under a large variety of complexity assumptions for the class of decision rules over which optimization is performed, by means of sharp control of uniform deviation of i.i.d. averages from their expectation, while fully ignoring the possible dependence across training data in general. It is the purpose of this paper to show that similar results can be obtained when statistical learning is based on a data sequence drawn from a (Harris positive) Markov chain X, through the running example of estimation of minimum volume sets (MV-sets) related to X’s stationary distribution, an unsupervised statistical learning approach to anomaly/novelty detection. Based on novel maximal deviation inequalities we establish, using the regenerative method, learning rate bounds that depend not only on the complexity of the class of candidate sets but also on the ergodicity rate of the chain X, expressed in terms of tail conditions for the length of the regenerative cycles. In particular, this approach fully tailored to Markovian data permits to interpret the rate bound results obtained in frequentist terms, in contrast to alternative coupling techniques based on mixing conditions: the larger the expected number of cycles over a trajectory of finite length, the more accurate the MV-set estimates. Beyond the theoretical analysis, this phenomenon is supported by illustrative numerical experiments.
Similar content being viewed by others
References
Adamczak, R., Bednorz, W.: Exponential concentration inequalities for additive functionals of Markov chains. ESAIM: PS 19, 440–481 (2015)
Adams, T.M., Nobel, A.B.: Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. Ann. Probab. 38, 1345–1367 (2010)
Agarwal, A., Duchi, J.: The generalization ability of online algorithms for dependent data. IEEE Trans. Inf. Theory 59(1), 573–587 (2013)
Alquier, P., Wintenberger, O.: Model selection for weakly dependent time series forecasting. Bernoulli 18, 883–913 (2012)
Asmussen, S.: Applied probability and queues. Springer, New York (2003)
Bertail, P., Ciołek, G.: New Bernstein and Hoeffding type inequalities for regenerative Markov chains. ALEA Lat. Am. J. Probab. Math. Stat. 259, Äì–277 (2019)
Bertail, P., Clémençon, S.: Edgeworth expansions for suitably normalized sample mean statistics of atomic Markov chains. Prob. Th. Rel Fields 130(3), 388–414 (2004)
Bertail, P., Clémençon, S.: A renewal approach to Markovian U-statistics. Math. Methods Statist. 20(2), 79–105 (2004)
Bertail, P., Clémençon, S.: Regenerative-block bootstrap for Markov chains. Bernoulli 12(4), 689–712 (2005)
Bertail, P., Clémençon, S.: Sharp bounds for the tails of functionals of Markov chains. Theory of Probability and Its Applications 54(3), 505–515 (2010)
Ciołek, G.: Bootstrap uniform central limit theorems for harris recurrent Markov chains. Electronic Journal of Statistics 10, 2157–2178 (2016)
Clémençon, S., Bertail, P., Papa, G.: Learning from survey training samples: rate bounds for Horvitz-Thompson risk minimizers. In: Proceedings of ACML’16 (2016)
de la Pena, V., Giné, E.: Decoupling: from dependence to independence. Springer, Berlin (1999)
Di, J., Kolaczyk, E.: Complexity-penalized estimation of minimum volume sets for dependent data. J. Multivar. Anal. 101(9), 1910–1926 (2004)
Einmahl, J.H.J., Mason, D.M.: Generalized quantile process. Ann. Stat. 20, 1062–1078 (1992)
Giné, E., Zinn, J.: Some limit theorems for empirical processes. Ann. Probab. 12(4), 929–998 (1984). With discussion
Hairer, M., Mattingly, J.C.: Yet another look at harris, Äô ergodic theorem for Markov chains. Seminar on Stochastic Analysis, Random Fields and Applications VI. Progr Probab. 63, 109–117 (2011)
Hanneke, S.: Learning whenever learning is possible: Universal learning under general stochastic processes. arXiv:1706.01418 (2017)
Jain, J., Jamison, B.: Contributions to Doeblin’s theory of Markov processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 8, 19–40 (1967)
Koltchinskii, V.: Rademacher penalties and structural risk minimization. IEEE Trans. Inf. Theory 47, 1902–1914 (2001)
Kuznetsov, V., Mohri, M.: Generalization bounds for time series prediction with non-stationary processes. In: Proceedings of ALT’14 (2014)
Massart, P.: Some applications of concentration inequalities to statistics. Annales de la faculté des sciences de Toulouse 9, 245–303 (2000)
McGoff, K., Nobel, A.B.: Empirical risk minimization and complexity of dunamical models. Submitted (2018)
Merlevède, F., Peligrad, M.: Rosenthal-type inequalities for the maximum of partial sums of stationary processes and examples. Ann. Probab. 41, 914–960 (2013)
Meyn, S.P., Tweedie, R.L.: Markov chains and stochastic stability. Springer, Berlin (1996)
Montgomery-Smith, S.J.: Comparison of sums of independent identically distributed random vectors. J. Math. Anal. Appl. 14, 281–285 (1993)
Nummelin, E.: A splitting technique for Harris recurrent chains. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 43, 309–318 (1978)
Peligrad, M.: The r-quick version of the strong law for stationary ϕ-mixing sequences. In: Almost Everywhere Convergence (Columbus, OH, 1988). Academic Press, Boston (1989)
Petrov, V.V.: Limit theorems of probability theory: sequences of independent random variables. Oxford studies in probability. Clarendon Press, Oxford (1995)
Polonik, W.: Minimum volume sets and generalized quantile processes. Stochastic Processes and their Applications 69(1), 1–24 (1997)
Revuz, D.: Markov chains. 2nd edition, North-Holland (1984)
Rosenthal, H.P.: On the subspaces of lp (p > 2) spanned by sequences of independent random variables. Israel J. Math. 8, 273–303 (1970)
Scott, C., Nowak, R.: Learning minimum volume sets. J. Mach. Learn. Res. 7, 665–704 (2006)
Shao, Q.: Maximal inequalities for partial sums of ρ-mixing sequences. Ann. Probab. 23, 948–965 (1995)
Steinwart, I., Christmann, A.: Fast learning from non-i.i.d. observations. NIPS 22, 1768–1776 (2009)
Steinwart, I., Hush, D., Scovel, C.: Learning from dependent observations. J. Multivar. Anal. 100(1), 175–194 (2009)
Thorisson, H.: Coupling, stationarity and regeneration. Springer, Berlin (2000)
Tuominen, P.K., Tweedie, R.: Subgeometric rates of convergence of f-ergodic Markov chains. Adv. Appl. Probab. 26, 775–798 (1994)
Utev, S.A.: Sums of random variables with ϕ-mixing. Sib. Adv. Math. 1, 124–155 (1991)
van der Vaart, A.W., Wellner, J.A.: Weak convergence and empirical processes. Springer, Berlin (1996)
Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)
Viennet, G.: Inequalities for absolutely regular sequences: application to density estimation. Probab. Theory Relat. Fields 107, 467–492 (1997)
Acknowledgements
This research was supported by a public grant as part of the Investissement d’avenir, project reference ANR-11-LABX-0056-LMH. Gabriela Ciołek was also supported by the Polish National Science Centre NCN (grant No. UMO2016/23/N/ST1/01355 ) and (partly) by the Ministry of Science and Higher Education. This research has also been conducted as part of the project Labex MME-DII (ANR11-LBX-0023-01). Part of this research was conducted during a stay of Gabriela Ciołek at Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo, Japan.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Technical proofs
Appendix: Technical proofs
1.1 Moment and probability inequalities in the i.i.d. setup
Since main probabilistic results of the paper are established by means of the regenerative approach, see Section 2.1, their proof are partly based on certain moment/probability inequalities in the i.i.d. case, which we recall below for clarity. Rosenthal’s inequality for i.i.d. random variables can be found in [32]. The version stated below (see Theorem 2.10 from [29]) seems to be more appropriate regarding the statistical learning applications considered in this paper.
Theorem 6.1
LetX1,⋯ ,Xnbe integrable centeredi.i.d. random variables andp ≥ 2.Assume that\(\mathbb {E}|X_{i}|^{p}<\infty \). Then, we have:∀𝜖 > 0,
where \(c_{p} = 2\max \nolimits \left (p^{p}, p^{p/2 + 1}e^{p} {\int \nolimits }_{0}^{\infty } x^{p/2-1}(1-x)^{-p}dx \right )\).
The constant cp documented above is due to [32]. The second result recalled here is Montgomery-Smith’s inequality, see [26].
Theorem 6.2 (Montgomery-Smith’s inequality)
LetX1,⋯ ,Xnbeintegrable centered i.i.d. random variables. Then,for\(1 \leq k \leq n < \infty \)andallt > 0, wehave
Lemma 6.1
Suppose that Assumption 3.3 holds. Then we have
The proof is a simple generalization of Lemma 3.6 in [6] and thus omitted.
1.2 Proof of theorem 3.5
We prove the version of the result stated below, more specific.
Theorem 6.3 (Polynomial tail maximal inequality for regenerative Markov chains)
Assume that Assumptions 3.3, 3.4 and 3.2 are satisfied bychain\(X= (X_{n})_{n \in \mathbb {N}}\). Then, wehave for anyx > 0, any0 < 𝜖 < x/2, anyN > 0 and foralln ≥ 1 that
where\(C_{p} = 24\max \nolimits \left (p^{p}, p^{p/2 + 1}e^{p} {\int \nolimits }_{0}^{\infty } x^{p/2-1}(1-x)^{-p}dx \right )\)and\(\tilde {H}=H+\mu (H)\).
Proof
The techniques used in the poof are similar to those in the proof of Theorem 3.14 in [6].Uniform covering We choose functions g1,g2,…,gM in class \(\mathcal {F}\) defining an 𝜖-covering of \(\mathcal {F}\), where \(M= \mathcal {N}_{1}(\epsilon , \mathcal {F})\), such that
where Q is any discrete probability measure. We also assume that g1,g2,…,gM satisfy conditions 3.3, 3.4 and 3.2. By f∗ we mean the gj that achieves the minimum. Next, by definition of uniform covering numbers, we obtain
We introduce the notation
Hence, rather than considering any \(f \in \mathcal {F}\), we may work with the functions \(g_{j} \in \mathcal {F}\) only.Decomposition Consider now the following decomposition
We control separately each term on the right hand side of the above inequality. Bounds for the first and the last non-regenerative blocks can be easily obtained using Markov inequality:
We deal in a similar fashion with the last non-regenerative block:
The control of the term in the middle is more challenging. Note that
where \(l_{n_{1}} = \min \nolimits (\lfloor {n/ \mathbb {E}_{A}[\tau _{A}]}\rfloor , l_{n})\) and \(l_{n_{2}}= \max \nolimits (\lfloor {n/\mathbb {E}_{A}[\tau _{A}]}\rfloor , l_{n})\).
Polynomial tail inequality for i.i.d. random variables We may apply Theorem 6.1 in order to obtain
Truncation The control of \({\sum }_{l_{n_{1}}}^{l_{n_{2}}} \bar {g}_{j}({}_{i})\) is slightly more challenging due to the fact that ln is random and correlated itself with the blocks. Observe firstly that since we expect the number of terms in this sum to be at most of the order \(\sqrt {n},\) this term should be much smaller than the leading term (1) and be thus asymptotically negligible. We have
First, we bound term I in (14) using Montgomery-Smith’s inequality and the fact that if
Note that it is sufficient to consider the case where \(\lfloor {n/\mathbb {E}_{A}[\tau _{A}]}\rfloor < l_{n}\) only. In what follows we rely on the following observation:
Thus,
and by exchangeability of the blocks we have
Montgomery-Smith’s inequality Now, we use Montgomery-Smith’s inequality to get
Finally, term II is directly controlled by means of Lemma 6.1. □
1.3 Proof of theorem 3.6
Before detailing the proof, we recall Massart’s Finite Class Lemma (see [22], Lemma 5.2, page 300) which is involved in our argument.
Lemma 6.2
Let\(\mathcal {A}\)be some finitesubset of\(\mathbb {R}^{n}\). Let N denotethe cardinality of\(\mathcal {A}\)andlet\(R= \sup _{a \in \mathcal {A}}\left [{\sum }_{i=1}^{n} {a_{i}^{2}} \right ]^{1/2},\)then
Montgomery-Smith’s inequality
In order to deal with the random character of the number of blocks ln − 1, apply Montgomery-Smith’s inequality:
Integrating over t > 0 then yields:
Ghost sample of regeneration blocks and randomization
In the following, we consider \({}^{\prime }=({}_{1}^{\prime }, \ldots , {}_{n}^{\prime })\) an independent copy of \({}=({}_{1}, \ldots , {}_{n})\) (a ’ghost’ sample). Let (𝜖1,…,𝜖n) be independent Rademacher variables. Let
Note that, for any M > 0, we have
Uniform covering for \(\mathcal {F}\)
We consider an uniform 𝜖-covering g1,…,gW, where \(W= \mathcal {N}_{1}(x/M\mathbb {E}_{A}[\tau _{A}], \mathcal {F})\) and
and Q is any discrete probability measure. We also assume that g1,g2,…,gW belong to \(\mathcal {F}\) and satisfy Assumption 3.3. By f∗ is meant the gj achieving the minimum. Then,
Massart’s finite class lemma
In what follows we will use Massart’s Finite Class Lemma (Lemma 6.2). We bound (4) by applying directly (3):
We now derive an upper bound for II.
Since we have
one may write
by virtue of Markov inequality combined with the fact that \(\mathbb {E}_{A}[l({}_{1})]^{2}<\infty \).
Rights and permissions
About this article
Cite this article
Clémençon, S., Bertail, P. & Ciołek, G. Statistical learning based on Markovian data maximal deviation inequalities and learning rates. Ann Math Artif Intell 88, 735–757 (2020). https://doi.org/10.1007/s10472-019-09670-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10472-019-09670-6
Keywords
- Concentration inequality
- Empirical process
- Generalization bound
- Harris positive Markov chain
- Minimum volume set
- Novelty detection
- Regenerative method
- Stationary probability distribution
- Unsupervised learning