Statistical learning based on Markovian data maximal deviation inequalities and learning rates

Clémençon, Stephan; Bertail, Patrice; Ciołek, Gabriela

doi:10.1007/s10472-019-09670-6

Statistical learning based on Markovian data maximal deviation inequalities and learning rates

Published: 29 August 2019

Volume 88, pages 735–757, (2020)
Cite this article

Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Stephan Clémençon ORCID: orcid.org/0000-0002-5879-9500¹,
Patrice Bertail^1,2 &
Gabriela Ciołek^1,3

148 Accesses
2 Citations
Explore all metrics

Abstract

In statistical learning theory, numerous works established non-asymptotic bounds assessing the generalization capacity of empirical risk minimizers under a large variety of complexity assumptions for the class of decision rules over which optimization is performed, by means of sharp control of uniform deviation of i.i.d. averages from their expectation, while fully ignoring the possible dependence across training data in general. It is the purpose of this paper to show that similar results can be obtained when statistical learning is based on a data sequence drawn from a (Harris positive) Markov chain X, through the running example of estimation of minimum volume sets (MV-sets) related to X’s stationary distribution, an unsupervised statistical learning approach to anomaly/novelty detection. Based on novel maximal deviation inequalities we establish, using the regenerative method, learning rate bounds that depend not only on the complexity of the class of candidate sets but also on the ergodicity rate of the chain X, expressed in terms of tail conditions for the length of the regenerative cycles. In particular, this approach fully tailored to Markovian data permits to interpret the rate bound results obtained in frequentist terms, in contrast to alternative coupling techniques based on mixing conditions: the larger the expected number of cycles over a trajectory of finite length, the more accurate the MV-set estimates. Beyond the theoretical analysis, this phenomenon is supported by illustrative numerical experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Article Open access 07 July 2017

Peyman Mohajerin Esfahani & Daniel Kuhn

Existence and Uniqueness of Quasi-stationary Distributions for Symmetric Markov Processes with Tightness Property

Article 17 January 2019

Masayoshi Takeda

Residuals-based distributionally robust optimization with covariate information

Article 26 September 2023

Rohit Kannan, Güzin Bayraksan & James R. Luedtke

References

Adamczak, R., Bednorz, W.: Exponential concentration inequalities for additive functionals of Markov chains. ESAIM: PS 19, 440–481 (2015)
Article MathSciNet Google Scholar
Adams, T.M., Nobel, A.B.: Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. Ann. Probab. 38, 1345–1367 (2010)
Article MathSciNet Google Scholar
Agarwal, A., Duchi, J.: The generalization ability of online algorithms for dependent data. IEEE Trans. Inf. Theory 59(1), 573–587 (2013)
Article MathSciNet Google Scholar
Alquier, P., Wintenberger, O.: Model selection for weakly dependent time series forecasting. Bernoulli 18, 883–913 (2012)
Article MathSciNet Google Scholar
Asmussen, S.: Applied probability and queues. Springer, New York (2003)
MATH Google Scholar
Bertail, P., Ciołek, G.: New Bernstein and Hoeffding type inequalities for regenerative Markov chains. ALEA Lat. Am. J. Probab. Math. Stat. 259, Äì–277 (2019)
MathSciNet MATH Google Scholar
Bertail, P., Clémençon, S.: Edgeworth expansions for suitably normalized sample mean statistics of atomic Markov chains. Prob. Th. Rel Fields 130(3), 388–414 (2004)
Article MathSciNet Google Scholar
Bertail, P., Clémençon, S.: A renewal approach to Markovian U-statistics. Math. Methods Statist. 20(2), 79–105 (2004)
Article MathSciNet Google Scholar
Bertail, P., Clémençon, S.: Regenerative-block bootstrap for Markov chains. Bernoulli 12(4), 689–712 (2005)
Article MathSciNet Google Scholar
Bertail, P., Clémençon, S.: Sharp bounds for the tails of functionals of Markov chains. Theory of Probability and Its Applications 54(3), 505–515 (2010)
Article MathSciNet Google Scholar
Ciołek, G.: Bootstrap uniform central limit theorems for harris recurrent Markov chains. Electronic Journal of Statistics 10, 2157–2178 (2016)
Article MathSciNet Google Scholar
Clémençon, S., Bertail, P., Papa, G.: Learning from survey training samples: rate bounds for Horvitz-Thompson risk minimizers. In: Proceedings of ACML’16 (2016)
de la Pena, V., Giné, E.: Decoupling: from dependence to independence. Springer, Berlin (1999)
Book Google Scholar
Di, J., Kolaczyk, E.: Complexity-penalized estimation of minimum volume sets for dependent data. J. Multivar. Anal. 101(9), 1910–1926 (2004)
Article MathSciNet Google Scholar
Einmahl, J.H.J., Mason, D.M.: Generalized quantile process. Ann. Stat. 20, 1062–1078 (1992)
Article Google Scholar
Giné, E., Zinn, J.: Some limit theorems for empirical processes. Ann. Probab. 12(4), 929–998 (1984). With discussion
Article MathSciNet Google Scholar
Hairer, M., Mattingly, J.C.: Yet another look at harris, Äô ergodic theorem for Markov chains. Seminar on Stochastic Analysis, Random Fields and Applications VI. Progr Probab. 63, 109–117 (2011)
MathSciNet MATH Google Scholar
Hanneke, S.: Learning whenever learning is possible: Universal learning under general stochastic processes. arXiv:1706.01418 (2017)
Jain, J., Jamison, B.: Contributions to Doeblin’s theory of Markov processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 8, 19–40 (1967)
Article MathSciNet Google Scholar
Koltchinskii, V.: Rademacher penalties and structural risk minimization. IEEE Trans. Inf. Theory 47, 1902–1914 (2001)
Article MathSciNet Google Scholar
Kuznetsov, V., Mohri, M.: Generalization bounds for time series prediction with non-stationary processes. In: Proceedings of ALT’14 (2014)
Massart, P.: Some applications of concentration inequalities to statistics. Annales de la faculté des sciences de Toulouse 9, 245–303 (2000)
Article MathSciNet Google Scholar
McGoff, K., Nobel, A.B.: Empirical risk minimization and complexity of dunamical models. Submitted (2018)
Merlevède, F., Peligrad, M.: Rosenthal-type inequalities for the maximum of partial sums of stationary processes and examples. Ann. Probab. 41, 914–960 (2013)
Article MathSciNet Google Scholar
Meyn, S.P., Tweedie, R.L.: Markov chains and stochastic stability. Springer, Berlin (1996)
MATH Google Scholar
Montgomery-Smith, S.J.: Comparison of sums of independent identically distributed random vectors. J. Math. Anal. Appl. 14, 281–285 (1993)
MathSciNet MATH Google Scholar
Nummelin, E.: A splitting technique for Harris recurrent chains. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 43, 309–318 (1978)
Article MathSciNet Google Scholar
Peligrad, M.: The r-quick version of the strong law for stationary ϕ-mixing sequences. In: Almost Everywhere Convergence (Columbus, OH, 1988). Academic Press, Boston (1989)
Petrov, V.V.: Limit theorems of probability theory: sequences of independent random variables. Oxford studies in probability. Clarendon Press, Oxford (1995)
MATH Google Scholar
Polonik, W.: Minimum volume sets and generalized quantile processes. Stochastic Processes and their Applications 69(1), 1–24 (1997)
Article MathSciNet Google Scholar
Revuz, D.: Markov chains. 2nd edition, North-Holland (1984)
Rosenthal, H.P.: On the subspaces of lp (p > 2) spanned by sequences of independent random variables. Israel J. Math. 8, 273–303 (1970)
Article MathSciNet Google Scholar
Scott, C., Nowak, R.: Learning minimum volume sets. J. Mach. Learn. Res. 7, 665–704 (2006)
MathSciNet MATH Google Scholar
Shao, Q.: Maximal inequalities for partial sums of ρ-mixing sequences. Ann. Probab. 23, 948–965 (1995)
Article MathSciNet Google Scholar
Steinwart, I., Christmann, A.: Fast learning from non-i.i.d. observations. NIPS 22, 1768–1776 (2009)
Google Scholar
Steinwart, I., Hush, D., Scovel, C.: Learning from dependent observations. J. Multivar. Anal. 100(1), 175–194 (2009)
Article MathSciNet Google Scholar
Thorisson, H.: Coupling, stationarity and regeneration. Springer, Berlin (2000)
Book Google Scholar
Tuominen, P.K., Tweedie, R.: Subgeometric rates of convergence of f-ergodic Markov chains. Adv. Appl. Probab. 26, 775–798 (1994)
Article MathSciNet Google Scholar
Utev, S.A.: Sums of random variables with ϕ-mixing. Sib. Adv. Math. 1, 124–155 (1991)
MATH Google Scholar
van der Vaart, A.W., Wellner, J.A.: Weak convergence and empirical processes. Springer, Berlin (1996)
Book Google Scholar
Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)
Article Google Scholar
Viennet, G.: Inequalities for absolutely regular sequences: application to density estimation. Probab. Theory Relat. Fields 107, 467–492 (1997)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This research was supported by a public grant as part of the Investissement d’avenir, project reference ANR-11-LABX-0056-LMH. Gabriela Ciołek was also supported by the Polish National Science Centre NCN (grant No. UMO2016/23/N/ST1/01355 ) and (partly) by the Ministry of Science and Higher Education. This research has also been conducted as part of the project Labex MME-DII (ANR11-LBX-0023-01). Part of this research was conducted during a stay of Gabriela Ciołek at Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo, Japan.

Author information

Authors and Affiliations

LTCI, Télécom Paris, Institut Polytechnique de Paris, Paris, France
Stephan Clémençon, Patrice Bertail & Gabriela Ciołek
Modal’X, UPL, Univ Paris Nanterre, Nanterre, France
Patrice Bertail
Faculty of Physics and Applied Computer Science AGH University of Science and Technology, Krakow, Poland
Gabriela Ciołek

Authors

Stephan Clémençon
View author publications
You can also search for this author in PubMed Google Scholar
Patrice Bertail
View author publications
You can also search for this author in PubMed Google Scholar
Gabriela Ciołek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephan Clémençon.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Technical proofs

1.1 Moment and probability inequalities in the i.i.d. setup

Since main probabilistic results of the paper are established by means of the regenerative approach, see Section 2.1, their proof are partly based on certain moment/probability inequalities in the i.i.d. case, which we recall below for clarity. Rosenthal’s inequality for i.i.d. random variables can be found in [32]. The version stated below (see Theorem 2.10 from [29]) seems to be more appropriate regarding the statistical learning applications considered in this paper.

Theorem 6.1

LetX₁,⋯ ,X_nbe integrable centeredi.i.d. random variables andp ≥ 2.Assume that$\mathbb {E}|X_{i}|^{p}<\infty $. Then, we have:∀𝜖 > 0,

$$ \mathbb{P}\left( \left\vert \frac{1}{n}\sum\limits_{i=1}^{n}X_{i}\right\vert \geq \epsilon \right) \leq\frac{c_{p} \mathbb{E}|X_{1}|^{p}}{\epsilon^{p}n^{p/2}}, $$

where $c_{p} = 2\max \nolimits \left (p^{p}, p^{p/2 + 1}e^{p} {\int \nolimits }_{0}^{\infty } x^{p/2-1}(1-x)^{-p}dx \right )$.

The constant c_p documented above is due to [32]. The second result recalled here is Montgomery-Smith’s inequality, see [26].

Theorem 6.2 (Montgomery-Smith’s inequality)

LetX₁,⋯ ,X_nbeintegrable centered i.i.d. random variables. Then,for$1 \leq k \leq n < \infty $andallt > 0, wehave

$$ \mathbb{P}\left( \max_{1 \leq k \leq n} \left|\sum\limits_{i=1}^{k} X_{i}\right| > t\right) \leq 9 \mathbb{P}\left( \left|\sum\limits_{i=1}^{n} X_{i}\right| > t/30\right). $$

Lemma 6.1

Suppose that Assumption 3.3 holds. Then we have

$$ \mathbb{P}_{\nu}\left( n^{1/2}\left( \frac{l_{n}}{n}-\frac{1}{\mathbb{E}_{A}[\tau_{A}]}\right)\geq N\right) \leq \frac{4^{p} (2^{p} + 1) }{\mathbb{E}_{A}[\tau_{A}]^{p} N^{p}} + \frac{4^{p} (2^{p} + 1) }{ N^{p/2} n^{p/4}}. $$

The proof is a simple generalization of Lemma 3.6 in [6] and thus omitted.

1.2 Proof of theorem 3.5

We prove the version of the result stated below, more specific.

Theorem 6.3 (Polynomial tail maximal inequality for regenerative Markov chains)

Assume that Assumptions 3.3, 3.4 and 3.2 are satisfied bychain$X= (X_{n})_{n \in \mathbb {N}}$. Then, wehave for anyx > 0, any0 < 𝜖 < x/2, anyN > 0 and foralln ≥ 1 that

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\nu}\left( \sup_{f \in \mathcal{F}} \left| \frac{1}{n} {\sum}_{i=1}^{n} \bar{f}(\mathcal{B}_{i})\right| \geq x\right) &\leq& \mathcal{N}_{1}\left( \epsilon, \mathcal{F}\right) \left[ \frac{3^{p}\left( \mathbb{E}_{\nu}\left[\left( \tilde{H}(\mathcal{B}_{1})\right)^{p} \right]+\mathbb{E}_{A}\left[\left( \tilde{H}(\mathcal{B}_{1})\right)^{p} \right]\right)}{ n^{p} (x-2\epsilon)^{p}} \right.\\ && \left. + \frac{18 \times 90^{p} C_{p} \mathbb{E}_{A}\left[(\tilde{H}(\mathcal{B}_{1}))^{p}\right] }{n^{p/2}(x-2\epsilon)^{p} }\right.\\ &&\left.+\frac{6^{p} C_{p} \mathbb{E}_{A}\left[(\tilde{H}(\mathcal{B}_{1}))^{p}\right] N^{p}}{n^{3p/4}(x-2\epsilon)^{p}}\right.\\ &&\left.+ \mathbb{P}_{\nu}\left( n^{1/2}\left( \frac{l_{n}}{n}-\frac{1}{\mathbb{E}_{A}[\tau_{A}]}\right)\geq N\right)\right], \end{array} $$

where$C_{p} = 24\max \nolimits \left (p^{p}, p^{p/2 + 1}e^{p} {\int \nolimits }_{0}^{\infty } x^{p/2-1}(1-x)^{-p}dx \right )$and$\tilde {H}=H+\mu (H)$.

Proof

The techniques used in the poof are similar to those in the proof of Theorem 3.14 in [6].Uniform covering We choose functions g₁,g₂,…,g_M in class $\mathcal {F}$ defining an 𝜖-covering of $\mathcal {F}$, where $M= \mathcal {N}_{1}(\epsilon , \mathcal {F})$, such that

$$ \min_{j} \vert|f-\mu(f)-g_{j}+ \mu(g_{j})|\vert_{L_{1}(Q)} \leq 2\epsilon \textit{ for all } f \in \mathcal{F}, $$

where Q is any discrete probability measure. We also assume that g₁,g₂,…,g_M satisfy conditions 3.3, 3.4 and 3.2. By f^∗ we mean the g_j that achieves the minimum. Next, by definition of uniform covering numbers, we obtain

$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}_{\nu} \left( \sup_{f \in \mathcal{F}} \left| \frac{1}{n} \sum\limits_{i=1}^{n} (f(X_{i}) - \mu(f))\right| \geq x \right) \\ &&\quad\leq \mathbb{P}_{\nu}\left( \sup_{f \in \mathcal{F}} \left[ \left|\frac{1}{n} \sum\limits_{i=1}^{n}|f(X_{i}) - \mu(f) -f^{*}(X_{i}) + \mu(f^{*})\right| + \left|\frac{1}{n}{\sum}_{i=1}^{n}|f^{*}(X_{i}) - \mu(f^{*})|\right| \right] \geq x\right)\\ &&\quad \leq \mathbb{P}_{\nu} \left( \max_{j \in \{1, \ldots, \mathcal{N}_{1}(\epsilon, \mathcal{F})\}} \left| \frac{1}{n} \sum\limits_{i=1}^{n} g_{j}(X_{i}) - \mu(g_{j})\right| \geq x - 2\epsilon\right)\\ &&\quad \leq \mathcal{N}_{1}\left( \epsilon, \mathcal{F}\right) \max_{j \in \{1, \ldots, \mathcal{N}_{1}(\epsilon, \mathcal{F})\}}\mathbb{P}_{\nu}\left( \frac{1}{n} \left|\sum\limits_{i=1}^{n} g_{j}(X_{i}) - \mu(g_{j})\right| \geq x-2\epsilon\right). \end{array} $$

We introduce the notation

$$ \overline{g}_{j} = g_{j} - \mu(g_{j}), j\in\{1, \ldots, M \}. $$

Hence, rather than considering any $f \in \mathcal {F}$, we may work with the functions $g_{j} \in \mathcal {F}$ only.Decomposition Consider now the following decomposition

$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}_{\nu}\left( \frac{1}{n} \left|\sum\limits_{i=1}^{n} \bar{g}_{j}(X_{i})\right| \geq x-2\epsilon\right) \leq \mathbb{P}_{\nu}\left( \frac{1}{n} \left|\sum\limits_{i=1}^{\tau_{A}} \bar{g}_{j}(X_{i})\right| \geq (x-2\epsilon)/3\right) \\ && + \mathbb{P}_{A}\left( \frac{1}{n}\left|\sum\limits_{i=1}^{l_{n}} \bar{g}_{j}(B_{i})\right| \geq (x-2\epsilon) /3 \right) + \mathbb{P}_{A}\left( \frac{1}{n}\left|\sum\limits_{i=\tau_{A}(l_{n}-1)}^{n}\bar{g}_{j}(X_{i}) \right| \geq (x-2\epsilon)/3\right). \end{array} $$

We control separately each term on the right hand side of the above inequality. Bounds for the first and the last non-regenerative blocks can be easily obtained using Markov inequality:

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\nu}\left( \frac{1}{n} \left|\sum\limits_{i=1}^{\tau_{A}} \bar{g}_{j}(X_{i})\right| \geq \frac{x-2\epsilon}{3}\right) \leq \frac{3^{p}\mathbb{E}_{\nu}\left[\left|{\sum}_{i=1}^{\tau_{A}} \bar{g}_{j}(X_{i})\right|^{p} \right]}{ n^{p} (x-2\epsilon)^{p}}\leq \frac{3^{p}\mathbb{E}_{\nu}\left[\left( {\sum}_{i=1}^{\tau_{A}} \tilde{H}(X_{i})\right)^{p} \right]}{ n^{p} (x-2\epsilon)^{p}}. \end{array} $$

We deal in a similar fashion with the last non-regenerative block:

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\nu}\left( \left|\sum\limits_{i=1+\tau_{A}(l_{n})}^{n} \bar{g}_{j}(X_{i}) \right|\geq \frac{x-2\epsilon}{3}\right) &\leq& \mathbb{P}_{\nu}\left( \sum\limits_{i=1+\tau_{A}(l_{n})}^{n} \left|\bar{g}_{j}\right|(X_{i}) \geq \frac{x-2\epsilon}{3}\right)\\ &\leq& \mathbb{P}_{\nu}\left( \sum\limits_{i=1+\tau_{A}(l_{n})}^{\tau_{A}(l_{n}+1)} \left|\bar{g}_{j}\right|(X_{i}) \geq \frac{x-2\epsilon}{3}\right)\\ &\leq& \frac{3^{p}\mathbb{E}_{A}\left[ \left( \tilde{H}(\mathcal{B}_{1})\right)^{p}\right]}{ n^{p} (x-2\epsilon)^{p}}. \end{array} $$

The control of the term in the middle is more challenging. Note that

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{A}\left( \frac{1}{n}\left|\sum\limits_{i=1}^{l_{n}} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq (x-2\epsilon) /3 \right) & \leq& \mathbb{P}_{A} \left( \frac{1}{n}\left|\sum\limits_{i=1}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq (x-2\epsilon)/6 \right) \\ && +\mathbb{P}_{A}\left( \frac{1}{n}\left| \sum\limits_{i=l_{n_{1}}}^{l_{n_{2}}} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq (x-2\epsilon)/6 \right), \end{array} $$

where $l_{n_{1}} = \min \nolimits (\lfloor {n/ \mathbb {E}_{A}[\tau _{A}]}\rfloor , l_{n})$ and $l_{n_{2}}= \max \nolimits (\lfloor {n/\mathbb {E}_{A}[\tau _{A}]}\rfloor , l_{n})$.

Polynomial tail inequality for i.i.d. random variables We may apply Theorem 6.1 in order to obtain

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{A} \left( \frac{1}{n}\left|{\sum}_{i=1}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq (x-2\epsilon)/6 \right) \leq N_{1}\left( \epsilon, \mathcal{F}\right) \frac{C_{p} \mathbb{E}_{A}\left[ |\tilde{H}(\mathcal{B}_{1})|^{p}\right] }{n^{p/2}(6(x-2\epsilon)^{p}}. \end{array} $$

(13)

Truncation The control of ${\sum }_{l_{n_{1}}}^{l_{n_{2}}} \bar {g}_{j}({}_{i})$ is slightly more challenging due to the fact that l_n is random and correlated itself with the blocks. Observe firstly that since we expect the number of terms in this sum to be at most of the order $\sqrt {n},$ this term should be much smaller than the leading term (1) and be thus asymptotically negligible. We have

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{A}\left( \left| {\sum}_{i=l_{n_{1}}}^{l_{n_{2}}} \bar{g}_{j}(\mathcal{B}_{i})\right| \!\geq\! \frac{x-2\epsilon}{6} \right) &\!\leq\!& \mathbb{P}_{A}\left( \left| {\sum}_{i=l_{n_{1}}}^{l_{n_{2}}} \bar{g}_{j}(\mathcal{B}_{i})\right| \!\geq\! \frac{x - 2\epsilon}{6}, \sqrt{n}\left[\frac{l_{n}}{n} - \frac{1}{\mathbb{E}_{A}[\tau_{A}]} \right] \!\leq\! N \right) \\ && + \mathbb{P}_{\nu}\left( \sqrt{n}\left[\frac{l_{n}}{n} - \frac{1}{\mathbb{E}_{A}[\tau_{A}]} \right]>N \right) = I + II. \end{array} $$

(14)

First, we bound term I in (14) using Montgomery-Smith’s inequality and the fact that if

$$\sqrt{n}\left[ \frac{l_{n}}{n}- \frac{1}{\mathbb{E}_{A}[\tau_{A}]} \right] \leq N, \textit{ then } l_{n_{2}} - l_{n_{1}} \leq \sqrt{n}N.$$

Note that it is sufficient to consider the case where $\lfloor {n/\mathbb {E}_{A}[\tau _{A}]}\rfloor < l_{n}$ only. In what follows we rely on the following observation:

$$ l_{n} = \sup\left\{s: \sum\limits_{i=1}^{s} l(\mathcal{B}_{i}) \leq n\right\}. $$

Thus,

$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}_{A}\left( \left| \sum\limits_{i=l_{n_{1}}}^{l_{n_{2}}} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6}, \sqrt{n}\left[\frac{l_{n}}{n} - \frac{1}{\mathbb{E}_{A}[\tau_{A}]} \right] \leq N \right) \\ \\&&\quad = \sum\limits_{k=1}^{N\sqrt{n}}\mathbb{P}_{A}\left( \left| \sum\limits_{i=\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor+k} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6}, l_{n} = \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor +k\right)\\ &&\quad= \sum\limits_{k=1}^{N\sqrt{n}}\mathbb{P}_{A}\left( \left| {\sum}_{i=1}^{k} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6}, \sum\limits_{i=1}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor +k}l(\mathcal{B}_{i}) \leq n < {\sum}_{i=1}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor +k +1}l(\mathcal{B}_{i}) \right) \end{array} $$

and by exchangeability of the blocks we have

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{k=1}^{N\sqrt{n}}\mathbb{P}_{A}\left( \left| \sum\limits_{i=1}^{k} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6}, \sum\limits_{i=1}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor +k}l(\mathcal{B}_{i}) \leq n < \sum\limits_{i=1}^{\left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor +k +1}l(\mathcal{B}_{i}) \right)\\ &&\quad= \sum\limits_{k=1}^{N\sqrt{n}}\mathbb{P}_{A}\left( \left|\sum\limits_{i=1}^{k} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6}, l_{n}= \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor + k \right) \\&&\quad =\sum\limits_{k=1}^{N\sqrt{n}}\mathbb{P}_{A}\left( \left|\sum\limits_{i=1}^{l_{n} - \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6}, l_{n} - \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor = k \right) \\ &&\quad = \mathbb{P}_{A} \left( \left|\sum\limits_{i=1}^{l_{n} - \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor} \bar{g}_{j}(\mathcal{B}_{i})\right|\geq \frac{x-2\epsilon}{6}, l_{n} - \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor \leq N\sqrt{n}\right). \end{array} $$

Montgomery-Smith’s inequality Now, we use Montgomery-Smith’s inequality to get

$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}_{A} \left( \left|\sum\limits_{i=1}^{l_{n} - \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor} \bar{g}_{j}(\mathcal{B}_{i})\right|\geq \frac{x-2\epsilon}{6}, l_{n} - \left\lfloor{n/\mathbb{E}_{A}[\tau_{A}]}\right\rfloor \leq N\sqrt{n}\right) \\ &&\quad= \mathbb{P}_{A}\left( \max_{1 \leq k \leq N\sqrt{n}} \left| \sum\limits_{i=1}^{k} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{6} \right) \\&&\quad \leq 9 \mathbb{P}_{A}\left( \left| \sum\limits_{i=1}^{N\sqrt{n}} \bar{g}_{j}(\mathcal{B}_{i})\right| \geq \frac{x-2\epsilon}{180}\right) \\ &&\quad \leq \frac{18 \times 90^{p} C_{p} \mathbb{E}_{A} \left[\tilde{H}(\mathcal{B}_{1})\right] \times N^{p}}{(x-2\epsilon)^{p} n^{3p/4}}. \end{array} $$

Finally, term II is directly controlled by means of Lemma 6.1. □

1.3 Proof of theorem 3.6

Before detailing the proof, we recall Massart’s Finite Class Lemma (see [22], Lemma 5.2, page 300) which is involved in our argument.

Lemma 6.2

Let$\mathcal {A}$be some finitesubset of$\mathbb {R}^{n}$. Let N denotethe cardinality of$\mathcal {A}$andlet$R= \sup _{a \in \mathcal {A}}\left [{\sum }_{i=1}^{n} {a_{i}^{2}} \right ]^{1/2},$then

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[sup_{a \in \mathcal{A}} \sum\limits_{i=1}^{n} a_{i}\epsilon_{i}\right] \leq R \sqrt{2log N}. \end{array} $$

(15)

Montgomery-Smith’s inequality

In order to deal with the random character of the number of blocks l_n − 1, apply Montgomery-Smith’s inequality:

$$ \begin{array}{@{}rcl@{}} \mathbb{P}_{\nu}\left( \sup_{f\in\mathcal{F}}\left\vert \frac{1}{n}\sum\limits_{j=1}^{l_{n}-1}\bar{f}(\mathcal{B}_{j})\right\vert \geq \epsilon\right) &\leq&\mathbb{P}_{A}\left( \max_{k\leq n}\sup_{f\in\mathcal{F}}\left\vert \frac{1}{n}\sum\limits_{j=1}^{k}\bar{f}(\mathcal{B}_{j})\right\vert \geq \epsilon\right) \\ &\leq& 9 \mathbb{P}_{A}\left( \sup_{f\in\mathcal{F}}\left\vert \frac{1}{n} \sum\limits_{j=1}^{n}\bar{f}(\mathcal{B}_{j})\right\vert \geq\frac{t}{30}\right). \end{array} $$

Integrating over t > 0 then yields:

$$ \mathbb{E}_{A}\left[\sup_{f \in \mathcal{F}} \left| \frac{1}{n}\sum\limits_{j=1}^{l_{n}-1}\bar{f}(\mathcal{B}_{j})\right| \right] \leq 270 \mathbb{E}_{A}\left[\sup_{f \in \mathcal{F}} \left| \frac{1}{n}\sum\limits_{j=1}^{n}\bar{f}(\mathcal{B}_{j})\right| \right]. $$

Ghost sample of regeneration blocks and randomization

In the following, we consider ${}^{\prime }=({}_{1}^{\prime }, \ldots , {}_{n}^{\prime })$ an independent copy of ${}=({}_{1}, \ldots , {}_{n})$ (a ’ghost’ sample). Let (𝜖₁,…,𝜖_n) be independent Rademacher variables. Let

$$ \|l\|_{P_{\mathcal{B}}}= \frac{{\sum}_{i=1}^{n}l(\mathcal{B}_{i})}{n \mathbb{E}_{A}[\tau_{A}]}. $$

Note that, for any M > 0, we have

$$ \begin{array}{@{}rcl@{}} 270 \mathbb{E}_{A}\left[\sup_{f \in \mathcal{F}} \left| \frac{1}{n}{\sum}_{i=1}^{n}f(\mathcal{B}_{i})-\mu(f(\mathcal{B}_{1}))\right| \right] &\leq& 540 \mathbb{E}_{\epsilon}\mathbb{E}_{\mathcal{B}}\left[ \sup_{f \in \mathcal{F}} \left| \frac{1}{n}{\sum}_{i=1}^{n}f(\mathcal{B}_{i})\epsilon_{i}\right|\right] \\ & \leq& 540 \mathbb{E}_{\epsilon}\mathbb{E}_{\mathcal{B}}\left[ \sup_{f \in \mathcal{F}} \left| \frac{1}{n}{\sum}_{i=1}^{n}f(\mathcal{B}_{i})\epsilon_{i}\right|\mathbb{I}_{\|l\|_{P_{\mathcal{B}}}}\leq M\mathbb{E}_{A}[\tau_{A}]\right] \\& +& 540 \mathbb{E}_{\epsilon}\mathbb{E}_{\mathcal{B}}\left[ \sup_{f \in \mathcal{F}} \left| \frac{1}{n}{\sum}_{i=1}^{n}f(\mathcal{B}_{i})\epsilon_{i}\right|\mathbb{I}_{\|l\|_{P_{\mathcal{B}}}}> M\mathbb{E}_{A}[\tau_{A}]\right] \\& =& I + II. \end{array} $$

Uniform covering for $\mathcal {F}$

We consider an uniform 𝜖-covering g₁,…,g_W, where $W= \mathcal {N}_{1}(x/M\mathbb {E}_{A}[\tau _{A}], \mathcal {F})$ and

$$ \min_{j} \vert|f-\mu(f)-g_{j}+ \mu(g_{j})|\vert_{L_{1}(Q)} \leq \epsilon \textit{ for all } f \in \mathcal{F} $$

and Q is any discrete probability measure. We also assume that g₁,g₂,…,g_W belong to $\mathcal {F}$ and satisfy Assumption 3.3. By f^∗ is meant the g_j achieving the minimum. Then,

$$ \begin{array}{@{}rcl@{}} I &\!\leq\!& 540 \mathbb{E}_{\epsilon}\mathbb{E}_{\mathcal{B}}\left\{ \sup_{f \in \mathcal{F}}\left[\left|\frac{1}{n}{\sum}_{i=1}^{n} (f(\mathcal{B}_{i}) - \mu(f) - f^{*}(\mathcal{B}_{i})\right.\right.\right.\\ &&\left.\left.\left.+ \mu(f^{*}))\epsilon_{i} \right| + \left|\frac{1}{n}{\sum}_{i=1}^{n} (f^{*}(\mathcal{B}_{i}) - \mu(f^{*}))\epsilon_{i} \right|\right]\right\}\\ &\!\leq\!& 540 \left[\!\epsilon + \mathbb{E}_{\epsilon}\mathbb{E}_{\mathcal{B}}\left[\!\mathcal{N}_{1}\left( \epsilon/(M\mathbb{E}_{A}[\tau_{A}]), \mathcal{F}\right) \max_{1\leq j\leq \mathcal{N}_{1}\left( \epsilon/(M\mathbb{E}_{A}[\tau_{A}]), \mathcal{F}\right)} \left|\! \frac{1}{n}{\sum}_{i=1}^{n}g_{j}(\mathcal{B}_{i})\epsilon_{i}\right|\!\mathbb{I}_{\|l\|_{P_{\mathcal{B}}}}\right.\right.\\ &\!\leq\!&\left.\left.\vphantom{\left| \frac{1}{n}{\sum}_{i=1}^{n}g_{j}(\mathcal{B}_{i})\epsilon_{i}\right|} M\mathbb{E}_{A}[\tau_{A}]\right]\right]. \end{array} $$

(16)

Massart’s finite class lemma

In what follows we will use Massart’s Finite Class Lemma (Lemma 6.2). We bound (4) by applying directly (3):

$$ \begin{array}{@{}rcl@{}} (16) &\leq &\mathbb{E}_{A} \left[\epsilon + \max_{1\leq j\leq \mathcal{N}_{1}\left( \epsilon/(M\mathbb{E}_{A}[\tau_{A}]), \mathcal{F}\right)}\left( \frac{1}{n} {\sum}_{i=1}^{n}(g_{j}(\mathcal{B}_{i})^{2}) \right)^{1/2} \times \sqrt{\frac{2log \mathcal{N}_{1}\left( \epsilon/(M\mathbb{E}_{A}[\tau_{A}]), \mathcal{F}\right)}{n} } \right] \\ &\leq &540 \left[ \epsilon + \mathcal{N}_{1}\left( \epsilon/(M\mathbb{E}_{A}[\tau_{A}]), \mathcal{F}\right)\times \mathbb{E}_{A}[F(\mathcal{B}_{1})^{2}]^{1/2} \sqrt{\frac{2log \mathcal{N}_{1}\left( \epsilon/(M\mathbb{E}_{A}[\tau_{A}]), \mathcal{F}\right)}{n} }\right]. \end{array} $$

We now derive an upper bound for II.

$$ \begin{array}{@{}rcl@{}} II &\leq &540 \mathbb{E}_{\epsilon}\mathbb{E}_{\mathcal{B}}\left[ \left( \frac{1}{n}{\sum}_{i=1}^{n}H(\mathcal{B}_{i})\right)^{2}\right]^{1/2} \left[\mathbb{P}\left( \|l\|_{P_{\mathcal{B}}}> M\mathbb{E}_{A}[\tau_{A}]\right)\right]^{1/2} \\&=& 540 \mathbb{E}_{A}[H(\mathcal{B}_{1})^{2}]^{1/2} \times \left[\mathbb{P}\left( \|l\|_{P_{\mathcal{B}}}> M\mathbb{E}_{A}[\tau_{A}]\right)\right]^{1/2}. \end{array} $$

Since we have

$$ \|l\|_{P_{\mathcal{B}}}= \frac{{\sum}_{i=1}^{n}l(\mathcal{B}_{i})}{n \mathbb{E}_{A}[\tau_{A}]}, $$

one may write

$$ \begin{array}{@{}rcl@{}} \left( \mathbb{P}\left[ \|l\|_{P_{\mathcal{B}}} -1 \geq M-1\right]\right)^{1/2}&=& \left[\mathbb{P}\left( \frac{1}{n} \frac{{\sum}_{i=1}^{n}l(\mathcal{B}_{i})-\mathbb{E}_{A}[\tau_{A}] }{\mathbb{E}_{A}[\tau_{A}]}-1 \geq M-1\right) \right]^{1/2}\\ & \leq& \frac{[Var[l(\mathcal{B}_{1})]]^{1/2}}{n^{1/2}\mathbb{E}_{A}[{\tau_{A}^{2}}]^{1/2}\sqrt{M-1}} \end{array} $$

by virtue of Markov inequality combined with the fact that $\mathbb {E}_{A}[l({}_{1})]^{2}<\infty $.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Clémençon, S., Bertail, P. & Ciołek, G. Statistical learning based on Markovian data maximal deviation inequalities and learning rates. Ann Math Artif Intell 88, 735–757 (2020). https://doi.org/10.1007/s10472-019-09670-6

Download citation

Published: 29 August 2019
Issue Date: July 2020
DOI: https://doi.org/10.1007/s10472-019-09670-6

Keywords

Mathematics Subject Classification (2010)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Statistical learning based on Markovian data maximal deviation inequalities and learning rates

Abstract

Access this article

Similar content being viewed by others

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Existence and Uniqueness of Quasi-stationary Distributions for Symmetric Markov Processes with Tightness Property

Residuals-based distributionally robust optimization with covariate information

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix: Technical proofs

1.1 Moment and probability inequalities in the i.i.d. setup

Theorem 6.1

Theorem 6.2 (Montgomery-Smith’s inequality)

Lemma 6.1

1.2 Proof of theorem 3.5

Theorem 6.3 (Polynomial tail maximal inequality for regenerative Markov chains)

Proof

1.3 Proof of theorem 3.6

Lemma 6.2

Montgomery-Smith’s inequality

Ghost sample of regeneration blocks and randomization

Uniform covering for \(\mathcal {F}\)

Massart’s finite class lemma

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2010)

Navigation

Statistical learning based on Markovian data maximal deviation inequalities and learning rates

Abstract

Access this article

Similar content being viewed by others

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Existence and Uniqueness of Quasi-stationary Distributions for Symmetric Markov Processes with Tightness Property

Residuals-based distributionally robust optimization with covariate information

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix: Technical proofs

Appendix: Technical proofs

1.1 Moment and probability inequalities in the i.i.d. setup

Theorem 6.1

Theorem 6.2 (Montgomery-Smith’s inequality)

Lemma 6.1

1.2 Proof of theorem 3.5

Theorem 6.3 (Polynomial tail maximal inequality for regenerative Markov chains)

Proof

1.3 Proof of theorem 3.6

Lemma 6.2

Montgomery-Smith’s inequality

Ghost sample of regeneration blocks and randomization

Uniform covering for \(\mathcal {F}\)

Massart’s finite class lemma

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)

Search

Navigation