A hybrid stochastic optimization framework for composite nonconvex optimization

Tran-Dinh, Quoc; Pham, Nhan H.; Phan, Dzung T.; Nguyen, Lam M.

doi:10.1007/s10107-020-01583-1

A hybrid stochastic optimization framework for composite nonconvex optimization

Full Length Paper
Series A
Published: 04 January 2021

Volume 191, pages 1005–1071, (2022)
Cite this article

Mathematical Programming Submit manuscript

1671 Accesses
12 Citations
1 Altmetric
Explore all metrics

Abstract

We introduce a new approach to develop stochastic optimization algorithms for a class of stochastic composite and possibly nonconvex optimization problems. The main idea is to combine a variance-reduced estimator and an unbiased stochastic one to create a new hybrid estimator which trades-off the variance and bias, and possesses useful properties for developing new algorithms. We first introduce our hybrid estimator and investigate its fundamental properties to form a foundational theory for algorithmic development. Next, we apply our new estimator to develop several variants of stochastic gradient method to solve both expectation and finite-sum composite optimization problems. Our first algorithm can be viewed as a variant of proximal stochastic gradient methods with a single loop and single sample, but can achieve the best-known oracle complexity bound as state-of-the-art double-loop algorithms in the literature. Then, we consider two different variants of our method: adaptive step-size and restarting schemes that have similar theoretical guarantees as in our first algorithm. We also study two mini-batch variants of the proposed methods. In all cases, we achieve the best-known complexity bounds under standard assumptions. We test our algorithms on several numerical examples with real datasets and compare them with many existing methods. Our numerical experiments show that the new algorithms are comparable and, in many cases, outperform their competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Article Open access 07 July 2017

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

References

Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 99, 1–1 (2010)
MATH Google Scholar
Agarwal, A., Bottou, L.: A lower bound for the optimization of finite sums. In: International Conference on Machine Learning, pp. 78–86 (2015)
Allen-Zhu, Z.: Katyusha: The first direct acceleration of stochastic gradient methods. Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 1200–1205 (2017). Montreal, Canada
Allen-Zhu, Z.: Natasha: Faster non-convex stochastic optimization via strongly non-convex parameter. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 89–97 (2017)
Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than SGD. In: Advances in neural information processing systems, pp. 2675–2686 (2018)
Allen-Zhu, Z., Li. Y.: NEON2: Finding local minima via first-order oracles. In: Advances in Neural Information Processing Systems, pp. 3720–3730 (2018)
Allen-Zhu, Zeyuan, Yuan, Yang: Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In: ICML, pp. 1080–1089 (2016)
Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., Woodworth, B.: Lower bounds for non-convex stochastic optimization. arXiv:1912.02365, (2019)
Bertsekas, D.P.: Incremental proximal methods for large scale convex optimization. Math. Program. 129(2), 163–195 (2011)
Article MathSciNet MATH Google Scholar
Bollapragada, R., Byrd, R., Nocedal, J.: Exact and Inexact Subsampled Newton Methods for Optimization. IMA J. Numer. Anal. 39(2), 545–578 (2019)
Article MathSciNet MATH Google Scholar
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)
Bottou, L.: Online learning and stochastic approximations. In: David, S. (ed.) Online Learning in Neural Networks, pp. 9–42. Cambridge University Press, New York (1998)
MATH Google Scholar
Richard, H.B., Hansen, S.L., Jorge, N., Yoram, S.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)
Article MathSciNet MATH Google Scholar
Carmon, Y., Duchi, J., Hinder, O., Sidford, A.: Lower bounds for finding stationary points I. Math. Program. 5, 1–50 (2017)
MATH Google Scholar
Chambolle, A., Ehrhardt, M.J., Richtárik, P., Schönlieb, C.-B.: Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM J. Optim. 28(4), 2783–2808 (2018)
Article MathSciNet MATH Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–27 (2011)
Article Google Scholar
Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex SGD. In: Advances in Neural Information Processing Systems, pp. 15210–15219 (2019)
Davis, D., Grimmer, B.: Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM J. Optim. 29(3), 1908–1930 (2019)
Article MathSciNet MATH Google Scholar
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems (NIPS), pp. 1646–1654 (2014)
Defazio, A., Caetano, T., Domke. J.: Finito: A faster, permutable incremental gradient method for big data problems. In: International Conference on Machine Learning, pp. 1125–1133 (2014)
Driggs, D., Liang, J., Schönlieb, C.-B.: On the bias-variance tradeoff in stochastic gradient methods. arXiv:1906.01133 (2019)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Erdogdu, M.A, Montanari, A.: Convergence rates of sub-sampled Newton methods. In: Advances in Neural Information Processing Systems, pp. 3052–3060 (2015)
Fang, C., Li, C. J., Lin, Z., Zhang, T.: SPIDER: near-optimal non-convex optimization via stochastic path integrated differential estimator. In: Advances in Neural Information Processing Systems, pp. 689–699 (2018)
Fang, C., Lin, Z., Zhang, T.: Sharp Analysis for Nonconvex SGD Escaping from Saddle Points. In: Conference on Learning Theory, pp. 1192–1234 (2019)
Foster, D., Sekhari, A., Shamir, O., Srebro, N., Sridharan, K., Woodworth, B.: The complexity of making the gradient small in stochastic convex optimization. In: Conference on Learning Theory, pp. 1319–1345 (2019)
Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points–online stochastic gradient for tensor decomposition. In: Conference on Learning Theory, pp. 797–842 (2015)
Ge, R., Li, Z., Wang, W., Wang, X.: Stabilized SVRG: Simple variance reduction for nonconvex optimization. In: Conference on Learning Theory, pp. 1394–1448 (2019)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Article MathSciNet MATH Google Scholar
Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)
Article MathSciNet MATH Google Scholar
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)
Article MathSciNet MATH Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, vol. 1. MIT Press, Cambridge (2016)
MATH Google Scholar
Hanzely, F., Mishchenko, K., Richtárik, P.: SEGA: variance reduction via gradient sketching. In: Advances in Neural Information Processing Systems, pp. 2082–2093 (2018)
Jofré, A., Thompson, P.: On variance reduction for stochastic smooth convex optimization with multiplicative noise. Math. Program. 174(1–2), 253–292 (2019)
Article MathSciNet MATH Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems (NIPS), pp. 315–323 (2013)
Karimi, H., Nutini, J. Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer (2016)
Kingma, D.P., Ba, J.: ADAM: A Method for Stochastic Optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR). arXiv:1412.6980 (2014)
Konečný, J., Liu, J., Richtárik, P., Takáč, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Signal Process. 10, 242–255 (2016)
Article Google Scholar
Kovalev, D., Horvath, S., Richtarik, P.: Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. Proc 31st Int Conf Algorithmic Learn Theor. 117, 1–17 (2020)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Li, Z.: SSRGD: simple stochastic recursive gradient descent for escaping saddle points. In: Advances in Neural Information Processing Systems, pp. 1521–1531 (2019)
Li, Z., Li, J.: A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 5564–5574 (2018)
Lihua, L., Ju, C., Chen, J., Jordan, M.: Non-convex finite-sum optimization via SCSG methods. In: Advances in Neural Information Processing Systems, pp. 2348–2358 (2017)
Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Advances in Neural Information Processing Systems, pp. 3384–3392 (2015)
Mairal, J.: Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. 25(2), 829–855 (2015)
Article MathSciNet MATH Google Scholar
Metel, M., Takeda, A.: Simple stochastic gradient methods for non-smooth non-convex regularized optimization. In: International Conference on Machine Learning, pp. 4537–4545 (2019)
Mokhtari, A., Ozdaglar, A., Jadbabaie, A.: Escaping saddle points in constrained optimization. In: Advances in Neural Information Processing Systems, pp. 3629–3639 (2018)
Moulines, Eric, Bach, Francis R: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems, pp. 451–459 (2011)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Article MathSciNet MATH Google Scholar
Nemirovskii, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)
Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization, vol. 87. Kluwer Academic Publishers, London (2004)
Book MATH Google Scholar
Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Article MathSciNet MATH Google Scholar
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: ICML (2017)
Nguyen, L.M., Nguyen, N.H., Phan, D.T., Kalagnanam, J.R., Scheinberg, K.: When does stochastic gradient algorithm work well? arXiv:1801.06159 (2018)
Nguyen, L.M., Scheinberg, K., Takac, M.: Inexact SARAH Algorithm for Stochastic Optimization. Optim. Methods Softw. (online first) (2020)
Nguyen, L.M., van Dijk, M., Phan, D.T., Nguyen, P.H., Weng, T.-W., Kalagnanam, J.R.: Optimal finite-sum smooth non-convex optimization with SARAH. arXiv:1901.07648 (2019)
Nguyen, L.M., Liu, J., Scheinberg, K., Takác, M.: Stochastic recursive gradient algorithm for nonconvex optimization. arXiv:1705.07261 (2017)
Oja, E.: Simplified neuron model as a principal component analyzer. J. Math. Biol. 15(3), 267–273 (1982)
Article MathSciNet MATH Google Scholar
Pham, H.N., Nguyen, M.L., Phan, T.D., Tran-Dinh, Q.: ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21, 1–48 (2020)
MathSciNet MATH Google Scholar
Pilanci, M., Wainwright, M.J.: Newton sketch: a linear-time optimization algorithm with linear-quadratic convergence. SIAM J. Optim. 27(1), 205–245 (2017)
Article MathSciNet MATH Google Scholar
Polyak, B., Juditsky, A.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838–855 (1992)
Article MathSciNet MATH Google Scholar
Reddi, S.J., Sra, S., Póczos, B., Smola, A.: Stochastic Frank-Wolfe methods for nonconvex optimization. In: 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1244–1251. IEEE (2016)
Reddi, S.J., Sra, S., Póczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in Neural Information Processing Systems, pp. 1145–1153 (2016)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MathSciNet MATH Google Scholar
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods I: globally convergent algorithms. Math. Program. 174, 293–326 (2019)
Article MathSciNet MATH Google Scholar
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)
Article MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
MathSciNet MATH Google Scholar
Tran-Dinh, Q., Liu, D., Nguyen, L.M.: Hybrid variance-reduced SGD algorithms for minimax problems with nonconvex-linear function. Proc. of The Thirty-fourth Conference on Neural In-formation Processing Systems (NeurIPS) (2020)
Unser, M.: A Representer Theorem for Deep Neural Networks. J. Math. 20, 1–30 (2019)
MathSciNet MATH Google Scholar
Wang, M., Fang, E., Liu, L.: Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program. 161(1–2), 419–449 (2017)
Article MathSciNet MATH Google Scholar
Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: SpiderBoost and Momentum: Faster variance reduction algorithms. Proc. of The Thirty-third Conference on Neural Information Processing Systems (NeurIPS) (2019)
Wang, Z., Zhou, Y., Liang, Y., Lan, G.: Stochastic variance-reduced cubic regularization for nonconvex optimization. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2731–2740 (2019)
Woodworth, B.E., Srebro, N.: Tight complexity bounds for optimizing composite objectives. In: Advances in Neural Information Processing Systems (NIPS), pp. 3639–3647 (2016)
Zhang, J., Xiao, L., Zhang, S.: Adaptive stochastic variance reduction for subsampled Newton method with cubic regularization. arXiv:1811.11637 (2018)
Zhao, L., Mammadov, M., Yearwood, J.: From convex to nonconvex: a loss function analysis for binary classification. In: IEEE International Conference on Data Mining Workshops (ICDMW), pp. 1281–1288. IEEE (2010)
Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In: International Conference on Machine Learning, pp. 1–9 (2015)
Zhou, D., Gu, Q.: Lower bounds for smooth nonconvex finite-sum optimization (2019)
Zhou, D., Gu, Q.: Stochastic recursive variance-reduced cubic regularization methods. Proc. of The 24th International Conference on Articial Intelligence and Statistics (AISTATS) (2020)
Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3925–3936. Curran Associates Inc. (2018)
Zhou, K., Shang, F., Cheng, J.: A simple stochastic variance reduced algorithm with fast convergence rates. In: International Conference on Machine Learning, pp. 5975–5984 (2018)
Zhou, Y., Wang, Z., Ji, K., Liang, Y., Tarokh, V.: Momentum schemes with stochastic variance reduction for nonconvex composite optimization. arXiv:1902.02715 (2019)

Download references

Acknowledgements

This paper is based upon work partially supported by the National Science Foundation (NSF) grant no. DMS-1619884 and the Office of Naval Research (ONR) grant no. N00014-20-1-2088.

Author information

Authors and Affiliations

Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill, 318 Hanes Hall, UNC-Chapel Hill, NC, 27599-3260, USA
Quoc Tran-Dinh & Nhan H. Pham
IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, 10598, USA
Dzung T. Phan & Lam M. Nguyen

Authors

Quoc Tran-Dinh
View author publications
You can also search for this author in PubMed Google Scholar
Nhan H. Pham
View author publications
You can also search for this author in PubMed Google Scholar
Dzung T. Phan
View author publications
You can also search for this author in PubMed Google Scholar
Lam M. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quoc Tran-Dinh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix: Properties of hybrid stochastic estimators

This appendix provides the full proof of our theoretical results in Sect. 3. However, we also need the following lemma in the sequel. Hence, we prove it here.

Lemma 7

Given $L > 0$, $\delta > 0$, $\epsilon > 0$, and $\omega \in (0, 1)$, let $\{\gamma _t\}_{t=0}^m$ be a sequence of positive real numbers updated by

$$\begin{aligned} \gamma _m := \frac{\delta }{L}~~~~\text {and}~~~~\gamma _t := \frac{\delta }{L + \epsilon L^2\big [\omega \gamma _{t+1} + \omega ^2\gamma _{t+2} + \cdots + \omega ^{(m-t)}\gamma _m\big ]}, \end{aligned}$$

(49)

for $t=0,\cdots , m-1$. Then

$$\begin{aligned} 0 {<} \gamma _0 {<} \gamma _1 {<} \cdots {<} \gamma _m {=} \frac{\delta }{L}~~~\text {and}~~~\varSigma _m {:=} \sum _{t=0}^m\gamma _t \ge \frac{\delta (m+1)\sqrt{1-\omega }}{L\left[ \sqrt{1-\omega } {+} \sqrt{1 - \omega + 4\delta \omega \epsilon }\right] }. \end{aligned}$$

(50)

Proof

First, from (49) it is obvious to show that $0< \gamma _0< \cdots< \gamma _{m-1} = \frac{\delta }{L(1+\epsilon \omega )} < \gamma _m = \frac{\delta }{L}$. At the same time, since $\omega \in (0, 1)$, we have $1 \ge \omega \ge \omega ^2 \ge \cdots \ge \omega ^{m}$. By Chebyshev’s sum inequality, we have

$$\begin{aligned} (m-t)\left( \omega \gamma _{t+1} + \omega ^2\gamma _{t+2} + \cdots + \omega ^{m-t}\gamma _m\right)\le & {} \left( \sum _{j=t+1}^m\gamma _i\right) \left( \omega + \omega ^2 + \cdots + \omega ^{m-t}\right) \nonumber \\\le & {} \frac{\omega }{1-\omega }\left( \sum _{j=t+1}^m\gamma _i\right) . \end{aligned}$$

(51)

From the update (49), we also have

$$\begin{aligned} \left\{ \begin{array}{ll} \epsilon L^2\gamma _0(\omega \gamma _1 + \omega ^2\gamma _2 + \cdots + \omega ^{m}\gamma _m) &{}= \delta - L\gamma _0 \\ \epsilon L^2\gamma _1(\omega \gamma _2 + \omega ^2\gamma _3 + \cdots + \omega ^{m-1}\gamma _{m}) &{}= \delta - L\gamma _1 \\ \cdots &{} \cdots \\ \epsilon L^2\gamma _{m-1}\omega \gamma _m &{}= \delta - L\gamma _{m-1} \\ 0 &{}= \delta - L\gamma _{m}. \end{array}\right. \end{aligned}$$

(52)

Substituting (51) into (52), we get

$$\begin{aligned} \left\{ \begin{array}{ll} \frac{\omega \epsilon L^2}{1-\omega }\gamma _0(\gamma _0 + \gamma _1 + \cdots + \gamma _m) &{}\ge \delta m - mL\gamma _0 + \frac{\omega \epsilon L^2}{1-\omega }\gamma _0^2 \\ \frac{\omega \epsilon L^2}{1-\omega }\gamma _1(\gamma _0 + \gamma _1 + \cdots + \gamma _{m}) &{}\ge \delta (m-1) - (m-1)L\gamma _1 + \frac{\omega \epsilon L^2}{1-\omega }(\gamma _1\gamma _0 + \gamma _1^2) \\ \cdots &{} \cdots \\ \frac{\omega \epsilon L^2}{1-\omega }\gamma _{m-1}(\gamma _0 + \gamma _1 + \cdots + \gamma _m) &{}\ge \delta - L\gamma _{m-1} + \frac{\omega \epsilon L^2}{1-\omega }(\gamma _{m-1}\gamma _0 + \cdots + \gamma _{m-1}^2) \\ \frac{\omega \epsilon L^2}{1-\omega }\gamma _{m}(\gamma _0 + \gamma _1 + \cdots + \gamma _m) &{}\ge \delta - L\gamma _{m} + \frac{\omega \epsilon L^2}{1-\omega }(\gamma _{m}\gamma _0 + \cdots + \gamma _{m}^2). \end{array}\right. \end{aligned}$$

Let us define $\varSigma _m := \sum _{t=0}^m\gamma _t$ and $S_m := \sum _{t=0}^m\gamma _t^2$. Summing up both sides of the above inequalities, we get

$$\begin{aligned} \frac{\omega \epsilon L^2}{1-\omega }\varSigma _m^2\ge & {} \frac{\delta (m^2 + m + 2)}{2} - L(m\gamma _0 + (m-1)\gamma _1 + \cdots + \gamma _{m-1} + \gamma _m) \\&+ \frac{\omega \epsilon L^2}{2(1-\omega )}\big (S_m + \varSigma _m^2\big ). \end{aligned}$$

Using again Chebyshev’s sum inequality, we have

$$\begin{aligned} m\gamma _0 + (m-1)\gamma _1 + \cdots + \gamma _{m-1} + \gamma _m \le \frac{m^2 + m + 2}{2(m+1)}\left( \sum _{t=0}^m\gamma _t\right) = \frac{(m^2 + m + 2)\varSigma _m}{2(m+1)}. \end{aligned}$$

Note that $(m+1)S_m \ge \varSigma _m^2$ by Cauchy-Schwarz’s inequality, which shows that $S_m + \varSigma _m^2 \ge \big (\frac{m+2}{m+1}\big )\varSigma _m^2$. Combining three last inequalities, we obtain the following quadratic inequation in $\varSigma _m > 0$:

$$\begin{aligned} \frac{m\omega \epsilon L^2}{(1-\omega )}\varSigma _m^2 + L(m^2 + m + 2)\varSigma _m - \delta (m+1)(m^2 + m + 2) \ge 0. \end{aligned}$$

Solving this inequation with respect to $\varSigma _m > 0$, we obtain

$$\begin{aligned} \varSigma _m\ge & {} \frac{(1-\omega )\big [\sqrt{(m^2 + m + 2)^2 + \frac{4m(m+1)(m^2 + m + 2)\omega \epsilon \delta }{1-\omega }} - (m^2 + m + 2)\big ]}{2\epsilon \omega m L} \\= & {} \frac{2\delta (m+1)}{L\left[ 1 + \sqrt{1 + \frac{4m(m+1)\omega \delta \epsilon }{(1-\omega )(m^2 + m+2)}}\right] } \\\ge & {} \frac{2\delta (m+1)\sqrt{1-\omega }}{L\left[ \sqrt{1-\omega } + \sqrt{1 - \omega + 4\delta \omega \epsilon }\right] } ~~~~~\text {since}~~\frac{m(m+1)}{m^2+m+2} < 1. \end{aligned}$$

This proves (50). $\square $

1.1 The proof of Lemma 3: Error estimate with mini-batch

The proof of the first expression of (19) is the same as in Lemma 1. We only prove the second one. Let $\varDelta _{\mathcal {B}_t} := \frac{1}{b_t}\sum _{\xi _i\in \mathcal {B}_t}\left[ G_{\xi _i}(x_t) - G_{\xi _i}(x_{t-1})\right] $, $\varDelta _t := G(x_t) - G(x_{t-1})$, $\hat{\delta }_t := \hat{v}_t - G(x_t)$, and $\delta {u}_t := u_t - G(x_t)$. Clearly, we have

$$\begin{aligned} \mathbb {E}_{\mathcal {B}_t}\left[ \varDelta _{\mathcal {B}_t}\right] = \varDelta _t~~~\text {and}~~~\mathbb {E}_{\hat{\mathcal {B}}_t}\left[ \delta {u}_t\right] = 0. \end{aligned}$$

Moreover, we can rewrite $\hat{v}_t$ as

$$\begin{aligned} \hat{\delta }_t = \beta _{t-1}\hat{\delta }_{t-1} + \beta _{t-1}\varDelta _{\mathcal {B}_t} + (1-\beta _{t-1})\delta {u}_t - \beta _{t-1}\varDelta _t. \end{aligned}$$

Therefore, using these two expressions, we can derive

$$\begin{aligned}&\mathbb {E}_{(\mathcal {B}_t,\hat{\mathcal {B}}_t)}\left[ \Vert \hat{\delta }_t\Vert ^2\right] \\&\quad = \beta _{t-1}^2\Vert \hat{\delta }_{t-1}\Vert ^2 + \beta _{t-1}^2\mathbb {E}_{\mathcal {B}_t}\left[ \Vert \varDelta _{\mathcal {B}_t}\Vert ^2\right] + (1-\beta _{t-1})^2\mathbb {E}_{\hat{\mathcal {B}}_t}\left[ \Vert \delta {u}_t\Vert ^2\right] + \beta _{t-1}^2\Vert \varDelta _t\Vert ^2\\&\qquad + {~} 2\beta _{t-1}^2\langle \hat{\delta }_{t-1},\mathbb {E}_{\mathcal {B}_t}\left[ \varDelta _{\mathcal {B}_t}\right] \rangle + 2\beta _{t-1}(1-\beta _{t-1})\langle \hat{\delta }_{t-1}, \mathbb {E}_{\hat{\mathcal {B}}_t}\left[ \delta {u}_t\right] \rangle - 2\beta _{t-1}^2\langle \hat{\delta }_{t-1},\varDelta _t\rangle \\&\qquad + {~} 2\beta _{t-1}(1-\beta _{t-1})\mathbb {E}_{(\mathcal {B}_t,\hat{\mathcal {B}}_t)}\left[ \langle \varDelta _{\mathcal {B}_t}, \delta {u}_t\rangle \right] - 2\beta _{t-1}^2\langle \mathbb {E}_{\mathcal {B}_t}\left[ \varDelta _{\mathcal {B}_t}\right] , \varDelta _t\rangle \\&\qquad - {~} 2\beta _{t-1}(1-\beta _{t-1})\langle \mathbb {E}_{\hat{\mathcal {B}}_t}\left[ \delta {u}_t\right] , \varDelta _t\rangle \\&\quad = \beta _{t-1}^2\Vert \hat{\delta }_{t-1}\Vert ^2 + \beta _{t-1}^2\mathbb {E}_{\mathcal {B}_t}\left[ \Vert \varDelta _{\mathcal {B}_t}\Vert ^2\right] + (1-\beta _{t-1})^2\mathbb {E}_{\hat{\mathcal {B}}_t}\left[ \Vert \delta {u}_t\Vert ^2\right] - \beta _{t-1}^2\Vert \varDelta _t\Vert ^2. \end{aligned}$$

Similar to the proof of [59, Lemma 2], for the finite-sum case (i.e., $\vert \varOmega \vert = n$), we can show that

$$\begin{aligned} \mathbb {E}_{\mathcal {B}_t}\left[ \Vert \varDelta _{\mathcal {B}_t}\Vert ^2\right] = \frac{n(b_t-1)}{(n-1)b_t}\Vert \varDelta _t\Vert ^2 + \frac{(n-b_t)}{(n-1)b_t}\mathbb {E}_{\xi }\left[ \Vert G_{\xi }(x_t) - G_{\xi }(x_{t-1})\Vert ^2\right] . \end{aligned}$$

For the expectation case, we have

$$\begin{aligned} \mathbb {E}_{\mathcal {B}_t}\left[ \Vert \varDelta _{\mathcal {B}_t}\Vert ^2\right] = \left( 1- \frac{1}{b_t}\right) \Vert \varDelta _t\Vert ^2 + \frac{1}{b_t}\mathbb {E}_{\xi }\left[ \Vert G_{\xi }(x_t) - G_{\xi }(x_{t-1})\Vert ^2\right] . \end{aligned}$$

Using the definition of $\rho $ in Lemma 4, we can unify these two expressions as

$$\begin{aligned} \mathbb {E}_{\mathcal {B}_t}\left[ \Vert \varDelta _{\mathcal {B}_t}\Vert ^2\right] = \left( 1- \rho \right) \Vert \varDelta _t\Vert ^2 + \rho \mathbb {E}_{\xi }\left[ \Vert G_{\xi }(x_t) - G_{\xi }(x_{t-1})\Vert ^2\right] . \end{aligned}$$

Substituting the last expression into the previous one, we obtain the second expression of (19). $\square $

1.2 The proof of Lemma 4: Upper bound of mini-batch “variance”

From Lemma 3, taking the expectation with respect to $\mathcal {F}_{t+1} := \sigma (x_0,\mathcal {B}_0, \hat{\mathcal {B}}_0, \cdots , \mathcal {B}_t, \hat{\mathcal {B}}_t)$, we have

$$\begin{aligned} \mathbb {E}\left[ \Vert \hat{v}_t - G(x_t)\Vert ^2\right]\le & {} \beta _{t-1}^2\mathbb {E}\left[ \Vert \hat{v}_{t-1} - G(x_{t-1})\Vert ^2\right] \\&+ {~} \rho L^2\beta _{t-1}^2 \mathbb {E}\left[ \Vert x_t - x_{t-1}\Vert ^2\right] + (1-\beta _{t-1})^2\mathbb {E}_{\hat{\mathcal {B}}_t}\left[ \Vert u_t - G(x_t)\Vert ^2\right] . \end{aligned}$$

In addition, from [59, Lemma 2], we have $\mathbb {E}_{\hat{\mathcal {B}}_t}\left[ \Vert u_t - G(x_t)\Vert ^2\right] \le \hat{\rho }\mathbb {E}_{\xi }\left[ \Vert G_{\xi }(x_t) - G(x_t)\Vert ^2\right] = \hat{\rho }\sigma _t^2$, where $\sigma _t^2 := \mathbb {E}_{\xi }\left[ \Vert G_{\xi }(x_t) - G(x_t)\Vert ^2\right] $.

Let $A_t^2 := \mathbb {E}\left[ \Vert \hat{v}_t - G(x_t)\Vert ^2\right] $ and $B_t^2 := \mathbb {E}\left[ \Vert x_{t +1}- x_{t}\Vert ^2\right] $. Then, the above estimate can be upper bounded as

$$\begin{aligned} A_t^2 \le \beta _{t-1}^2A_{t-1}^2 + \rho L^2\beta _{t-1}^2B_{t-1}^2 + \hat{\rho }(1-\beta _{t-1})^2\sigma _t^2. \end{aligned}$$

By following inductive step as in the proof of Lemma 2, we obtain from the last inequality that

$$\begin{aligned} A_t^2\le & {} \left( \beta _{t-1}^2\cdots \beta _0^2 \right) A_0^2 + \rho L^2 \left( \beta _{t-1}^2\cdots \beta _0^2\right) B_0^2 + \cdots + \rho L^2 \beta _{t-1}^2B_{t-1}^2 \\&+ {~} \hat{\rho }\left[ \left( \beta _{t-1}^2\cdots \beta _1^2\right) (1-\beta _0)^2\sigma _1^2 + \cdots + (1-\beta _{t-1})^2\sigma _{t}^2 \right] . \end{aligned}$$

Using the definition of $\omega _{t}$, $\omega _{i, t}$, and $S_{t}$ from (16), the previous inequality becomes

$$\begin{aligned} A_t^2 \le \omega _{t}A_0^2 + \rho L^2 \sum _{i=0}^{t-1}\omega _{i,t}B_i^2 + \hat{\rho } S_t, \end{aligned}$$

which is the same as (20) by substituting the definition of $A_t$ and $B_t$ above into it. $\square $

The proof of technical results in Sect. 4: The single sample case

We provide the full proof of the technical results in Sect. 4.

1.1 The proof of Lemma 5: Key estimate

From the update $x_{t+1} := (1-\gamma _t)x_t + \gamma _t\widehat{x}_{t+1}$ at Step 8 of Algorithm 1, we have $x_{t+1} - x_t = \gamma _t(\widehat{x}_{t+1} - x_t)$. From the L-average smoothness condition in Assumption 2, one can write

$$\begin{aligned} f(x_{t+1})\le & {} f(x_t) + \langle \nabla f(x_t), x_{t+1} - x_t\rangle + \frac{L}{2}\Vert x_{t+1} - x_t\Vert ^2\nonumber \\= & {} f(x_{t}) + \gamma _t \langle \nabla f(x_t),\widehat{x}_{t+1} - x_t\rangle + \frac{L\gamma _t^2}{2}\Vert \widehat{x}_{t+1} - x_t\Vert ^2. \end{aligned}$$

(53)

Using convexity of $\psi $, we can show that

$$\begin{aligned} \psi (x_{t+1}) \le (1-\gamma _t)\psi (x_t) + \gamma _t \psi (\widehat{x}_{t+1}) \le \psi (x_t) + \gamma _t \langle \nabla \psi (\widehat{x}_{t+1}), \widehat{x}_{t+1} - x_t\rangle ,\qquad \end{aligned}$$

(54)

where $\nabla \psi (\widehat{x}_{t+1}) \in \partial \psi (\widehat{x}_{t+1})$ is any subgradient of $\psi $ at $\widehat{x}_{t+1}$.

Utilizing the optimality condition of $\widehat{x}_{t+1} = \mathrm {prox}_{\eta _t \psi }(x_{t} - \eta _t v_t)$, we can show that $\nabla \psi (\widehat{x}_{t+1}) = -v_t - \frac{1}{\eta _t}(\widehat{x}_{t+1} - x_t)$ for some $\nabla \psi (\widehat{x}_{t+1}) \in \partial \psi (\widehat{x}_{t+1})$. Substituting this relation into (54), we get

$$\begin{aligned} \psi (x_{t+1}) \le \psi (x_t) - \gamma _t \left\langle v_t, \widehat{x}_{t+1} - x_t\right\rangle - \frac{\gamma _t}{\eta _t}\Vert \widehat{x}_{t+1} - x_t\Vert ^2. \end{aligned}$$

(55)

Combining (53) and (55), and using $F(x) := f(x) + \psi (x)$ from (1), we obtain

$$\begin{aligned} F(x_{t+1}) \le F (x_t) {+} \gamma _t \left\langle \nabla f(x_t) {-}v_t, \widehat{x}_{t+1} {-} x_t\right\rangle {-}\left( \frac{\gamma _t}{\eta _t} {-} \frac{L\gamma _t^2}{2}\right) \Vert \widehat{x}_{t+1} {-} x_t\Vert ^2. \end{aligned}$$

(56)

For any $c_t > 0$, we can always write

$$\begin{aligned} \langle \nabla f(x_t) -v_t, \widehat{x}_{t+1} - x_t\rangle= & {} \frac{1}{2c_t}\Vert \nabla f(x_t) -v_t \Vert ^2 + \frac{c_t}{2} \Vert \widehat{x}_{t+1} - x_t \Vert ^2 \\&- \frac{1}{2c_t}\Vert \nabla f(x_t) -v_t - c_t( \widehat{x}_{t+1} - x_t) \Vert ^2. \end{aligned}$$

Utilizing this expression, we can rewrite as

$$\begin{aligned} F(x_{t+1}) \le F (x_t) + \frac{\gamma _t}{2c_t}\Vert \nabla {f}(x_t -v_t \Vert ^2 -\left( \frac{\gamma _t}{\eta _t} - \frac{L\gamma _t^2}{2} - \frac{\gamma _tc_t}{2}\right) \Vert \widehat{x}_{t+1} - x_t\Vert ^2 - \frac{\tilde{\sigma }_t^2}{2}. \end{aligned}$$

where $\tilde{\sigma }_t^2 := \frac{\gamma _t}{c_t}\Vert \nabla {f}(x_t -v_t - c_t( \widehat{x}_{t+1} - x_t) \Vert ^2 \ge 0$.

Taking expectation both sides of this inequality over the entire history $\mathcal {F}_{t+1}$, we obtain

$$\begin{aligned} \mathbb {E}\left[ F(x_{t+1})\right]\le & {} \mathbb {E}\left[ F(x_t)\right] + \frac{\gamma _t}{2c_t}\mathbb {E}\left[ \Vert \nabla {f}(x_t) - v_t\Vert ^2\right] \nonumber \\&- {~} \Big (\frac{\gamma _t}{\eta _t} - \frac{L\gamma _t^2}{2}- \frac{\gamma _tc_t}{2}\Big )\mathbb {E}\left[ \Vert \widehat{x}_{t+1} - x_t\Vert ^2\right] - \frac{1}{2}\mathbb {E}\left[ \tilde{\sigma }_t^2\right] . \end{aligned}$$

(57)

Next, from the definition of gradient mapping $\mathcal {G}_{\eta }(x):= \frac{1}{\eta }\left( x - \mathrm {prox}_{\eta \psi }(x - \eta \nabla f(x))\right) $ in (9), we can see that

$$\begin{aligned} \eta _{t}\Vert \mathcal {G}_{\eta _t}(x_t)\Vert = \Vert x_t - \mathrm {prox}_{\eta _t \psi }\left( x_t - \eta _t \nabla f(x_t)\right) \Vert . \end{aligned}$$

Using this expression, the triangle inequality, and the nonexpansive property $\Vert \mathrm {prox}_{\eta \psi }(z) - \mathrm {prox}_{\eta \psi }(w)\Vert \le \Vert z - w\Vert $ of $\mathrm {prox}_{\eta \psi }$, we can derive that

$$\begin{aligned} \eta _t\Vert \mathcal {G}_{\eta _t}(x_t)\Vert\le & {} \Vert \widehat{x}_{t+1} - x_t\Vert + \Vert \mathrm {prox}_{\eta _t\psi }(x_t - \eta _t\nabla {f}(x_t)) - \widehat{x}_{t+1}\Vert \\= & {} \Vert \widehat{x}_{t+1} - x_t\Vert + \Vert \mathrm {prox}_{\eta _t\psi }(x_t - \eta _t\nabla {f}(x_t)) - \mathrm {prox}_{\eta _t\psi }(x_t - \eta _tv_t)\Vert \\\le & {} \Vert \widehat{x}_{t+1} - x_t\Vert + \eta _t\Vert \nabla {f}(x_t) - v_t\Vert . \end{aligned}$$

Now, for any $r_t > 0$, the last estimate leads to

$$\begin{aligned} \eta _t^2\mathbb {E}\left[ \Vert \mathcal {G}_{\eta _t}(x_t)\Vert ^2\right] \le \left( 1+\tfrac{1}{r_t}\right) \mathbb {E}\left[ \Vert \widehat{x}_{t+1} - x_t\Vert ^2\right] + (1+r_t)\eta _t^2\mathbb {E}\left[ \Vert \nabla {f}(x_t) - v_t\Vert ^2\right] . \end{aligned}$$

Multiplying this inequality by $\frac{q_t}{2} > 0$ and adding the result to (57), we finally get

$$\begin{aligned} \mathbb {E}\left[ F(x_{t+1})\right]\le & {} \mathbb {E}\left[ F(x_t)\right] - \frac{q_t\eta _t^2}{2}\mathbb {E}\left[ \Vert \mathcal {G}_{\eta _t}(x_t)\Vert ^2\right] \\&+ {~} \frac{1}{2}\Big [\frac{\gamma _t}{c_t} + (1+r_t)q_t\eta _t^2\Big ] \mathbb {E}\left[ \Vert \nabla {f}(x_t) - v_t\Vert ^2\right] \\&- {~} \frac{1}{2}\Big [\frac{2\gamma _t}{\eta _t} - L\gamma _t^2 - \gamma _tc_t - q_t\left( 1+\frac{1}{r_t}\right) \Big ] \mathbb {E}\left[ \Vert \widehat{x}_{t+1} - x_t \Vert ^2\right] - \frac{1}{2}\mathbb {E}\left[ \tilde{\sigma }_t^2\right] . \end{aligned}$$

Using the definition of $\theta _t$ and $\kappa _t$ from (22), i.e.:

$$\begin{aligned} \theta _t:= \frac{\gamma _t}{c_t} + (1+r_t)q_t\eta _t^2~~~~\text {and}~~~~\kappa _t := \frac{2\gamma _t}{\eta _t} - L\gamma _t^2 - \gamma _tc_t - q_t\left( 1 + \frac{1}{r_t}\right) , \end{aligned}$$

we can simplify the last estimate as follows:

$$\begin{aligned} \mathbb {E}\left[ F(x_{t+1})\right]\le & {} \mathbb {E}\left[ F(x_t)\right] - \frac{q_t\eta _t^2}{2}\mathbb {E}\left[ \Vert \mathcal {G}_{\eta _t}(x_t)\Vert ^2\right] + \frac{\theta _t}{2} \mathbb {E}\left[ \Vert \nabla {f}(x_t) - v_t\Vert ^2\right] \\&- {~} \frac{\kappa _t}{2}\mathbb {E}\left[ \Vert \widehat{x}_{t+1} - x_t \Vert ^2\right] - \frac{1}{2}\mathbb {E}\left[ \tilde{\sigma }_t^2\right] , \end{aligned}$$

which is exactly (21). $\square $

1.2 The proof of Lemma 6: Key estimate of Lyapunov function

From (14), by taking the full expectation on the history $\mathcal {F}_{t+1}$ and using the L-average smoothness of f, we can show that

$$\begin{aligned}&\mathbb {E}\left[ \Vert v_{t+1} - \nabla {f}(x_{t+1})\Vert ^2\right] \\&\quad \le \beta _t^2\mathbb {E}\left[ \Vert v_t - \nabla {f}(x_t)\Vert ^2\right] + \beta _t^2L^2\mathbb {E}\left[ \Vert x_{t+1} - x_t\Vert ^2\right] + (1-\beta _t)^2\sigma _{t+1}^2 \\&\quad = \beta _t^2\mathbb {E}\left[ \Vert v_t - \nabla {f}(x_t)\Vert ^2\right] + \beta _t^2\gamma _t^2L^2\mathbb {E}\left[ \Vert \widehat{x}_{t+1} - x_t\Vert ^2\right] + (1-\beta _t)^2\sigma _{t+1}^2, \end{aligned}$$

where $\sigma _t^2 := \mathbb {E}\left[ \Vert \nabla {f}_{\zeta _t}(x_t) - \nabla {f}(x_t)\Vert ^2\right] $.

Let V be the Lyapunov function defined by (23). Then, by multiplying the last inequality by $\frac{\alpha _{t+1}}{2} > 0$, adding the result to (21), and then using this Lyapunov function we can show that

$$\begin{aligned} V(x_{t+1})\le & {} V(x_t) - \frac{q_t\eta _t^2}{2}\mathbb {E}\left[ \Vert \mathcal {G}_{\eta _t}(x_t)\Vert ^2\right] - \frac{1}{2}(\alpha _t - \beta _t^2\alpha _{t+1} - \theta _t)\mathbb {E}\left[ \Vert v_t - \nabla {f}(x_t)\Vert ^2\right] \nonumber \\&- {~} \frac{1}{2}(\kappa _t - \alpha _{t+1}\beta _t^2\gamma _t^2L^2)\mathbb {E}\left[ \Vert \widehat{x}_{t+1} - x_t\Vert ^2\right] + \frac{1}{2}(1-\beta _t)^2\alpha _{t+1}\sigma _{t+1}^2 - \frac{1}{2}\mathbb {E}\left[ \tilde{\sigma }_t^2\right] .\qquad \quad \end{aligned}$$

(58)

Let us choose $\gamma _t$, $\eta _t$, and other parameters such that the conditions (24) hold, i.e.:

$$\begin{aligned} \alpha _t - \beta _t^2\alpha _{t+1} - \theta _t \ge 0 ~~~\text {and}~~\kappa _t - \alpha _{t+1}\beta _t^2\gamma _t^2L^2 \ge 0. \end{aligned}$$

In this case, (58) can be simplified as follows:

$$\begin{aligned} V(x_{t+1}) \le V(x_t) - \frac{q_t\eta _t^2}{2}\mathbb {E}\left[ \Vert \mathcal {G}_{\eta _t}(x_t)\Vert ^2\right] + \frac{1}{2}\alpha _{t+1}(1-\beta _t)^2\sigma _{t+1}^2, \end{aligned}$$

which proves (25).

Finally, summing up this inequality from $t := 0$ to $t := m$, we obtain

$$\begin{aligned} \sum _{t=0}^m\frac{q_t\eta _t^2}{2}\mathbb {E}\left[ \Vert \mathcal {G}_{\eta _t}(x_t)\Vert ^2\right] \le V(x_0) - V(x_{m+1}) + \frac{1}{2}\sum _{t=0}^m\alpha _{t+1}(1-\beta _t)^2\sigma _{t+1}^2. \end{aligned}$$

Note that $V(x_{m+1}) := \mathbb {E}\left[ F(x_{m+1})\right] + \frac{\alpha _{m+1}}{2}\mathbb {E}\left[ \Vert v_{m+1} - \nabla {f}(x_{m+1})\Vert ^2\right] \ge \mathbb {E}\left[ F(x_{m+1})\right] \ge F^{\star }$ by Assumption 1 and $V(x_0) = F(x_0) + \frac{\alpha _0}{2}\mathbb {E}\left[ \Vert v_0 - \nabla {f}(x_0)\Vert ^2\right] $. Using these estimates into the last inequality, we obtain the key estimate (26). $\square $

1.3 The proof of Theorem 2: The adaptive step-size case

Let $\{(x_t, \hat{x}_{t})\}$ be generated by Algorithm 1. Let us again choose $c_t := L$, $r_t := 1$ and $q_t := \frac{L\gamma _t}{2}$ and fix $\eta _t := \eta \in (0, \frac{1}{L})$ in Lemma 2 as done in Theorem 1. Then, from (22), we have

$$\begin{aligned} \theta _t := \frac{(1 + L^2\eta ^2)\gamma _t}{L}~~~~~\text {and}~~~~~\kappa _t := \left( \frac{2}{\eta } - L\gamma _t - 2L\right) \gamma _t. \end{aligned}$$

Using these parameters into (21) and summing up the result from $t := 0$ to $t := m$, and then using (15) from Lemma 2, we obtain

$$\begin{aligned} \mathbb {E}\left[ F(x_{m+1})\right]\le & {} \mathbb {E}\left[ F(x_0)\right] +\dfrac{L^2}{2}\displaystyle \sum _{t=0}^m \theta _t\sum _{i=0}^{t-1}\gamma _i^2\omega _{i,t}\mathbb {E}\left[ \Vert \widehat{x}_{i+1} - x_{i}\Vert ^2\right] \\&- \dfrac{1}{2}\displaystyle \sum _{t=0}^m\kappa _t \mathbb {E}\left[ \Vert \widehat{x}_{t+1} - x_t\Vert ^2\right] - {~} \dfrac{\eta ^2L}{4}\displaystyle \sum _{t=0}^m \gamma _t\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right] - \displaystyle \sum _{t=0}^m\mathbb {E}\left[ \tilde{\sigma }_t\right] \\&+ \dfrac{1}{2}\displaystyle \sum _{t=0}^m\theta _t \omega _t \bar{\sigma }^2 + \dfrac{1}{2}\displaystyle \sum _{t=0}^m \theta _t S_t, \end{aligned}$$

where $\bar{\sigma }^2 := \mathbb {E}\left[ \Vert v_0 - \nabla f(x_0)\Vert ^2\right] \ge 0$, $\tilde{\sigma }_t^2 := \dfrac{\gamma _t}{2}\Vert \nabla f(x_t) - v_t - L(\hat{x}_{t+1} - x_t)\Vert ^2 \ge 0$, and $\omega _{i,t}$, $\omega _t$, and $S_t$ are defined in Lemma 2.

By ignoring the nonnegative term $\mathbb {E}\left[ \tilde{\sigma }_t^2\right] $, and using the expression of $\theta _t$ and $\kappa _t$ above, we can further estimate the last inequality as follows:

$$\begin{aligned} \mathbb {E}\left[ F(x_{m+1})\right]\le & {} \mathbb {E}\left[ F(x_0)\right] - \displaystyle \frac{\eta ^2L}{4}\sum _{t=0}^m\gamma _t\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right] + \displaystyle \frac{(1+L^2\eta ^2)\bar{\sigma }^2}{2L}{\displaystyle \sum _{t=0}^m\omega _t\gamma _t} \nonumber \\&+ {~} \dfrac{(1+L^2\eta ^2)}{2L}{\displaystyle \sum _{t=0}^m\gamma _tS_t} + \dfrac{\mathcal {T}_m}{2}, \end{aligned}$$

(59)

where $\mathcal {T}_m$ is defined as follows:

$$\begin{aligned} \mathcal {T}_m&:= L(1+L^2\eta ^2)\sum _{t=0}^m\gamma _t\sum _{i=0}^{t-1}\omega _{i,t}\gamma _i^2\mathbb {E}\left[ \Vert \widehat{x}_{i+1} - x_i\Vert ^2\right] \nonumber \\&\quad -\, \sum _{t=0}^m\gamma _t\left( \frac{2}{\eta } - 2L - L\gamma _t\right) \mathbb {E}\left[ \Vert \widehat{x}_{i+1} - x_i\Vert ^2\right] . \end{aligned}$$

(60)

Now, with the choice of $\beta _t = \beta := 1- \frac{1}{\sqrt{\tilde{b}(m+1)}} \in (0, 1)$, we can easily show that $\omega _t = \beta ^{2t}$, $\omega _{i,t} = \beta ^{2(t-i)}$, and $s_t := \big (\prod _{j=i+2}^{t}\beta _{j-1}^2\big )(1-\beta _i)^2 = (1-\beta )^2\Big [\frac{1-\beta ^{2t}}{1-\beta ^2}\Big ] < \frac{1-\beta }{1+\beta }$ due to Lemma 2.

Let $w_i^2 := \mathbb {E}\left[ \Vert \widehat{x}_{i+1} - x_i\Vert ^2\right] $. To bound the quantity $\mathcal {T}_m$ defined by (60), we note that

$$\begin{aligned} \displaystyle \sum _{t=1}^m\gamma _t\sum _{i=0}^{t-1}\beta ^{2(t-i)}\gamma _i^2w_i^2= & {} \beta ^2\gamma _0^2\big [\gamma _1 + \beta ^2\gamma _2 + \cdots + \beta ^{2(m-1)}\gamma _m\big ]w_0^2 \\&+ {~} \beta ^2\gamma _1^2\big [\gamma _2 + \beta ^2\gamma _3 + \cdots + \beta ^{2(m-2)}\gamma _{m}\big ]w_1^2 + \cdots \\&+ {~} \beta ^2\gamma _{m-2}^2\big [\gamma _{m-1} + \beta ^2\gamma _{m}\big ]w_{m-2}^2 + \beta ^2\gamma _{m-1}^2\gamma _m w_{m-1}^2. \end{aligned}$$

Using $\delta := \frac{2}{\eta } - 2L$, we can write $\mathcal {T}_m$ from (60) as

$$\begin{aligned} \mathcal {T}_m= & {} \gamma _0\Big [ L(1+L^2\eta ^2) \beta ^2\gamma _0\big [\gamma _1 + \beta ^2\gamma _2 + \cdots + \beta ^{2(m-1)}\gamma _m\big ] - (\delta - L\gamma _0)\Big ]w_0^2 \\&+ {~} \gamma _1\Big [ L(1+L^2\eta ^2) \beta ^2\gamma _1\big [\gamma _2 + \beta ^2\gamma _3 + \cdots + \beta ^{2(m-2)}\gamma _{m}\big ] - (\delta - L\gamma _1)\Big ]w_1^2 + \cdots \\&+ {~} \gamma _{m-1}\Big [ L(1+L^2\eta ^2) \beta ^2\gamma _{m-1}\gamma _m - (1 - L\gamma _{m-1})\Big ]w_{m-1}^2 - \gamma _m(\delta - L\gamma _m)w_{m}^2. \end{aligned}$$

To guarantee $\mathcal {T}_m \le 0$, from the last expression of $\mathcal {T}_m$, we can impose the following condition:

$$\begin{aligned} \left\{ \begin{array}{lcl} L(1+L^2\eta ^2) \beta ^2\gamma _0\big [\gamma _1 + \beta ^2\gamma _2 + \cdots + \beta ^{2(m-1)}\gamma _m\big ] - (\delta - L\eta _0) &{}=&{} 0\\ \cdots \cdots &{}\cdots &{}\\ L(1+L^2\eta ^2) \beta ^2\gamma _{m-1}\gamma _m - (\delta - L\gamma _{m-1}) &{}=&{} 0\\ - (1 - L\eta _{m}) &{}= &{} 0. \end{array}\right. \end{aligned}$$

(61)

It is obvious to show that the condition (61) leads to the following update of $\gamma _t$:

$$\begin{aligned} \gamma _m :=&\frac{\delta }{L}~~~\text {and}~~~\gamma _t := \frac{\delta }{L + L(1+L^2\eta ^2)\big [\beta ^2\gamma _{t+1} + \beta ^4\gamma _{t+2} + \cdots + \beta ^{2(m-t)}\gamma _m\big ]},\\&\quad ~~~t=0,\cdots , m-1, \end{aligned}$$

which is exactly (33).

(a) Since $\beta = 1 - \frac{1}{[\tilde{b}(m+1)]^{1/2}}$, we have

$$\begin{aligned} \frac{1}{[\tilde{b}(m+1)]^{1/2}}= & {} 1 - \beta \le 1 - \beta ^2 \\\le & {} \frac{2}{[\tilde{b}(m+1)]^{1/2}}. \end{aligned}$$

Moreover, since $\eta \in (0, \frac{1}{L})$, with $\epsilon := \frac{1+L^2\eta ^2}{L}$, $\delta := \frac{2}{\eta }-2L$, and $\omega := \beta ^2 \in (0,1)$, using the last inequalities, we can easily show that

$$\begin{aligned} \sqrt{1-\omega } + \sqrt{1 - \omega + 4\delta \omega \epsilon }= & {} \sqrt{1-\beta ^2} + \sqrt{1 - \beta ^2 + \frac{4\delta \beta ^2(1+L^2\eta ^2)}{L} } \nonumber \\\le & {} 2\sqrt{2}\left( \frac{1}{[\tilde{b}(m+1)]^{1/4}} + \sqrt{\frac{\delta }{L}}\right) . \end{aligned}$$

(62)

Using (62), $\sqrt{1-\omega } = \sqrt{1-\beta ^2} \ge \frac{1}{(\tilde{b}(m+1))^{1/4}}$, and $\epsilon = \frac{1+L^2\eta ^2}{L}$ into (50) of Lemma 7, we can derive

$$\begin{aligned} \varSigma _m := \sum _{t=0}^m\gamma _t \ge \frac{\delta (m+1)}{2\sqrt{2}\left( L + \sqrt{L\delta }[\tilde{b}(m+1)]^{1/4}\right) }. \end{aligned}$$

(63)

Next, since $\omega _t = \beta ^{2t}$, by Chebyshev’s sum inequality, we have

$$\begin{aligned} \sum _{t=0}^m\omega _t\gamma _t = \sum _{t=0}^m\beta ^{2t}\gamma _t \le \frac{\varSigma _m}{(m+1)}(1 + \beta ^2 + \cdots + \beta ^{2m}) \le \frac{\varSigma _m}{(m+1)(1-\beta ^2)}. \end{aligned}$$

Utilizing this estimate, $\bar{\sigma }^2 := \mathbb {E}\left[ \Vert v_0 - \nabla {f}(x_0)\Vert ^2\right] \le \frac{\sigma ^2}{\tilde{b}}$, and $S_t \le \sigma ^2 s_t \le \frac{(1-\beta )\sigma ^2}{1+\beta }$ into (59), and noting that $\mathcal {T}_m \le 0$, we can further upper bound it as

$$\begin{aligned} \displaystyle \frac{\eta ^2L}{4}\displaystyle \sum _{t=0}^m\gamma _t \mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right]\le & {} F(x_0) - \mathbb {E}\left[ F(x_{m+1})\right] + \frac{(1+L^2\eta ^2)\sigma ^2\varSigma _m}{2L(1-\beta ^2)(m+1)\tilde{b}}\\&+ \frac{(1+L^2\eta ^2)(1-\beta )\sigma ^2\varSigma _m}{2L(1+\beta )}. \end{aligned}$$

By Assumption 1, we have $\mathbb {E}\left[ F(x_{m+1})\right] \ge F^{\star }$. Substituting this bound into the last estimate and then multiplying the result by $\frac{4}{L\eta ^2\varSigma _m}$ we obtain

$$\begin{aligned} \displaystyle \frac{1}{\varSigma _m}\displaystyle \sum _{t=0}^m\gamma _t\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right]\le & {} \frac{4}{L\eta ^2\varSigma _m}[F(x_0) - F^{\star }] \\&+ \frac{2\sigma ^2(1+L^2\eta ^2)}{L^2\eta ^2(1+\beta )}\left[ \frac{1}{\tilde{b}(m+1)(1-\beta )} + (1-\beta )\right] . \end{aligned}$$

Since $\beta = 1- \frac{1}{\tilde{b}^{1/2}(m+1)^{1/2}}$, we have $\frac{1}{\tilde{b}(m+1)(1-\beta )} + (1-\beta ) = \frac{2}{\tilde{b}^{1/2}(m+1)^{1/2}}$. Utilizing this expression, (63), $1+\eta ^2L^2 \le 2$, and $\beta \in [0, 1]$, we can further upper bound the last estimate as

$$\begin{aligned} \displaystyle \frac{1}{\varSigma _m}\displaystyle \sum _{t=0}^m\gamma _t \mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right]\le & {} \frac{8\sqrt{2}\big (L + \sqrt{\delta L}[\tilde{b}(m+1)]^{1/4}\big )}{L\eta ^2\delta (m+1)}\left[ F(x_0) - F^{\star }\right] \nonumber \\&+ \frac{8\sigma ^2}{L^2\eta ^2[\tilde{b}(m+1)]^{1/2}}. \end{aligned}$$

(64)

In addition, due to the choice of $\overline{x}_m \sim \mathbb {U}_{\mathbf {p}}\left( \{x_t\}_{t=0}^m\right) $, we have $\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_m)\Vert ^2\right] = \displaystyle \frac{1}{\varSigma _m}\displaystyle \sum _{t=0}^m\gamma _t\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right] $. Combining this expression and (64), we obtain (34).

(b) Let us choose $\tilde{b} := \lceil c_1^2(m+1)^{1/3} \rceil $ for some constant $c_1 > 0$. Since $\beta = 1 - \frac{1}{[\tilde{b}(m+1)]^{1/2}}$, to guarantee $\beta \ge 0$, we need to impose $c_1 \ge \frac{1}{(m+1)^{2/3}}$. With this choice of $\tilde{b}$, (34) reduces to

$$\begin{aligned} \mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_m)\Vert ^2\right] \le \frac{8}{L^2\eta ^2(m+1)^{2/3}}\left[ \frac{\sqrt{2}L\big ( L + \sqrt{c_1L\delta }\big )}{\delta }\big [F(x_0) - F^{\star }\big ] + \frac{\sigma ^2}{c_1}\right] . \end{aligned}$$

Let us denote by $\varDelta _0 := \frac{8}{L^2\eta ^2}\left[ \frac{\sqrt{2}L( L + \sqrt{c_1L\delta })}{\delta }\big [F(x_0) - F^{\star }\big ] + \frac{\sigma ^2}{c_1}\right] $. Then, similar to the proof of Theorem 1, we can show that the number of iterations m is at most $m := \left\lceil \frac{\varDelta _0^{3/2}}{\varepsilon ^3} \right\rceil $, and the total number $\mathcal {T}_m$ of stochastic gradient evaluations $\nabla {f}_{\xi }(x_t)$ is at most $\mathcal {T}_m := \left\lceil \frac{c_1^2\varDelta _0^{1/2}}{\varepsilon } + \frac{3\varDelta _0^{3/2}}{\varepsilon ^3}\right\rceil $. $\square $

1.4 The proof of Theorem 3: The restarting variant

(a) Since we use the adaptive variant of Algorithm 1 as stated in Theorem 2 for the inner loop of Algorithm 2, from (64), we can see that at each stage s, the following estimate holds

$$\begin{aligned} \displaystyle \frac{1}{\varSigma _m}\displaystyle \sum _{t=0}^m\gamma _t\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t^{(s)})\Vert ^2\right]\le & {} \frac{8\sqrt{2}\tilde{b}^{1/4}\left( L + \sqrt{L\delta }\right) }{L\eta ^2\delta (m+1)^{3/4}}\mathbb {E}\left[ F(x_0^{(s)}) - F(x_{m+1}^{(s)})\right] \nonumber \\&+ \frac{8\sigma ^2}{L^2\eta ^2[\tilde{b}(m+1)]^{1/2}}. \end{aligned}$$

(65)

where we use the superscript “$^{(s)}$” to indicate the stage s in Algorithm 2. Summing up this inequality from $s := 1$ to $s := S$, and then multiplying the result by $\frac{1}{S}$ and using $\mathbb {E}\left[ F(x_{m+1}^{(S)})\right] \ge F^{\star } > -\infty $, and $\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_T)\Vert ^2\right] = \dfrac{1}{S\varSigma _m}\displaystyle \sum _{s=1}^S\sum _{t=0}^m\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t^{(s)})\Vert ^2\right] $, we get (35), i.e.:

$$\begin{aligned} \mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_T)\Vert ^2\right]= & {} \dfrac{1}{S\varSigma _m}\displaystyle \sum _{s=1}^S\sum _{t=0}^m\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t^{(s)})\Vert ^2\right] \nonumber \\\le & {} \dfrac{8\sqrt{2}\tilde{b}^{1/4}\big (L + \sqrt{L\delta }\big )}{L\delta \eta ^2S (m+1)^{3/4}} \big [F(\overline{x}^{(0)}) - F^{\star }\big ] + \dfrac{8\sigma ^2}{L^2\eta ^2[\tilde{b}(m+1)]^{1/2}}.\qquad \end{aligned}$$

(66)

(b) Let $\varDelta _F := F(\overline{x}^{(0)}) - F^{\star } > 0$ and choose $\tilde{b} := c_1^2(m + 1)$ for some constant $c_1 > 0$. Since $\beta = 1 - \frac{1}{[\tilde{b}(m+1)]^{1/2}} \in (0, 1)$, we need to choose $c_1$ such that $c_1 \ge \frac{1}{m+1}$.

Now, for any tolerance $\varepsilon > 0$, to guarantee $\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_T)\Vert ^2\right] \le \varepsilon ^2$, from (66), we require

$$\begin{aligned}&\frac{8\sqrt{2}\tilde{b}^{1/4}\big (L + \sqrt{L\delta }\big )\varDelta _F}{L\delta \eta ^2S (m+1)^{3/4}} + \frac{8\sigma ^2}{L^2\eta ^2[\tilde{b}(m+1)]^{1/2}}\\&\quad = \frac{8\sqrt{2c_1} \big (L + \sqrt{L\delta }\big )\varDelta _F}{L\delta \eta ^2S (m+1)^{1/2}} + \frac{8\sigma ^2}{L^2\eta ^2c_1(m+1)} \le \varepsilon ^2. \end{aligned}$$

Let us break this inequality into two equal parts as

$$\begin{aligned} \frac{8\sqrt{2c_1}\big (L + \sqrt{L\delta }\big )\varDelta _F}{L\delta \eta ^2S (m+1)^{1/2}} = \frac{\varepsilon ^2}{2} ~~~\text {and}~~~ \frac{8\sigma ^2}{L^2\eta ^2c_1(m+1)} \le \frac{\varepsilon ^2}{2}. \end{aligned}$$

Then, we have

$$\begin{aligned} S = \frac{16\sqrt{2c_1}\big (L + \sqrt{L\delta }\big )\varDelta _F}{L\delta \eta ^2(m+1)^{1/2}\varepsilon ^2} ~~~\text {and}~~~ m+1 \ge \frac{16\sigma ^2}{L^2\eta ^2c_1\varepsilon ^2}. \end{aligned}$$

Let us choose $m+1 = \left\lceil \frac{16}{L^2\eta ^2c_1}\cdot \frac{\max \left\{ 1,\sigma ^2\right\} }{\varepsilon ^2} \right\rceil $. Then, $m + 1 \ge \frac{16}{L^2\eta ^2c_1\varepsilon ^2}$, and we can set

$$\begin{aligned} S := \left\lceil \frac{16\sqrt{2c_1}\big (L + \sqrt{L\delta }\big )\varDelta _F}{L\delta \eta ^2\varepsilon ^2} \cdot \frac{L\eta \sqrt{c_1}\varepsilon }{4} \right\rceil = \left\lceil \frac{4\sqrt{2}c_1\big (L + \sqrt{L\delta }\big )\varDelta _F}{\delta \eta \varepsilon } \right\rceil . \end{aligned}$$

This leads to (36). Moreover, we can also show that

$$\begin{aligned} (m+1)S = \left\lceil \frac{16\sqrt{2c_1}(L + \sqrt{L\delta })\varDelta _F}{L\delta \eta ^2\varepsilon ^2}\sqrt{m+1} \right\rceil = \left\lceil \frac{64\sqrt{2}(L + \sqrt{L\delta })\varDelta _F}{L^2\eta ^3\delta } \cdot \frac{\max \left\{ 1,\sigma \right\} }{\varepsilon ^3} \right\rceil . \end{aligned}$$

Consequently, the total number of stochastic gradient evaluations $\nabla {f}_{\xi }(x_t)$ is at most

$$\begin{aligned} \mathcal {T}_{\nabla {f}} :=&\big [ \tilde{b} + 3(m+1) \big ]S = \left\lceil (c_1^2 + 3)(m+1)S \right\rceil = \left\lceil 64\sqrt{2}(c_1^2 + 3)\frac{(L + \sqrt{L\delta })\varDelta _F}{L^2\eta ^3\delta } \cdot \frac{\max \left\{ 1,\sigma \right\} }{\varepsilon ^3} \right\rceil \\ =&\mathcal {O}\left( \max \left\{ \sigma , 1\right\} \cdot \frac{\varDelta _F}{\varepsilon ^3}\right) . \end{aligned}$$

Since we choose $\tilde{b} := \left\lceil \frac{16c_1}{L^2\eta ^2}\cdot \frac{\max \left\{ 1,\sigma ^2\right\} }{\varepsilon ^2} \right\rceil $, the final complexity is $\mathcal {O}\left( \frac{\max \left\{ 1,\sigma ^2\right\} }{\varepsilon ^2} + \frac{\max \left\{ 1,\sigma \right\} }{\varepsilon ^3}\right) $, where other constants independent of $\sigma $ and $\varepsilon $ are hidden. The total number of proximal operators $\mathrm {prox}_{\eta \psi }$ is at most

$$\begin{aligned} \mathcal {T}_{\mathrm {prox}} := S(m+1) = \left\lceil \frac{64\sqrt{2}(L + \sqrt{L\delta })\varDelta _F}{L^2\eta ^3\delta } \cdot \frac{\max \left\{ 1,\sigma \right\} }{\varepsilon ^3} \right\rceil = \mathcal {O}\left( \max \left\{ \sigma , 1\right\} \cdot \frac{\varDelta _F}{\varepsilon ^3}\right) . \end{aligned}$$

The estimate (37) follows from the bound of $\mathcal {T}_{\nabla {f}}$ above and the choice of $\tilde{b}$. $\square $

The proof of technical results in Sect. 5: The mini-batch case

This appendix presents the full proof of the results in Section 5 for the mini-batch case.

1.1 The proof of Theorem 4: The single-loop variant

Using (19) from Lemma 3 with $G := \nabla {f}$, taking full expectation, and using a constant weight $\beta _t := \beta \in (0, 1)$ and $b_t := b \in \mathbb {N}_{+}$, we have

$$\begin{aligned} \mathbb {E}\left[ \Vert \hat{v}_{t+1} - \nabla {f}(x_{t+1})\Vert ^2\right]\le & {} \beta ^2\mathbb {E}\left[ \Vert \hat{v}_{t} - \nabla {f}(x_{t})\Vert ^2\right] + \rho \beta ^2\mathbb {E}\left[ \Vert \nabla {f}_{\xi }(x_{t+1}) - \nabla {f}_{\xi }(x_{t})\Vert ^2\right] \\&+ {~} (1-\beta )^2\mathbb {E}\left[ \Vert u_{t+1} - \nabla {f}(x_{t+1})\Vert ^2\right] , \end{aligned}$$

where $\rho := \frac{1}{b}$ since we solve (1).

Since $\mathbb {E}\left[ \Vert \nabla {f}_{\xi }(x_{t+1}) - \nabla {f}_{\xi }(x_t)\Vert ^2\right] \le L^2\mathbb {E}\left[ \Vert x_{t+1} - x_t\Vert ^2\right] \le L^2\gamma _{t}^2\mathbb {E}\left[ \Vert \widehat{x}_{t+1} - x_t\Vert ^2\right] $ by Assumption 2 and $\mathbb {E}\left[ \Vert u_{t+1} - \nabla {f}(x_{t+1})\Vert ^2\right] \le \frac{\sigma ^2}{\hat{b}}$ by Assumption 3 and [59, Lemma 2], the last estimate leads to

$$\begin{aligned} \mathbb {E}\left[ \Vert \hat{v}_{t+1} - \nabla {f}(x_{t+1})\Vert ^2\right]\le & {} \beta ^2\mathbb {E}\left[ \Vert \hat{v}_t - \nabla {f}(x_{t})\Vert ^2\right] + \frac{\beta ^2\gamma _t^2L^2}{b}\mathbb {E}\left[ \Vert \widehat{x}_{t+1} - x_t\Vert ^2\right] \nonumber \\&+ \frac{(1-\beta )^2\sigma ^2}{\hat{b}}. \end{aligned}$$

(67)

Next, let us choose $\eta _t := \eta > 0$, $\gamma _t := \gamma > 0$, $c_t := L$, $r_t := 1$, and $q_t := \frac{L\gamma }{2} > 0$ in Lemma 2. Then, we have $\theta _t = \theta = \frac{(1 + L^2\eta ^2)\gamma }{L} > 0$ and $\kappa _t = \kappa = \left( \frac{2}{\eta } - L\gamma - 2L\right) \gamma > 0$. Using these values into (21), we obtain

$$\begin{aligned} \mathbb {E}\left[ F(x_{t+1})\right] \le \mathbb {E}\left[ F(x_t)\right] - \dfrac{\gamma \eta ^2L}{4}\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right] + \dfrac{\theta }{2} \mathbb {E}\left[ \Vert \nabla {f}(x_t) - \hat{v}_t\Vert ^2\right] - \dfrac{\kappa }{2}\mathbb {E}\left[ \Vert \widehat{x}_{t+1} - x_t \Vert ^2\right] . \end{aligned}$$

Multiplying (67) by $\frac{\alpha }{2}$ for some $\alpha > 0$, and adding the result to the above estimate, we obtain

$$\begin{aligned}&\mathbb {E}\left[ F(x_{t+1})\right] + \dfrac{\alpha }{2}\mathbb {E}\left[ \Vert \hat{v}_{t+1} - \nabla {f}(x_{t+1})\Vert ^2\right] \\&\quad \le \mathbb {E}\left[ F(x_t)\right] + \dfrac{(\alpha \beta ^2 + \theta )}{2}\mathbb {E}\left[ \Vert \hat{v}_{t} - \nabla {f}(x_{t})\Vert ^2\right] \\&\qquad - {~} \dfrac{\gamma \eta ^2L}{4}\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right] + \dfrac{\alpha (1-\beta )^2\sigma ^2}{2\hat{b}}\\&\qquad - {~} \dfrac{1}{2}\left( \kappa - \frac{\alpha \beta ^2\gamma ^2L^2}{b}\right) \mathbb {E}\left[ \Vert x_t - x_{t-1}\Vert ^2\right] . \end{aligned}$$

Using the Lyapunov function V defined by (23), the last estimate leads to

$$\begin{aligned} V(x_{t+1})\le & {} V(x_t) - \dfrac{\gamma \eta ^2L}{4}\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right] + \dfrac{\alpha (1-\beta )^2\sigma ^2}{2\hat{b}}\\&- {~} \dfrac{1}{2}\left( \kappa - \dfrac{\alpha \beta ^2\gamma ^2L^2}{b}\right) \mathbb {E}\left[ \Vert x_t - x_{t-1}\Vert ^2\right] \\&- \dfrac{1}{2}\left[ \alpha (1-\beta ^2) - \theta \right] \mathbb {E}\left[ \Vert \hat{v}_{t} - \nabla {f}(x_{t})\Vert ^2\right] . \end{aligned}$$

If we impose the following conditions

$$\begin{aligned} \kappa {=} \left( \frac{2}{\eta } {-} L\gamma {-} 2L\right) \gamma {\ge } \frac{\alpha \beta ^2\gamma ^2L^2}{b} ~~\text {and}~~\theta {=} \frac{(1 + L^2\eta ^2)}{L}\gamma \le \alpha (1-\beta ^2), \end{aligned}$$

(68)

then we get from the last inequality that

$$\begin{aligned} V(x_{t+1}) \le V(x_t) - \frac{\gamma \eta ^2L}{4}\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right] + \frac{\alpha (1-\beta )^2\sigma ^2}{2\hat{b}}. \end{aligned}$$

(69)

The conditions (68) can be simplified as

$$\begin{aligned} \frac{2}{\eta } - 2L - L\gamma \ge \frac{\alpha \gamma \beta ^2L^2}{b}~~~~~~\text {and}~~~~~~\frac{(1 + L^2\eta ^2)}{L}\gamma \le \alpha (1 - \beta ^2). \end{aligned}$$

(70)

Moreover, by induction, $V(x_{m+1}) \ge F^{\star }$, and $V(x_0) := F(x_0) + \frac{\alpha }{2}\mathbb {E}\left[ \Vert \hat{v}_0 - \nabla {f}(x_0)\Vert ^2\right] \le F(x_0) + \frac{\alpha \sigma ^2}{2\tilde{b}}$, we can further derive from (69) that

$$\begin{aligned} \frac{1}{m+1}\sum _{t=0}^m\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right]\le & {} \frac{4}{L\eta ^2\gamma (m+1)}\left[ F(x_0) - F^{\star }\right] \nonumber \\&+ \frac{2\alpha \sigma ^2}{L\eta ^2\gamma }\left[ \frac{1}{\tilde{b}(m+1)} + \frac{(1-\beta )^2}{\hat{b}}\right] . \end{aligned}$$

(71)

By minimizing the last term on the right-hand side of (71) w.r.t. $\beta \in [0, 1]$, we get $\beta := 1 - \frac{\hat{b}^{1/2}}{[\tilde{b}(m+1)]^{1/2}}$. Clearly, with this choice of $\beta $ if $1 \le \hat{b} \le \tilde{b}(m+1)$, then $\beta \in [0, 1)$.

(a) Next, we update $\eta := \frac{2}{L(3 + \gamma )}$. Then, since $\gamma \in [0, 1]$ we have $\frac{1}{2L} \le \eta \le \frac{2}{3L}$. Moreover, we have $\frac{2}{\eta } - 2L - L\gamma = L$ and $\frac{1 + L^2\eta ^2}{L} \le \frac{13}{9L}$. In addition, since $\beta \in [0, 1)$ we have $1-\beta ^2 \ge 1 - \beta = \frac{\hat{b}^{1/2}}{[\tilde{b}(m+1)]^{1/2}}$. Consequently, the second condition of (70) holds if we choose $\gamma $ as

$$\begin{aligned} 0 < \gamma \le \bar{\gamma } := \frac{9L\alpha \hat{b}^{1/2}}{13\tilde{b}^{1/2}(m+1)^{1/2}}. \end{aligned}$$

Since $\beta \in [0, 1]$, the first condition of (70) holds if we choose $0 < \gamma \le \bar{\gamma } := \frac{b}{L\alpha }$. Combining both conditions on $\gamma $, we get $\frac{b}{L\alpha } = \frac{9L\alpha \hat{b}^{1/2}}{13\tilde{b}^{1/2}(m+1)^{1/2}}$, leading to $\alpha := \frac{\sqrt{13}\tilde{b}^{1/4}b^{1/2}(m+1)^{1/4}}{3L\hat{b}^{1/4}}$. Therefore, we can update $\gamma $ as

$$\begin{aligned} \gamma := \frac{3c_0\hat{b}^{1/4}b^{1/2}}{\sqrt{13}[\tilde{b}(m+1)]^{1/4}}, \end{aligned}$$

for some $c_0 > 0$. Since $1 \le \hat{b} \le \tilde{b}(m+1)$, we have $\gamma \le \frac{3c_0b^{1/2}}{\sqrt{13}}$. If we choose $0 < c_0 \le \frac{\sqrt{13}}{3b^{1/2}}$, then $\gamma \in (0, 1]$. Consequently, we obtain (41).

(b) Now, we note that the choice of $\alpha $ and $\gamma $ also implies that

$$\begin{aligned} \frac{\alpha }{\gamma } = \frac{13\tilde{b}^{1/2}(m+1)^{1/2}}{9L\hat{b}^{1/2}}~~~~~\text {and}~~~~~\frac{1}{\gamma } = \frac{\sqrt{13}\tilde{b}^{1/4}(m+1)^{1/4}}{3c_0\hat{b}^{1/4}b^{1/2}}. \end{aligned}$$

In addition, since $\overline{x}_m\sim \mathbb {U}\left( \left\{ x_t\right\} _{t=0}^m\right) $, we have $\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_m)\Vert ^2\right] = \frac{1}{m+1}\sum _{t=0}^m\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t)\Vert ^2\right] $. Using these expressions and $L^2\eta ^2 \ge \frac{1}{4}$ into (71), we finally get

$$\begin{aligned} \mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_m)\Vert ^2\right] \le \frac{16\sqrt{13}L \tilde{b}^{1/4}}{3c_0\hat{b}^{1/4}b^{1/2}(m+1)^{3/4}} \left[ F(x_0) - F^{\star }\right] + \frac{208\sigma ^2}{9\hat{b}^{1/2}\tilde{b}^{1/2}(m+1)^{1/2}}, \end{aligned}$$

which proves (42).

Let us choose $b = \hat{b} \in \mathbb {N}_{+}$ and $\tilde{b} := \lceil c_1^2[b(m+1)]^{1/3} \rceil $ for some $c_1 > 0$. Then (42) reduces to

$$\begin{aligned} \mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_m)\Vert ^2\right] \le \frac{16}{3[b(m+1)]^{2/3}}\left[ \frac{\sqrt{13 c_1}L}{c_0} \left[ F(x_0) - F^{\star }\right] + \frac{13\sigma ^2}{3c_1}\right] . \end{aligned}$$

Denote $\varDelta _0 := \frac{16}{3}\left[ \frac{\sqrt{13 c_1}L}{c_0} \left[ F(x_0) - F^{\star }\right] + \frac{13\sigma ^2}{3c_1}\right] $. For any tolerance $\varepsilon > 0$, to guarantee $\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_m)\Vert ^2\right] \le \varepsilon ^2$, we need to impose $\frac{\varDelta _0}{[b(m+1)]^{2/3}} \le \varepsilon ^2$. This implies $b(m+1) \ge \frac{\varDelta _0^{3/2}}{\varepsilon ^3}$, which also leads to $m+1 \ge \frac{\varDelta _0^{3/2}}{b\varepsilon ^3}$. Therefore, the maximum number of iterations is at most $m := \left\lceil \frac{\varDelta _0^{3/2}}{b\varepsilon ^3}\right\rceil $. This is also the number of proximal operations $\mathrm {prox}_{\eta \psi }$.

The number of stochastic gradient evaluations $\nabla {f_{\xi }}(x_t)$ is at most $\mathcal {T}_m := \tilde{b} + 3(m+1)b = \left\lceil \frac{c_1^2\varDelta _0^{1/2}}{\varepsilon } + \frac{3\varDelta _0^{3/2}}{\varepsilon ^3} \right\rceil $. Finally, since $1 \le b = \hat{b} \le \tilde{b}(m+1) = c_1^2b^{1/3}(m+1)^{4/3}$, we have $b \le c_1^3(m+1)^2$, which is equivalent to $c_1\ge \frac{b^{1/3}}{(m+1)^{2/3}}$. In addition, since $\tilde{b} := \lceil c_1^2[b(m+1)]^{1/3} \rceil $ and $b=\hat{b}$, we have $\gamma := \frac{3c_0b^{2/3}}{\sqrt{13c_1}(m+1)^{1/3}}$. $\square $

1.2 The proof of Theorem 5: The restarting mini-batch variant

(a) Similar to the proof of Theorem 2, summing up (21) from $t := 0$ to $t := m$ and using (20) with $\rho := \frac{1}{b}$ and $\hat{\rho } := \frac{1}{\hat{b}}$ from Lemma 4, we obtain

$$\begin{aligned} \mathbb {E}\left[ F(x_{m+1}^{(s)})\right]\le & {} \mathbb {E}\left[ F(x_0^{(s)})\right] +\dfrac{L^2}{2b}\displaystyle \sum _{t=0}^m \theta _t\sum _{i=0}^{t-1}\gamma _i^2\omega _{i,t}\mathbb {E}\left[ \Vert \hat{x}_{i+1}^{(s)} - x_{i}^{(s)}\Vert ^2\right] \nonumber \\&- {~} \dfrac{1}{2}\displaystyle \sum _{t=0}^m\kappa _t \mathbb {E}\left[ \Vert \widehat{x}_{t+1}^{(s)} - x_t^{(s)}\Vert ^2\right] - \displaystyle \sum _{t=0}^m\dfrac{\gamma _t\eta ^2}{2}\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t^{(s)})\Vert ^2\right] \nonumber \\&+{~} \dfrac{1}{2}\displaystyle \sum _{t=0}^m\theta _t \omega _t\mathbb {E}\left[ \Vert \hat{v}_0^{(s)} - \nabla f(x_0^{(s)})\Vert ^2\right] + \dfrac{1}{2\hat{b}}\displaystyle \sum _{t=0}^m \theta _t S_t, \end{aligned}$$

(72)

where $\gamma _t$, $\eta $, $\kappa _t$, $\theta _t$, $\omega _{i,t}$, $\omega _t$, and $S_t$ are defined in Lemma 2.

Let us fix $c_t := L$, $r_t := 1$, $q_t := \frac{L\gamma _t}{2}$, and $\beta _t := \beta \in [0, 1]$. Then $\theta _t = \frac{(1 + L^2\eta ^2)}{L}\gamma _t$ and $\kappa _t = \gamma _t\left( \frac{2}{\eta } - 2L - L\gamma _t\right) $ as before. Moreover, $\omega _t = \beta ^{2t}$, $\omega _{i,t} = \beta ^{2(t-i)}$, and $s_t = (1-\beta )^2\Big [\frac{1-\beta ^{2t}}{1-\beta ^2}\Big ] < \frac{1-\beta }{1+\beta }$ due to Lemma 2, and $\mathbb {E}\left[ \Vert \hat{v}_0^{(s)} - \nabla f(x_0^{(s)})\Vert ^2\right] \le \frac{\sigma ^2}{\tilde{b}}$.

Using this configuration and noting that $\overline{x}^{(s)} = x_{m+1}^{(s)}$ and $\overline{x}^{(s-1)} = x^{(s)}_0$, following the same argument as (64), (72) reduces to

$$\begin{aligned} \mathbb {E}\left[ F(\overline{x}^{(s)})\right]\le & {} \mathbb {E}\left[ F(\overline{x}^{(s-1)})\right] - \frac{L\eta ^2}{4}\displaystyle \sum _{t=0}^m\gamma _t\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t^{(s)})\Vert ^2\right] \nonumber \\&+ ~\frac{(1 + L^2\eta ^2)\sigma ^2}{2L(1+\beta )} \left[ \frac{1}{\tilde{b}(m+1)(1-\beta )} + \frac{(1-\beta )}{\hat{b}}\right] \varSigma _m + \frac{\widehat{\mathcal {T}}_m}{2},\qquad \end{aligned}$$

(73)

where $\widehat{\mathcal {T}}_m$ is defined as follows:

$$\begin{aligned} \widehat{\mathcal {T}}_m:= & {} \frac{L(1 + L^2\eta ^2)}{b}\sum _{t=0}^m\gamma _t\sum _{i=0}^{t-1}\beta ^{2(t-i)}\gamma _i^2\mathbb {E}\left[ \Vert \widehat{x}_{i+1}^{(s)} - x_i^{(s)}\Vert ^2\right] \nonumber \\&- {~} \sum _{t=0}^m\gamma _t\left( \frac{2}{\eta } - 2L - L\gamma _t\right) \mathbb {E}\left[ \Vert \widehat{x}_{i+1}^{(s)} - x_i^{(s)}\Vert ^2\right] . \end{aligned}$$

(74)

Similar to the proof of (33), if we choose $\eta \in (0, \frac{1}{L})$, set $\delta := \frac{2}{\eta } - 2L > 0$, and update $\gamma $ as in (43):

$$\begin{aligned} \gamma _m := \frac{\delta }{L}~~~~\text {and}~~~~\gamma _t := \frac{\delta b}{Lb + L(1 + L^2\eta ^2)\big [\beta ^2\gamma _{t+1} + \beta ^4\gamma _{t+2} + \cdots + \beta ^{2(m-t)}\gamma _m\big ]}, \end{aligned}$$

then $\widehat{\mathcal {T}}_m \le 0$. Moreover, since $\beta \in [0, 1]$ and $1+L^2\eta ^2 \le 2$, (73) can be simplified as

$$\begin{aligned} \mathbb {E}\left[ F(\overline{x}^{(s)})\right]\le & {} \mathbb {E}\left[ F(\overline{x}^{(s-1)})\right] - \frac{L\eta ^2}{4}\displaystyle \sum _{t=0}^m\gamma _t\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t^{(s)})\Vert ^2\right] \\&+ \frac{\sigma ^2}{L}\left[ \frac{1}{\tilde{b}(m+1)(1-\beta )} + \frac{(1-\beta )}{\hat{b}}\right] \varSigma _m. \end{aligned}$$

Summing up this inequality from $s := 1$ to $s := S$ and noting that $F(\overline{x}^{(S)}) \ge F^{\star }$, we obtain

$$\begin{aligned} \frac{1}{S\varSigma _m}\displaystyle \sum _{s=1}^S\sum _{t=0}^m\gamma _t\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(x_t^{(s)})\Vert ^2\right]\le & {} \frac{4\big [F(\overline{x}^{(0)}) - F^{\star }\big ] }{L\eta ^2S\varSigma _m}\nonumber \\&+ \frac{4\sigma ^2}{L^2\eta ^2}\left[ \frac{1}{\tilde{b}(m+1)(1-\beta )} + \frac{(1-\beta )}{\hat{b}}\right] .\qquad \end{aligned}$$

(75)

Let us first choose $\beta := 1 - \frac{\hat{b}^{1/2}}{\tilde{b}^{1/2}(m+1)^{1/2}}$. Then, $1 - \beta ^2 \le \frac{2\hat{b}^{1/2}}{\tilde{b}^{1/2}(m+1)^{1/2}}$ and $\frac{(1+L^2\eta ^2)\beta ^2}{L} \le \frac{2}{L}$. Using these inequalities, similar to the proof of (62), we can upper bound

$$\begin{aligned}&L\left[ \sqrt{1-\beta ^2} + \sqrt{1 - \beta ^2 + \frac{4(1+L^2\eta ^2)\beta ^2\delta }{Lb}}\right] \\&\quad \le 2\sqrt{2}\left[ \frac{L\hat{b}^{1/4}b^{1/2} + [\tilde{b}(m+1)]^{1/4} \sqrt{L\delta }}{b^{1/2}[\tilde{b}(m+1)]^{1/4}}\right] . \end{aligned}$$

Using this bound, the update rule (43) of $\gamma _t$, and $\sqrt{1-\beta ^2} \ge \frac{\hat{b}^{1/4}}{[\tilde{b}(m+1)]^{1/4}}$, we apply Lemma 7 with $\omega := \beta ^2$ and $\epsilon := \frac{(1 + L^2\eta ^2)}{Lb}$ to obtain

$$\begin{aligned} \varSigma _m :=&\sum _{t=0}^m\gamma _t \ge \frac{\delta (m+1)\sqrt{1-\beta ^2}}{L\Big [\sqrt{1-\beta ^2} + \sqrt{1 - \beta ^2 + \frac{4(1+L^2\eta ^2)\beta ^2\delta }{Lb}}\Big ]}\\ \ge&\frac{\delta (m+1)\hat{b}^{1/4}b^{1/2}}{2\sqrt{2}\left[ L\hat{b}^{1/4}b^{1/2} + [\tilde{b}(m+1)]^{1/4} \sqrt{L\delta }\right] }. \end{aligned}$$

Utilizing this bound into (75) and noting that $\overline{x}_T \sim \mathbb {U}_{\mathbf {p}}\left( \{x^{(s)}_t\}_{t=0\rightarrow m}^{s=1\rightarrow S}\right) $, we can upper bound it as

$$\begin{aligned} \mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_T)\Vert ^2\right]\le & {} \frac{8\sqrt{2}\left[ L\hat{b}^{1/4}b^{1/2} + [\tilde{b}(m+1)]^{1/4} \sqrt{L\delta }\right] }{L\eta ^2\delta S(m+1)\hat{b}^{1/4}b^{1/2}}\big [F(\overline{x}^{(0)}) - F^{\star }\big ] \\&+ \frac{8\sigma ^2}{L^2\eta ^2[\hat{b}\tilde{b}(m+1)]^{1/2}}, \end{aligned}$$

which is exactly (44).

(b) Now, let us choose $\hat{b} = b \in \mathbb {N}_{+}$ and assume that $\tilde{b} := \lceil c_1^2b(m+1) \rceil $ for some $c_1 > 0$. In this case, the right-hand side of (44) can be upper bounded as

$$\begin{aligned} \mathcal {R}_T:= & {} \frac{8\sqrt{2}\left[ Lb^{3/4} {~}+{~} [\tilde{b}(m+1)]^{1/4} \sqrt{L\delta }\right] }{L\eta ^2\delta S(m+1)b^{3/4}}\big [F(\overline{x}^{(0)}) - F^{\star }\big ] + \frac{8\sigma ^2}{L^2\eta ^2 c_1b(m+1)} \\= & {} \frac{8\sqrt{2}\varDelta _F}{\eta ^2\delta S(m+1)} + \frac{8\sqrt{2 c_1}\varDelta _F}{\sqrt{L\delta }\eta ^2 S(m+1)^{1/2}b^{1/2}} + \frac{8\sigma ^2}{L^2\eta ^2 c_1b(m+1)}, \end{aligned}$$

where $\varDelta _F := F(\overline{x}^{(0)}) - F^{\star } > 0$.

For any $\varepsilon > 0$, to guarantee $\mathbb {E}\left[ \Vert \mathcal {G}_{\eta }(\overline{x}_T)\Vert ^2\right] \le \varepsilon ^2$, we impose $\mathcal {R}_T \le \varepsilon ^2$. From the upper bound of $\mathcal {R}_T$, we can break its relaxed condition into three parts as

$$\begin{aligned}&\frac{8\sqrt{2}\varDelta _F}{\eta ^2\delta S(m+1)} \le \frac{\varepsilon ^2}{3},~~~~~\frac{8\sqrt{2 c_1}\varDelta _F}{\sqrt{L\delta }\eta ^2S(m+1)^{1/2}b^{1/2}} = \frac{\varepsilon ^2}{3},\nonumber \\&\text {and}~~~~ \frac{8\sigma ^2}{L^2\eta ^2 c_1b(m+1)} \le \frac{\varepsilon ^2}{3}. \end{aligned}$$

(76)

Let us choose $m + 1 := \big \lceil \frac{24}{c_1L^2\eta ^2 b\varepsilon ^2}\max \left\{ \sigma ^2,1\right\} \big \rceil $. Then, $\tilde{b} = \big \lceil \frac{24c_1}{L^2\eta ^2 \varepsilon ^2}\max \left\{ \sigma ^2,1\right\} \big \rceil $. Moreover, the last condition of (76) holds and $m+1 \ge \frac{24}{c_1L^2\eta ^2 b\varepsilon ^2}$. Hence, the second condition of (76) leads to

$$\begin{aligned} S = \frac{24\sqrt{2 c_1}\varDelta _F}{\sqrt{L\delta }\eta ^2 \varepsilon ^2}\frac{1}{\sqrt{b(m+1)}} \le \frac{4\sqrt{3L}c_1 \varDelta _F}{\sqrt{\delta }\eta \varepsilon }. \end{aligned}$$

From the second condition of (76), we also have

$$\begin{aligned} (m+1)bS = \frac{24\sqrt{2c_1}\varDelta _F}{\eta ^2\sqrt{L\delta } \varepsilon ^2}(m+1)^{1/2}b^{1/2} = \frac{96\sqrt{3}\varDelta _F}{\eta ^3L\sqrt{L\delta } \varepsilon ^3}\max \left\{ 1,\sigma \right\} . \end{aligned}$$

From this expression, to guarantee the first condition of (76), we need to impose

$$\begin{aligned} (m+1)S = \frac{96\sqrt{3}\varDelta _F}{\eta ^3L\sqrt{L\delta } b\varepsilon ^3}\max \left\{ 1,\sigma \right\} \ge \frac{24\sqrt{2}\varDelta _F}{\eta ^2\delta \varepsilon ^2}, \end{aligned}$$

which leads to $1\le b \le \frac{2\sqrt{6\delta }}{L\sqrt{L}\eta \varepsilon }$.

Finally, the total number of stochastic gradient evaluations $\nabla {f}_{\xi }(x_t)$ is at most

$$\begin{aligned} \mathcal {T}_{\nabla {f}}&:= \big [\tilde{b} + 3b(m+1) \big ]S = \left\lceil (c_1^2 + 3)b(m+1)S \right\rceil \\&= \Bigg \lceil (c_1^2 + 3)\frac{96\sqrt{3}\varDelta _F}{\eta ^3L\sqrt{L\delta } \varepsilon ^3}\max \left\{ 1,\sigma \right\} \Bigg \rceil . \end{aligned}$$

The total number of proximal operations $\mathrm {prox}_{\eta \psi }$ is at most $\mathcal {T}_{\mathrm {prox}} = (m+1)S = \left\lceil \frac{96\sqrt{3}\varDelta _F}{\eta ^3L\sqrt{L\delta } b\varepsilon ^3}\max \left\{ 1,\sigma \right\} \right\rceil $. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tran-Dinh, Q., Pham, N.H., Phan, D.T. et al. A hybrid stochastic optimization framework for composite nonconvex optimization. Math. Program. 191, 1005–1071 (2022). https://doi.org/10.1007/s10107-020-01583-1

Download citation

Received: 04 September 2019
Accepted: 13 October 2020
Published: 04 January 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s10107-020-01583-1

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid stochastic optimization framework for composite nonconvex optimization

Abstract

Access this article

Similar content being viewed by others

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix: Properties of hybrid stochastic estimators

Lemma 7

Proof

1.1 The proof of Lemma 3: Error estimate with mini-batch

1.2 The proof of Lemma 4: Upper bound of mini-batch “variance”

The proof of technical results in Sect. 4: The single sample case

1.1 The proof of Lemma 5: Key estimate

1.2 The proof of Lemma 6: Key estimate of Lyapunov function

1.3 The proof of Theorem 2: The adaptive step-size case

1.4 The proof of Theorem 3: The restarting variant

The proof of technical results in Sect. 5: The mini-batch case

1.1 The proof of Theorem 4: The single-loop variant

1.2 The proof of Theorem 5: The restarting mini-batch variant

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A hybrid stochastic optimization framework for composite nonconvex optimization

Abstract

Access this article

Similar content being viewed by others

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix: Properties of hybrid stochastic estimators

Lemma 7

Proof

1.1 The proof of Lemma 3: Error estimate with mini-batch

1.2 The proof of Lemma 4: Upper bound of mini-batch “variance”

The proof of technical results in Sect. 4: The single sample case

1.1 The proof of Lemma 5: Key estimate

1.2 The proof of Lemma 6: Key estimate of Lyapunov function

1.3 The proof of Theorem 2: The adaptive step-size case

1.4 The proof of Theorem 3: The restarting variant

The proof of technical results in Sect. 5: The mini-batch case

1.1 The proof of Theorem 4: The single-loop variant

1.2 The proof of Theorem 5: The restarting mini-batch variant

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation