On the Quality of First-Order Approximation of Functions with Hölder Continuous Gradient

Berger, Guillaume O.; Absil, P.-A.; Jungers, Raphaël M.; Nesterov, Yurii

doi:10.1007/s10957-020-01632-x

On the Quality of First-Order Approximation of Functions with Hölder Continuous Gradient

Published: 11 February 2020

Volume 185, pages 17–33, (2020)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Guillaume O. Berger ORCID: orcid.org/0000-0002-0633-8948¹,
P.-A. Absil¹,
Raphaël M. Jungers¹ &
…
Yurii Nesterov¹

542 Accesses
1 Citation
Explore all metrics

Abstract

We show that Hölder continuity of the gradient is not only a sufficient condition, but also a necessary condition for the existence of a global upper bound on the error of the first-order Taylor approximation. We also relate this global upper bound to the Hölder constant of the gradient. This relation is expressed as an interval, depending on the Hölder constant, in which the error of the first-order Taylor approximation is guaranteed to be. We show that, for the Lipschitz continuous case, the interval cannot be reduced. An application to the norms of quadratic forms is proposed, which allows us to derive a novel characterization of Euclidean norms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Best Error Localizations for Piecewise Polynomial Approximation of Gradients, Functions and Functionals

Functions with Bounded Hessian–Schatten Variation: Density, Variational, and Extremality Properties

Article Open access 20 November 2023

Conditions for linear convergence of the gradient method for non-convex optimization

Article Open access 25 February 2023

Notes

For example, $x\in \mathbb {R}\mapsto x^3$, or $x\in \mathbb {R}\mapsto x^2\sin (1/x^2)$ (with continuous extension at 0).
Note that the method requires $\nu >0$. In fact, finding a descent direction for a non-smooth non-convex function is NP-hard [1], and thus, it is reasonable to ask that $\nu >0$.
With the convention that $0^0=1$, hence, $\left( \frac{1+\nu }{\nu }\right) ^\nu =\left( \frac{1+\nu }{\nu }\right) ^{\nu /2}=1$, when $\nu =0$.
With the convention that $0^0=1$, as in Theorem 4.1.
That is, $\langle Bx,x\rangle \ge 0$ for every $x\in E$.
The result presented in Theorem 6.1 seems to be a novel (to the best of the authors’ knowledge) characterization of Euclidean norms in the finite-dimensional case. (For a detailed survey of results on equivalent characterizations of Euclidean norms, we refer the reader to the celebrated book by Amir [11].)
Indeed, a norm $\Vert \cdot \Vert $ is completely determined by its unit ball K, via the identity $\Vert x\Vert = \inf \,\{\alpha >0:x/\alpha \in K\}$. Since $\Vert \cdot \Vert '$ is not Euclidean, its unit ball cannot be equal to $\mathbb {B}^n$ (since $\mathbb {B}^n$ is the unit ball of the canonical Euclidean norm).

References

Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Article MathSciNet MATH Google Scholar
Yashtini, M.: On the global convergence rate of the gradient descent method for functions with Hölder continuous gradients. Optim. Lett. 10(6), 1361–1370 (2016)
Article MathSciNet MATH Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: Worst-case evaluation complexity of regularization methods for smooth unconstrained optimization using Hölder continuous gradients. Optim. Methods Softw. 32(6), 1273–1298 (2017)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152(1–2), 381–404 (2015)
Article MathSciNet MATH Google Scholar
Boumal, N., Absil, P.A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. IMA J. Numer. Anal. 39(1), 1–33 (2018)
MathSciNet Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2013)
MATH Google Scholar
Jordan, P., von Neumann, J.: On inner products in linear, metric spaces. Ann. Math. 36(3), 719–723 (1935)
Article MathSciNet MATH Google Scholar
Friedman, A.: Foundations of Modern Analysis. Courier Corporation, North Chelmsford (1982)
MATH Google Scholar
Motzkin, T.S., Straus, E.G.: Maxima for graphs and a new proof of a theorem of Turán. Can. J. Math. 17, 533–540 (1965)
Article MATH Google Scholar
Ben-Tal, A., Nemirovski, A.: Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications, vol. 2. SIAM, Philadelphia (2001)
Book MATH Google Scholar
Amir, D.: Characterizations of Inner Product Spaces. Birkhauser Verlag, Basel (1986)
Book MATH Google Scholar
John, F.: Extremum problems with inequalities as subsidiary conditions. In: Giorgi, G., Kjeldsen, T. (eds.) Traces and Emergence of Nonlinear Programming. Birkhäuser, Basel (2014)
Google Scholar
Ball, K.: Ellipsoids of maximal volume in convex bodies. Geom. Dedicata 41(2), 241–250 (1992)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was supported by (i) the Fonds de la Recherche Scientifique—FNRS and the Fonds Wetenschappelijk Onderzoek—Vlaanderen under EOS Project No. 30468160, (ii) “Communauté française de Belgique—Actions de Recherche Concertées” (contract ARC 14/19-060). The research of the first author was supported by a FNRS/FRIA grant. The research of the third author was supported by the FNRS, the Walloon Region and the Innoviris Foundation. The research of the fourth author was supported by ERC Advanced Grant 788368.

Author information

Authors and Affiliations

UCLouvain, Louvain-la-Neuve, Belgium
Guillaume O. Berger, P.-A. Absil, Raphaël M. Jungers & Yurii Nesterov

Authors

Guillaume O. Berger
View author publications
You can also search for this author in PubMed Google Scholar
P.-A. Absil
View author publications
You can also search for this author in PubMed Google Scholar
Raphaël M. Jungers
View author publications
You can also search for this author in PubMed Google Scholar
Yurii Nesterov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillaume O. Berger.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proof of Theorem 6.1

The proof relies on the fact that, if $\Vert \cdot \Vert $ is not Euclidean, then the unit ball defined by $\Vert \cdot \Vert $, i.e., $\{x\in E:\Vert x\Vert \le 1\}$, is not equal to the ellipsoid with smallest volume containing this ball. Based on this ellipsoid, we will build a self-adjoint operator $B:E\rightarrow E^*$, such that $\Vert Q_B\Vert <\Vert B\Vert $. The notions of ellipsoid and (Lebesgue) volume are defined on $\mathbb {R}^n$ only. The following lemma implies, among other things, that there is no loss of generality in restricting to the case $E=\mathbb {R}^n$:

Lemma A.1

Let E be a real vector space with norm $\Vert \cdot \Vert $, and let $A:E \rightarrow E'$ be a bijective linear map. Then, $\Vert \cdot \Vert $ is Euclidean, if and only if the norm $\Vert \cdot \Vert '$ on $E'$, defined by $\Vert x\Vert ' = \Vert A^{-1}x\Vert $, is Euclidean.

Proof

Straightforward from the definition of $\Vert \cdot \Vert $ being Euclidean, if and only if it is induced by a scalar product, i.e., if and only if there exists a self-adjoint operator $H:E\rightarrow E^*$, satisfying $\Vert x\Vert ^2=\langle Hx,x\rangle $ for all $x\in E$. $\square $

Proof of Theorem 6.1

The “only if” part follows from Proposition 6.1. For the proof of the “if” part, let E be an n-dimensional vector space, and let $\Vert \cdot \Vert $ be a non-Euclidean norm on E. We will build a self-adjoint operator B on E, such that $\Vert Q_B\Vert <\Vert B\Vert $.

By Lemma A.1, we may assume that $E=\mathbb {R}^n$ and that $\Vert \cdot \Vert $ is a non-Euclidean norm on $\mathbb {R}^n$. We use superscripts to denote the components of vectors in $\mathbb {R}^n$: $x=(x^{(1)},\ldots ,x^{(n)})^\top $.

Let $K=\{ x\in \mathbb {R}^n : \Vert x\Vert \le 1 \}$. Because K is compact, convex, with non-empty interior, and symmetric with respect to the origin, the Löwner–John ellipsoid theorem [12, 13] asserts that there exists a unique ellipsoid $\mathcal {E}$, with minimal volume, and such that $K\subseteq \mathcal {E}$. Moreover, $\mathcal {E}$ is centered at the origin, and K has n linearly independent vectors on the boundary of $\mathcal {E}$.

Let $L:\mathbb {R}^n\rightarrow \mathbb {R}^n$ be a linear isomorphism, such that $L\mathcal {E}$ is the Euclidean ball $\mathbb {B}^n=\{x\in \mathbb {R}^n : \Vert x\Vert _2\le 1\}$, where $\Vert x\Vert _2=\sqrt{x^\top x}$ is the canonical Euclidean norm on $\mathbb {R}^n$. Let $\Vert x\Vert '= \Vert L^{-1}x\Vert $, and let $K'=\{ x\in \mathbb {R}^n : \Vert x\Vert ' \le 1 \}$. By Lemma A.1, $\Vert \cdot \Vert '$ is not Euclidean. Since $K'=LK$, it is clear that $K'$ is compact, convex, with non-empty interior, and symmetric with respect to the origin. Moreover, $K'$ is included in $\mathbb {B}^n$, and it has n linearly independent vectors on the boundary $\mathbb {S}^{n-1}$ of $\mathbb {B}^n$.

We will need the following lemma to conclude the proof of Theorem 6.1:

Lemma A.2

There exist $u,v\in \mathbb {S}^{n-1}\cap K'$, not colinear, such that $\frac{u+v}{\Vert u+v\Vert _2}\notin K'$.

We proceed with the proof of Theorem 6.1 (a proof of Lemma A.2 is provided at the end of this “Appendix”). Let u, v be as in Lemma A.2, and define $e_1 = \frac{u+v}{\Vert u+v\Vert _2}$ and $e_2=\frac{u-v}{\Vert u-v\Vert _2}$. Note that these vectors are orthonormal (w.r.t. the inner product $x^\top y$).

Let $\kappa = \max \, \{ |e_1^\top x|: x\in K' \}$. Since $|e_1^\top x|<1$ for every $x\in \mathbb {B}^n\setminus \{\pm e_1\}$, and $\pm e_1\notin K'$, we have that $\kappa <1$. Moreover, $\kappa >0$, since $\mathrm {int}(K')\ne \varnothing $. Let $\tilde{B}$ be the self-adjoint operator on $\mathbb {R}^n$, defined by

$$\begin{aligned} \langle \tilde{B}x,y\rangle = \frac{1}{\kappa ^2}\, \left( e_1^\top x\right) \left( e_1^\top y\right) - \left( e_2^\top x\right) \left( e_2^\top y\right) \end{aligned}$$

for every $x,y\in \mathbb {R}^n$. Let $x\in K'$. Then,

$$\begin{aligned} -1\le - \left( e_2^\top x\right) ^2 \le \langle \tilde{B}x,x\rangle \le \frac{1}{\kappa ^2}\,\left( e_1^\top x\right) ^2 \le \frac{1}{\kappa ^2}\,\kappa ^2 = 1 . \end{aligned}$$

It follows that, for every $x\in \mathbb {R}^n$ with $x\ne 0$, $|\langle \tilde{B}x,x\rangle |= \Vert x\Vert '^2\,|\langle \tilde{B}\frac{x}{\Vert x\Vert '},\frac{x}{\Vert x\Vert '}\rangle |\le \Vert x\Vert '^2$. Hence, $\Vert Q_{\tilde{B}}\Vert \le 1$. Now, we will show that $|\langle \tilde{B}u,v\rangle |>\Vert u\Vert '\Vert v\Vert '$ (where u, v are as above). Therefore, let $\alpha =\Vert u+v\Vert _2$ and $\beta =\Vert u-v\Vert _2$. Observe that $u=\frac{\alpha e_1+\beta e_2}{2}$ and $v=\frac{\alpha e_1-\beta e_2}{2}$. Thus,

$$\begin{aligned} \langle \tilde{B}u,v\rangle = \frac{1}{\kappa ^2}\,\frac{\alpha ^2}{4} + \frac{\beta ^2}{4} = \frac{1-\kappa ^2}{\kappa ^2}\,\frac{\alpha ^2}{4} + \frac{\alpha ^2}{4} + \frac{\beta ^2}{4} . \end{aligned}$$

This shows that $\langle \tilde{B}u,v\rangle >1$, since (by the parallelogram identity)

$$\begin{aligned} \frac{\alpha ^2}{4} + \frac{\beta ^2}{4} = \frac{1}{4}\left( \Vert u+v\Vert _2^2 + \Vert u-v\Vert _2^2 \right) = 1 , \end{aligned}$$

$0<\kappa <1$, and $\alpha >0$. Since $u,v\in K'$ (i.e., $\Vert u\Vert ',\Vert v\Vert '\le 1$), we have that $\Vert u\Vert '\Vert v\Vert '\le 1<|\langle \tilde{B}u,v\rangle |$. Thus, $\Vert \tilde{B}\Vert >1$.

Finally, define the self-adjoint operator B on E by $\langle Bx,y\rangle = \langle \tilde{B}Lx,Ly\rangle $. It is clear, from the definition of $\Vert \cdot \Vert '$, that $|\langle Bx,x\rangle |\le \Vert x\Vert ^2$ for every $x\in E$ and $|\langle Bx,y\rangle |>\Vert x\Vert \Vert y\Vert $ for $x=L^{-1}u$ and $y=L^{-1}v$ (where u, v are as above). Hence, one gets $\Vert Q_B\Vert \le 1<\Vert B\Vert $. This concludes the proof of Theorem 6.1. $\square $

It remains to prove Lemma A.2. The following proposition, known as Fritz John necessary conditions for optimality will be useful in the proof of Lemma A.2:

Proposition A.1

(Fritz John necessary conditions [12]) Let S be a compact metric space. Let F(x) be a real-valued function on $\mathbb {R}^n$, and let G(x, y) be a real-valued function defined for all $(x,y)\in \mathbb {R}^n\times S$. Assume that F(x) and G(x, y) are both differentiable with respect to x and that F(x), G(x, y), $\frac{\partial F}{\partial x}(x)$, and $\frac{\partial G}{\partial x}(x,y)$ are continuous on $\mathbb {R}^n\times S$. Let $R=\{x\in \mathbb {R}^n:G(x,y)\le 0,\,\forall y\in S\}$, and suppose that R is non-empty.

Let $x^*\in R$ be such that $F(x^*)=\max _{x\in R} F(x)$. Then, there is $m\in \{0,\ldots ,n\}$, and points $y_1,\ldots ,y_m\in S$, and nonnegative multipliers $\lambda _0,\lambda _1,\ldots ,\lambda _m\ge 0$, such that (i) $G(x^*,y_i)=0$ for every $1\le i\le m$, (ii) $\sum _{i=0}^m \lambda _i >0$, and (iii)

$$\begin{aligned} \lambda _0\frac{\partial F}{\partial x}(x^*)=\sum _{i=1}^m \lambda _i \frac{\partial G}{\partial x}(x^*,y_i). \end{aligned}$$

We refer the reader to [12] for a proof of Proposition A.1.

Proof of Lemma A.2

Consider the following optimization problem:

$$\begin{aligned} \begin{array}{ll} \mathrm{maximize} &{}\quad F(x):=\Vert x\Vert _2^2 \\ \mathrm{subject}\,\mathrm{to} &{}\quad G(x,y):= x^\top y-1\le 0 \quad \mathrm{for}\, \mathrm{every}\, y\in \mathbb {S}^{n-1}\cap K',\\ \end{array} \end{aligned}$$

(11)

with variable $x\in \mathbb {R}^n$.

First, we show that (11) is bounded. Suppose the contrary, and, for every $k\ge 1$, let $x_k$ be a feasible solution with $\Vert x_k\Vert _2\ge k$. Let $\hat{x}_k=x_k/\Vert x_k\Vert _2$. Taking a subsequence if necessary, we may assume that $\hat{x}_k$ converges to some $\hat{x}_*$, with $\Vert \hat{x}_*\Vert _2=1$. Since $\hat{x}_k^\top y\le 1/\Vert x_k\Vert _2$ for every $y\in \mathbb {S}^{n-1}\cap K'$, we have that $\hat{x}_*^\top y \le 0$ for every $y\in \mathbb {S}^{n-1}\cap K'$. By symmetry of $\mathbb {S}^{n-1}\cap K'$, it follows that $\hat{x}_*^\top y =0$ for every $x\in \mathbb {S}^{n-1}\cap K'$, a contradiction with the fact that $\mathbb {S}^{n-1}\cap K'$ contains n linearly independent vectors. Hence, the set of feasible solutions of (11) is bounded, and closed (as the intersection of closed sets), so that (11) has an optimal solution, say $\bar{x}$.

We will show that $\Vert \bar{x}\Vert _2>1$. Therefore, we use the fact that $K'\ne \mathbb {B}^n$.^{Footnote 7} Fix some $z\in \mathbb {S}^{n-1}\setminus K'$, and let $\eta =\max \,\{ z^\top y:y\in K' \}$. Since $z^\top y<1$ for every $y\in \mathbb {B}^n\setminus \{z\}$, and $z\notin K'$, we have that $\eta <1$. Let $x=z/\eta $. From the definition of $\eta $, it is clear that x is a feasible solution of (11). Moreover, $\Vert x\Vert _2=\eta ^{-1}>1$, so that $\Vert \bar{x}\Vert _2\ge \Vert x\Vert _2>1$.

The gradient of F at $\bar{x}$ is equal to $2\bar{x}$. Then, Proposition A.1 asserts that there exist vectors $y_1,\ldots ,y_m\in \mathbb {S}^{n-1}\cap K'$, and nonnegative multipliers $\lambda _0,\lambda _1,\ldots ,\lambda _m\ge 0$, such that $\bar{x}^\top y_i=1$ for every $1\le i\le m$, $\sum _{i=0}^m \lambda _i>0$, and $\lambda _0\bar{x}=\sum _{i=1}^m \lambda _iy_i$. If $\lambda _0=0$, then $0=\sum _{i=1}^m \lambda _i\bar{x}^\top y_i = \sum _{i=1}^m \lambda _i >0$, a contradiction. Hence, $\lambda _0>0$. Suppose that $y_1,\ldots ,y_m$ are colinear. This implies that all $y_i$’s must be parallel to $\bar{x}$ (because $\lambda _0\bar{x}=\sum _{i=1}^m \lambda _iy_i$ and $\lambda _0\bar{x}\ne 0$), and since they are in $\mathbb {S}^{n-1}$, we have that $y_i=\pm \bar{x}/\Vert \bar{x}\Vert _2$, so that $\bar{x}^\top y_i=\Vert \bar{x}\Vert _2>1$ or $-\bar{x}^\top y_i=\Vert \bar{x}\Vert _2>1$. This gives a contradiction, since $-y_i$ and $y_i$ are both in $\mathbb {S}^{n-1}\cap K'$ (by the symmetry of $\mathbb {S}^{n-1}\cap K'$). Thus, there exist at least two non-colinear vectors $u,v\in \mathbb {S}^{n-1}\cap K'$ satisfying $\bar{x}^\top u=1$ and $\bar{x}^\top v=1$.

Let $e_1 = \frac{u+v}{\Vert u+v\Vert _2}$. Since u and v are not colinear, $\Vert u+v\Vert _2<2$, and thus, $\bar{x}^\top e_1=2/\Vert u+v\Vert _2>1$. This shows that $e_1\notin \mathbb {S}^{n-1}\cap K'$. By definition, $e_1$ is in $\mathbb {S}^{n-1}$, so that $e_1\notin K'$, concluding the proof of the lemma. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berger, G.O., Absil, PA., Jungers, R.M. et al. On the Quality of First-Order Approximation of Functions with Hölder Continuous Gradient. J Optim Theory Appl 185, 17–33 (2020). https://doi.org/10.1007/s10957-020-01632-x

Download citation

Received: 27 March 2019
Accepted: 13 January 2020
Published: 11 February 2020
Issue Date: April 2020
DOI: https://doi.org/10.1007/s10957-020-01632-x

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Quality of First-Order Approximation of Functions with Hölder Continuous Gradient

Abstract

Access this article

Similar content being viewed by others

Best Error Localizations for Piecewise Polynomial Approximation of Gradients, Functions and Functionals

Functions with Bounded Hessian–Schatten Variation: Density, Variational, and Extremality Properties

Conditions for linear convergence of the gradient method for non-convex optimization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proof of Theorem 6.1

Lemma A.1

Proof

Proof of Theorem 6.1

Lemma A.2

Proposition A.1

Proof of Lemma A.2

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

On the Quality of First-Order Approximation of Functions with Hölder Continuous Gradient

Abstract

Access this article

Similar content being viewed by others

Best Error Localizations for Piecewise Polynomial Approximation of Gradients, Functions and Functionals

Functions with Bounded Hessian–Schatten Variation: Density, Variational, and Extremality Properties

Conditions for linear convergence of the gradient method for non-convex optimization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proof of Theorem 6.1

Appendix: Proof of Theorem 6.1

Lemma A.1

Proof

Proof of Theorem 6.1

Lemma A.2

Proposition A.1

Proof of Lemma A.2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation