Skip to main content
Log in

On the Quality of First-Order Approximation of Functions with Hölder Continuous Gradient

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

We show that Hölder continuity of the gradient is not only a sufficient condition, but also a necessary condition for the existence of a global upper bound on the error of the first-order Taylor approximation. We also relate this global upper bound to the Hölder constant of the gradient. This relation is expressed as an interval, depending on the Hölder constant, in which the error of the first-order Taylor approximation is guaranteed to be. We show that, for the Lipschitz continuous case, the interval cannot be reduced. An application to the norms of quadratic forms is proposed, which allows us to derive a novel characterization of Euclidean norms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. For example, \(x\in \mathbb {R}\mapsto x^3\), or \(x\in \mathbb {R}\mapsto x^2\sin (1/x^2)\) (with continuous extension at 0).

  2. Note that the method requires \(\nu >0\). In fact, finding a descent direction for a non-smooth non-convex function is NP-hard [1], and thus, it is reasonable to ask that \(\nu >0\).

  3. With the convention that \(0^0=1\), hence, \(\left( \frac{1+\nu }{\nu }\right) ^\nu =\left( \frac{1+\nu }{\nu }\right) ^{\nu /2}=1\), when \(\nu =0\).

  4. With the convention that \(0^0=1\), as in Theorem 4.1.

  5. That is, \(\langle Bx,x\rangle \ge 0\) for every \(x\in E\).

  6. The result presented in Theorem 6.1 seems to be a novel (to the best of the authors’ knowledge) characterization of Euclidean norms in the finite-dimensional case. (For a detailed survey of results on equivalent characterizations of Euclidean norms, we refer the reader to the celebrated book by Amir [11].)

  7. Indeed, a norm \(\Vert \cdot \Vert \) is completely determined by its unit ball K, via the identity \(\Vert x\Vert = \inf \,\{\alpha >0:x/\alpha \in K\}\). Since \(\Vert \cdot \Vert '\) is not Euclidean, its unit ball cannot be equal to \(\mathbb {B}^n\) (since \(\mathbb {B}^n\) is the unit ball of the canonical Euclidean norm).

References

  1. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  2. Yashtini, M.: On the global convergence rate of the gradient descent method for functions with Hölder continuous gradients. Optim. Lett. 10(6), 1361–1370 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  3. Cartis, C., Gould, N.I., Toint, P.L.: Worst-case evaluation complexity of regularization methods for smooth unconstrained optimization using Hölder continuous gradients. Optim. Methods Softw. 32(6), 1273–1298 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  4. Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152(1–2), 381–404 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  5. Boumal, N., Absil, P.A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. IMA J. Numer. Anal. 39(1), 1–33 (2018)

    MathSciNet  Google Scholar 

  6. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2013)

    MATH  Google Scholar 

  7. Jordan, P., von Neumann, J.: On inner products in linear, metric spaces. Ann. Math. 36(3), 719–723 (1935)

    Article  MathSciNet  MATH  Google Scholar 

  8. Friedman, A.: Foundations of Modern Analysis. Courier Corporation, North Chelmsford (1982)

    MATH  Google Scholar 

  9. Motzkin, T.S., Straus, E.G.: Maxima for graphs and a new proof of a theorem of Turán. Can. J. Math. 17, 533–540 (1965)

    Article  MATH  Google Scholar 

  10. Ben-Tal, A., Nemirovski, A.: Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications, vol. 2. SIAM, Philadelphia (2001)

    Book  MATH  Google Scholar 

  11. Amir, D.: Characterizations of Inner Product Spaces. Birkhauser Verlag, Basel (1986)

    Book  MATH  Google Scholar 

  12. John, F.: Extremum problems with inequalities as subsidiary conditions. In: Giorgi, G., Kjeldsen, T. (eds.) Traces and Emergence of Nonlinear Programming. Birkhäuser, Basel (2014)

    Google Scholar 

  13. Ball, K.: Ellipsoids of maximal volume in convex bodies. Geom. Dedicata 41(2), 241–250 (1992)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by (i) the Fonds de la Recherche Scientifique—FNRS and the Fonds Wetenschappelijk Onderzoek—Vlaanderen under EOS Project No. 30468160, (ii) “Communauté française de Belgique—Actions de Recherche Concertées” (contract ARC 14/19-060). The research of the first author was supported by a FNRS/FRIA grant. The research of the third author was supported by the FNRS, the Walloon Region and the Innoviris Foundation. The research of the fourth author was supported by ERC Advanced Grant 788368.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guillaume O. Berger.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proof of Theorem 6.1

Appendix: Proof of Theorem 6.1

The proof relies on the fact that, if \(\Vert \cdot \Vert \) is not Euclidean, then the unit ball defined by \(\Vert \cdot \Vert \), i.e., \(\{x\in E:\Vert x\Vert \le 1\}\), is not equal to the ellipsoid with smallest volume containing this ball. Based on this ellipsoid, we will build a self-adjoint operator \(B:E\rightarrow E^*\), such that \(\Vert Q_B\Vert <\Vert B\Vert \). The notions of ellipsoid and (Lebesgue) volume are defined on \(\mathbb {R}^n\) only. The following lemma implies, among other things, that there is no loss of generality in restricting to the case \(E=\mathbb {R}^n\):

Lemma A.1

Let E be a real vector space with norm \(\Vert \cdot \Vert \), and let \(A:E \rightarrow E'\) be a bijective linear map. Then, \(\Vert \cdot \Vert \) is Euclidean, if and only if the norm \(\Vert \cdot \Vert '\) on \(E'\), defined by \(\Vert x\Vert ' = \Vert A^{-1}x\Vert \), is Euclidean.

Proof

Straightforward from the definition of \(\Vert \cdot \Vert \) being Euclidean, if and only if it is induced by a scalar product, i.e., if and only if there exists a self-adjoint operator \(H:E\rightarrow E^*\), satisfying \(\Vert x\Vert ^2=\langle Hx,x\rangle \) for all \(x\in E\)\(\square \)

Proof of Theorem 6.1

The “only if” part follows from Proposition 6.1. For the proof of the “if” part, let E be an n-dimensional vector space, and let \(\Vert \cdot \Vert \) be a non-Euclidean norm on E. We will build a self-adjoint operator B on E, such that \(\Vert Q_B\Vert <\Vert B\Vert \).

By Lemma A.1, we may assume that \(E=\mathbb {R}^n\) and that \(\Vert \cdot \Vert \) is a non-Euclidean norm on \(\mathbb {R}^n\). We use superscripts to denote the components of vectors in \(\mathbb {R}^n\): \(x=(x^{(1)},\ldots ,x^{(n)})^\top \).

Let \(K=\{ x\in \mathbb {R}^n : \Vert x\Vert \le 1 \}\). Because K is compact, convex, with non-empty interior, and symmetric with respect to the origin, the Löwner–John ellipsoid theorem [12, 13] asserts that there exists a unique ellipsoid \(\mathcal {E}\), with minimal volume, and such that \(K\subseteq \mathcal {E}\). Moreover, \(\mathcal {E}\) is centered at the origin, and K has n linearly independent vectors on the boundary of \(\mathcal {E}\).

Let \(L:\mathbb {R}^n\rightarrow \mathbb {R}^n\) be a linear isomorphism, such that \(L\mathcal {E}\) is the Euclidean ball \(\mathbb {B}^n=\{x\in \mathbb {R}^n : \Vert x\Vert _2\le 1\}\), where \(\Vert x\Vert _2=\sqrt{x^\top x}\) is the canonical Euclidean norm on \(\mathbb {R}^n\). Let \(\Vert x\Vert '= \Vert L^{-1}x\Vert \), and let \(K'=\{ x\in \mathbb {R}^n : \Vert x\Vert ' \le 1 \}\). By Lemma A.1, \(\Vert \cdot \Vert '\) is not Euclidean. Since \(K'=LK\), it is clear that \(K'\) is compact, convex, with non-empty interior, and symmetric with respect to the origin. Moreover, \(K'\) is included in \(\mathbb {B}^n\), and it has n linearly independent vectors on the boundary \(\mathbb {S}^{n-1}\) of \(\mathbb {B}^n\).

We will need the following lemma to conclude the proof of Theorem 6.1:

Lemma A.2

There exist \(u,v\in \mathbb {S}^{n-1}\cap K'\), not colinear, such that \(\frac{u+v}{\Vert u+v\Vert _2}\notin K'\).

We proceed with the proof of Theorem 6.1 (a proof of Lemma A.2 is provided at the end of this “Appendix”). Let uv be as in Lemma A.2, and define \(e_1 = \frac{u+v}{\Vert u+v\Vert _2}\) and \(e_2=\frac{u-v}{\Vert u-v\Vert _2}\). Note that these vectors are orthonormal (w.r.t. the inner product \(x^\top y\)).

Let \(\kappa = \max \, \{ |e_1^\top x|: x\in K' \}\). Since \(|e_1^\top x|<1\) for every \(x\in \mathbb {B}^n\setminus \{\pm e_1\}\), and \(\pm e_1\notin K'\), we have that \(\kappa <1\). Moreover, \(\kappa >0\), since \(\mathrm {int}(K')\ne \varnothing \). Let \(\tilde{B}\) be the self-adjoint operator on \(\mathbb {R}^n\), defined by

$$\begin{aligned} \langle \tilde{B}x,y\rangle = \frac{1}{\kappa ^2}\, \left( e_1^\top x\right) \left( e_1^\top y\right) - \left( e_2^\top x\right) \left( e_2^\top y\right) \end{aligned}$$

for every \(x,y\in \mathbb {R}^n\). Let \(x\in K'\). Then,

$$\begin{aligned} -1\le - \left( e_2^\top x\right) ^2 \le \langle \tilde{B}x,x\rangle \le \frac{1}{\kappa ^2}\,\left( e_1^\top x\right) ^2 \le \frac{1}{\kappa ^2}\,\kappa ^2 = 1 . \end{aligned}$$

It follows that, for every \(x\in \mathbb {R}^n\) with \(x\ne 0\), \(|\langle \tilde{B}x,x\rangle |= \Vert x\Vert '^2\,|\langle \tilde{B}\frac{x}{\Vert x\Vert '},\frac{x}{\Vert x\Vert '}\rangle |\le \Vert x\Vert '^2\). Hence, \(\Vert Q_{\tilde{B}}\Vert \le 1\). Now, we will show that \(|\langle \tilde{B}u,v\rangle |>\Vert u\Vert '\Vert v\Vert '\) (where uv are as above). Therefore, let \(\alpha =\Vert u+v\Vert _2\) and \(\beta =\Vert u-v\Vert _2\). Observe that \(u=\frac{\alpha e_1+\beta e_2}{2}\) and \(v=\frac{\alpha e_1-\beta e_2}{2}\). Thus,

$$\begin{aligned} \langle \tilde{B}u,v\rangle = \frac{1}{\kappa ^2}\,\frac{\alpha ^2}{4} + \frac{\beta ^2}{4} = \frac{1-\kappa ^2}{\kappa ^2}\,\frac{\alpha ^2}{4} + \frac{\alpha ^2}{4} + \frac{\beta ^2}{4} . \end{aligned}$$

This shows that \(\langle \tilde{B}u,v\rangle >1\), since (by the parallelogram identity)

$$\begin{aligned} \frac{\alpha ^2}{4} + \frac{\beta ^2}{4} = \frac{1}{4}\left( \Vert u+v\Vert _2^2 + \Vert u-v\Vert _2^2 \right) = 1 , \end{aligned}$$

\(0<\kappa <1\), and \(\alpha >0\). Since \(u,v\in K'\) (i.e., \(\Vert u\Vert ',\Vert v\Vert '\le 1\)), we have that \(\Vert u\Vert '\Vert v\Vert '\le 1<|\langle \tilde{B}u,v\rangle |\). Thus, \(\Vert \tilde{B}\Vert >1\).

Finally, define the self-adjoint operator B on E by \(\langle Bx,y\rangle = \langle \tilde{B}Lx,Ly\rangle \). It is clear, from the definition of \(\Vert \cdot \Vert '\), that \(|\langle Bx,x\rangle |\le \Vert x\Vert ^2\) for every \(x\in E\) and \(|\langle Bx,y\rangle |>\Vert x\Vert \Vert y\Vert \) for \(x=L^{-1}u\) and \(y=L^{-1}v\) (where uv are as above). Hence, one gets \(\Vert Q_B\Vert \le 1<\Vert B\Vert \). This concludes the proof of Theorem 6.1\(\square \)

It remains to prove Lemma A.2. The following proposition, known as Fritz John necessary conditions for optimality will be useful in the proof of Lemma A.2:

Proposition A.1

(Fritz John necessary conditions [12]) Let S be a compact metric space. Let F(x) be a real-valued function on \(\mathbb {R}^n\), and let G(xy) be a real-valued function defined for all \((x,y)\in \mathbb {R}^n\times S\). Assume that F(x) and G(xy) are both differentiable with respect to x and that F(x), G(xy), \(\frac{\partial F}{\partial x}(x)\), and \(\frac{\partial G}{\partial x}(x,y)\) are continuous on \(\mathbb {R}^n\times S\). Let \(R=\{x\in \mathbb {R}^n:G(x,y)\le 0,\,\forall y\in S\}\), and suppose that R is non-empty.

Let \(x^*\in R\) be such that \(F(x^*)=\max _{x\in R} F(x)\). Then, there is \(m\in \{0,\ldots ,n\}\), and points \(y_1,\ldots ,y_m\in S\), and nonnegative multipliers \(\lambda _0,\lambda _1,\ldots ,\lambda _m\ge 0\), such that (i) \(G(x^*,y_i)=0\) for every \(1\le i\le m\), (ii) \(\sum _{i=0}^m \lambda _i >0\), and (iii)

$$\begin{aligned} \lambda _0\frac{\partial F}{\partial x}(x^*)=\sum _{i=1}^m \lambda _i \frac{\partial G}{\partial x}(x^*,y_i). \end{aligned}$$

We refer the reader to [12] for a proof of Proposition A.1.

Proof of Lemma A.2

Consider the following optimization problem:

$$\begin{aligned} \begin{array}{ll} \mathrm{maximize} &{}\quad F(x):=\Vert x\Vert _2^2 \\ \mathrm{subject}\,\mathrm{to} &{}\quad G(x,y):= x^\top y-1\le 0 \quad \mathrm{for}\, \mathrm{every}\, y\in \mathbb {S}^{n-1}\cap K',\\ \end{array} \end{aligned}$$
(11)

with variable \(x\in \mathbb {R}^n\).

First, we show that (11) is bounded. Suppose the contrary, and, for every \(k\ge 1\), let \(x_k\) be a feasible solution with \(\Vert x_k\Vert _2\ge k\). Let \(\hat{x}_k=x_k/\Vert x_k\Vert _2\). Taking a subsequence if necessary, we may assume that \(\hat{x}_k\) converges to some \(\hat{x}_*\), with \(\Vert \hat{x}_*\Vert _2=1\). Since \(\hat{x}_k^\top y\le 1/\Vert x_k\Vert _2\) for every \(y\in \mathbb {S}^{n-1}\cap K'\), we have that \(\hat{x}_*^\top y \le 0\) for every \(y\in \mathbb {S}^{n-1}\cap K'\). By symmetry of \(\mathbb {S}^{n-1}\cap K'\), it follows that \(\hat{x}_*^\top y =0\) for every \(x\in \mathbb {S}^{n-1}\cap K'\), a contradiction with the fact that \(\mathbb {S}^{n-1}\cap K'\) contains n linearly independent vectors. Hence, the set of feasible solutions of (11) is bounded, and closed (as the intersection of closed sets), so that (11) has an optimal solution, say \(\bar{x}\).

We will show that \(\Vert \bar{x}\Vert _2>1\). Therefore, we use the fact that \(K'\ne \mathbb {B}^n\).Footnote 7 Fix some \(z\in \mathbb {S}^{n-1}\setminus K'\), and let \(\eta =\max \,\{ z^\top y:y\in K' \}\). Since \(z^\top y<1\) for every \(y\in \mathbb {B}^n\setminus \{z\}\), and \(z\notin K'\), we have that \(\eta <1\). Let \(x=z/\eta \). From the definition of \(\eta \), it is clear that x is a feasible solution of (11). Moreover, \(\Vert x\Vert _2=\eta ^{-1}>1\), so that \(\Vert \bar{x}\Vert _2\ge \Vert x\Vert _2>1\).

The gradient of F at \(\bar{x}\) is equal to \(2\bar{x}\). Then, Proposition A.1 asserts that there exist vectors \(y_1,\ldots ,y_m\in \mathbb {S}^{n-1}\cap K'\), and nonnegative multipliers \(\lambda _0,\lambda _1,\ldots ,\lambda _m\ge 0\), such that \(\bar{x}^\top y_i=1\) for every \(1\le i\le m\), \(\sum _{i=0}^m \lambda _i>0\), and \(\lambda _0\bar{x}=\sum _{i=1}^m \lambda _iy_i\). If \(\lambda _0=0\), then \(0=\sum _{i=1}^m \lambda _i\bar{x}^\top y_i = \sum _{i=1}^m \lambda _i >0\), a contradiction. Hence, \(\lambda _0>0\). Suppose that \(y_1,\ldots ,y_m\) are colinear. This implies that all \(y_i\)’s must be parallel to \(\bar{x}\) (because \(\lambda _0\bar{x}=\sum _{i=1}^m \lambda _iy_i\) and \(\lambda _0\bar{x}\ne 0\)), and since they are in \(\mathbb {S}^{n-1}\), we have that \(y_i=\pm \bar{x}/\Vert \bar{x}\Vert _2\), so that \(\bar{x}^\top y_i=\Vert \bar{x}\Vert _2>1\) or \(-\bar{x}^\top y_i=\Vert \bar{x}\Vert _2>1\). This gives a contradiction, since \(-y_i\) and \(y_i\) are both in \(\mathbb {S}^{n-1}\cap K'\) (by the symmetry of \(\mathbb {S}^{n-1}\cap K'\)). Thus, there exist at least two non-colinear vectors \(u,v\in \mathbb {S}^{n-1}\cap K'\) satisfying \(\bar{x}^\top u=1\) and \(\bar{x}^\top v=1\).

Let \(e_1 = \frac{u+v}{\Vert u+v\Vert _2}\). Since u and v are not colinear, \(\Vert u+v\Vert _2<2\), and thus, \(\bar{x}^\top e_1=2/\Vert u+v\Vert _2>1\). This shows that \(e_1\notin \mathbb {S}^{n-1}\cap K'\). By definition, \(e_1\) is in \(\mathbb {S}^{n-1}\), so that \(e_1\notin K'\), concluding the proof of the lemma. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Berger, G.O., Absil, PA., Jungers, R.M. et al. On the Quality of First-Order Approximation of Functions with Hölder Continuous Gradient. J Optim Theory Appl 185, 17–33 (2020). https://doi.org/10.1007/s10957-020-01632-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-020-01632-x

Keywords

Mathematics Subject Classification

Navigation