Convergence of the Exponentiated Gradient Method with Armijo Line Search

Li, Yen-Huan; Cevher, Volkan

doi:10.1007/s10957-018-1428-9

Convergence of the Exponentiated Gradient Method with Armijo Line Search

Published: 03 December 2018

Volume 181, pages 588–607, (2019)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

1220 Accesses
6 Citations
Explore all metrics

Abstract

Consider the problem of minimizing a convex differentiable function on the probability simplex, spectrahedron, or set of quantum density matrices. We prove that the exponentiated gradient method with Armijo line search always converges to the optimum, if the sequence of the iterates possesses a strictly positive limit point (element-wise for the vector case, and with respect to the Löwner partial ordering for the matrix case). To the best of our knowledge, this is the first convergence result for a mirror descent-type method that only requires differentiability. The proof exploits self-concordant likeness of the log-partition function, which is of independent interest.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On R-linear convergence analysis for a class of gradient methods

Article 17 November 2021

Rates of superlinear convergence for classical quasi-Newton methods

Article Open access 08 February 2021

Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods

Article 01 June 2018

Notes

Here, we exclude the very standard projected gradient method.
For any element-wisely strictly positive vector $v := ( v_i )_{1 \le i \le d}$, the Burg entropy is defined as $b ( v ) := - \sum _{i = 1}^d \log v_i$.

References

Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
Hohage, T., Werner, F.: Inverse problems with Poisson data: statistical regularization theory, applications and algorithms. Inverse Probl. 32, 093001 (2016)
Article MathSciNet MATH Google Scholar
Koltchinskii, V.: von Neumann entropy penalization and low-rank matrix estimation. Ann. Stat. 39(6), 2936–2973 (2011)
Article MathSciNet MATH Google Scholar
Paris, M., Řeháček, J. (eds.): Quantum State Estimation. Springer, Berlin (2004)
MATH Google Scholar
Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, Chichester (1983)
Google Scholar
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31, 167–175 (2003)
Article MathSciNet MATH Google Scholar
Auslender, A., Teboulle, M.: Interior gradient and epsilon-subgradient descent methods for constrained convex minimization. Math. Oper. Res. 29(1), 1–26 (2004)
Article MathSciNet MATH Google Scholar
Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)
Article MathSciNet MATH Google Scholar
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017)
Article MathSciNet MATH Google Scholar
Arora, S., Hazan, E., Kale, S.: The multiplicative weights update method: a meta-algorithm and applications. Theory Comput. 8, 121–164 (2012)
Article MathSciNet MATH Google Scholar
Kivinen, J., Warmuth, M.K.: Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput. 132, 1–63 (1997)
Article MathSciNet MATH Google Scholar
Helmbold, D.P., Shapire, R.E., Singer, Y., Warmuth, M.K.: On-line portfolio selection using multiplicative updates. Math. Finance 8(4), 325–347 (1998)
Article MATH Google Scholar
Tsuda, K., Rätsch, G., Warmuth, M.K.: Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res. 6, 995–1018 (2005)
MathSciNet MATH Google Scholar
Lu, H., Freund, R.M., Nesterov, Y.: Relatively-smooth convex optimization by first-order methods, and applications. arXiv:1610.05708v1 (2016)
Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P.L.: Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. J. Mach. Learn. Res. 9, 1775–1822 (2008)
MathSciNet MATH Google Scholar
Doljansky, M., Teboulle, M.: An interior proximal algorithm and the exponential multiplier method for semidefinite programming. SIAM J. Optim. 9(1), 1–13 (1998)
Article MathSciNet MATH Google Scholar
Bertsekas, D.P.: On the Goldstein–Levitin–Polyak gradient projection method. IEEE Trans. Autom. Control AC–21(2), 174–184 (1976)
Article MathSciNet MATH Google Scholar
Gafni, E.M., Bertsekas, D.P.: Convergence of a Gradient Projection Method. LIDS-P-1201, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge (1982)
Google Scholar
Salzo, S.: The variable metric forward-backward splitting algorithms under mild differentiability assumptions. SIAM J. Optim. 27(4), 2153–2181 (2017)
Article MathSciNet MATH Google Scholar
Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia (1994)
Book MATH Google Scholar
Blume-Kohout, R.: Hedged maximum likelihood quantum state estimation. Phys. Rev. Lett. 105, 200504 (2010)
Article Google Scholar
Decarreau, A., Hilhorst, D., Lemaréchal, C., Navaza, J.: Dual methods in entropy maximization. application to some problems in crystallography. SIAM J. Optim. 2(2), 173–197 (1992)
Article MathSciNet MATH Google Scholar
Hiai, F., Ohya, M., Tsukada, M.: Sufficiency, KMS condition and relative entropy in von Neumann algebras. Pac. J. Math. 96(1), 99–109 (1981)
Article MathSciNet MATH Google Scholar
Bertsekas, D.P.: Nonlinear Programming, vol. 3. Athena Sci, Belmont (2016)
MATH Google Scholar
Bach, F.: Self-concordant analysis for logistic regression. Electron. J. Stat. 4, 384–414 (2010)
Article MathSciNet MATH Google Scholar
Bach, F.: Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15, 595–627 (2014)
MathSciNet MATH Google Scholar
Tran-Dinh, Q., Li, Y.H., Cevher, V.: Composite convex minimization involving self-concordant-like cost functions. In: Modelling, Computation and Optimization in Information Systems and Management Sciences, pp. 155–168. Springer, Cham (2015)
Ohya, M., Petz, D.: Quantum Entropy and Its Use. Springer, Berlin (1993)
Book MATH Google Scholar
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Book MATH Google Scholar
Hradil, Z.: Quantum-state estimation. Phys. Rev. A 55(3), R1561 (1997)
Article MathSciNet Google Scholar
Byrne, C., Censor, Y.: Proximity function minimization using multiple Bregman projections, with application to split feasibility and Kullback–Leibler distance minimization. Ann. Oper. Res. 105, 77–98 (2001)
Article MathSciNet MATH Google Scholar
MacLean, L.C., Thorp, E.O., Ziemba, W.T. (eds.): The Kelly Capital Growth Investment Criterion. World Scientific, Singapore (2012)
Google Scholar
Odor, G., Li, Y.H., Yurtsever, A., Hsieh, Y.P., El Halabi, M., Tran-Dinh, Q., Cevher, V.: Frank-Wolfe works for non-Lipschitz continuous gradient objectives: scalable Poisson phase retrieval. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6230–6234 (2016)
Vardi, Y., Shepp, L.A., Kaufman, L.: A statistical model for positron emission tomography. J. Am. Stat. Assoc. 80(389), 8–20 (1985)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We thank Ya-Ping Hsieh for his comments. This work was supported by SNF 200021-146750 and ERC project time-data 725594.

Author information

Authors and Affiliations

Laboratory for Information and Inference Systems, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Yen-Huan Li & Volkan Cevher
Department of Computer Science and Information Engineering and Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan
Yen-Huan Li

Authors

Yen-Huan Li
View author publications
You can also search for this author in PubMed Google Scholar
Volkan Cevher
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yen-Huan Li.

Appendices

Appendix A Inapplicability of Existing Convergence Guarantees to Quantum State Tomography

Quantum state tomography is the task of estimating the state of a quantum systems, which is essential to calibrating quantum computation devices [4, 30]. Numerically, it corresponds to solving (P) with the objective function

$$\begin{aligned} f_{\text {QST}} ( \rho ) := - \sum _{i = 1}^n \log {{\mathrm{\mathrm {Tr}}}}( M_i \rho ) , \end{aligned}$$

where $M_i$ are positive semi-definite matrices given by the experimental data.

The following proposition shows that existing convergence guarantees for the EG method do not apply to quantum state tomography.

Proposition A.1

The function $f_{\text {QST}}$ is not Lipschitz, its gradient is not Lipschitz, and it is not smooth relative to the negative von Neumann entropy.

Proof

Consider the two-dimensional case, where $\rho = ( \rho _{i, j} )_{1 \le i, j \le 2} \in {\mathbb {C}}^{2 \times 2}$. Define $e_1 := ( 1, 0 )$ and $e_2 := ( 0, 1 )$. Suppose that there are only two summands, with $M_1 = e_1 \otimes e_1$ and $M_2 = e_2 \otimes e_2$. Then, we have $f ( \rho ) = - \log \rho _{1,1} - \log \rho _{2,2}$. It suffices to disprove all properties for this specific f on the set of diagonal density matrices. Hence, we will focus on the function $g ( x, y ) := - \log x - \log y$, defined for any $x, y > 0$ such that $x + y = 1$.

As either x or y can be arbitrarily close to zero, g cannot be Lipschitz continuous in itself or its gradient due to the logarithmic functions. Define the entropy function

$$\begin{aligned} h(x, y) := - x \log x - y \log y + x + y, \end{aligned}$$

with the convention $0 \log 0 = 0$. Then, g is L-smooth relative to the relative entropy, if and only if $- L h - g$ is convex. It suffices to check the positive semi-definiteness of the Hessian of $- L h - g$. A necessary condition for the Hessian to be positive semi-definite is that

$$\begin{aligned} - L \frac{\partial ^2 h}{\partial x^2} ( x, y ) - \frac{\partial ^2 g}{\partial x^2} ( x, y ) = \frac{L}{x} - \frac{1}{x^2} \ge 0 , \end{aligned}$$

for all $x \in ] 0, 1 [$, which cannot hold for $x < ( 1 / L )$, for any fixed $L > 0$. $\square $

We note that similar objective functions can be found in positive linear inverse problems, positron emission tomography, portfolio selection, and Poisson phase retrieval [31,32,33,34].

Appendix B Technical Lemmas Necessary for Sect. 3

Define

$$\begin{aligned} \rho ( \alpha ) := C_\rho ^{-1} \exp \left[ \log ( \rho ) - \alpha \nabla f ( \rho ) \right] , \end{aligned}$$

for every non-singular $\rho \in {\mathcal {D}}$ and $\alpha \ge 0$, where $C_\rho $ is the positive real number normalizing the trace of $\rho ( \alpha )$.

Lemma B.1

For every non-singular $\rho \in {\mathcal {D}}$ and $\alpha > 0$, it holds that

$$\begin{aligned} {\langle { \nabla f ( \rho ), \rho ( \alpha ) - \rho }\rangle } \le - \frac{H ( \rho ( \alpha ), \rho )}{ \alpha } . \end{aligned}$$

Proof

The equivalent formulation of the EG method, (2), implies that

$$\begin{aligned} \alpha {\langle { \nabla f ( \rho ), \rho ( \alpha ) - \rho }\rangle } + H ( \rho ( \alpha ), \rho ) \le \alpha {\langle { \nabla f ( \rho ), \rho - \rho }\rangle } + H ( \rho , \rho ) = 0 . \end{aligned}$$

$\square $

Lemma B.2

Let $\rho \in {\mathcal {D}}$ be non-singular. If $\rho $ is a minimizer of f on ${\mathcal {D}}$, then $\rho ( \alpha ) = \rho $ for all $\alpha \ge 0$. If $\rho ( \alpha ) = \rho $ for some $\alpha > 0$, then $\rho $ is a minimizer of f on ${\mathcal {D}}$.

Proof

The optimality condition says that $\rho \in {{\mathrm{\mathrm {int}}}}{\mathcal {D}}$ is a minimizer of f on ${\mathcal {D}}$, if and only if

$$\begin{aligned} {\langle { \nabla f ( \rho ), \sigma - \rho }\rangle } \ge 0 , \quad \forall \sigma \in {\mathcal {D}} . \end{aligned}$$

For any $\alpha > 0$, we can equivalently write

$$\begin{aligned} {\langle { \alpha \nabla f ( \rho ) + \left[ \nabla h ( \rho ) - \nabla h ( \rho ) \right] , \sigma - \rho }\rangle } \ge 0 , \quad \forall \sigma \in {\mathcal {D}} , \end{aligned}$$

(9)

where h denotes the negative von Neumann entropy function, i.e.,

$$\begin{aligned} h ( \rho ) := {{\mathrm{\mathrm {Tr}}}}( \rho \log \rho ) - {{\mathrm{\mathrm {Tr}}}}\rho . \end{aligned}$$

Note that the quantum relative entropy H is the Bregman divergence induced by the negative von Neumann entropy. It is easily checked, again by the optimality condition, that (9) is equivalent to

$\square $

For every non-singular $\rho \in {\mathcal {D}}$ and $\alpha \ge 0$, define

$$\begin{aligned} G := - \nabla f ( \rho ), \quad H_\alpha := \log \rho + \alpha G. \end{aligned}$$

Let $G = \sum _j \lambda _j P_j$ be the spectral decomposition of G. Define $\eta _\alpha $ as a random variable satisfying

$$\begin{aligned} {\mathsf {P}} \left( \eta _\alpha = \lambda _j \right) = \frac{ {{\mathrm{\mathrm {Tr}}}}\left( P_j \exp ( H_\alpha ) \right) }{ {{\mathrm{\mathrm {Tr}}}}\exp ( H_\alpha ) }; \end{aligned}$$

(10)

it is easily checked that ${\mathsf {P}} \left( \eta _\alpha = \lambda _j \right) > 0$ for all j, and the probabilities sum to one.

Lemma B.3

For any $\alpha \in {\mathbb {R}}$, it holds that

$$\begin{aligned} \varphi ' ( \alpha ) = {\mathsf {E}}\, \eta _\alpha , \quad \varphi '' ( \alpha ) = {\mathsf {E}} \left( \eta _\alpha - {\mathsf {E}}\, \eta _\alpha \right) ^2, \quad \varphi ''' ( \alpha ) = {\mathsf {E}} \left( \eta _\alpha - {\mathsf {E}}\, \eta _\alpha \right) ^3 . \end{aligned}$$

Proof

Note that

$$\begin{aligned} {\mathsf {E}}\, \eta _\alpha ^n = \frac{{{\mathrm{\mathrm {Tr}}}}( G^n \exp ( H_\alpha ) )}{{{\mathrm{\mathrm {Tr}}}}\exp ( H_\alpha )} , \end{aligned}$$

for any $n \in {\mathbb {N}}$. Define $\sigma _\alpha := \exp ( H_\alpha ) / {{\mathrm{\mathrm {Tr}}}}\exp ( H_\alpha )$. A direct calculation gives

$$\begin{aligned} \varphi ' ( \alpha )&= {{\mathrm{\mathrm {Tr}}}}( G \sigma _\alpha ) , \quad \varphi '' ( \alpha ) = {{\mathrm{\mathrm {Tr}}}}( G^2 \sigma _\alpha ) - \left( {{\mathrm{\mathrm {Tr}}}}( G \sigma _\alpha ) \right) ^2 , \\ \varphi ''' ( \alpha )&= {{\mathrm{\mathrm {Tr}}}}( G^3 \sigma _\alpha ) - 3 {{\mathrm{\mathrm {Tr}}}}( G^2 \sigma _\alpha ) {{\mathrm{\mathrm {Tr}}}}( G \sigma _\alpha ) + 2 \left( {{\mathrm{\mathrm {Tr}}}}( G \sigma _\alpha ) \right) ^3 . \end{aligned}$$

The lemma follows. $\square $

Since $\eta _\alpha $ is a bounded random variable, it follows that $\varphi ''$ is bounded from above.

Corollary B.1

It holds that $\varphi '' ( \alpha ) \le ( 1 / 4 ) \varDelta ^2$, where

$$\begin{aligned} \varDelta := \lambda _{\max } ( \nabla f ( \rho ) ) - \lambda _{\min } ( \nabla f ( \rho ) ) . \end{aligned}$$

Proof

Recall that the variance of a random variable taking values in [a, b] is bounded from above by $( b - a )^2 / 4$.

$\square $

Appendix C Proof of Lemma 3.3

Recall the random variable $\eta _\alpha $ defined in (10). Suppose that $\varphi '' ( \alpha ) = 0$ for some $\alpha \in [ 0, + \infty [$. Then, we have $\eta _\alpha = 0$ almost surely, but this implies that $\varDelta = 0$, a contradiction. Therefore, we have $\varphi '' ( \alpha ) > 0$ for all $\alpha \in [ 0, + \infty [$.

We prove a general result. Let $\psi : {\mathbb {R}} \rightarrow {\mathbb {R}}$ be a $\mu $-self-concordant-like function. Suppose that $\psi '' ( t ) > 0$ for all t. Consider the function

$$\begin{aligned} \chi ( t ) := \log \left( \psi '' ( t ) \right) . \end{aligned}$$

We write, by the self-concordant likeness of $\psi $, that

$$\begin{aligned} \vert \chi ' ( t ) \vert = \frac{\vert \psi ''' ( t ) \vert }{ \psi '' ( t ) } \le \mu \, , \quad \forall t \in {\mathbb {R}} \, . \end{aligned}$$

Then, for any $t_1, t_2 \in {\mathbb {R}}$, we have

$$\begin{aligned} \vert \chi ( t_1 ) - \chi ( t_2 ) \vert = \left| \log \left( \psi '' ( t_1 ) \right) - \log \left( \psi '' ( t_2 ) \right) \right| \le \mu \vert t_2 - t_1 \vert \, ; \end{aligned}$$

that is,

$$\begin{aligned} \mathrm {e}^{- \mu \vert t_2 - t_1 \vert } \psi '' ( t_2 ) \le \psi '' ( t_1 ) \le \mathrm {e}^{ \mu \vert t_2 - t_1 \vert } \psi '' ( t_2 ) \, . \end{aligned}$$

Applying the Newton–Leibniz formula, we obtain

$$\begin{aligned} \psi ' ( t_2 ) - \psi ' ( t_1 )&= \int _{0}^1 \psi '' ( t_1 + \tau ( t_2 - t_1 ) ) ( t_2 - t_1 ) \, \mathrm {d}\tau \\&\le \int _0^1 \mathrm {e}^{\mu \tau \vert t_2 - t_1 \vert } \psi '' ( t_1 ) ( t_2 - t_1 ) \, \mathrm {d}\tau \\&= \left( \frac{\mathrm {e}^{\mu \vert t_2 - t_1 \vert } - 1}{ \mu \vert t_2 - t_1 \vert } \right) \psi '' ( t_1 ) ( t_2 - t_1 ) \, ; \end{aligned}$$

similarly, we obtain

$$\begin{aligned} \psi ' ( t_2 ) - \psi ' ( t_1 ) \ge - \left( \frac{\mathrm {e}^{- \mu \vert t_2 - t_1 \vert } - 1}{ \mu \vert t_2 - t_1 \vert } \right) \psi '' ( t_1 ) ( t_2 - t_1 ) \, . \end{aligned}$$

Applying the Newton–Leibniz formula again, we obtain

$$\begin{aligned} \psi ( t_2 ) - \psi ( t_1 )&= \int _0^1 \psi ' ( t_1 + \tau ( t_2 - t_1 ) ) ( t_2 - t_1 ) \, \mathrm {d}\tau \\&= \psi ' ( t_1 ) ( t_2 - t_1 ) + \int _0^1 \left( \psi ' ( t_1 + \tau ( t_2 - t_1 ) ) - \psi ' ( t_1 ) \right) ( t_2 - t_1 ) \, \mathrm {d}\tau \\&\le \psi ' ( t_1 ) ( t_2 - t_1 ) + \int _0^1 \left( \frac{\mathrm {e}^{\mu \tau \vert t_2 - t_1 \vert } - 1 }{ \mu \tau \vert t_2 - t_1 \vert } \right) \psi '' ( t_1 ) \tau ( t_2 - t_1 )^2 \, \mathrm {d}\tau \\&= \psi ' ( t_1 ) ( t_2 - t_1 ) + \frac{ \left( \mathrm {e}^{\mu \vert t_2 - t_1 \vert } - \mu \vert t_2 - t_1 \vert - 1 \right) }{\mu ^2} \psi '' ( t_1 ) \, ; \end{aligned}$$

similarly, we obtain

$$\begin{aligned} \psi ( t_2 ) - \psi ( t_1 ) \ge \psi ' ( t_1 ) ( t_2 - t_1 ) + \frac{ \left( \mathrm {e}^{- \mu \vert t_2 - t_1 \vert } + \mu \vert t_2 - t_1 \vert - 1 \right) }{\mu ^2} \psi '' ( t_1 ) \, . \end{aligned}$$

Lemma 3.3 follows from setting $\psi = \varphi $, $\mu = \varDelta $, $t_2 = 0$, and $t_1 = \alpha $.

Appendix D Proof of Proposition 3.3

Suppose that . We write

$$\begin{aligned} f ( \rho _k ) - f ( \rho _{k + 1} )&\ge - \tau {\langle { \nabla f ( \rho _{k} ), f ( \rho _{k + 1} ) - f ( \rho _{k} ) }\rangle } \\&\ge \tau \alpha _k^{-1} H ( \rho _{k + 1}, \rho _k ) \\&= \tau \alpha _k \alpha _k^{-2} H ( \rho _k ( \alpha _k ), \rho _k ) \\&\ge \tau {\underline{\alpha }} \kappa H ( \rho _k ( {\bar{\alpha }} ), \rho _k ) \\&\ge 0 , \end{aligned}$$

for large enough $k \in {\mathcal {K}}$, where the first inequality follows from the Armijo line search rule, the second follows from Lemma B.1, and the third follows from Corollary 3.1. Taking limit, we obtain that $H ( \rho _k ( {\bar{\alpha }} ), \rho _k ) \rightarrow 0$ as $k \rightarrow \infty $ in ${\mathcal {K}}$.

Suppose that . Let $( \alpha _k )_{k \in {\mathcal {K}}'}$, ${\mathcal {K}}' \subseteq {\mathcal {K}}$, be a subsequence converging to zero. According to the Armijo rule, we have

$$\begin{aligned} f ( \rho _k ( r^{-1} \alpha _k ) ) - f ( \rho _k ) > \tau {\langle { \nabla f ( \rho _k ), \rho _k ( r^{-1} \alpha _k ) - \rho _k ( \alpha _k )}\rangle } , \end{aligned}$$

(11)

for large enough $k \in {\mathcal {K}}$. The mean value theorem says that the left-hand side equals ${\langle { \nabla f ( \sigma ), \rho _k ( r^{-1} \alpha _k ) - \rho _k }\rangle }$ for some $\sigma $ in the line segment jointing $\rho _k ( r^{-1} \alpha _k )$ and $\rho _k$. Then, (11) can be equivalently written as

$$\begin{aligned}&{\langle { \nabla f ( \sigma ) - \nabla f ( \rho _k ), \rho _k ( r^{-1} \alpha _k ) - \rho _k }\rangle } \nonumber \\&\quad > - ( 1 - \tau ) {\langle { \nabla f ( \rho _k ), \rho _k ( r^{-1} \alpha _k ) - \rho _k ( \alpha _k ) }\rangle } . \end{aligned}$$

(12)

By Pinsker’s inequality and Hölder’s inequality, we obtain

$$\begin{aligned}&\Vert \nabla f ( \sigma ) - \nabla f ( \rho _k ) \Vert _\infty \sqrt{2 H ( \rho _k ( r^{-1} \alpha _k ), \rho _k )} \nonumber \\&\quad \ge \Vert \nabla f ( \sigma ) - \nabla f ( \rho _k ) \Vert _\infty \Vert \rho _k ( r^{-1} \alpha _k ), \rho _k \Vert _1 \nonumber \\&\quad \ge {\langle { \nabla f ( \sigma ) - \nabla f ( \rho _k ), \rho _k ( r^{-1} \alpha _k ) - \rho _k }\rangle } . \end{aligned}$$

(13)

for large enough $k \in {\mathcal {K}}$. Note that $r^{-1} \alpha _k \le {\bar{\alpha }}$ for large enough $k \in {\mathcal {K}}$. By Lemma B.1 and Corollary 3.2, we obtain

$$\begin{aligned}&- {\langle { \nabla f ( \rho _k ), \rho _k ( r^{-1} \alpha _k ) - \rho _k ( \alpha _k ) }\rangle } \ge \frac{H ( \rho _k ( r^{-1} \alpha _k ), \rho _k )}{r^{-1} \alpha _k } \nonumber \\&\ge \sqrt{ \kappa H ( \rho _k ( {\bar{\alpha }} ), \rho _k ) } \sqrt{ H ( \rho _k ( r^{-1} \alpha _k ), \rho _k ) } , \end{aligned}$$

(14)

for large enough $k \in {\mathcal {K}}$. Since $H ( \rho _k ( r^{-1} \alpha _k ), \rho _k )$ is strictly positive for all $k \in {\mathcal {K}}'$ by assumption, (12), (13), and (14) imply

$$\begin{aligned} \Vert \nabla f ( \sigma ) - \nabla f ( \rho _k ) \Vert _\infty > ( 1 - \tau ) \sqrt{ \frac{\kappa H ( \rho _k ( {\bar{\alpha }} ), \rho _k )}{2} } \ge 0 . \end{aligned}$$

Taking limits, we obtain that $H ( \rho _k ( {\bar{\alpha }} ), \rho _k ) \rightarrow 0$ a $k \rightarrow \infty $ in ${\mathcal {K}}'$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, YH., Cevher, V. Convergence of the Exponentiated Gradient Method with Armijo Line Search. J Optim Theory Appl 181, 588–607 (2019). https://doi.org/10.1007/s10957-018-1428-9

Download citation

Received: 22 December 2017
Accepted: 26 October 2018
Published: 03 December 2018
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s10957-018-1428-9

Keywords

Mathematics Subject Classification

90C25

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convergence of the Exponentiated Gradient Method with Armijo Line Search

Abstract

Access this article

Similar content being viewed by others

On R-linear convergence analysis for a class of gradient methods

Rates of superlinear convergence for classical quasi-Newton methods

Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods

Notes

References

Acknowledgements