Skip to main content
Log in

Convergence of the Exponentiated Gradient Method with Armijo Line Search

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

Consider the problem of minimizing a convex differentiable function on the probability simplex, spectrahedron, or set of quantum density matrices. We prove that the exponentiated gradient method with Armijo line search always converges to the optimum, if the sequence of the iterates possesses a strictly positive limit point (element-wise for the vector case, and with respect to the Löwner partial ordering for the matrix case). To the best of our knowledge, this is the first convergence result for a mirror descent-type method that only requires differentiability. The proof exploits self-concordant likeness of the log-partition function, which is of independent interest.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Here, we exclude the very standard projected gradient method.

  2. For any element-wisely strictly positive vector \(v := ( v_i )_{1 \le i \le d}\), the Burg entropy is defined as \(b ( v ) := - \sum _{i = 1}^d \log v_i\).

References

  1. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  2. Hohage, T., Werner, F.: Inverse problems with Poisson data: statistical regularization theory, applications and algorithms. Inverse Probl. 32, 093001 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  3. Koltchinskii, V.: von Neumann entropy penalization and low-rank matrix estimation. Ann. Stat. 39(6), 2936–2973 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  4. Paris, M., Řeháček, J. (eds.): Quantum State Estimation. Springer, Berlin (2004)

    MATH  Google Scholar 

  5. Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, Chichester (1983)

    Google Scholar 

  6. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31, 167–175 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  7. Auslender, A., Teboulle, M.: Interior gradient and epsilon-subgradient descent methods for constrained convex minimization. Math. Oper. Res. 29(1), 1–26 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  8. Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  10. Arora, S., Hazan, E., Kale, S.: The multiplicative weights update method: a meta-algorithm and applications. Theory Comput. 8, 121–164 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  11. Kivinen, J., Warmuth, M.K.: Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput. 132, 1–63 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  12. Helmbold, D.P., Shapire, R.E., Singer, Y., Warmuth, M.K.: On-line portfolio selection using multiplicative updates. Math. Finance 8(4), 325–347 (1998)

    Article  MATH  Google Scholar 

  13. Tsuda, K., Rätsch, G., Warmuth, M.K.: Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res. 6, 995–1018 (2005)

    MathSciNet  MATH  Google Scholar 

  14. Lu, H., Freund, R.M., Nesterov, Y.: Relatively-smooth convex optimization by first-order methods, and applications. arXiv:1610.05708v1 (2016)

  15. Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P.L.: Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. J. Mach. Learn. Res. 9, 1775–1822 (2008)

    MathSciNet  MATH  Google Scholar 

  16. Doljansky, M., Teboulle, M.: An interior proximal algorithm and the exponential multiplier method for semidefinite programming. SIAM J. Optim. 9(1), 1–13 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  17. Bertsekas, D.P.: On the Goldstein–Levitin–Polyak gradient projection method. IEEE Trans. Autom. Control AC–21(2), 174–184 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  18. Gafni, E.M., Bertsekas, D.P.: Convergence of a Gradient Projection Method. LIDS-P-1201, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge (1982)

    Google Scholar 

  19. Salzo, S.: The variable metric forward-backward splitting algorithms under mild differentiability assumptions. SIAM J. Optim. 27(4), 2153–2181 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  20. Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia (1994)

    Book  MATH  Google Scholar 

  21. Blume-Kohout, R.: Hedged maximum likelihood quantum state estimation. Phys. Rev. Lett. 105, 200504 (2010)

    Article  Google Scholar 

  22. Decarreau, A., Hilhorst, D., Lemaréchal, C., Navaza, J.: Dual methods in entropy maximization. application to some problems in crystallography. SIAM J. Optim. 2(2), 173–197 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  23. Hiai, F., Ohya, M., Tsukada, M.: Sufficiency, KMS condition and relative entropy in von Neumann algebras. Pac. J. Math. 96(1), 99–109 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  24. Bertsekas, D.P.: Nonlinear Programming, vol. 3. Athena Sci, Belmont (2016)

    MATH  Google Scholar 

  25. Bach, F.: Self-concordant analysis for logistic regression. Electron. J. Stat. 4, 384–414 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  26. Bach, F.: Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15, 595–627 (2014)

    MathSciNet  MATH  Google Scholar 

  27. Tran-Dinh, Q., Li, Y.H., Cevher, V.: Composite convex minimization involving self-concordant-like cost functions. In: Modelling, Computation and Optimization in Information Systems and Management Sciences, pp. 155–168. Springer, Cham (2015)

  28. Ohya, M., Petz, D.: Quantum Entropy and Its Use. Springer, Berlin (1993)

    Book  MATH  Google Scholar 

  29. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)

    Book  MATH  Google Scholar 

  30. Hradil, Z.: Quantum-state estimation. Phys. Rev. A 55(3), R1561 (1997)

    Article  MathSciNet  Google Scholar 

  31. Byrne, C., Censor, Y.: Proximity function minimization using multiple Bregman projections, with application to split feasibility and Kullback–Leibler distance minimization. Ann. Oper. Res. 105, 77–98 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  32. MacLean, L.C., Thorp, E.O., Ziemba, W.T. (eds.): The Kelly Capital Growth Investment Criterion. World Scientific, Singapore (2012)

    Google Scholar 

  33. Odor, G., Li, Y.H., Yurtsever, A., Hsieh, Y.P., El Halabi, M., Tran-Dinh, Q., Cevher, V.: Frank-Wolfe works for non-Lipschitz continuous gradient objectives: scalable Poisson phase retrieval. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6230–6234 (2016)

  34. Vardi, Y., Shepp, L.A., Kaufman, L.: A statistical model for positron emission tomography. J. Am. Stat. Assoc. 80(389), 8–20 (1985)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We thank Ya-Ping Hsieh for his comments. This work was supported by SNF 200021-146750 and ERC project time-data 725594.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yen-Huan Li.

Appendices

Appendix A Inapplicability of Existing Convergence Guarantees to Quantum State Tomography

Quantum state tomography is the task of estimating the state of a quantum systems, which is essential to calibrating quantum computation devices [4, 30]. Numerically, it corresponds to solving (P) with the objective function

$$\begin{aligned} f_{\text {QST}} ( \rho ) := - \sum _{i = 1}^n \log {{\mathrm{\mathrm {Tr}}}}( M_i \rho ) , \end{aligned}$$

where \(M_i\) are positive semi-definite matrices given by the experimental data.

The following proposition shows that existing convergence guarantees for the EG method do not apply to quantum state tomography.

Proposition A.1

The function \(f_{\text {QST}}\) is not Lipschitz, its gradient is not Lipschitz, and it is not smooth relative to the negative von Neumann entropy.

Proof

Consider the two-dimensional case, where \(\rho = ( \rho _{i, j} )_{1 \le i, j \le 2} \in {\mathbb {C}}^{2 \times 2}\). Define \(e_1 := ( 1, 0 )\) and \(e_2 := ( 0, 1 )\). Suppose that there are only two summands, with \(M_1 = e_1 \otimes e_1\) and \(M_2 = e_2 \otimes e_2\). Then, we have \(f ( \rho ) = - \log \rho _{1,1} - \log \rho _{2,2}\). It suffices to disprove all properties for this specific f on the set of diagonal density matrices. Hence, we will focus on the function \(g ( x, y ) := - \log x - \log y\), defined for any \(x, y > 0\) such that \(x + y = 1\).

As either x or y can be arbitrarily close to zero, g cannot be Lipschitz continuous in itself or its gradient due to the logarithmic functions. Define the entropy function

$$\begin{aligned} h(x, y) := - x \log x - y \log y + x + y, \end{aligned}$$

with the convention \(0 \log 0 = 0\). Then, g is L-smooth relative to the relative entropy, if and only if \(- L h - g\) is convex. It suffices to check the positive semi-definiteness of the Hessian of \(- L h - g\). A necessary condition for the Hessian to be positive semi-definite is that

$$\begin{aligned} - L \frac{\partial ^2 h}{\partial x^2} ( x, y ) - \frac{\partial ^2 g}{\partial x^2} ( x, y ) = \frac{L}{x} - \frac{1}{x^2} \ge 0 , \end{aligned}$$

for all \(x \in ] 0, 1 [\), which cannot hold for \(x < ( 1 / L )\), for any fixed \(L > 0\). \(\square \)

We note that similar objective functions can be found in positive linear inverse problems, positron emission tomography, portfolio selection, and Poisson phase retrieval [31,32,33,34].

Appendix B Technical Lemmas Necessary for Sect. 3

Define

$$\begin{aligned} \rho ( \alpha ) := C_\rho ^{-1} \exp \left[ \log ( \rho ) - \alpha \nabla f ( \rho ) \right] , \end{aligned}$$

for every non-singular \(\rho \in {\mathcal {D}}\) and \(\alpha \ge 0\), where \(C_\rho \) is the positive real number normalizing the trace of \(\rho ( \alpha )\).

Lemma B.1

For every non-singular \(\rho \in {\mathcal {D}}\) and \(\alpha > 0\), it holds that

$$\begin{aligned} {\langle { \nabla f ( \rho ), \rho ( \alpha ) - \rho }\rangle } \le - \frac{H ( \rho ( \alpha ), \rho )}{ \alpha } . \end{aligned}$$

Proof

The equivalent formulation of the EG method, (2), implies that

$$\begin{aligned} \alpha {\langle { \nabla f ( \rho ), \rho ( \alpha ) - \rho }\rangle } + H ( \rho ( \alpha ), \rho ) \le \alpha {\langle { \nabla f ( \rho ), \rho - \rho }\rangle } + H ( \rho , \rho ) = 0 . \end{aligned}$$

\(\square \)

Lemma B.2

Let \(\rho \in {\mathcal {D}}\) be non-singular. If \(\rho \) is a minimizer of f on \({\mathcal {D}}\), then \(\rho ( \alpha ) = \rho \) for all \(\alpha \ge 0\). If \(\rho ( \alpha ) = \rho \) for some \(\alpha > 0\), then \(\rho \) is a minimizer of f on \({\mathcal {D}}\).

Proof

The optimality condition says that \(\rho \in {{\mathrm{\mathrm {int}}}}{\mathcal {D}}\) is a minimizer of f on \({\mathcal {D}}\), if and only if

$$\begin{aligned} {\langle { \nabla f ( \rho ), \sigma - \rho }\rangle } \ge 0 , \quad \forall \sigma \in {\mathcal {D}} . \end{aligned}$$

For any \(\alpha > 0\), we can equivalently write

$$\begin{aligned} {\langle { \alpha \nabla f ( \rho ) + \left[ \nabla h ( \rho ) - \nabla h ( \rho ) \right] , \sigma - \rho }\rangle } \ge 0 , \quad \forall \sigma \in {\mathcal {D}} , \end{aligned}$$
(9)

where h denotes the negative von Neumann entropy function, i.e.,

$$\begin{aligned} h ( \rho ) := {{\mathrm{\mathrm {Tr}}}}( \rho \log \rho ) - {{\mathrm{\mathrm {Tr}}}}\rho . \end{aligned}$$

Note that the quantum relative entropy H is the Bregman divergence induced by the negative von Neumann entropy. It is easily checked, again by the optimality condition, that (9) is equivalent to

\(\square \)

For every non-singular \(\rho \in {\mathcal {D}}\) and \(\alpha \ge 0\), define

$$\begin{aligned} G := - \nabla f ( \rho ), \quad H_\alpha := \log \rho + \alpha G. \end{aligned}$$

Let \(G = \sum _j \lambda _j P_j\) be the spectral decomposition of G. Define \(\eta _\alpha \) as a random variable satisfying

$$\begin{aligned} {\mathsf {P}} \left( \eta _\alpha = \lambda _j \right) = \frac{ {{\mathrm{\mathrm {Tr}}}}\left( P_j \exp ( H_\alpha ) \right) }{ {{\mathrm{\mathrm {Tr}}}}\exp ( H_\alpha ) }; \end{aligned}$$
(10)

it is easily checked that \({\mathsf {P}} \left( \eta _\alpha = \lambda _j \right) > 0\) for all j, and the probabilities sum to one.

Lemma B.3

For any \(\alpha \in {\mathbb {R}}\), it holds that

$$\begin{aligned} \varphi ' ( \alpha ) = {\mathsf {E}}\, \eta _\alpha , \quad \varphi '' ( \alpha ) = {\mathsf {E}} \left( \eta _\alpha - {\mathsf {E}}\, \eta _\alpha \right) ^2, \quad \varphi ''' ( \alpha ) = {\mathsf {E}} \left( \eta _\alpha - {\mathsf {E}}\, \eta _\alpha \right) ^3 . \end{aligned}$$

Proof

Note that

$$\begin{aligned} {\mathsf {E}}\, \eta _\alpha ^n = \frac{{{\mathrm{\mathrm {Tr}}}}( G^n \exp ( H_\alpha ) )}{{{\mathrm{\mathrm {Tr}}}}\exp ( H_\alpha )} , \end{aligned}$$

for any \(n \in {\mathbb {N}}\). Define \(\sigma _\alpha := \exp ( H_\alpha ) / {{\mathrm{\mathrm {Tr}}}}\exp ( H_\alpha )\). A direct calculation gives

$$\begin{aligned} \varphi ' ( \alpha )&= {{\mathrm{\mathrm {Tr}}}}( G \sigma _\alpha ) , \quad \varphi '' ( \alpha ) = {{\mathrm{\mathrm {Tr}}}}( G^2 \sigma _\alpha ) - \left( {{\mathrm{\mathrm {Tr}}}}( G \sigma _\alpha ) \right) ^2 , \\ \varphi ''' ( \alpha )&= {{\mathrm{\mathrm {Tr}}}}( G^3 \sigma _\alpha ) - 3 {{\mathrm{\mathrm {Tr}}}}( G^2 \sigma _\alpha ) {{\mathrm{\mathrm {Tr}}}}( G \sigma _\alpha ) + 2 \left( {{\mathrm{\mathrm {Tr}}}}( G \sigma _\alpha ) \right) ^3 . \end{aligned}$$

The lemma follows. \(\square \)

Since \(\eta _\alpha \) is a bounded random variable, it follows that \(\varphi ''\) is bounded from above.

Corollary B.1

It holds that \(\varphi '' ( \alpha ) \le ( 1 / 4 ) \varDelta ^2\), where

$$\begin{aligned} \varDelta := \lambda _{\max } ( \nabla f ( \rho ) ) - \lambda _{\min } ( \nabla f ( \rho ) ) . \end{aligned}$$

Proof

Recall that the variance of a random variable taking values in [ab] is bounded from above by \(( b - a )^2 / 4\).

\(\square \)

Appendix C Proof of Lemma 3.3

Recall the random variable \(\eta _\alpha \) defined in (10). Suppose that \(\varphi '' ( \alpha ) = 0\) for some \(\alpha \in [ 0, + \infty [\). Then, we have \(\eta _\alpha = 0\) almost surely, but this implies that \(\varDelta = 0\), a contradiction. Therefore, we have \(\varphi '' ( \alpha ) > 0\) for all \(\alpha \in [ 0, + \infty [\).

We prove a general result. Let \(\psi : {\mathbb {R}} \rightarrow {\mathbb {R}}\) be a \(\mu \)-self-concordant-like function. Suppose that \(\psi '' ( t ) > 0\) for all t. Consider the function

$$\begin{aligned} \chi ( t ) := \log \left( \psi '' ( t ) \right) . \end{aligned}$$

We write, by the self-concordant likeness of \(\psi \), that

$$\begin{aligned} \vert \chi ' ( t ) \vert = \frac{\vert \psi ''' ( t ) \vert }{ \psi '' ( t ) } \le \mu \, , \quad \forall t \in {\mathbb {R}} \, . \end{aligned}$$

Then, for any \(t_1, t_2 \in {\mathbb {R}}\), we have

$$\begin{aligned} \vert \chi ( t_1 ) - \chi ( t_2 ) \vert = \left| \log \left( \psi '' ( t_1 ) \right) - \log \left( \psi '' ( t_2 ) \right) \right| \le \mu \vert t_2 - t_1 \vert \, ; \end{aligned}$$

that is,

$$\begin{aligned} \mathrm {e}^{- \mu \vert t_2 - t_1 \vert } \psi '' ( t_2 ) \le \psi '' ( t_1 ) \le \mathrm {e}^{ \mu \vert t_2 - t_1 \vert } \psi '' ( t_2 ) \, . \end{aligned}$$

Applying the Newton–Leibniz formula, we obtain

$$\begin{aligned} \psi ' ( t_2 ) - \psi ' ( t_1 )&= \int _{0}^1 \psi '' ( t_1 + \tau ( t_2 - t_1 ) ) ( t_2 - t_1 ) \, \mathrm {d}\tau \\&\le \int _0^1 \mathrm {e}^{\mu \tau \vert t_2 - t_1 \vert } \psi '' ( t_1 ) ( t_2 - t_1 ) \, \mathrm {d}\tau \\&= \left( \frac{\mathrm {e}^{\mu \vert t_2 - t_1 \vert } - 1}{ \mu \vert t_2 - t_1 \vert } \right) \psi '' ( t_1 ) ( t_2 - t_1 ) \, ; \end{aligned}$$

similarly, we obtain

$$\begin{aligned} \psi ' ( t_2 ) - \psi ' ( t_1 ) \ge - \left( \frac{\mathrm {e}^{- \mu \vert t_2 - t_1 \vert } - 1}{ \mu \vert t_2 - t_1 \vert } \right) \psi '' ( t_1 ) ( t_2 - t_1 ) \, . \end{aligned}$$

Applying the Newton–Leibniz formula again, we obtain

$$\begin{aligned} \psi ( t_2 ) - \psi ( t_1 )&= \int _0^1 \psi ' ( t_1 + \tau ( t_2 - t_1 ) ) ( t_2 - t_1 ) \, \mathrm {d}\tau \\&= \psi ' ( t_1 ) ( t_2 - t_1 ) + \int _0^1 \left( \psi ' ( t_1 + \tau ( t_2 - t_1 ) ) - \psi ' ( t_1 ) \right) ( t_2 - t_1 ) \, \mathrm {d}\tau \\&\le \psi ' ( t_1 ) ( t_2 - t_1 ) + \int _0^1 \left( \frac{\mathrm {e}^{\mu \tau \vert t_2 - t_1 \vert } - 1 }{ \mu \tau \vert t_2 - t_1 \vert } \right) \psi '' ( t_1 ) \tau ( t_2 - t_1 )^2 \, \mathrm {d}\tau \\&= \psi ' ( t_1 ) ( t_2 - t_1 ) + \frac{ \left( \mathrm {e}^{\mu \vert t_2 - t_1 \vert } - \mu \vert t_2 - t_1 \vert - 1 \right) }{\mu ^2} \psi '' ( t_1 ) \, ; \end{aligned}$$

similarly, we obtain

$$\begin{aligned} \psi ( t_2 ) - \psi ( t_1 ) \ge \psi ' ( t_1 ) ( t_2 - t_1 ) + \frac{ \left( \mathrm {e}^{- \mu \vert t_2 - t_1 \vert } + \mu \vert t_2 - t_1 \vert - 1 \right) }{\mu ^2} \psi '' ( t_1 ) \, . \end{aligned}$$

Lemma 3.3 follows from setting \(\psi = \varphi \), \(\mu = \varDelta \), \(t_2 = 0\), and \(t_1 = \alpha \).

Appendix D Proof of Proposition 3.3

Suppose that . We write

$$\begin{aligned} f ( \rho _k ) - f ( \rho _{k + 1} )&\ge - \tau {\langle { \nabla f ( \rho _{k} ), f ( \rho _{k + 1} ) - f ( \rho _{k} ) }\rangle } \\&\ge \tau \alpha _k^{-1} H ( \rho _{k + 1}, \rho _k ) \\&= \tau \alpha _k \alpha _k^{-2} H ( \rho _k ( \alpha _k ), \rho _k ) \\&\ge \tau {\underline{\alpha }} \kappa H ( \rho _k ( {\bar{\alpha }} ), \rho _k ) \\&\ge 0 , \end{aligned}$$

for large enough \(k \in {\mathcal {K}}\), where the first inequality follows from the Armijo line search rule, the second follows from Lemma B.1, and the third follows from Corollary 3.1. Taking limit, we obtain that \(H ( \rho _k ( {\bar{\alpha }} ), \rho _k ) \rightarrow 0\) as \(k \rightarrow \infty \) in \({\mathcal {K}}\).

Suppose that . Let \(( \alpha _k )_{k \in {\mathcal {K}}'}\), \({\mathcal {K}}' \subseteq {\mathcal {K}}\), be a subsequence converging to zero. According to the Armijo rule, we have

$$\begin{aligned} f ( \rho _k ( r^{-1} \alpha _k ) ) - f ( \rho _k ) > \tau {\langle { \nabla f ( \rho _k ), \rho _k ( r^{-1} \alpha _k ) - \rho _k ( \alpha _k )}\rangle } , \end{aligned}$$
(11)

for large enough \(k \in {\mathcal {K}}\). The mean value theorem says that the left-hand side equals \({\langle { \nabla f ( \sigma ), \rho _k ( r^{-1} \alpha _k ) - \rho _k }\rangle }\) for some \(\sigma \) in the line segment jointing \(\rho _k ( r^{-1} \alpha _k )\) and \(\rho _k\). Then, (11) can be equivalently written as

$$\begin{aligned}&{\langle { \nabla f ( \sigma ) - \nabla f ( \rho _k ), \rho _k ( r^{-1} \alpha _k ) - \rho _k }\rangle } \nonumber \\&\quad > - ( 1 - \tau ) {\langle { \nabla f ( \rho _k ), \rho _k ( r^{-1} \alpha _k ) - \rho _k ( \alpha _k ) }\rangle } . \end{aligned}$$
(12)

By Pinsker’s inequality and Hölder’s inequality, we obtain

$$\begin{aligned}&\Vert \nabla f ( \sigma ) - \nabla f ( \rho _k ) \Vert _\infty \sqrt{2 H ( \rho _k ( r^{-1} \alpha _k ), \rho _k )} \nonumber \\&\quad \ge \Vert \nabla f ( \sigma ) - \nabla f ( \rho _k ) \Vert _\infty \Vert \rho _k ( r^{-1} \alpha _k ), \rho _k \Vert _1 \nonumber \\&\quad \ge {\langle { \nabla f ( \sigma ) - \nabla f ( \rho _k ), \rho _k ( r^{-1} \alpha _k ) - \rho _k }\rangle } . \end{aligned}$$
(13)

for large enough \(k \in {\mathcal {K}}\). Note that \(r^{-1} \alpha _k \le {\bar{\alpha }}\) for large enough \(k \in {\mathcal {K}}\). By Lemma B.1 and Corollary 3.2, we obtain

$$\begin{aligned}&- {\langle { \nabla f ( \rho _k ), \rho _k ( r^{-1} \alpha _k ) - \rho _k ( \alpha _k ) }\rangle } \ge \frac{H ( \rho _k ( r^{-1} \alpha _k ), \rho _k )}{r^{-1} \alpha _k } \nonumber \\&\ge \sqrt{ \kappa H ( \rho _k ( {\bar{\alpha }} ), \rho _k ) } \sqrt{ H ( \rho _k ( r^{-1} \alpha _k ), \rho _k ) } , \end{aligned}$$
(14)

for large enough \(k \in {\mathcal {K}}\). Since \(H ( \rho _k ( r^{-1} \alpha _k ), \rho _k )\) is strictly positive for all \(k \in {\mathcal {K}}'\) by assumption, (12), (13), and (14) imply

$$\begin{aligned} \Vert \nabla f ( \sigma ) - \nabla f ( \rho _k ) \Vert _\infty > ( 1 - \tau ) \sqrt{ \frac{\kappa H ( \rho _k ( {\bar{\alpha }} ), \rho _k )}{2} } \ge 0 . \end{aligned}$$

Taking limits, we obtain that \(H ( \rho _k ( {\bar{\alpha }} ), \rho _k ) \rightarrow 0\) a \(k \rightarrow \infty \) in \({\mathcal {K}}'\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, YH., Cevher, V. Convergence of the Exponentiated Gradient Method with Armijo Line Search. J Optim Theory Appl 181, 588–607 (2019). https://doi.org/10.1007/s10957-018-1428-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-018-1428-9

Keywords

Mathematics Subject Classification

Navigation