Abstract
Consider the problem of minimizing a convex differentiable function on the probability simplex, spectrahedron, or set of quantum density matrices. We prove that the exponentiated gradient method with Armijo line search always converges to the optimum, if the sequence of the iterates possesses a strictly positive limit point (element-wise for the vector case, and with respect to the Löwner partial ordering for the matrix case). To the best of our knowledge, this is the first convergence result for a mirror descent-type method that only requires differentiability. The proof exploits self-concordant likeness of the log-partition function, which is of independent interest.
Similar content being viewed by others
Notes
Here, we exclude the very standard projected gradient method.
For any element-wisely strictly positive vector \(v := ( v_i )_{1 \le i \le d}\), the Burg entropy is defined as \(b ( v ) := - \sum _{i = 1}^d \log v_i\).
References
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
Hohage, T., Werner, F.: Inverse problems with Poisson data: statistical regularization theory, applications and algorithms. Inverse Probl. 32, 093001 (2016)
Koltchinskii, V.: von Neumann entropy penalization and low-rank matrix estimation. Ann. Stat. 39(6), 2936–2973 (2011)
Paris, M., Řeháček, J. (eds.): Quantum State Estimation. Springer, Berlin (2004)
Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, Chichester (1983)
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31, 167–175 (2003)
Auslender, A., Teboulle, M.: Interior gradient and epsilon-subgradient descent methods for constrained convex minimization. Math. Oper. Res. 29(1), 1–26 (2004)
Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017)
Arora, S., Hazan, E., Kale, S.: The multiplicative weights update method: a meta-algorithm and applications. Theory Comput. 8, 121–164 (2012)
Kivinen, J., Warmuth, M.K.: Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput. 132, 1–63 (1997)
Helmbold, D.P., Shapire, R.E., Singer, Y., Warmuth, M.K.: On-line portfolio selection using multiplicative updates. Math. Finance 8(4), 325–347 (1998)
Tsuda, K., Rätsch, G., Warmuth, M.K.: Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res. 6, 995–1018 (2005)
Lu, H., Freund, R.M., Nesterov, Y.: Relatively-smooth convex optimization by first-order methods, and applications. arXiv:1610.05708v1 (2016)
Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P.L.: Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. J. Mach. Learn. Res. 9, 1775–1822 (2008)
Doljansky, M., Teboulle, M.: An interior proximal algorithm and the exponential multiplier method for semidefinite programming. SIAM J. Optim. 9(1), 1–13 (1998)
Bertsekas, D.P.: On the Goldstein–Levitin–Polyak gradient projection method. IEEE Trans. Autom. Control AC–21(2), 174–184 (1976)
Gafni, E.M., Bertsekas, D.P.: Convergence of a Gradient Projection Method. LIDS-P-1201, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge (1982)
Salzo, S.: The variable metric forward-backward splitting algorithms under mild differentiability assumptions. SIAM J. Optim. 27(4), 2153–2181 (2017)
Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia (1994)
Blume-Kohout, R.: Hedged maximum likelihood quantum state estimation. Phys. Rev. Lett. 105, 200504 (2010)
Decarreau, A., Hilhorst, D., Lemaréchal, C., Navaza, J.: Dual methods in entropy maximization. application to some problems in crystallography. SIAM J. Optim. 2(2), 173–197 (1992)
Hiai, F., Ohya, M., Tsukada, M.: Sufficiency, KMS condition and relative entropy in von Neumann algebras. Pac. J. Math. 96(1), 99–109 (1981)
Bertsekas, D.P.: Nonlinear Programming, vol. 3. Athena Sci, Belmont (2016)
Bach, F.: Self-concordant analysis for logistic regression. Electron. J. Stat. 4, 384–414 (2010)
Bach, F.: Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15, 595–627 (2014)
Tran-Dinh, Q., Li, Y.H., Cevher, V.: Composite convex minimization involving self-concordant-like cost functions. In: Modelling, Computation and Optimization in Information Systems and Management Sciences, pp. 155–168. Springer, Cham (2015)
Ohya, M., Petz, D.: Quantum Entropy and Its Use. Springer, Berlin (1993)
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Hradil, Z.: Quantum-state estimation. Phys. Rev. A 55(3), R1561 (1997)
Byrne, C., Censor, Y.: Proximity function minimization using multiple Bregman projections, with application to split feasibility and Kullback–Leibler distance minimization. Ann. Oper. Res. 105, 77–98 (2001)
MacLean, L.C., Thorp, E.O., Ziemba, W.T. (eds.): The Kelly Capital Growth Investment Criterion. World Scientific, Singapore (2012)
Odor, G., Li, Y.H., Yurtsever, A., Hsieh, Y.P., El Halabi, M., Tran-Dinh, Q., Cevher, V.: Frank-Wolfe works for non-Lipschitz continuous gradient objectives: scalable Poisson phase retrieval. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6230–6234 (2016)
Vardi, Y., Shepp, L.A., Kaufman, L.: A statistical model for positron emission tomography. J. Am. Stat. Assoc. 80(389), 8–20 (1985)
Acknowledgements
We thank Ya-Ping Hsieh for his comments. This work was supported by SNF 200021-146750 and ERC project time-data 725594.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A Inapplicability of Existing Convergence Guarantees to Quantum State Tomography
Quantum state tomography is the task of estimating the state of a quantum systems, which is essential to calibrating quantum computation devices [4, 30]. Numerically, it corresponds to solving (P) with the objective function
where \(M_i\) are positive semi-definite matrices given by the experimental data.
The following proposition shows that existing convergence guarantees for the EG method do not apply to quantum state tomography.
Proposition A.1
The function \(f_{\text {QST}}\) is not Lipschitz, its gradient is not Lipschitz, and it is not smooth relative to the negative von Neumann entropy.
Proof
Consider the two-dimensional case, where \(\rho = ( \rho _{i, j} )_{1 \le i, j \le 2} \in {\mathbb {C}}^{2 \times 2}\). Define \(e_1 := ( 1, 0 )\) and \(e_2 := ( 0, 1 )\). Suppose that there are only two summands, with \(M_1 = e_1 \otimes e_1\) and \(M_2 = e_2 \otimes e_2\). Then, we have \(f ( \rho ) = - \log \rho _{1,1} - \log \rho _{2,2}\). It suffices to disprove all properties for this specific f on the set of diagonal density matrices. Hence, we will focus on the function \(g ( x, y ) := - \log x - \log y\), defined for any \(x, y > 0\) such that \(x + y = 1\).
As either x or y can be arbitrarily close to zero, g cannot be Lipschitz continuous in itself or its gradient due to the logarithmic functions. Define the entropy function
with the convention \(0 \log 0 = 0\). Then, g is L-smooth relative to the relative entropy, if and only if \(- L h - g\) is convex. It suffices to check the positive semi-definiteness of the Hessian of \(- L h - g\). A necessary condition for the Hessian to be positive semi-definite is that
for all \(x \in ] 0, 1 [\), which cannot hold for \(x < ( 1 / L )\), for any fixed \(L > 0\). \(\square \)
We note that similar objective functions can be found in positive linear inverse problems, positron emission tomography, portfolio selection, and Poisson phase retrieval [31,32,33,34].
Appendix B Technical Lemmas Necessary for Sect. 3
Define
for every non-singular \(\rho \in {\mathcal {D}}\) and \(\alpha \ge 0\), where \(C_\rho \) is the positive real number normalizing the trace of \(\rho ( \alpha )\).
Lemma B.1
For every non-singular \(\rho \in {\mathcal {D}}\) and \(\alpha > 0\), it holds that
Proof
The equivalent formulation of the EG method, (2), implies that
\(\square \)
Lemma B.2
Let \(\rho \in {\mathcal {D}}\) be non-singular. If \(\rho \) is a minimizer of f on \({\mathcal {D}}\), then \(\rho ( \alpha ) = \rho \) for all \(\alpha \ge 0\). If \(\rho ( \alpha ) = \rho \) for some \(\alpha > 0\), then \(\rho \) is a minimizer of f on \({\mathcal {D}}\).
Proof
The optimality condition says that \(\rho \in {{\mathrm{\mathrm {int}}}}{\mathcal {D}}\) is a minimizer of f on \({\mathcal {D}}\), if and only if
For any \(\alpha > 0\), we can equivalently write
where h denotes the negative von Neumann entropy function, i.e.,
Note that the quantum relative entropy H is the Bregman divergence induced by the negative von Neumann entropy. It is easily checked, again by the optimality condition, that (9) is equivalent to
\(\square \)
For every non-singular \(\rho \in {\mathcal {D}}\) and \(\alpha \ge 0\), define
Let \(G = \sum _j \lambda _j P_j\) be the spectral decomposition of G. Define \(\eta _\alpha \) as a random variable satisfying
it is easily checked that \({\mathsf {P}} \left( \eta _\alpha = \lambda _j \right) > 0\) for all j, and the probabilities sum to one.
Lemma B.3
For any \(\alpha \in {\mathbb {R}}\), it holds that
Proof
Note that
for any \(n \in {\mathbb {N}}\). Define \(\sigma _\alpha := \exp ( H_\alpha ) / {{\mathrm{\mathrm {Tr}}}}\exp ( H_\alpha )\). A direct calculation gives
The lemma follows. \(\square \)
Since \(\eta _\alpha \) is a bounded random variable, it follows that \(\varphi ''\) is bounded from above.
Corollary B.1
It holds that \(\varphi '' ( \alpha ) \le ( 1 / 4 ) \varDelta ^2\), where
Proof
Recall that the variance of a random variable taking values in [a, b] is bounded from above by \(( b - a )^2 / 4\).
\(\square \)
Appendix C Proof of Lemma 3.3
Recall the random variable \(\eta _\alpha \) defined in (10). Suppose that \(\varphi '' ( \alpha ) = 0\) for some \(\alpha \in [ 0, + \infty [\). Then, we have \(\eta _\alpha = 0\) almost surely, but this implies that \(\varDelta = 0\), a contradiction. Therefore, we have \(\varphi '' ( \alpha ) > 0\) for all \(\alpha \in [ 0, + \infty [\).
We prove a general result. Let \(\psi : {\mathbb {R}} \rightarrow {\mathbb {R}}\) be a \(\mu \)-self-concordant-like function. Suppose that \(\psi '' ( t ) > 0\) for all t. Consider the function
We write, by the self-concordant likeness of \(\psi \), that
Then, for any \(t_1, t_2 \in {\mathbb {R}}\), we have
that is,
Applying the Newton–Leibniz formula, we obtain
similarly, we obtain
Applying the Newton–Leibniz formula again, we obtain
similarly, we obtain
Lemma 3.3 follows from setting \(\psi = \varphi \), \(\mu = \varDelta \), \(t_2 = 0\), and \(t_1 = \alpha \).
Appendix D Proof of Proposition 3.3
Suppose that . We write
for large enough \(k \in {\mathcal {K}}\), where the first inequality follows from the Armijo line search rule, the second follows from Lemma B.1, and the third follows from Corollary 3.1. Taking limit, we obtain that \(H ( \rho _k ( {\bar{\alpha }} ), \rho _k ) \rightarrow 0\) as \(k \rightarrow \infty \) in \({\mathcal {K}}\).
Suppose that . Let \(( \alpha _k )_{k \in {\mathcal {K}}'}\), \({\mathcal {K}}' \subseteq {\mathcal {K}}\), be a subsequence converging to zero. According to the Armijo rule, we have
for large enough \(k \in {\mathcal {K}}\). The mean value theorem says that the left-hand side equals \({\langle { \nabla f ( \sigma ), \rho _k ( r^{-1} \alpha _k ) - \rho _k }\rangle }\) for some \(\sigma \) in the line segment jointing \(\rho _k ( r^{-1} \alpha _k )\) and \(\rho _k\). Then, (11) can be equivalently written as
By Pinsker’s inequality and Hölder’s inequality, we obtain
for large enough \(k \in {\mathcal {K}}\). Note that \(r^{-1} \alpha _k \le {\bar{\alpha }}\) for large enough \(k \in {\mathcal {K}}\). By Lemma B.1 and Corollary 3.2, we obtain
for large enough \(k \in {\mathcal {K}}\). Since \(H ( \rho _k ( r^{-1} \alpha _k ), \rho _k )\) is strictly positive for all \(k \in {\mathcal {K}}'\) by assumption, (12), (13), and (14) imply
Taking limits, we obtain that \(H ( \rho _k ( {\bar{\alpha }} ), \rho _k ) \rightarrow 0\) a \(k \rightarrow \infty \) in \({\mathcal {K}}'\).
Rights and permissions
About this article
Cite this article
Li, YH., Cevher, V. Convergence of the Exponentiated Gradient Method with Armijo Line Search. J Optim Theory Appl 181, 588–607 (2019). https://doi.org/10.1007/s10957-018-1428-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-018-1428-9
Keywords
- Exponentiated gradient method
- Armijo line search
- Self-concordant likeness
- Peierls–Bogoliubov inequality