Abstract

In this work, we introduce AdaCN, a novel adaptive cubic Newton method for nonconvex stochastic optimization. AdaCN dynamically captures the curvature of the loss landscape by diagonally approximated Hessian plus the norm of difference between previous two estimates. It only requires at most first order gradients and updates with linear complexity for both time and memory. In order to reduce the variance introduced by the stochastic nature of the problem, AdaCN hires the first and second moment to implement and exponential moving average on iteratively updated stochastic gradients and approximated stochastic Hessians, respectively. We validate AdaCN in extensive experiments, showing that it outperforms other stochastic first order methods (including SGD, Adam, and AdaBound) and stochastic quasi-Newton method (i.e., Apollo), in terms of both convergence speed and generalization performance.

1. Introduction

Stochastic gradient descent (SGD) [1] is the workhorse method for nonconvex stochastic optimization in machine learning, particularly for training deep neural networks (DNNs). During the last decades, many accelerated first order variants of SGD are widely used due to their simplicity and versatility, including the accelerated SGD (ASGD) methods using Nesterov scheme [2], momentum [3] and heavy-ball method [4], and the adaptive methods such as AdaGrad [5], AdaDelta [6], RMSProp [7], and Adam [8]. Recently, Adam has become the default optimization method for many deep learning applications because of its rapid convergence speed and relatively insensitive choices of hyperparameters [9], and it also has engendered an ever-growing list of modifications, such as AdamW [10], NosAdam [11], AMSGrad [12], AdaBound [13], Radam [14], Sadam [15], and Adax [16], to name a few. The main difference between the ASGD methods and the adaptive methods is the former scales the gradients in different directions uniformly while the latter uses adaptively element-wisely scaled learning rates, which usually causes that the latter is able to converge faster and less sensitive to the learning rate than the former. However, it has been observed that the adaptive methods may converge to bad/suspicious local optima, leading to worse generalization ability than the ASGD methods [17], or fail to converge because of unstable and extreme learning rates [13].

The abovementioned methods, belonging to the stochastic first order method family, only use gradient information and do not consider the curvature of the loss landscape, thereby leading to their suboptimal behavior in algorithmic iterations. Instead, existing stochastic second order methods can capture and exploit the curvature properties of the loss landscape by incorporating both gradient and Hessian information. For example, stochastic Newton methods are typical ones that adopt exact stochastic Hessians. However, computing the full Hessian for training large-scale DNNs is prohibitively expensive, and thus it is necessary to approximate it or avoid directly computing it in algorithmic iterations. According to the way of approximating the stochastic Hessian matrix, stochastic second order methods for training large-scale DNNs can be broadly categorized into two branches: the stochastic quasi-Newton (SQN) methods approximate the Hessian as a series sum of first order information from prior iterations, such as AdaQN [18], SdLBFGS [19], and Apollo [20]; the stochastic second order Hessian-free methods compute the Hessian-vector product exactly through an efficient procedure proposed in [21], such as AdaHessian [22] which approximates the Hessian diagonal using Hutchinson’s method based on the Hessian-vector product, which is significantly more costly than SQN methods.

Recently, based on a classic method in the nonstochastic setting, the cubic regularized Newton method [23], the stochastic adaptive regularization methods using cubics (SARC) [2426] are proposed to address relatively small-scale nonconvex stochastic optimization problems, and they find the minimizer of a local second order Taylor approximation with a cubic regularization term at each iteration. It is observed that the SARC methods are able to escape saddle points more efficiently, leading to better generalization performance than most of the abovementioned stochastic first order and second order methods [23, 27] in relatively small-scale machine learning problems. More recently, a variant of SARC combining with negative curvature (SANC) is proposed in [28] with even better generalizability since a direction of negative curvature also benefits escaping strict saddle points [29, 30]. Unlike previous SARC methods, the SANC uses independent sets of data points of consistent size over all iterations to attain stochastic gradient and Hessian estimators, making it more practical than SARC. However, SARC and SANC use a Krylov subspace method to iteratively solve a cubic regularized Newton subproblem and use a trust region-like scheme to determine if an iteration is successful or not and update only when it is successful, which hinders their applications in large-scale nonconvex stochastic optimization (in the sense of large datasets and/or high-dimensional parameter spaces). Therefore, these existing cubic regularized Newton methods are not suitable for training large-scale DNNs.

In this work, we develop a novel adaptive cubic Newton method, AdaCN, for large-scale nonconvex stochastic optimization, and it can inherit the superiority of SARC and SANC methods (i.e., great generalizability), but tackle their aforementioned challenges (i.e., unsuitable for training large-scale DNNs). AdaCN is designed for nonconvex stochastic optimization through dynamically capturing the curvature of the loss function by diagonally approximated Hessian and the norm of difference between previous two estimates. It only requires at most first order gradients and updates with linear complexity for both time and memory, thus it is quite suitable for large-scale nonconvex stochastic optimization problems. In order to reduce the variance introduced by the stochastic nature of the problem, AdaCN hires the first and second moment to implement an exponential moving average on iteratively calculated gradients and approximated Hessians, respectively. Therefore, these moments are able to not only accelerate convergence speed but also smooth the noisy curvature information and get an approximation to the global curvature information, avoiding misleading local gradient and Hessian information which can be catastrophic. The superiority of AdaCN can be illustrated with two simple 2D functions (one is convex and the other is nonconvex, and these two functions give hints to local behavior of optimizers in deep learning), as shown in Figure 1, where we show the trajectories of different optimizers. As can be seen, AdaCN can converge much faster than stochastic first order methods such as SGD and Adam and stochastic quasi-Newton method Apollo. Furthermore, we will experimentally show the superiority of AdaCN through image classification tasks: LeNet [31] on Mnist and VGG11 [32], ResNet34 [33], and DenseNet121 [34] on CIFAR10 and CIFAR100 [35] dataset; and language modeling task: LSTM [36] on Penn Treebank [37] dataset.

Notation: we use italics letters , to denote scalars, bold lowercase letters , to denote vectors, and bold uppercase letters A, B to denote matrices. For vectors, we use to denote the -norm.

2. Formulation of Adaptive Cubic Newton Method

2.1. Problem Statement

In this paper, we consider the following stochastic optimization problem:where is a continuously differentiable function and possibly nonconvex, is the parameter to be optimized, and denotes the expectation with respect to , a random variable with the distribution . A special case of (1) that arises frequently in supervised machine learning is the empirical risk minimization (ERM) problem [38]:where is the loss function corresponding to the -th data instance, and is the number of data samples that is assumed to be extremely large.

2.2. Newton Update from Cubically Regularized Model

We begin with a stochastic Newton (SN) method. At a high level, we sample two independent mini-batches and at each iteration, and the stochastic gradient and Hessian estimators, say, and , can be defined as

In each iteration of the SN method, gradient descent finds the minimizer of a local second order Taylor expansionand its corresponding Newton update is shown as

For large-scale stochastic optimization problems, the stochastic Hessian is properly approximated. For example, AdaQN [18], SdLBFGS [19], and Apollo [20] (belonging to stochastic quasi-Newton methods) approximate as a series sum of first order information from prior iterations, while AdaHessian [22] (one of the stochastic second order Hessian-free methods) approximates the Hessian diagonal using Hutchinson’s method [39] based on the Hessian-vector product.

A new principled variant of the SN method that could enjoy global convergence guarantees is proposed in [23], and it finds the minimizer of the following cubically regularized second order approximation of with , and a sufficient large :where is the value at -th iteration. By first order optimality conditions, we set the derivative of the objective to zero, which immediately yieldswhich is a nonlinear system and can be approximated by a linear one as follows:yielding a novel Newton update:where represents the identity matrix with the same size as . Comparing (5) with (9) additionally makes use of the norm of difference between previous two estimates, leading to better performance since it captures more curvature information.

2.3. Updating

The stochastic Hessian can be updated based on the weak secant equation [40, 41]:where and . The solution of the above problem with Frobenius matrix based on the variational technique in [42] is given bywhere is the diagonal matrix with diagonal elements from vector . Further, with rectifying operation which guarantees the positive-definiteness, the updated cubic Newton becomeswherewhere is a positive parameter, and the cost of computing is marginal since is diagonal, takes absolute values of all the elements of the matrix . It is worth mentioning that, according to equations (5) and (12), operation enables to prevent the step size from becoming arbitrary large since there exists zero value in .

2.4. Moments for and

In this paper, we adopt the moments for and , given bywhere denotes the elementwise multiplication, and are the first and second moment hyperparameters that are also used in Adam [8] and its many variants. The moments are further bias corrected as

Using the first and second moment amounts to carrying out an exponential moving average on iteratively updated stochastic gradients and approximated stochastic Hessians plus the norm of difference between previous two estimates, which can smooth the noisy curvature information and get an approximation to the global curvature information, avoiding misleading local gradient and Hessian information which can be catastrophic.

To summarize, the complete algorithm of AdaCN is given in Algorithm 1, in which at most first order gradients are required, and , , , and are all diagonal. Therefore, AdaCN updates with linear complexity for both time and memory.

Require: //Mini-batch size
Require: //Stepsize
Require://Parameters of exponential moving average
Require://Positive parameters
Require://Initialize variables as zero vectors or zero matrices
Require://Initialize timestep
(1)While not converged do
(2)
(3)sample
(4)//Stochastic gradient at timestep k
(5)
(6)
(7)//Update diagonal Hessian
(8)
(9)
(10)
(11)
(12)
(13)
(14)end while
(15)return

3. Experiments

In this section, we access the performance of AdaCN on learning tasks: image classification and language modeling, comparing with stochastic first order optimizers such as Adam [8], SGD [1], AdaBound [13], and stochastic second order optimizer Apollo [20], listed in Table 1. For image classification, we investigate models by LeNet [31] on Mnist and VGG11 [32], ResNet34, and DenseNet121 [34] on CIFAR10 and CIFAR100 [35], while for language modeling, LSTM [36] on Penn Treebank [37] is tested. Moreover, the robustness to hyperparameters is tested, through comparing AdaCN and Apollo with different values of and learning rate. The results are shown in the following subsections.

3.1. Experiments Setup

We perform a careful hyperparameter tuning in experiments as follows. AdaCN: we set , , , , and at k-th iteration. The learning rate for Mnist dataset, 0.2 for CIFAR datasets, and 30 for Penn Treebank dataset. SGD: the momentum is set to 0.9, while the learning rate is searched among where and . Adam, AdaBound, and Apollo: the learning rate is searched as SGD, and other parameters are set as their own default values in the literature.

3.2. Image Classification

At first, we evaluate the convergence and generalization of AdaCN on image classification. We use LeNet on Mnist and VGG11, ResNet34, and DenseNet121 on CIFAR10 and CIFAR100 dataset.

Mnist. Results on Mnist are shown as Figure 2: the curves of train and test accuracy on Mnist and Table 2, from which we can see that AdaCN achieves the best convergence speed and classification accuracy.

CIFAR10. We report the results on CIFAR10 in Figure 3 and Table 3. For all three network architectures, AdaCN obviously outperforms other optimizers with comparable convergence speed and best classification accuracy.

CIFAR100. Results on CIFAR100 are shown in Figure 4 and Table 4. For VGG11, AdaCN is better than Adam and Apollo, but worse than SGD and AdaBound in terms of convergence speed and classification accuracy. For ResNet34 and DenseNet121, AdaCN achieves the best classification accuracy.

3.3. Language Modeling

On language modeling, we experiment with 1, 2, 3-layer LSTM model on Penn Treebank dataset. As Figure 5 and Table 5 show, AdaCN can also keep fastest convergence speed and achieve the lowest perplexity among the optimizers for 1, 2, 3-layer LSTM.

Finally, we explore the effects of the hyperparameters including and learning rates on the performance of AdaCN, respectively. The results on CIFAR10 dataset are reported in Figures 6 and 7, where the values of are ranging from to in a log-scale grid on ResNet34 and learning rate ranging from to on VGG11, respectively. As can be seen, the test accuracies of AdaCN are above 95% for all values of , while Apollo achieves an accuracy consistently below 95%. Moreover, the results validate the robustness of AdaCN to and learning rate.

4. Conclusion

We have proposed AdaCN, a novel, efficient, and effective adaptive cubic Newton method for nonconvex stochastic optimization. This method is designed for large-scale nonconvex stochastic optimization problems which are the core of state-of-the-art deep learning literature. Experimental results on image classification tasks and language modeling task demonstrate the superiority of AdaCN, in terms of convergence speed and generalization performance.

Data Availability

The data used to support the findings of this study are open datasets which could be found in general websites, and the datasets are also freely available.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported in part by the Neural Science Foundation of Hunan Province (Grant No. 2019JJ50746) and National Nature Science Foundation of China (Grant No. 61602494).