Crossprop: Learning Representations by Stochastic Meta-Gradient Descent in Neural Networks

Veeriah, Vivek; Zhang, Shangtong; Sutton, Richard S.

doi:10.1007/978-3-319-71249-9_27

Vivek Veeriah¹⁸,
Shangtong Zhang¹⁸ &
Richard S. Sutton¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10534))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

4304 Accesses
1 Citations

Abstract

Representations are fundamental to artificial intelligence. The performance of a learning system depends on how the data is represented. Typically, these representations are hand-engineered using domain knowledge. Recently, the trend is to learn these representations through stochastic gradient descent in multi-layer neural networks, which is called backprop. Learning representations directly from the incoming data stream reduces human labour involved in designing a learning system. More importantly, this allows in scaling up a learning system to difficult tasks. In this paper, we introduce a new incremental learning algorithm called crossprop, that learns incoming weights of hidden units based on the meta-gradient descent approach. This meta-gradient descent approach was previously introduced by Sutton (1992) and Schraudolph (1999) for learning step-sizes. The final update equation introduces an additional memory parameter for each of these weights and generalizes the backprop update equation. From our empirical experiments, we show that crossprop learns and reuses its feature representation while tackling new and unseen tasks whereas backprop relearns a new feature representation.

V. Veeriah and S. Zhang—These authors contributed equally to this work.

You have full access to this open access chapter, Download conference paper PDF

Artificial Neural Networks

Learning Input Features Representations in Deep Learning

Blind Descent: A Prequel to Gradient Descent

Keywords

1 Introduction

The type of representation used for presenting the data to a learning system plays a key role in artificial intelligence and machine learning. Typically, the performance of the system, such as its speed of learning or its error rate, directly depends on how the data is represented internally by the system. Hand-engineering these representations using some special domain knowledge was the norm. More recently, these representations are learned hierarchically and directly from the data through stochastic gradient descent. Learning such representations significantly improves the learning performance and reduces the human effort involved in designing the system. Importantly, this allows in scaling up systems to handle bigger and harder problems.

Learning hierarchical representations directly from the data has recently gained a lot of popularity. Designing deep neural networks has allowed the learning systems to tackle incredibly hard problems: classifying or recognizing the objects from natural scene images (Deng et al. 2009; Szegedy et al. 2017), automatically translating text and speeches (Cho et al. 2014; Bahdanau et al. 2014; Wu et al. 2016), achieving and surpassing human-level baseline in Atari (Mnih et al. 2015), achieving super-human performance in Poker (Moravčík et al. 2017) and in improving robot control from learning experiences (Levine et al. 2016). It is important to note that in many of these problems it is difficult to hand-engineer a data representation and an inadequate representation generally limits performance or scalability of the system.

The algorithm behind the training of such deep neural networks is called backprop (or backpropagation), which was introduced by Rumelhart et al. (1988). It extended stochastic gradient descent, via chain rule, for learning the weights in the hidden layers of a neural network.

Though backprop has produced many successful results, it suffers from some fundamental issues which makes it slow in learning a useful representation that solves many tasks. Specifically, backprop tends to interfere with the previously learned representations because the units that have so far been found to be useful are the ones that are most likely to be changed (Sutton 1986). One of the reasons for this is that the weights of each hidden layer is assumed to be independent with each other, and because of this, the parameters of the neural network race against each other to minimize the error for a given example. In order to overcome this issue, the neural network needs to be trained over multiple sweeps (epochs) with the data so that algorithm can settle down with one representation that encompasses all the data it has seen so far.

In this paper, we introduce a meta-gradient descent approach for learning the weights connecting the hidden units of a neural network. Previously, the meta-gradient descent approach was introduced by Sutton (1992) and Schraudolph (1999) for learning parameter-specific step-sizes, which is adapted here for learning incoming weights of hidden units. Our proposed method is called crossprop.

This specifically addresses the racing problem which is observed in backprop. Furthermore, from our continual learning experiments where a learning system experiences a sequence of related tasks, we observed that crossprop tends to find the features that best generalize across these multiple tasks. Backprop, on the other hand, tends to unlearn and relearn the features with each task that it experiences. From a continual learning perspective, where a learning system experiences a sequence of tasks that are related with each other, it is desirable to have a learning system that can leverage its learning from its past experiences for solving unseen and more difficult tasks that it experiences in its future.

2 Related Methods

There are three fundamental approaches for learning representations, via a neural network, directly from the data.

The first and the most popular approach for learning such representations is through stochastic gradient descent over the supervised learning error function, like the mean squared or the cross-entropy error (Rumelhart et al. 1988). This approach is proved successful in many applications, ranging from difficult problems in computer vision to patient diagnoses. Although this method has a strong track record, it is not perfect yet. Particularly, learning representations by backpropagating the supervised error signal often learns slowly and poorly in many problems (Sutton 1986; Jacobs 1988). In order to address this, many modifications to backprop are introduced, like adding momentum (Jacobs 1988), RMSProp (Tieleman and Hinton 2012), ADADELTA (Zeiler 2012), ADAM (Kingma and Ba 2014) etc. and its not quite clear which variation of backprop will work well for a given task. However, all these variations of backprop still tend to interfere with the previously learned representations, thereby causing the network to unlearn and relearn representation even when the task can be solved by leveraging the learning from previous experiences.

Another promising approach for learning representations is by the generate and test process (Klopf and Gose 1969; Mahmood and Sutton 2013). The underlying principle behind these approaches is to generate many features in a random manner and then test the usability for each of these features. Based on certain heuristics, the features are either preserved or discarded. Furthermore, the generate and test approach can be combined with backprop to achieve a better rate of learning in supervised learning tasks. The primary motivation behind these generate and test approaches is to design a distributed and a computationally inexpensive representation learning method.

Some researchers have also looked at learning representations that fulfil certain unsupervised learning objectives, like clustering, sparsity, statistic independence or reproducibility of data, which takes us to the third fundamental approach towards learning representations (Olshausen and Field 1997; Comon 1994; Vincent et al. 2010; Coates and Ng 2012). Recently, learning such unsupervised representations has allowed in designing an effective clinical decision making system (Miotto et al. 2016). However, its not exactly clear on how to design a learning system for a continual and online learning setting using representations obtained through unsupervised learning, because we do not have access to data prior to the beginning of a learning task.

3 Algorithm

We consider a single-hidden layer neural network with a single output unit for presenting our algorithm. The parameters $ U \in \mathbb {R}^{m \times n} $ and $ W \in \mathbb {R}^{n} $ are the incoming and outgoing weights of the neural network where m is the number of input units and n is the number of hidden units. Each element of U is denoted as $ u_{ij} $ where i refers to the corresponding input unit and j refers to the hidden unit. Likewise, each element of W is denoted as $ w_{j} $.

Our proposed method is summarized as a pseudo-code in Algorithm 1 (and the code is available on github^{Footnote 1}). A learning system (for simplicity, consider a single-hidden layer network), at time step t, receives an example $ X_{t} \in \mathbb {R}^{m} $ where each element of this vector is denoted as $ x_{i, t} $. This is mapped onto the hidden units through the incoming weight matrix U and a nonlinearity, like tanh, sigmoid or relu, is applied over this summed-product. The activations for each hidden unit for a given example at time step t using a $ \tanh $ activation function is expressed mathematically as, $ \phi _{j, t} = \tanh \Big ( \sum _{i = 1}^{m} x_{i, t} u_{ij, t} \Big ) $. These hidden units are successively mapped onto a scalar output $ y_{t} \in \mathbb {R} $ using the weights W, which is expressed as $ y_{t} = \sum _{j = 1}^{n} \phi _{j, t} w_{j, t} $.

Let $ \delta _{t}^{2} = \big ( y_{t}^{*} - y_{t} \big )^{2} $ be a noisy objective function where $ y_{t}^{*} $ is the scalar target and $ y_{t} $ is an estimate made by an algorithm for an example at time step t. The incoming and outgoing weights (U and W) are incrementally learned after processing an example one after the other.

The outgoing weights W are updated using the least mean squares (LMS) learning rule after processing an example at time step t as follows:

$$\begin{aligned} w_{j, t + 1}&= w_{j, t} - \frac{1}{2} \alpha \frac{\partial \delta ^{2}_{t}}{\partial w_{j, t}} \nonumber \\&= w_{j, t} - \alpha \delta _{t} \frac{\partial \delta _{t}}{\partial w_{j, t}}\nonumber \\&= w_{j, t} - \alpha \delta _{t} \frac{\partial [ y^{*}_{t} - y_{t} ]}{\partial w_{j, t}}\nonumber \\&= w_{j, t} + \alpha \delta _{t} \frac{\partial y_{t}}{\partial w_{j, t}} \nonumber \\ w_{j, t + 1}&= w_{j, t} + \alpha \delta _{t} \frac{\partial }{\partial w_{j, t}} \Big [ \sum _{i=1}^{n} \phi _{i, t}w_{i, t} \Big ] \nonumber \\ w_{j, t + 1}&= w_{j, t} + \alpha \delta _{t} \phi _{j, t} \end{aligned}$$

(1)

We diverge from the conventional way (i.e., through backprop) for learning the incoming weights U. Specifically, for learning the weights U, we consider the influence of all the past values of $ U_{1}, U_{2}, \cdots U_{t} $ on the current error $ \delta _{t}^{2} $. We would like to learn the values of $ u_{ij, t + 1} $ by making an update using the partial derivative term $ \frac{\partial \delta _{t}^{2}}{\partial u_{ij}} $ where $ u_{ij} $ refers to all its past values.

This is interesting because most of the current research on representation learning usually consider only the influence of the weight at the current time step $ u_{ij, t} $ on the squared error $ \delta _{t}^{2} $: $ \frac{\partial \delta _{t}^{2}}{\partial u_{ij, t}} $. This ignores the effects of the previous possible values of these weights on the squared error at the current time step.

We now derive the update rule for the incoming weights as follows:

$$\begin{aligned} u_{ij, t + 1}&= u_{ij, t} - \frac{1}{2} \alpha \frac{\partial \delta _{t}^{2}}{\partial u_{ij}} \nonumber \\&= u_{ij, t} - \alpha \delta _{t} \frac{\partial [ y_{t}^{*} - y_{t} ]}{\partial u_{ij}} \nonumber \\ u_{ij, t+1}&= u_{ij, t} + \alpha \delta _{t} \frac{\partial y_{t}}{\partial u_{ij}} \end{aligned}$$

(2)

Adapting the meta-gradient descent approach, that was introduced by Sutton (1992) and Schraudolph (1999), we derive the update rule for the incoming weights U as follows:

$$\begin{aligned} \frac{\partial y_{t}}{\partial u_{ij}}&= \sum _{k} \frac{\partial y_{t}}{\partial w_{k, t}}\frac{\partial w_{k, t}}{\partial u_{ij}} + \sum _{k} \frac{\partial y_{t}}{\partial \phi _{k, t}}\frac{\partial \phi _{k, t}}{\partial u_{ij}} \nonumber \\&= \sum _{k} \frac{\partial y_{t}}{\partial w_{k, t}}\frac{\partial w_{k, t}}{\partial u_{ij}} + \sum _{k} \frac{\partial y_{t}}{\partial \phi _{k, t}}\frac{\partial \phi _{k, t}}{\partial u_{ij, t}} \nonumber \\ \frac{\partial y_{t}}{\partial u_{ij}}&\approx \frac{\partial y_{t}}{\partial w_{j, t}}\frac{\partial w_{j, t}}{\partial u_{ij}} + \frac{\partial y_{t}}{\partial \phi _{j, t}}\frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \end{aligned}$$

(3)

Any error made during estimation of $ y_{t} $ by the learning system is attributed to both the outgoing weights of the features and to the activations of the hidden units. The approximations of $ \sum _{k} \frac{\partial y_{t}}{\partial w_{k, t}}\frac{\partial w_{k, t}}{\partial u_{ij}} \approx \frac{\partial y_{t}}{\partial w_{j, t}}\frac{\partial w_{j, t}}{\partial u_{ij}} $ and $ \sum _{k} \frac{\partial y_{t}}{\partial \phi _{k, t}}\frac{\partial \phi _{k, t}}{\partial u_{ij, t}} \approx \frac{\partial y_{t}}{\partial \phi _{j, t}}\frac{\partial \phi _{j, t}}{\partial u_{ij, t}} $ are reasonable because the primary effect on the input weight $ u_{ij} $ will be through the corresponding output weight $ w_{j, t} $ and feature $ \phi _{j, t} $.

By defining $ h_{ij, t} = \frac{\partial w_{j, t}}{\partial u_{ij}} $, we can obtain a simple form for Eq. (3):

$$\begin{aligned} \frac{\partial y_{t}}{\partial u_{ij}}&\approx \frac{\partial y_{t}}{\partial w_{j, t}}\frac{\partial w_{j, t}}{\partial u_{ij}} + \frac{\partial y_{t}}{\partial \phi _{j, t}}\frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \nonumber \\&= \Big ( \frac{\partial }{\partial w_{j, t}} \sum _{k} \phi _{k, t} w_{k, t} \Big )h_{ij, t} + \frac{\partial y_{t}}{\partial \phi _{j, t}}\frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \nonumber \\ \frac{\partial y_{t}}{\partial u_{ij}}&\approx \phi _{j, t} h_{ij, t} + \frac{\partial y_{t}}{\partial \phi _{j, t}}\frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \end{aligned}$$

(4)

The partial derivative $ \frac{\partial y_{t}}{\partial \phi _{j, t}}\frac{\partial \phi _{j, t}}{\partial u_{ij, t}} $ is the conventional backprop update. However, in our proposed algorithm, we have an additional update term $ \phi _{j, t} h_{ij, t} $ that captures the dependencies of all the previous values of $ u_{ij} $ on the current estimate $ y_{t} $ and on the current squared error $ \delta _{t}^{2} $.

$ h_{ij, t} $ is an additional memory parameter corresponding to the input weight $ u_{ij, t} $ and can be written as a recursive update equation as follows:

$$\begin{aligned} h_{ij, t + 1}&\approx \frac{\partial w_{j, t + 1}}{\partial u_{ij}} \nonumber \\&= \frac{\partial }{\partial u_{ij}} \Big [ w_{j, t} + \alpha \delta _{t} \phi _{j, t} \Big ] \nonumber \\&= \frac{\partial w_{j, t}}{\partial u_{ij}} + \alpha \frac{\partial }{\partial u_{ij}} \Big [ \delta _{t}\phi _{j, t} \Big ] \nonumber \\&= h_{ij, t} + \alpha \delta _{t} \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} + \alpha \frac{\partial \delta _{t}}{\partial u_{ij}} \phi _{j, t} \nonumber \\&\approx h_{ij, t} + \alpha \delta _{t} \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} + \alpha \frac{\partial \delta _{t}}{\partial y_{t}}\frac{\partial y_{t}}{\partial u_{ij, t}}\phi _{j, t} \nonumber \\&\approx h_{ij, t} + \alpha \delta _{t} \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} + \alpha \frac{\partial \delta _{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial w_{j, t}}\frac{\partial w_{j, t}}{\partial u_{ij}}\phi _{j, t} + \alpha \frac{\partial \delta _{t}}{\partial y_{t}}\frac{\partial y_{t}}{\partial \phi _{j, t}}\frac{\partial \phi _{j, t}}{\partial u_{ij, t}}\phi _{j, t} \nonumber \\&= h_{ij, t} + \alpha \delta _{t} \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} - \alpha \frac{\partial y_{t}}{\partial w_{j, t}}\frac{\partial w_{j, t}}{\partial u_{ij}}\phi _{j, t} - \alpha \frac{\partial y_{t}}{\partial \phi _{j, t}}\frac{\partial \phi _{j, t}}{\partial u_{ij, t}}\phi _{j, t} \nonumber \\&= h_{ij, t} + \alpha \delta _{t} \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} - \alpha \phi _{j, t}^{2} h_{ij, t} - \alpha w_{j, t}\phi _{j, t}\frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \nonumber \\ h_{ij, t}&\approx h_{ij, t} \Big ( 1 - \alpha \phi _{j, t}^{2} \Big ) + \alpha \Big ( \delta _{t} - w_{j, t}\phi _{j, t} \Big ) \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \end{aligned}$$

(5)

By substituting Eq. (4) in Eq. (2), we define a recursive update equation for the weights $ u_{ij, t} $ and thereby summarize the complete algorithm as follows:

$$\begin{aligned} \begin{aligned} u_{ij, t + 1}&= u_{ij, t} + \alpha \delta _{t} \Big [ \phi _{j, t} h_{ij, t} + w_{j, t} \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \Big ] \\ h_{ij, t + 1}&= h_{ij, t} \Big ( 1 - \alpha \phi _{j, t}^{2} \Big ) + \alpha \Big ( \delta _{t} - w_{j, t}\phi _{j, t} \Big ) \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \\ w_{i, t + 1}&= w_{i, t} + \alpha \delta _{t} \phi _{i, t} \end{aligned} \end{aligned}$$

(6)

Depending on the nonlinearity used for the hidden units, $ \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} $ can be reduced to a closed-form equation.

For instance, if a logistic function is used, then $ \phi _{j, t} = \sigma \Big ( \sum _{i=1}^{m}x_{i, t}u_{ij, t} \Big ) $,

$$\begin{aligned} \frac{\partial \phi _{j, t}}{\partial u_{ij, t}}&= \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \\&= \frac{\partial }{\partial u_{ij, t}} \sigma \Big ( \sum _{i=1}^{m} x_{i, t}u_{ij, t} \Big ) \\ \frac{\partial \phi _{j, t}}{\partial u_{ij, t}}&= \phi _{j, t} \Big ( 1 - \phi _{j, t} \Big ) x_{i, t} \end{aligned}$$

Another frequently used activation function is $ \tanh $, which implies that $ \phi _{j, t} = \tanh \Big ( \sum _{i=1}^{m} x_{i, t}u_{ij, t} \Big ) $,

$$\begin{aligned} \frac{\partial \phi _{j, t}}{\partial u_{ij, t}}&= \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \\&= \frac{\partial }{\partial u_{ij, t}} \tanh \Big ( \sum _{i=1}^{m} x_{i, t}u_{ij, t} \Big ) \\ \frac{\partial \phi _{j, t}}{\partial u_{ij, t}}&= \Big ( 1 - \phi _{j, t}^2 \Big ) x_{i, t} \end{aligned}$$

We could also introduce a weighting factor $ \eta \in [0, 1] $ in Eq. (4), which allows in smoothly mixing backprop and meta-gradient updates,

$$\begin{aligned} \frac{\partial y_{t}}{\partial u_{ij}} \approx (1 - \eta )\phi _{j, t} h_{ij, t} + \eta \frac{\partial y_{t}}{\partial \phi _{j, t}}\frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \end{aligned}$$

which results in the following update equations for learning the weights U and W of the neural network:

$$\begin{aligned} \begin{aligned} u_{ij, t + 1}&= u_{ij, t} + \alpha \delta _{t} \Big [ (1 - \eta ) \phi _{j, t} h_{ij, t} + \eta w_{j, t} \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \Big ] \\ h_{ij, t + 1}&= h_{ij, t} \Big ( 1 - \alpha (1 - \eta ) \phi _{j, t}^{2} \Big ) + \alpha \Big ( \delta _{t} - \eta w_{j, t}\phi _{j, t} \Big ) \frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \\ w_{i, t + 1}&= w_{i, t} + \alpha \delta _{t} \phi _{i, t} \end{aligned} \end{aligned}$$

(7)

The algorithm that was derived and presented in Eqs. (7) and (6) are computationally expensive when there are more number of outgoing weights per hidden unit. Specifically, when there are k output units, then $ \delta _{t} $ becomes a k-dimensional vector with dimensions equal to that of the output units. This leads to a large computational cost involved in computing $ h_{ij, t} $, which can be avoided by approximating the $ h_{ij, t} $ parameter. The approximation involves in accumulating the error assigned to each of the hidden units through its outgoing weights and using this to compute the update term. This approximated algorithm is referred to as crossprop-approx. in our experiments and has the following update equations:

$$\begin{aligned} \begin{aligned} u_{ij, t + 1}&= u_{ij, t} + \alpha \sum _{k} \delta _{k, t} \Big [ (1 - \eta ) \phi _{j, t} h_{jk, t} + \eta w_{jk, t} \Big ]\frac{\partial \phi _{j, t}}{\partial u_{ij, t}} \\ h_{jk, t + 1}&= h_{jk, t} \Big ( 1 - \alpha (1 - \eta ) \phi _{j, t}^{2} \Big ) + \alpha \Big ( \delta _{k, t} - \eta w_{jk, t} \phi _{j, t} \Big ) \\ w_{jk, t + 1}&= w_{jk, t} + \alpha \delta _{k, t} \phi _{j, t} \end{aligned} \end{aligned}$$

(8)

4 Experiments and Results

Here we empirically investigate whether crossprop is effective in finding useful representations for continual learning tasks and compare them with backprop and its many (such as adding momentum, RMSProp and ADAM). By continual learning tasks, we refer to an experiment setting where supervised training examples are generated and presented to a learning system from a sequence of related tasks. The learning system does not know when the task is switched.

4.1 GEOFF Tasks

The GEneric Online Feature Finding (GEOFF) problem was first introduced by Sutton (2014) as a generic, synthetic, feature-finding test bed for evaluating different representation learning algorithms. The primary advantage of this test bed is that infinitely many supervised-learning tasks can be generated without any experimenter bias.

The test bed consists of a single hidden layer neural network, called the target network, with a real-valued scalar output. Each input example, $ X_{t} \in \{ 0, 1 \}^{m} $, is a m-dimensional binary input vector where each element in the vector can take a value of 0 or 1. The hidden layer consists of n Linear Threshold Units (LTUs), $ \varvec{\phi }_{t}^{*} \in \{ 0, 1 \}^{n} $, with a threshold parameter of $ \beta $. The $ \beta $ parameter controls the sparsity in the hidden layer. The weights $ U^{*} \in \{ -1, +1 \}^{m \times n} $ maps the input vector to the hidden units and the weights $ W^{*} \in \{ -1, 0, +1 \}^{n} $ linearly combine the LTUs (features) to produce a scalar target output $ y^{*} $. The weights $ U^{*} $ and $ W^{*} $ are generated using a uniform probability distribution and remain fixed throughout a task, representing a stationary function mapping a given input vector $ X_{t} $ to a scalar target output $ y_{t}^{*} $. The input vector $ X_{t} $ is generated randomly using a uniform probability distribution. For each input vector, this target network is used to produce a scalar target output $ y_{t}^{*} = \sum _{i=1}^{n} \phi _{i, t}^{*} w^{*}_{i} + \mathcal {N}(0, 1) $. For our experiments, we fix $ m = 20 $, $ n = 1000 $ and $ \beta = 0.6 $ (the parameters of the target network).

Experiment setup. For our experiments, we create an instance of the GEOFF task. This is called Task A and use this to generate a set of 5000 examples. These examples are then used for training the learning systems in an online manner, where each example is processed once and then discarded. After processing the examples from Task A, we generate a Task B by randomly choosing and regenerating 50% of the outgoing target weights $ W^{*} $. Another set of 5000 training examples is generated for training using this modified task. Similarly, after processing the examples from Task B, Task C is produced which is used for generating another 5000 training examples. The learning systems learn online from training examples produced by a sequence of related tasks (Tasks A, B & C) where the representations learned from one can be used for solving the other tasks. It is important to point out here that all these tasks share the same feature representation (i.e. the weights $ U ^ {*} $ remain fixed throughout) and the system can leverage from its previous learning experiences.

This experiment was designed from a continual learning perspective where a learning system will experience examples generated from a sequence of related tasks and learning from one task can help in learning other similar tasks. The step-size for all the algorithms was fixed at a constant value ($ \alpha = 0.0005 $) as the objective here is to show how the features are learned by different algorithms for a sequence of related learning tasks. The learning network consisted of a single hidden layer neural network with a single output unit. It had 20 input units and 500 hidden units using $ \tanh $ activation function. The squared error function was used for learning the parameters of this network. These were the parameters of the learning network used for evaluating multiple algorithms.

Results. We compare the behavior of crossprop with backprop and its variations on a sequence of related tasks generated using the GEOFF testbed. Fig. 1 Left shows the learning curve for different algorithms. After every 5000 examples, the task switches to a new and related task as previously described. It is important to note here that the learning system does not know that the task has changed.

The learning curves show that crossprop reaches a similar asymptotic value to that of backprop, implying that the introduced algorithm produces a similar solution as backprop. In terms of asymptotic values, backprop achieves a significantly better asymptotic value compared to crossprop and the other variations of backprop. However, it is interesting to note that these learning algorithms approach the solution differently.

Figure 2 Left shows the euclidean norm ($ l^{2} $ norm) between the weights U after processing the n$^\mathrm{th}$ training example and the initialized value of the same weights. Though all the algorithms reach similar asymptotic values, the way backprop achieves this is clearly different from that of crossprop. Backprop tends to frequently modify the features even though it has seen examples that are generated using a previously learned function. Specifically, backprop fails to leverage from its previous learning experiences in solving new tasks even when it is possible. Because of this, backprop tends to take a lot of time in finding a feature representation which can sufficiently solve this continual problem. This is clearly not the case with crossprop. Our proposed algorithm tends to find a feature representation much quicker than backprop that can sufficiently solve the sequence of continual problems and reuses this for solving new tasks that it encounters in the future.

Figure 3 Left shows the euclidean norm between the weights W after processing the n$^\mathrm{th}$ example and the initialized value of the same weights. Because crossprop tends to find the set of features much quicker than backprop and reuses these features while solving a new task, it reduces the error by moving the outgoing weights rather than modifying its feature representation. Furthermore, all the tasks presented to the learning system can be solved by using a single feature representation and from our plots, it can be clearly seen that crossprop recognizes this.

4.2 MNIST Tasks

The MNIST dataset of handwritten digits was introduced by LeCun et al. (1988). Though the MNIST dataset is old, it is still viewed as a standard supervised learning benchmark task for testing out new learning algorithms (Sironi et al. 2015; Papernot et al. 2016).

The dataset consists of grayscale images each with $ 28 \times 28 $ dimensions. These images are obtained from handwritten digits and their corresponding labels denote the supervised learning target for a given image. The objective of a learning system in a MNIST task would be to learn a mapping function that maps each of these images to a label.

Experiment setup. We adapt the MNIST dataset to a continual learning setting, where in each task the label for the training images is shifted by one. For example, Task A uses the standard MNIST training images and their labels, Task B uses the same training examples as Task A, but now the labels get shifted by one. Similarly, for Task C the label for the training examples get further shifted by one. As in our previous experiment, we fix the step-size ($ \alpha = 0.0005 $) for the different algorithms as our objective here is to study how the representations are learned between these algorithms for a continual learning setting, where the learning system experiences examples from a sequence of related tasks. The learning system consisted of a single hidden layer neural network with 784 input units, 1024 hidden units and 10 output units. The hidden units used a $ \tanh $ activation function and the output units used a softmax activation function. The cross-entropy error function was used for training the network.

Results. Figure 1 Right shows the learning curves for all the methods evaluated on the MNIST tasks. As observed in the GEOFF tasks, the learning curves for the different algorithms converge to almost similar points which means that all the methods reach similar solutions. However, ADAM and RMSProp achieves a significantly better asymptotic error value compared to the other learning algorithms.

Figures 2 Right and 3 Right show the euclidean norm of the change in weights U and W respectively. As seen in our previous experiments, crossprop tends to find the features much quicker than backprop and its variations. Also, crossprop tends to reuse these features in solving the new tasks that it faces. It is interesting to observe that backprop does not seem to settle down on a good feature representation for solving a sequence of continual learning problems. It tends to naïvely unlearn and relearn its feature representation even when the tasks are similar to each other and can be solved by using the feature representation learned from the first task. Specifically, backprop does not seem to leverage its previous learning experiences while encountering a new task.

5 Visualizing the Learned Features

We visualize the features that are obtained while training the learning systems using crossprop and backprop. These visualization are obtained using the t-SNE approach, which was developed by Maaten and Hinton (2008) for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. Here, we show only the two-dimensional map generated using the features learned by the different learning algorithms.

The features learned by backprop and crossprop (with $ \eta $ set to 0) on a standard MNIST task are plotted in Fig. 4. From the visualizations, it can be observed that both these algorithms produce similar feature representations on the task. Both these algorithms learn a feature representation that clusters examples according to their labels. There does not seem to be much of a difference between them by looking at their features.

6 Discussions

Neural networks and backprop form a powerful, hierarchical feature learning paradigm. Designing deep neural network architectures has allowed many learning systems to achieve levels of performance comparable to that of humans in many domains. Many of the recent research works, however, fail to notice or ignore the fundamental issues that are present with backprop, even though it is important to address them.

Some research works even tend to provide ad-hoc solutions to overcome these fundamental problems posed by backprop, but these are usually not scalable to general domains. Over time, many modifications were introduced to backprop, but they still fail to address the fundamental issue with backprop, which is that backprop tends to interfere with its previously learned representations in order to accommodate a new example. This prevents in directly applying backprop to continual learning domains, which is critical for achieving Artificial Intelligence (Ring 1997; Kirkpatrick et al. 2017).

In a continual learning setting, a learning system needs to progressively learn and hierarchically accumulate knowledge from its experiences, using them to solve many difficult, unseen tasks. In such a setting, it is not desirable to have a learning system that naïvely unlearns and relearns even when it sees a task that can be solved by reusing its learning from its past experiences. Particularly, for a continual learning setting, it is necessary to have a learning system that can hierarchically build knowledge from its previous experiences and use them in solving a completely new and unseen task.

In this paper, we present two continual learning tasks that were adapted from standard supervised learning domains: the GEOFF testbed and MNIST dataset. On these tasks, we evaluate backprop and its variations (momentum, RMSProp and ADAM). We also evaluate our proposed meta-gradient descent approach for learning the features in a neural network, called crossprop. We show that backprop (and its variations) tends to relearn its feature representations for every task, even when these tasks can be solved by reusing the feature representation learned from previous experiences. Crossprop, on the other hand, tends to reuse its previously learned representations in tackling new and unseen tasks. The process of consistently failing to leverage from previous learning experiences is not particularly desirable in a continual learning setting which prevents in directly applying backprop to such settings. Addressing this particular issue is the primary motivation for our work.

As an immediate future work, we would like to study the performances of this meta-gradient descent approach on deep neural networks and comprehensively evaluate them on more difficult benchmarks, like IMAGENET (Deng et al. 2009) and the Arcade Learning Environment (Bellemare et al. 2013).

7 Conclusions

In this paper, we introduced a meta-gradient descent approach, called crossprop, for learning the incoming weights of hidden units in a neural network and showed that such approaches are complementary to backprop, which is the popular algorithm for training neural networks. We also show that by using crossprop, a learning system can learn to reuse the learned features for solving new and unseen tasks. However, we see this as the first general work towards comprehensively addressing and overcoming the fundamental issues posed by backprop, particularly for continual learning domains.

Notes

1.
https://github.com/ShangtongZhang/Crossprop.

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv preprint arXiv:1409.0473
Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. (JAIR) 47, 253–279 (2013)
Google Scholar
Cho, K., Van Merrinboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014). arXiv preprint arXiv:1406.1078
Coates, A., Ng, A.Y.: Learning feature representations with K-means. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 561–580. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_30
Chapter Google Scholar
Comon, P.: Independent component analysis, a new concept? Sig. process. 36(3), 287–314 (1994)
Article MATH Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE, June 2009
Google Scholar
Jacobs, R.A.: Increased rates of convergence through learning rate adaptation. Neural Netw. 1(4), 295–307 (1988)
Article Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D.: Overcoming catastrophic forgetting in neural networks. Proc. Nat. Acad. Sci. 201611835 (2017)
Google Scholar
Klopf, A., Gose, E.: An evolutionary pattern recognition network. IEEE Trans. Syst. Sci. Cybern. 5(3), 247–250 (1969)
Article Google Scholar
LeCun, Y., Cortes, C., Burges., C.: The MNIST database of handwritten digits (1988)
Google Scholar
Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(39), 1–40 (2016)
MathSciNet MATH Google Scholar
Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)
MATH Google Scholar
Mahmood, A.R., Sutton, R.S.: Representation search through generate and test. In: AAAI Workshop, Learning Rich Representations from Low-Level Sensors, June 2013
Google Scholar
Miotto, R., Li, L., Kidd, B.A., Dudley, J.T.: Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientific reports 6, 26094 (2016)
Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Moravčík, M., Schmid, M., Burch, N., Lis, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., Bowling, M.: Deepstack: Expert-level artificial intelligence in no-limit poker (2017). arXiv preprint arXiv:1701.01724
Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by V1? Vis. Res. 37(23), 3311–3325 (1997)
Article Google Scholar
Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A.: The limitations of deep learning in adversarial settings. In: 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. IEEE, March 2016
Google Scholar
Ring, M.B.: CHILD: a first step towards continual learning. Mach. Learn. 28(1), 77–104 (1997)
Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Cogn. Model. 5(3), 1 (1988)
MATH Google Scholar
Schraudolph, N.N.: Local gain adaptation in stochastic gradient descent (1999)
Google Scholar
Sironi, A., Tekin, B., Rigamonti, R., Lepetit, V., Fua, P.: Learning separable filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(1), 94–106 (2015)
Article Google Scholar
Sutton, R.S.: Two problems with backpropagation and other steepest-descent learning procedures for networks. In: Proceeding of 8th Annual Conference on Cognitive Science Society, pp. 823–831. Erlbaum, May 1986
Google Scholar
Sutton, R.S.: Adapting bias by gradient descent: an incremental version of delta-bar-delta. In: AAAI, pp. 171–176, July 1992
Google Scholar
Sutton, R.S.: Myths of representation learning. In: ICLR (2014). Lecture
Google Scholar
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: AAAI, pp. 4278–4284 (2017)
Google Scholar
Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–31 (2012)
Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)
MathSciNet MATH Google Scholar
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., ... Klingner, J.: Google’s neural machine translation system: bridging the gap between human and machine translation (2016). arXiv preprint arXiv:1609.08144
Zeiler, M.D.: ADADELTA: an adaptive learning rate method (2012). arXiv preprint arXiv:1212.5701

Download references

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Edmonton, AB, Canada
Vivek Veeriah, Shangtong Zhang & Richard S. Sutton

Authors

Vivek Veeriah
View author publications
You can also search for this author in PubMed Google Scholar
Shangtong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Richard S. Sutton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shangtong Zhang .

Editor information

Editors and Affiliations

Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Aalto University School of Science, Espoo, Finland
Jaakko Hollmén
University of Ljubljana, Ljubljana, Slovenia
Ljupčo Todorovski
KU Leuven Kulak, Kortrijk, Belgium
Celine Vens
Jožef Stefan Institute, Ljubljana, Slovenia
Sašo Džeroski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Veeriah, V., Zhang, S., Sutton, R.S. (2017). Crossprop: Learning Representations by Stochastic Meta-Gradient Descent in Neural Networks. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Džeroski, S. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2017. Lecture Notes in Computer Science(), vol 10534. Springer, Cham. https://doi.org/10.1007/978-3-319-71249-9_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-71249-9_27
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71248-2
Online ISBN: 978-3-319-71249-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Crossprop: Learning Representations by Stochastic Meta-Gradient Descent in Neural Networks

Abstract