Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

On the relationship between predictive coding and backpropagation

Abstract

Artificial neural networks are often interpreted as abstract models of biological neuronal networks, but they are typically trained using the biologically unrealistic backpropagation algorithm and its variants. Predictive coding has been proposed as a potentially more biologically realistic alternative to backpropagation for training neural networks. This manuscript reviews and extends recent work on the mathematical relationship between predictive coding and backpropagation for training feedforward artificial neural networks on supervised learning tasks. Implications of these results for the interpretation of predictive coding and deep neural networks as models of biological learning are discussed along with a repository of functions, Torch2PC, for performing predictive coding with PyTorch neural network models.

Introduction

The backpropagation algorithm and its variants are widely used to train artificial neural networks. While artificial and biological neural networks share some common features, a direct implementation of backpropagation in the brain is often considered biologically implausible in part because of the nonlocal nature of parameter updates: The update to a parameter in one layer depends on activity in all deeper layers. In contrast, biological neural networks are believed to learn largely through local synaptic plasticity rules for which changes to a synaptic weight depend on neural activity local to that synapse. While neuromodulators can have non-local impact on synaptic plasticity, they are not believed to be sufficiently specific to implement the precise, high-dimensional credit assignment required by backpropogation. However, some work has shown that global errors and neuromodulators can work with local plasticity to implement effective learning algorithms [1, 2]. Backpropagation can be performed using local updates if gradients of neurons’ activations are passed upstream through feedback connections, but this interpretation implies other biologically implausible properties of the network, like symmetric feedforward and feedback weights. See previous work [3, 4] for a more complete review of the biological plausibility of backpropagation.

Several approaches have been proposed for achieving or approximating backpropagation with ostensibly more biologically realistic learning rules [214]. One such approach [1114] is derived from the theory of “predictive coding” or “predictive processing” [1523]. A relationship between predictive coding and backpropagation was first discovered by Whittington and Bogacz [11] who showed that, when predictive coding is used to train a feedforward neural network on a supervised learning task, it can produce parameter updates that approximate those computed by backpropagation. These original results have since been extended to more general network architectures and to show that modifying predictive coding by a “fixed prediction assumption” leads to an algorithm that produces the exact same parameter updates as backpropagation [1214].

This manuscript reviews and extends previous work [1114] on the relationship between predictive coding and backpropagation, as well as some implications of these results on the interpretation of predictive coding and artificial neural networks as models of biological learning. The main results in this manuscript are as follows,

  1. Accounting for covariance or precision matrices in hidden layers does not affect parameter updates (learning) for predictive coding under the “fixed prediction assumption” used in previous work.
  2. Predictive coding under the fixed prediction assumption is algorithmically equivalent to a direct implementation of backpropagation, which raises the question of whether it should be interpreted as more biologically plausible than backpropagation.
  3. Empirical results show that the magnitude of prediction errors do not necessarily correspond to surprising features of inputs.

In addition, a public repository of Python functions, Torch2PC, is introduced. These functions can be used to perform predictive coding on any PyTorch Sequential model (see Materials and methods).

Results

A review of the relationship between backpropagation and predictive coding from previous work

For completeness, let us first review the backpropagation algorithm. Consider a feedforward deep neural network (DNN) defined by (1) where each is a vector or tensor of activations, each θ is a set of parameters for layer , and L is the network’s depth. In supervised learning, one seeks to minimize a loss function where y is a label associated with input, x, and is the network’s output, which depends on parameters . The loss is typically minimized using gradient-based optimization methods with gradients computed using automatic differentiation tools based on the backpropagation algorithm. For completeness, backpropagation is reviewed in the pseudocode below.

Algorithm 1 A standard implementation of backpropagation.

Given: Input (x) and label (y)

# forward pass

for = 1, …, L

# backward pass

for = L − 1, …, 1

A direct application of the chain rule and mathematical induction shows that backpropagation computes the gradients, The negative gradients, , are then used to update parameters, either directly for stochastic gradient descent or indirectly for other gradient-based learning methods [24]. For the sake of comparison, I used backpropagation to train a 5-layer convolutional neural network on the MNIST data set (Fig 1A and 1B; blue curves). I next review algorithms derived from the theory of predictive coding and their relationship to backpropagation, as originally derived in previous work [1114].

thumbnail
Fig 1. Comparing backpropagation and predictive coding in a convolutional neural network trained on MNIST.

A,B) The loss (A) and accuracy (B) on the training set (pastel) and test set (dark) when a 5-layer network was trained using a strict implementation of predictive coding (Algorithm 2 with η = 0.1 and n = 20; red) and backpropagation (blue). C,D) The relative error (C) and angle (B) between the parameter update, , computed by Algorithm 2 and the negative gradient of the loss at each layer. Predictive coding and backpropagation give similar accuracies, but the parameter updates are less similar.

https://doi.org/10.1371/journal.pone.0266102.g001

A strict interpretation of predictive coding does not accurately compute gradients.

I begin by reviewing supervised learning under a strict interpretation of predictive coding. The formulation in this section is equivalent to the one first studied by Whittington and Bogacz [11], except that their results are restricted to the case in which f(v−1; θ) = θg(v−1) for some point-wise-applied activation function, g, and connectivity matrix, θ. Our formulation extends this formulation to arbitrary vector-valued differentiable functions, f. For the sake of continuity with later sections, I also use the notational conventions from [12] which differ from those in [11].

Predictive coding can be derived from a hierarchical, Gaussian probabilistic model in which each layer, , is associated with a Gaussian random variable, V, satisfying (2) where is the multivariate Gaussian distribution with mean, μ, and covariance matrix, Σ, evaluated at v. Following previous work [1114], I take Σ = I to be the identity matrix, but later relax this assumption [21].

If we condition on an observed input, V0 = x, then a forward pass through the network described by Eq (1) corresponds to setting and then sequentially computing the conditional expectations or, equivalently, maximizing conditional probabilities, until reaching an inferred output, . Note that this forward pass does not necessarily maximize the global conditional probability, and it does not account for a prior distribution on VL, which arises in related work on predictive coding for unsupervised learning [15, 21]. One interpretation of a forward pass is that each is the network’s “belief” about the state of V, when only V0 = x has been observed.

Now suppose that we condition on both an observed input, V0 = x, and its label, VL = y. In this case, generating beliefs about the hidden states, V, is more difficult because we need to account for potentially conflicting information at each end of the network. We can proceed by initializing a set of beliefs, v, about the state of each V, and then updating our initial beliefs to be more consistent with the observations, x and y, and parameters, θ.

The error made by a set of beliefs, , under parameters, , can be quantified by for = 1, …, L − 1 where v0 = V0 = x is observed. It is not so simple to quantify the error, ϵL, made at the last layer in a way that accounts for arbitrary loss functions. In the special case of a squared-Euclidean loss function, where ‖u2 = uT u. Standard formulations of predictive coding [20, 21] use (3) where recall that y is the label. In this case, ϵL satisfies (4) where We use the to emphasize that is different from (which is defined by a forward pass starting at ) and is defined in a fundamentally different way from the v terms (which do not necessarily satisfy v = f(v−1; θ)). We can then define the total summed magnitude of errors as More details on the derivation of F in terms of variational Bayesian inference can be found in previous work [12, 16, 20, 21] where F is known as the variational free energy of the model. Essentially, minimizing F produces a model that is more consistent with the observed data. Minimizing F by gradient descent on v and θ produce the inference and learning steps of predictive coding, respectively.

Under a more heuristic interpretation, v represents the network’s “belief” about V, and f(v−1; θ) is the “prediction” of v made by the previous layer. Under this interpretation, ϵ is the error made by the previous layer’s prediction, so ϵ is called a “prediction error.” Then F quantifies the total magnitude of prediction errors given a set of beliefs, v, parameters, θ, and observations, V0 = x and VL = y.

In predictive coding, beliefs, v, are updated to minimize the error, F. This can be achieved by gradient descent, i.e., by making updates of the form where η is a step size and (5) In this expression, ∂f+1(v; θ+1)/∂v is a Jacobian matrix and ϵ+1 is a row-vector to simplify notation, but a column-vector interpretation is similar. If x is a mini-batch instead of one data point, then v is an m × n matrix and derivatives are tensors. These conventions are used throughout the manuscript. The updates in Eq (5) can be iterated until convergence or approximate convergence. Note that the prediction errors, ϵ = vf(v−1; θ), should also be updated on each iteration.

Learning can also be phrased as minimizing F with gradient descent on parameters. Specifically, where (6) Note that some previous work uses the negative of the prediction errors used here, i.e., they use ϵ = vf(v−1; θ). While this choice changes some of the expressions above, the value of F and its dependence on θ is not changed because F is defined by the norms of the ϵ terms. The complete algorithm is defined more precisely by the pseudocode below:

Algorithm 2 A direct interpretation of predictive coding.

Given: Input (x), label (y), and initial beliefs (v)

# error and belief computation

for i = 1, …, n

for = L − 1, …, 1

  ϵ = vf(v−1; θ)

  

  v = v + ηdv

# parameter update computation

for = 1, …, L

Here and elsewhere, n denotes the number of iterations for the inference step. The choice of initial beliefs is not specified in the algorithm above, but previous work [1114] uses the results from a forward pass, , as initial conditions and I do the same in all numerical examples.

I tested Algorithm 2 on MNIST using a 5-layer convolutional neural network. To be consistent with the definitions above, I used a mean-squared error (squared Euclidean) loss function, which required one-hot encoded labels [24]. Algorithm 2 performed similarly to backpropagation (Fig 1A and 1B) even though the parameter updates did not match the true gradients (Fig 1C and 1D). Algorithm 2 was slower than backpropagation (31s for Algorithm 2 versus 8s for backpropagation when training metrics were not computed on every iteration) in part because Algorithm 2 requires several inner iterations to compute the prediction errors (n = 20 iterations used in this example). Algorithm 2 failed to converge on a larger model. Specifically, the loss grew consistently with iterations when trying to use Algorithm 2 to train the 6-layer CIFAR-10 model described in the next section. S1 Fig shows the same results from Fig 1 repeated across 30 trials with different random seeds to quantify the mean and standard deviation across trials.

Fig 1C and 1D shows that predictive coding does not update parameters according to the true gradients, but it is not immediately clear whether this would be resolved by using more iterations (larger n) or different values of the step size, η. I next compared the parameter updates, , to the true gradients, for different values of n and η (Fig 2). For the smaller values of η tested (η = 0.1 and η = 0.2) and larger values of n (n > 100), parameter updates were similar to the true gradients in the last two layers, but they differed substantially in the first two layers. The largest values of η tested (η = 0.5 and η = 1) caused the iterations in Algorithm 2 to diverge.

thumbnail
Fig 2. Comparing parameter updates from predictive coding to true gradients in a network trained on MNIST.

Relative error and angle between produced by predictive coding (Algorithm 2) as compared to the exact gradients, computed by backpropagation (relative error defined by ‖pcbp‖/‖bp‖). Updates were computed as a function of the number of iterations, n, used in Algorithm 2 for various values of the step size, η, using the model from Fig 1 applied to one mini-batch of data. Both models were initialized identically to the pre-trained parameter values from the trained model in Fig 1. Parameter updates converge near the gradients after many iterations for smaller values of η, but diverge for larger values.

https://doi.org/10.1371/journal.pone.0266102.g002

Some choices in designing Algorithm 2 were made arbitrarily. For example, the three updates inside the inner for-loop over could be performed in a different order or the outer for-loop over i could be changed to a while-loop with a convergence criterion. For any initial conditions and any of these design choices, if the iterations over i are repeated until convergence or approximate convergence of each v to a fixed point, , then the increments must satisfy dv = 0 at the fixed point and therefore the fixed point values of the prediction errors, , must satisfy (7) for = 1, …, L − 1. By the definition of ϵL, we have (8) where Combining Eqs (7) and (8) gives the fixed point prediction errors of the penultimate layer (9) where we used the fact that and the chain rule. The error in layer L − 2 is given by Note that we cannot apply the chain rule to reduce this product (like we did for Eq (9)) because it is not necessarily true that . I revisit this point below. We can continue this process to derive and continue for = L − 4, …, 1. In doing so, we see (by induction) that can be written as (10) for = 1, …, L − 2. Therefore, if the inference loop converges to a fixed point, then the subsequent parameter update obeys (11) by Eq (6). It is not clear whether there is a simple mathematical relationship between these parameter updates and the negative gradients, , computed by backpropagation.

It is tempting to assume that , in which case the product terms would be reduced by the chain rule. Indeed, this assumption would imply that and and, finally, that and , identical to the values computed by backpropagation. However, we cannot generally expect to have because this would imply that and therefore . In other words, Algorithm 2 is only equivalent to backpropagation in the case where parameters are at a critical point of the loss function, so all updates are zero. Nevertheless, this thought experiment suggests a modification to Algorithm 2 for which the fixed points do represent the true gradients [11, 12]. I review that modification in the next section.

Note also that the calculations above rely on the assumption of a Euclidean loss function, . If we want to generalize the algorithm to different loss functions, then Eqs (3) and (4) could not both be true, and therefore Eqs (7) and (8) could not both be true. This leaves open the question of how to define ϵL when using loss functions that are not proportional to the squared Euclidean norm. If we were to define ϵL by (3), at the expense of losing (4), then the algorithm would not account for the loss function at all, so it would effectively assume a Euclidean loss, i.e., it would compute the same values that are computed by Algorithm 2 with a Euclidean loss. If we instead were to define ϵL by Eq (4) at the expense of (3), then Eqs (5) and (7) would no longer be true for = L − 1 and Eq (6) would no longer be true for = L. Instead, all three of these equations would involve second-order derivatives of the loss function, and therefore the fixed point Eqs (10) and (11) would also involve second order derivatives. The interpretation of the parameter updates is not clear in this case. One might instead try to define ϵL by the result of a forward pass, but then ϵL would be a constant with respect to vL−1, so we would have ∂ϵL/∂vL−1 = 0, and therefore Eq (5) at = L − 1 would become which has a fixed point at . This would finally imply that all the errors converge to and therefore = 0 at the fixed point.

I next discuss a modification of Algorithm 2 that converges to the same gradients computed by backpropagation, and is applicable to general loss functions [11, 12].

Predictive coding modified by the fixed prediction assumption converges to the gradients computed by backpropagation.

Previous work [11, 12] proposed a modification of the predictive coding algorithm described above called the “fixed prediction assumption” which I now review. Motivated by the considerations in the last few paragraphs of the previous section, we can selectively substitute some terms of the form v and f(v−1; θ) in Algorithm 2 with (or, equivalently, ) where are the results of the original forward pass starting from . Specifically, the following modifications are made to the quantities computed by Algorithm 2 (12) for = 1, …, L − 1. This modification can be interpreted as “fixing” the predictions at the values computed by a forward pass and is therefore called the “fixed prediction assumption” [11, 12]. Additionally, the initial conditions of the beliefs are set to the results from a forward pass, for = 1, …, L − 1. The complete modified algorithm is defined by the pseudocode below:

Algorithm 3 Supervised learning with predictive coding modified by the fixed prediction assumption. Adapted from the algorithm in [12] and similar to the algorithm from [11].

Given: Input (x) and label (y)

# forward pass

for = 1, …, L

# error and belief computation

for i = 1, …, n

for = L − 1, …, 1

  

  

  v = v + ηdv

# parameter update computation

for = 1, …, L

Note, again, that some choices in Algorithm 3 were made arbitrarily. The three updates inside the inner for-loop over could be performed in a different order or the outer for loop over i could be changed to a while-loop with a convergence criterion. Regardless of these choices, the fixed points, , can again be computed by setting dv = 0 to obtain Now note that ϵL is fixed, so and we can combine these two equations to compute where we used the chain rule and the fact that . Continuing this approach we have, for all = 1, …, L (where recall that is the output from the feedfoward pass). Combining this with the modified definition of , we have where we use the chain rule and the fact that . We may conclude that, if the inference step converges to a fixed point (dv = 0), then Algorithm 3 computes the same values of as backpropagation and also that the prediction errors, ϵ, converge to the gradients, , computed by backpropagation. As long as the inference step approximately converges to a fixed point (dv ≈ 0), then we should expect the parameter updates from Algorithm 3 to approximate those computed by backpropagation. In the next section, I extend this result to show that a special case of the algorithm computes the true gradients in a fixed number of steps.

I next tested Algorithm 3 on MNIST using the same 5-layer convolutional neural network considered above. I used a cross-entropy loss function, but otherwise used all of the same parameters used to test Algorithm 2 in Fig 1. The modified predictive coding algorithm (Algorithm 3) performed similarly to backpropagation in terms of the loss and accuracy (Fig 3A and 3B). Parameter updates computed by Algorithm 3 did not match the true gradients, but pointed in a similar direction and provided a closer match than Algorithm 2 (compare Fig 3C and 3D to Fig 1C and 1D). Algorithm 3 was similar to Algorithm 2 in terms of training time (29s for Algorithm 3 versus 31s for Algorithm 2 and 8s for backpropagation). S2 Fig shows the same results from Fig 3 repeated across 30 trials with different random seeds to quantify the mean and standard deviation across trials.

thumbnail
Fig 3. Predictive coding modified by the fixed prediction assumption compared to backpropagation in a convolutional neural network trained on MNIST.

Same as Fig 1 except Algorithm 3 was used (with η = 0.1 and n = 20) in place of Algorithm 2. The accuracy of predictive coding with the fixed prediction assumption is similar to backpropagation, but the parameter updates are less similar for these hyperparameters.

https://doi.org/10.1371/journal.pone.0266102.g003

I next compared the parameter updates computed by Algorithm 3 to the true gradients for different values of n and η (Fig 4). When η < 1, the parameter updates, , appeared to converge, but did not converge exactly to the true gradients. This is likely due to numerical floating point errors accumulated over iterations. When η = 1, the parameter updates at each layer remained constant for the first few iterations, then immediately jumped to become very near the updates from backpropagation. In the next section, I provide a mathematical analysis of this behavior and show that when η = 1, Algorithm 3 computes the true gradients in a fixed number of steps.

thumbnail
Fig 4. Comparing parameter updates from predictive coding modified by the fixed prediction assumption to true gradients in a network trained on MNIST.

Relative error and angle between produced by predictive coding modified by the fixed prediction assumption (Algorithm 3) as compared to the exact gradients computed by backpropagation (relative error defined by ‖pcbp‖/‖bp‖). Updates were computed as a function of the number of iterations, n, used in Algorithm 3 for various values of the step size, η, using the model from Fig 3 applied to one mini-batch of data. Both models were initialized identically to the pre-trained parameter values from the backpropagation-trained model in Fig 3. In the rightmost panels, some lines are not visible where they overlap at zero. Parameter updates quickly converge to the true gradients when η is larger.

https://doi.org/10.1371/journal.pone.0266102.g004

To see how well these results extend to a larger model and more difficult benchmark, I next tested Algorithm 3 on CIFAR-10 [25] using a six-layer convolutional network. While the network only had one more layer than the MNIST network used above, it had 141 times more parameters (32,695 trainable parameters in the MNIST model versus 4,633,738 in the CIFAR-10 model). Algorithm 3 performed similarly to backpropagation in terms of loss and accuracy during learning (Fig 5A and 5B) and produced parameter updates that pointed in a similar direction, but still did not match the true gradients (Fig 5C and 5D). Algorithm 3 was substantially slower than backpropagation (848s for Algorithm 3 versus 58s for backpropagation when training metrics were not computed on every iteration).

thumbnail
Fig 5. Predictive coding modified by the fixed prediction assumption compared to backpropagation in convolutional neural networks trained on CIFAR-10.

Same as Fig 3 except a larger network was trained on the CIFAR-10 data set. The accuracy of predictive coding with the fixed prediction assumption is similar to backpropagation and parameter updates are similar to the true gradients.

https://doi.org/10.1371/journal.pone.0266102.g005

Predictive coding modified by the fixed prediction assumption using a step size of η = 1 computes exact gradients in a fixed number of steps.

A major disadvantage of the approach outlined above—when compared to standard backpropagation—is that it requires iterative updates to v and ϵ. Indeed, previous work [12] used n = 100–200 iterations, leading to substantially slower performance compared to standard backpropagation. Other work [11] used n = 20 iterations as above. In general, there is a tradeoff between accuracy and performance when choosing n, as demonstrated in Fig 4. However, more recent work [13, 14] showed that, under the fixed prediction assumption, predictive coding can compute the exact same gradients computed by backpropagation in a fixed number of steps. That work used a more specific formulation of the neural network which can implement fully connected layers, convolutional layers, and recurrent layers. They also used an unconventional interpretation of neural networks in which weights are multiplied outside the activation function, i.e., f(x; θ) = θg(x), and inputs are fed into the last layer instead of the first. Next, I show that their result holds for arbitrary feedforward neural networks as formulated in Eq (1) (with arbitrary functions, f) and this result has a simple interpretation in terms of Algorithm 3. Specifically, the following theorem shows that taking a step size of η = 1 yields an exact computation of gradients using just n = L iterations (where L is the depth of the network).

Theorem 1. If Algorithm 3 is run with step size η = 1 and at least n = L iterations then the algorithm computes and for all ℓ = 1, …, L where are the results from a forward pass with and is the output.

Proof. For the sake of notational simplicity within this proof, define . Therefore, we first need to prove that ϵ = δ. First, rewrite the inside of the error and belief loop from Algorithm 3 while explicitly keeping track of the iteration number in which each variable was updated, Here, , , and denote the values of , , and respectively at the end of the ith iteration, corresponds to the initial value, and all terms without superscripts are constant inside the inference loop. There are some subtleties here. For example, we have in the first line because v is updated after ϵ in the loop. More subtly, we have in the second equation instead of because the for loop goes backwards from = L − 1 to = 1, so ϵ+1 is updated before ϵ. First note that for = 1, …, L − 1 because . Now compute the change in ϵ across one step, Note that this equation is only valid for i ≥ 1 due to the i − 1 term ( is not defined). Adding to both sides of the resulting equation gives We now use induction to prove that ϵ = δ after n = L iterations. Indeed, we prove a stronger claim that at i = L + 1. First note that for all i because is initialized to δL and then never changed. Therefore, our claim is true for the base case = L.

Now suppose that for i = L − ( + 1) + 1 = L. We need to show that . From above, we have This completes our induction argument. It follows that at iteration i = L + 1 at all layers = 1, …, L. The last layer to be updated to the correct value is = 1, which is updated on iteration number i = L − 1 + 1 = L. Hence, ϵ = δ for all = 1, …, L after n = L iterations. This proves the first statement in our theorem. The second statement then follows from the definition of , This completes the proof.

This theorem ties together the implementation and formulation of predictive coding from [12] (i.e., Algorithm 3) to the results in [13, 14]. As noted in [13, 14], this result depends critically on the assumption that the values of v are initialized to the activations from a forward pass, initially. The theoretical predictions from Theorem 1 are confirmed by the fact that all of the errors in the rightmost panels of Fig 4 converge to zero after n = L = 5 iterations.

To further test the result empirically, I repeated Figs 3 and 5 using η = 1 and n = L (in contrast to Figs 3 and 5 which used η = 0.1 and n = 20). The loss and accuracy closely matched those computed by backpropagation (Figs 6A and 6B and 7A and 7B). More importantly, the parameter updates closely matched the true gradients (Figs 6C and 6D and 7C and 7D), as predicted by Theorem 1. The differences between predictive coding and backpropagation in Fig 6 were due floating point errors and the non-determinism of computations performed on GPUs. For example, similar differences to those seen in Fig 6A and 6B were present when the same training algorithm was run twice with the same random seed. The smaller number of iterations (n = L in Figs 6 and 7 versus n = 20 in Figs 3 and 5) resulted in a shorter training time (13s for MNIST and 300s for CIFAR-10 for Figs 6 and 7, compare to 29s and 848s in Figs 3 and 5, and compare to 8s and 58s for backpropagation).

thumbnail
Fig 6. Predictive coding modified by the fixed prediction assumption with η = 1 compared to backpropagation in convolutional neural networks trained on MNIST.

Same as Fig 3 except η = 1 and n = L. Predictive coding with the fixed prediction assumption approximates true gradients accurately when η = 1.

https://doi.org/10.1371/journal.pone.0266102.g006

thumbnail
Fig 7. Predictive coding modified by the fixed prediction assumption with η = 1 compared to backpropagation in convolutional neural networks trained on CIFAR-10.

Same as Fig 5 except η = 1 and n = L. Predictive coding with the fixed prediction assumption approximates true gradients accurately when η = 1.

https://doi.org/10.1371/journal.pone.0266102.g007

In summary, a review of the literature shows that a strict interpretation of predictive coding (Algorithm 2) does not converge to the true gradients computed by backpropagation. To compute the true gradients, predictive coding must be modified by the fixed prediction assumption (Algorithm 2). Further, I proved that Algorithm 2 computes the exact gradients when η = 1 and nL, which ties together results from previous work [1214].

Predictive coding with the fixed prediction assumption and η = 1 is functionally equivalent to a direct implementation of backpropagation

The proof of Theorem 1 and the last panel of Fig 4 give some insight into a how Algorithm 3 works. First note that the values of v in Algorithm 3 are only used to compute the values of ϵ and are not otherwise used in the computation of or any other quantities. Therefore, if we only care about understanding parameter updates, , we can ignore the values of v and only focus on how ϵ is updated on each iteration, i. Secondly, note that when η = 1, each ϵ is updated only once: for i < L + 1 and for iL + 1, so ϵ is only changed on iteration number i = L + 1. In other words, the error computation in Algorithm 3 when η = 1 and n = L is equivalent to

# error computation

for i = 1, …, L

for = L − 1, …, 1

  if == Li + 1

   

The two computations are equivalent in the sense that they compute the same values of the errors, , on every iteration. The formulation above makes it clear that the nested loops are unnecessary because for each value of i, ϵ is only updated at one value of . Therefore, the nested loops and if-statement can be replaced by a single for-loop. Specifically, the error computation in Algorithm 3 when η = 1 is equivalent to

# error computation

for = L − 1, …, 1

This is exactly the error computation from the standard backpropagation algorithm, i.e., Algorithm 1. Hence, if we use η = 1, then Algorithm 3 is just backpropagation with extra steps and these extra steps do not compute any non-zero values. If we additionally want to compute the fixed point beliefs, then they can still be computed using the relationship We may conclude that, when η = 1, Algorithm 3 can be replaced by an exact implementation of backpropagation without any effect on the results or effective implementation of the algorithm. This raises the question of whether predictive coding with the fixed prediction assumption should be considered any more biologically plausible than a direct implementation of backpropagation.

Accounting for covariance or precision matrices in hidden layers does not affect learning under the fixed prediction assumption

Above, I showed that predictive coding with the fixed prediction assumption is functionally equivalent to backpropagation. However, the predictive coding algorithm was derived under an assumption that covariance matrices in the probabilistic model are identity matrices, Σ = I. This raises the question of whether relaxing this assumption could generalize backpropagation to account for the covariances, as suggested in previous work [11, 12, 26].

We can account for covariances by returning to the calculations starting from the probabilistic model in Eq (2) and omit the assumption that Σ = I. To this end, it is helpful to define the precision-weighted prediction errors [20, 21, 26], for = 1, …, L − 1 where is the inverse of the covariance matrix of V, which is called “precision matrix.” Recall that we treat ϵ as a row-matrix, which explains the right-multiplication in this definition.

Modifying the definition of ϵL to account for covariances is not so simple because the Gaussian model for V is not justified for non-Euclidean loss functions such as categorical loss functions. Moreover, it is not clear how to define the covariance or precision matrix of the output layer when labels are observed. As such, I restrict to accounting for precision matrices in hidden layers only, and leave the question of accounting for covariances in the output layer for future work with some comments on the issue provided at the end of this section. To this end, let us not modify the last layer’s precision and instead define The free energy is then defined as [20, 21] Performing gradient descent on F with respect to v therefore gives and performing gradient descent on F with respect to θ gives These expressions are identical to Eqs (5) and (6) derived above except that takes the place of ϵ.

The precision matrices themselves can be learned by performing gradient descent on F with respect to P or, as suggested in other work [21], by parameterizing the model in terms of and performing gradient descent with respect to Σ. Alternatively, one could use techniques from the literature on Gaussian graphical models to learn a sparse or low-rank representation of P. I circumvent the question of estimating P by instead just asking how an estimate of P (however it is obtained) would affect learning. I do assume that P is symmetric. I also simplify the calculations by restricting the analysis to predictive coding with the fixed prediction assumption, leaving the analysis of fixed point prediction errors and parameter updates under strict predictive coding with precisions matrices for future work. Some analysis has been performed in this direction [21], but not for the supervised learning scenario considered here.

Putting this together, predictive coding under the fixed prediction assumption while accounting for precision matrices in hidden layers is defined by the following equations The only difference between these equations and Eq (12) is that they use in place of . Following the same line of reasoning, therefore, if the updates to v are repeated until convergence, then fixed point precision-weighted prediction errors satisfy Notably, this is the same equation derived for ϵ under the fixed prediction assumption with Σ = I, so fixed point precision-weighted prediction errors are also the same, and, therefore, parameter updates are the same as well, In conclusion, accounting for precision matrices in hidden layers does not affect learning under the fixed prediction assumption. Fixed point parameter updates are still the same as those computed by backpropagation. This conclusion is independent of how the precision matrices are estimated, but it does rely on the assumption that fixed points for v exist and are unique.

Above, we only considered precision matrices in the hidden layers because accounting for precision matrices in the output layer is problematic for general loss functions. The use of a precision matrix in the output implies the use of a Gaussian model for the output layer and labels, which is inconsistent with some types of labels and loss functions. If we focus on the case of a squared-Euclidean loss function, then the use of precision matrices in the output layer is more parsimonious and we can define in place of the definition above (recalling that ). Following the same calculations as above, gives fixed points of the form and, therefore, weight updates take the form at the fixed point. Hence, accounting for precision matrices at the output layer can affect learning by re-weighting the gradient of the loss function according to the precision matrix of the output layer. Note that the precision matrices of the hidden layers still have no effect on learning in this case. Previous work relates the inclusion of the precision matrix in output layers with the use of natural gradients [26, 27].

Prediction errors do not necessarily represent surprising or unexpected features of inputs

Deep neural networks are often interpreted as abstract models of cortical neuronal networks. To this end, the activations of units in deep neural networks are compared to the activity (typically firing rates) of cortical neurons [3, 28, 29]. This approach ignores the representation of errors within the network. More generally, the activations in one particular layer of a feedforward deep neural network contain no information about the activations of deeper layers, the label, or the loss. On the other hand, the activity of cortical neurons can be modulated by downstream activity and information believed to be passed upstream by feedback projections. Predictive coding provides a precise model for the information that deeper layers send to shallower layers, specifically prediction errors.

Under the fixed prediction assumption (Algorithm 3), prediction errors in a particular layer are approximated by the gradients of that layers’ activations with respect to the loss function, , but under a strict interpretation of predictive coding (Algorithm 2), prediction errors do not necessarily reflect gradients. We next empirically explored how the representations of images differ between the activations from a feedforward pass, , the prediction errors under the fixed prediction assumption, ϵ = δ, as well as the beliefs, v, and prediction errors, ϵ, under a strict interpretation of predictive coding (Algorithm 2). To do so, we computed each quantity in VGG-19 [30], which is a large, feedforward convolutional neural network (19 layers and 143,667,240 trainable parameters) pre-trained on ImageNet [31].

The use of convolutional layers allowed us to visualize the activations and prediction errors in each layer. Specifically, we took the Euclidean norm of each quantity across all channels and plotted them as two-dimensional images for layers = 1 and = 10 and for two different input images (Fig 8). For each image and each layer (each row in Fig 8), we computed the Euclidean norm of four quantities. First, we computed the activations from a forward pass through the network (, second column). Under predictive coding with the fixed prediction assumption (Algorithm 3), we can interpret the activations, , as “beliefs” and the gradients, δ, as “prediction errors.” Strictly speaking, there is a distinction between the beliefs, , from a feedforward pass and the beliefs, , when labels are provided. Either could be interpreted as a “belief.” However, we found that the difference between them was negligible for the examples considered here.

thumbnail
Fig 8. Magnitude of activations, beliefs, and prediction errors in a convolutional neural network pre-trained on ImageNet.

The Euclidean norm of feedforward activations (, interpreted as beliefs under the fixed prediction assumption), gradients of the loss with respect to activations (, interpreted as prediction errors under the fixed prediction assumption), beliefs (v) under strict predictive coding, and prediction errors (ϵ)) under strict predictive coding computed from the VGG-19 network [30] pre-trained on ImageNet [31] with two different photographs as inputs at two different layers. The vertical labels on the left (“triceratops” and “Irish wolfhound”) correspond to the guessed label which was also used as the “true” label (y) used to compute the gradients.

https://doi.org/10.1371/journal.pone.0266102.g008

Next, we computed the gradients of the loss with respect to the activations (δ, third column in Fig 8). The theory and simulations above and from previous work confirms that these gradients approximate the prediction errors from predictive coding with the fixed prediction assumption (Algorithm 3). Indeed, for the examples considered here, the differences between the two quantities were negligible. Next, we computed the beliefs (v, fourth column in Fig 8) computed by strict predictive coding (Algorithm 2). Finally, we computed the prediction errors (ϵ, last column in Fig 8) computed by strict predictive coding (Algorithm 2).

Note that we used a VGG-19 model that was pre-trained using backpropagation. Hence, the weights were not necessarily the same as the weights that would be obtained if the model were trained using predictive coding, particularly strict predictive coding (Algorithm 2) which does not necessarily converge to the true gradients. Training a large ImageNet model like VGG-19 with predictive coding is extremely computationally expensive. Regardless, future work should address the question of whether using pre-trained weights (versus weights trained by predictive coding) affects the conclusions reached here.

Overall, the activations, , from a feedforward pass were qualitatively very similar to the beliefs, v, computed under a strict interpretation of predictive coding (Algorithm 2). To a slightly lesser degree, the gradients, δ, from a feedforward pass were qualitatively similar to the prediction errors computed under a strict interpretation of predictive coding (Algorithm 2). Since and δ approximate beliefs and prediction errors under the fixed prediction assumption, these observations confirmed that the fixed prediction assumption does not make large qualitative changes to the representation of beliefs and errors in these examples. Therefore, in the discussion below, we used “beliefs” and “prediction errors” to refer to the quantities from both models.

Interestingly, prediction errors were non-zero even when the image and the network’s “guess” was consistent with the label (no “mismatch”). Indeed, the prediction errors were largest in magnitude at pixels corresponding to the object predicted by the label, i.e., at the most predictable regions. While this observation is an obvious consequence of the fact that prediction errors are approximated by the gradients, , it is contradictory to the heuristic or intuitive interpretation of prediction errors as measurements of “surprise” in the colloquial sense of the word [16].

As an illustrative example from Fig 8, it is not surprising that an image labeled by “triceratops” contains a triceratops, but this does not imply a lack of prediction errors because the space of images containing a triceratops is large and any one image of a triceratops is not wholly representative of the label. Moreover, the pixels to which the loss is most sensitive are those pixels containing the triceratops. Therefore those pixels give rise to larger values of . Hence, in high-dimensional sensory spaces, predictive coding models do not necessarily predict that prediction error units encode “surprise” in the colloquial sense of the word.

In both examples in Fig 8, we used an input, y, that matched the network’s “guessed” label, i.e., the label to which the network assigned the highest probability (). Prediction errors are often discussed in the context of mismatched stimuli in which top-down input is inconsistent with bottom-up predictions [3237]. Mismatches can be modeled by taking a label that is different from the network’s guess. In Fig 9, we visualized the prediction errors in response to matched and mismatched labels. The network assigned a probability of p = 0.9991 to the label “carousel” and a probability of p = 3.63 × 10−8 to the label “bald eagle”. The low probability assigned to “bald eagle” is, at least in part, a consequence of the network being trained with a softmax loss function, which implicitly assumes one label per input. When we applied the mismatched label “bald eagle,” prediction errors were larger in pixels that are salient for that label (e.g., the bird’s white head, which is a defining feature of a bald eagle). Moreover, the prediction errors as a whole are much larger in magnitude in response to the mismatched label (see the scales of the color bars in Fig 9).

thumbnail
Fig 9. Magnitude of activations, beliefs, and prediction errors in response to matched and mismatched inputs and labels.

Same as Fig 8, but for the bottom row the label did not match the network’s guess.

https://doi.org/10.1371/journal.pone.0266102.g009

In summary, the relationship between prediction errors and gradients helped demonstrate that prediction errors sometimes, but do not always conform to their common interpretation as unexpected features of a bottom-up input in the context of a top-down input. Also, beliefs and prediction errors were qualitatively similar with and without the fixed prediction assumption for the examples considered here.

Discussion

We reviewed and extended previous work [1114] on the relationship between predictive coding and backpropagation for learning in neural networks. Our results demonstrated that a strict interpretation of predictive coding does not accurately approximate backpropagation, but is still capable of learning (Figs 1 and 2). Previous work proposed a modification to predictive coding called the “fixed prediction assumption” which causes predictive coding to converge to the same parameter updates produced by backpropagation, under the assumption that the predictive coding iterations converge to fixed points. Hence, the relationship between predictive coding and backpropagation identified in previous work relies critically on the fixed prediction assumption. Formal derivations of predictive coding in terms of variational inference [20] do not produce the fixed prediction assumption. It is possible that an alternative probabilistic model or alternative approaches to the variational formulation could help formalize a model of predictive coding under the fixed prediction assumption.

We proved analytically and verified empirically that taking a step size of η = 1 in the modified predictive coding algorithm computes the exact gradients computed by backpropagation in a fixed number of steps (modulo floating point numerical errors). This result is consistent with similar, but slightly less general, results in previous work [13, 14].

A closer inspection of the the fixed prediction assumption with η = 1 showed that it is algorithmically equivalent to a direct implementation of backpropagation. As such, any potential neural architecture and machinery that could be to implement predictive coding with the fixed prediction assumption could also implement backpropagation directly. This result calls into question whether predictive coding with the fixed prediction assumption is any more biologically plausible than a direct implementation of backpropagation.

Visualizing the beliefs and prediction errors produced by predictive coding models applied to a large convolutional neural network pre-trained on ImageNet showed that beliefs and prediction errors were activated by distinct parts of input images, and the parts of the images that produced larger prediction errors were not always consistent with an intuitive interpretation of prediction errors as representing surprising or unexpected features of inputs. These observations are consistent with the fact that prediction errors approximate gradients of the loss function in backpropagation [1114]. Gradients are large for input features that have a larger impact on the loss. While surprising features can have a large impact on the loss, unsurprising features can as well. We only verified this finding empirically on a few examples. The reader can try additional examples by inserting the URL of any image into the file PredErrsFromURLimage.ipynb contained in the directories linked in Materials and Methods, and can also be accessed directly at https://bit.ly/3JwGUM9. Future work should attempt to quantify the relationship between prediction errors and surprising features more systematically across many inputs. In addition, prediction errors could be computed for learning tasks associated with common experimental paradigms so they can be used to make experimentally testable predictions.

When interpreting artificial deep neural networks as models of biological neuronal networks, it is common to compare activations in the artificial network to biological neurons’ firing rates [28, 29]. However, under predictive coding models and other models in which errors are propagated upstream by feedback connections, many biological interpretations posit the existence of “error neurons” that encode the errors sent upstream. In most such models (including predictive coding), error neurons reflect or approximate the gradient of the loss function with respect to artificial neurons’ activations, δ. Any model that hypothesizes the neural representation of backpropagated errors would predict that some recorded neural activity should reflect these errors. Therefore, if we want to draw analogues between artificial and biological neural networks, the activity of biological neurons should be compared to both the activations and the gradients of artificial neurons.

Following previous work [11, 12], we took the covariance matrices underlying the probabilistic model to be identity matrices, Σ = I, when deriving the predictive coding model. We also showed that relaxing this assumption by allowing for arbitrary precision matrices in hidden layers does not affect learning under the fixed prediction assumption. Future work should consider the utility of accounting for covariance (or precision) matrices in models without the fixed prediction assumption (i.e., under the “strict” model) and accounting for precisions or covariances in the output layer. Moreover, precision matrices could still have benefits in other settings such as recurrent network models, unsupervised learning, or active inference.

Predictive coding and deep neural networks (trained by backpropagation) are often viewed as competing models of brain function. Better understanding their relationship can help in the interpretation and implementation of each algorithm as well as their mutual relationships to biological neuronal networks.

Materials and methods

All numerical examples were performed on GPUs using Google Collaboratory with custom-written PyTorch code. The networks trained on MNIST used two convolutional and three fully connected layers with rectified linear activation functions using 2 epochs, a learning rate of 0.002, and a batch size of 300. The networks trained on CIFAR-10 used three convolutional and three fully connected layers with rectified linear activation functions using 5 epochs, a learning rate of 0.01, and a batch size of 256. All networks were trained using the Adam optimizer with gradients replaced by the output of the respective algorithm. All of the code to produce the figures in the manuscript can be found at https://doi.org/10.6084/m9.figshare.19387409.v2 A Google Drive folder with Colab notebooks that produce all figures in this text can be found at https://drive.google.com/drive/folders/1m_y0G_sTF-pV9pd2_sysWt1nvRvHYzX0 An additional copy of the same code is also stored at https://github.com/RobertRosenbaum/PredictiveCodingVsBackProp Full details of the neural network architectures and metaparameters can be found in this code.

Torch2PC software for predictive coding with PyTorch models

The figures above were all produced using PyTorch [38] models combined with custom written functions for predictive coding. Functions for predictive coding with PyTorch models are collected in the Github Repository Torch2PC. Currently, the only available functions are intended for models built using the Sequential class, but more general functions will be added to Torch2PC in the future. The functions can be imported using the following commands

!git clone https://github.com/RobertRosenbaum/Torch2PC.git

from Torch2PC import TorchSeq2PC as T2PC

The primary function in TorchSeq2PC is PCInfer, which performs one predictive coding step (computes one value of ) on a batch of inputs and labels. The function takes an input ErrType, which is a string that determines whether to use a strict interpretation of predictive coding (Algorithm 2; ErrType=“Strict”), predictive coding with the fixed prediction assumption (Algorithm 3; “FixedPred”), or to compute the gradients exactly using backpropagation (Algorithm 1; “Exact”). Algorithm 2 can be called as follows,

vhat,Loss,dLdy,v,epsilon=

 T2PC.PCInfer(model,LossFun,X,Y,“Strict”,eta,n,vinit)

where model is a Sequential PyTorch model, LossFun is a loss function, X is a mini-batch of inputs, Y is a mini-batch of labels, eta is the step size, n is the number of iterations to use, and vinit is the initial value for the beliefs. If vinit is not passed, it is set to the result from a forward pass, vinit = vhat. The function returns a list of activations from a forward pass at each layer as vhat, the loss as Loss, the gradient of the output with respect to the loss as dLdy, a list of beliefs, v, at each layer as v, and a list of prediction errors, ϵ, at each layer as epsilon. The values of the parameter updates, , are stored in the grad attributes of each parameter, model.param.grad. Hence, after a call to PCInfer, gradient descent could be implemented by calling

with torch.no_grad():

 for p in modelPC.parameters():

  p-=eta*p.grad

Alternatively, an arbitrary optimizer could be used by calling

optimizer.step()

where optimizer is an optimizer created using the PyTorch optim class, e.g., by calling

optimizer = optim.Adam(model.parameters()) before the call to T2PC.PCInfer.

The input model should be a PyTorch Sequential model. Each layer is treated as a single predictive coding layer. Multiple functions can be included within the same layer by wrapping them in a separate call to Sequential. For example the following code:

model = nn.Sequential(

  nn.Conv2d(1,10,3),

  nn.ReLU(),

  nn.MaxPool2d(2),

  nn.Conv2d(10,10,3),

  nn.ReLU())

will treat each item as its own layer (5 layers in all). To treat each “convolutional block” as a separate layer, instead do

model = nn.Sequential(

  nn.Sequential(

   nn.Conv2d(1,10,3),

   nn.ReLU(),

   nn.MaxPool2d(2)),

  nn.Sequential(

   nn.Conv2d(10,10,3),

   nn.ReLU()))

which has just 2 layers.

Algorithm 3 can be called as follows,

vhat,Loss,dLdy,v,epsilon=

 T2PC.PCInfer(model,LossFun,X,Y,“FixedPred”,eta,n)

The input vinit is not used for Algorithm 3, so it does not need to be passed in. The exact values computed by backpropagation can be obtained by calling

vhat,Loss,dLdy,v,epsilon=

 T2PC.PCInfer(model,LossFun,X,Y,“Exact”)

The inputs vinit, eta, and n are not used for computing exact gradients, so they do not need to be passed in. Theorem 1 says that

T2PC.PCInfer(model,LossFun,X,Y,“FixedPred”,eta = 1,n = len(model))

computes the same values as

T2PC.PCInfer(model,LossFun,X,Y,“Exact”)

up to numerical floating point errors. The inputs eta, n, and vinit are optional. If they are omitted by calling

T2PC.PCInfer(model,LossFun,X,Y,ErrType)

then they default to eta=.1,n = 20,vinit = None which produces vinit = vhat when

ErrType=“Strict”. More complete documentation and a complete example is provided as

SimpleExample.ipynb in the GitHub repository and in the code accompanying this paper. More examples are provided by the code accompanying each figure above.

Supporting information

S1 Fig. Comparing backpropagation and predictive coding in a convolutional neural network trained on MNIST across multiple trials.

Same as Fig 1 except the model was trained 30 times with different random seeds. Dark curves show the mean values and shaded regions show ± one standard deviation across trials.

https://doi.org/10.1371/journal.pone.0266102.s001

(EPS)

S2 Fig. Comparing backpropagation and predictive coding modified by the fixed prediction assumption in a convolutional neural network trained on MNIST across multiple trials.

Same as Fig 3 except the model was trained 30 times with different random seeds. Dark curves show the mean values and shaded regions show ± one standard deviation across trials.

https://doi.org/10.1371/journal.pone.0266102.s002

(EPS)

References

  1. 1. Izhikevich EM. Solving the distal reward problem through linkage of STDP and dopamine signaling. Cerebral cortex. 2007;17(10):2443–2452. pmid:17220510
  2. 2. Clark DG, Abbott L, Chung S. Credit Assignment Through Broadcasting a Global Error Vector. arXiv preprint arXiv:210604089. 2021;.
  3. 3. Lillicrap TP, Santoro A, Marris L, Akerman CJ, Hinton G. Backpropagation and the brain. Nature Reviews Neuroscience. 2020;21(6):335–346. pmid:32303713
  4. 4. Whittington JC, Bogacz R. Theories of error back-propagation in the brain. Trends in Cognitive Sciences. 2019;23(3):235–250. pmid:30704969
  5. 5. Urbanczik R, Senn W. Learning by the dendritic prediction of somatic spiking. Neuron. 2014;81(3):521–528. pmid:24507189
  6. 6. Lillicrap TP, Cownden D, Tweed DB, Akerman CJ. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications. 2016;7(1):1–10. pmid:27824044
  7. 7. Scellier B, Bengio Y. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Frontiers in computational neuroscience. 2017;11:24. pmid:28522969
  8. 8. Aljadeff J, D’amour J, Field RE, Froemke RC, Clopath C. Cortical credit assignment by Hebbian, neuromodulatory and inhibitory plasticity. arXiv preprint arXiv:191100307. 2019;.
  9. 9. Kunin D, Nayebi A, Sagastuy-Brena J, Ganguli S, Bloom J, Yamins D. Two routes to scalable credit assignment without weight symmetry. In: International Conference on Machine Learning. PMLR; 2020. p. 5511–5521.
  10. 10. Payeur A, Guerguiev J, Zenke F, Richards BA, Naud R. Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits. Nature Neuroscience. 2021; p. 1–10.
  11. 11. Whittington JC, Bogacz R. An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural Computation. 2017;29(5):1229–1262. pmid:28333583
  12. 12. Millidge B, Tschantz A, Buckley CL. Predictive coding approximates backprop along arbitrary computation graphs. arXiv preprint arXiv:200604182. 2020;.
  13. 13. Song Y, Lukasiewicz T, Xu Z, Bogacz R. Can the brain do backpropagation?—exact implementation of backpropagation in predictive coding networks. Advances in Neural Information Processing Systems. 2020;33:22566. pmid:33840988
  14. 14. Salvatori T, Song Y, Lukasiewicz T, Bogacz R, Xu Z. Predictive Coding Can Do Exact Backpropagation on Convolutional and Recurrent Neural Networks. arXiv preprint arXiv:210303725. 2021;.
  15. 15. Rao RP, Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience. 1999;2(1):79–87. pmid:10195184
  16. 16. Friston K. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience. 2010;11(2):127–138. pmid:20068583
  17. 17. Huang Y, Rao RP. Predictive Coding. Wiley Interdisciplinary Reviews: Cognitive Science. 2011;2(5):580–593. pmid:26302308
  18. 18. Bastos AM, Usrey WM, Adams RA, Mangun GR, Fries P, Friston KJ. Canonical microcircuits for predictive coding. Neuron. 2012;76(4):695–711. pmid:23177956
  19. 19. Clark A. Surfing uncertainty: Prediction, action, and the embodied mind. Oxford University Press; 2015.
  20. 20. Buckley CL, Kim CS, McGregor S, Seth AK. The free energy principle for action and perception: A mathematical review. Journal of Mathematical Psychology. 2017;81:55–79.
  21. 21. Bogacz R. A tutorial on the free-energy framework for modelling perception and learning. Journal of Mathematical Psychology. 2017;76:198–211. pmid:28298703
  22. 22. Spratling MW. A review of predictive coding algorithms. Brain and cognition. 2017;112:92–97. pmid:26809759
  23. 23. Keller GB, Mrsic-Flogel TD. Predictive processing: a canonical cortical computation. Neuron. 2018;100(2):424–435. pmid:30359606
  24. 24. Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep Learning. MIT press Cambridge; 2016.
  25. 25. Krizhevsky A, Hinton G, et al. Learning multiple layers of features from tiny images. Citeseer. 2009;.
  26. 26. Millidge B, Seth A, Buckley CL. Predictive Coding: a Theoretical and Experimental Review. arXiv preprint arXiv:210712979. 2021;.
  27. 27. Amari Si. Information geometry of the EM and em algorithms for neural networks. Neural networks. 1995;8(9):1379–1408.
  28. 28. Schrimpf M, Kubilius J, Hong H, Majaj NJ, Rajalingham R, Issa EB, et al. Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like? bioRxiv preprint. 2018;.
  29. 29. Schrimpf M, Kubilius J, Lee MJ, Murty NAR, Ajemian R, DiCarlo JJ. Integrative Benchmarking to Advance Neurally Mechanistic Models of Human Intelligence. Neuron. 2020;. pmid:32918861
  30. 30. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556. 2014;.
  31. 31. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision. 2015;115(3):211–252.
  32. 32. Hertäg L, Sprekeler H. Learning prediction error neurons in a canonical interneuron circuit. Elife. 2020;9:e57541. pmid:32820723
  33. 33. Gillon CJ, Pina JE, Lecoq JA, Ahmed R, Billeh Y, Caldejon S, et al. Learning from unexpected events in the neocortical microcircuit. bioRxiv. 2021;.
  34. 34. Keller GB, Bonhoeffer T, Hübener M. Sensorimotor mismatch signals in primary visual cortex of the behaving mouse. Neuron. 2012;74(5):809–815. pmid:22681686
  35. 35. Zmarz P, Keller GB. Mismatch receptive fields in mouse visual cortex. Neuron. 2016;92(4):766–772. pmid:27974161
  36. 36. Attinger A, Wang B, Keller GB. Visuomotor coupling shapes the functional development of mouse visual cortex. Cell. 2017;169(7):1291–1302. pmid:28602353
  37. 37. Homann J, Koay SA, Glidden AM, Tank DW, Berry MJ. Predictive coding of novel versus familiar stimuli in the primary visual cortex. BioRxiv. 2017; p. 197608.
  38. 38. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems. 2019;32:8026–8037.