Elsevier

Neural Networks

Volume 131, November 2020, Pages 103-114
Neural Networks

Improved object recognition using neural networks trained to mimic the brain’s statistical properties

https://doi.org/10.1016/j.neunet.2020.07.013Get rights and content

Abstract

The current state-of-the-art object recognition algorithms, deep convolutional neural networks (DCNNs), are inspired by the architecture of the mammalian visual system, and are capable of human-level performance on many tasks. As they are trained for object recognition tasks, it has been shown that DCNNs develop hidden representations that resemble those observed in the mammalian visual system (Razavi and Kriegeskorte, 2014; Yamins and Dicarlo, 2016; Gu and van Gerven, 2015; Mcclure and Kriegeskorte, 2016). Moreover, DCNNs trained on object recognition tasks are currently among the best models we have of the mammalian visual system. This led us to hypothesize that teaching DCNNs to achieve even more brain-like representations could improve their performance. To test this, we trained DCNNs on a composite task, wherein networks were trained to: (a) classify images of objects; while (b) having intermediate representations that resemble those observed in neural recordings from monkey visual cortex. Compared with DCNNs trained purely for object categorization, DCNNs trained on the composite task had better object recognition performance and are more robust to label corruption. Interestingly, we found that neural data was not required for this process, but randomized data with the same statistical properties as neural data also boosted performance. While the performance gains we observed when training on the composite task vs the “pure” object recognition task were modest, they were remarkably robust. Notably, we observed these performance gains across all network variations we studied, including: smaller (CORNet-Z) vs larger (VGG-16) architectures; variations in optimizers (Adam vs gradient descent); variations in activation function (ReLU vs ELU); and variations in network initialization. Our results demonstrate the potential utility of a new approach to training object recognition networks, using strategies in which the brain – or at least the statistical properties of its activation patterns – serves as a teacher signal for training DCNNs.

Introduction

Deep convolutional neural networks (DCNNs) have recently led to a rapid advance in the state-of-the-art object recognition systems (Lecun, Bengio, & Hinton, 2015). At the same time, there remain critical shortcomings in these systems (Rajalingham et al., 2018). We asked whether training DCNNs to respond to images in a more brain-like manner could lead to better performance. DCNN architectures are directly inspired by that of the mammalian visual system (MVS) (Hubel & Wiesel, 1968), and as DCNNs improve at object recognition tasks, they learn representations that are increasingly similar to those found in the MVS (Gu and van Gerven, 2015, Mcclure and Kriegeskorte, 2016, Razavi and Kriegeskorte, 2014, Yamins and Dicarlo, 2016). Consequently, we expected that forcing the DCNNs to have image representations that were even more similar to those found in the MVS, could lead to better performance.

Previous work showed that the performance of smaller “student” DCNNs could be improved by training them to match the image representations of larger “teacher” DCNNs  (Hinton et al., 2015, Mcclure and Kriegeskorte, 2016, Romero et al., 2015), and that DCNNs could be directly trained to reproduce image representations formed by the V1 area of monkey visual cortex (Kindel, Christensen, & Zylberberg, 2019). These studies provide a foundation for the current work, in which we used monkey V1 as a teacher network for training DCNNs to categorize images. We then tested the hypothesis that DCNNs trained with the monkey V1 as a teacher would outperform those trained without this teacher signal. By several relevant metrics (including accuracy), we found that performance was increased when monkey V1 was used as a teacher. Importantly, the monkey V1 data were collected in response to different images than the ones in the object recognition task. As a result, our approach, of using the brain as a teacher signal, can leverage pre-existing, and publicly-available neural data, without necessarily requiring new neuroscience experiments for each new machine learning task. Moreover, we also trained DCNNs with random teacher signals that matched the statistics of monkey V1 neural activations, and found that those outperformed networks trained without a teacher signal. This emphasizes a potential role for using the statistical properties of neural activations as a form of regularizer, that could be useful for training DCNNs.

Related recent work has demonstrated success in using neural data to train machine learning models. One study found that using fMRI measurements of human brain activations from subjects viewing images could guide Support Vector Machines (SVMs) decision boundaries (Fong, Scheirer, & Cox, 2018). In this study, the authors weighted the training data based on how easy it was for the human brain to recognize the example as a member of a class (Fong et al., 2018). This is different from our work where we train deep convolutional neural networks with a two-part cost function that explicitly trains on matching the neural representations, as opposed to the previous approach which weighted the cost function on specific training examples. In our work, the images shown to the animal during the collection of neural data were not category-labeled images from a machine learning benchmark task. For contrast, in the recent Fong et al. study (Fong et al., 2018), the neural data had to be collected for the same image set used for the categorization task. Previous work from Peterson et al. demonstrated that training deep convolutional neural networks with human perceptual uncertainty makes classification more robust to variations in the test set and to adversarial examples Peterson, Battleday, Griffiths, and Russakovsky (2020). In this study, they use human guidance to change the labels for training, incorporating uncertainties. We instead focus on changing the cost function – incorporating neural data into the network evaluation – without considering behavioral reports of uncertainty. Finally, previous work from Linsley, Shiebler, Eberhardt, and Serre (2019) demonstrated that using human behavioral data to add supervisory attention guidance improved object recognition performance. This is again quite a different approach from ours: while they focused on behavioral data, we instead used signals recorded from visual cortical neurons.

Notably, we did not aim to achieve state-of-the-art classification performance in this work: we instead sought to test whether the use of neural data as a “teacher” signal could robustly improve DCNN performance. For that reason, we studied a wide variety of network properties: different architectures and network sizes; different activation functions; and different optimizers. Our results indicate that, over all of these variations, (1) DCNNs trained to mimic monkey V1 (or surrogate data matching the statistics of monkey V1) have better object recognition performance, (2) DCNNs trained to mimic monkey V1 make fewer errors, and of these errors, more of them are within the correct superclass, and (3) DCNNs trained to mimic monkey V1 are more robust to label corruption. While the performance gains we observed were somewhat modest, based on the robustness of those performance gains, we anticipate that future work could productively apply our new training method to other networks, potentially improving on the current state-of-the-art object recognition systems.

Section snippets

Monkey visual cortex data

Our monkey V1 teacher signal is from publicly-available multielectrode recordings from anesthetized monkeys presented with a series of images while experimenters recorded the spiking activity of neurons in primary visual cortex (V1) with a multielectrode array (Coen-Cagli, Kohn, & Schwartz, 2015). These recordings were conducted in 10 experimental sessions with 3 different animals, resulting in recordings from 392 neurons. The monkeys were shown 270 static natural images as well as various

Results

We trained neural networks on the composite cost (Eq. (1)), with varying ratios r describing the trade-off between representational similarity cost and categorization cost. We evaluated the trained networks based on categorization accuracy achieved on held-out data (not used in training) from the CIFAR100 dataset. We ran each experiment 10 times with different initial randomized conditions to demonstrate that the difference in accuracy is not due to initial conditions (Mehrer et al., 2020). In

Discussion

Training the early layers of convolutional neural networks to mimic the image representations from monkey V1 improves those networks’ ability to categorize previously-unseen images. Moreover, networks trained using monkey V1 as a representation “teacher” made errors that were more often within the correct superclass than that did networks without the “teacher” signal. While the performance gains were modest, they were remarkably robust: we observed similar performance gains on large and small

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

JZ is an Associate Fellow of CIFAR, in the Learning in Machines and Brains Program. JZ further acknowledges the following funding sources: Sloan Fellowship, Canada Research Chairs Program, and Natural Science and Engineering Research Council of Canada (NSERC). CF was supported by an NSF Graduate Research Fellowship, Award # 1553798. AF is a Fellow of CIFAR program for Learning in Machines and Brains, and holds a Canada CIFAR AI Chair. AF and HX are funded through CIFAR and an NSERC Discovery

References (32)

  • Barron, J.  T. (2019). A general and adaptive robust loss function. In Proc IEEE comput soc conf comput vis...
  • CadenaS.A. et al.

    Deep convolutional models improve predictions of macaque v1 responses to natural images

    (2017)
  • ClevertD. et al.

    Fast and accurate deep network learning by exponential linear units (ELUs)

  • Coen-CagliR. et al.

    Flexible gating of contextual influences in natural vision

    Nature Neuroscience

    (2015)
  • FongR.C. et al.

    Using human brain activity to guide machine learning

    Scientific Reports

    (2018)
  • GlorotX. et al.

    Understanding the difficulty of training deep feedforward neural networks

    Journal of Machine Learning Research

    (2010)
  • GuU. et al.

    Deep neural networks reveal a gradient in the complexity

    The Journal of Neuroscience

    (2015)
  • HintonG. et al.

    Distilling the knowledge in a neural network

    (2015)
  • HubelD.H. et al.

    Receptive fields and functional architecture of monkey striate cortex

    Journal Physiology

    (1968)
  • KietzmannT.C. et al.

    Recurrence is required to capture the representational dynamics of the human visual system

    Proceedings of the National Academy of Sciences of the United States of America

    (2019)
  • KindelW. et al.

    Using deep learning to probe the neural code for images in primary visual cortex

    Journal of Vision

    (2019)
  • KingmaD.P. et al.

    Adam: A method for stochastic optimization

  • KriegeskorteN. et al.

    Representational similarity analysis – connecting the branches of systems neuroscience

    Frontiers in Systems Neuroscience

    (2008)
  • KrizhevskyA.

    Learning multiple layers of features from tiny imagesTechnical Report TR-2009

    (2009)
  • KrizhevskyA. et al.

    Imagenet classification with deep convolutional neural networks

  • KubiliusJ. et al.

    Cornet : Modeling the neural mechanisms of core object recognition

    (2018)
  • Cited by (19)

    • Natural and Artificial Intelligence: A brief introduction to the interplay between AI and neuroscience research

      2021, Neural Networks
      Citation Excerpt :

      The challenge of creating artificial systems capable of emulating biological visual processing is formidable. However, recent efforts to understand and reverse engineer the brain’s ventral visual stream, a series of interconnected cortical nuclei responsible for hierarchically processing and encoding of images into explicit neural representations, have shown great promise in the creation of robust AI systems capable of decoding and interpreting human visual processing, as well as performing complex visual intelligence skills including image recognition (Federer, Xu, Fyshe, & Zylberberg, 2020; Verschae & Ruiz-del-Solar, 2015), motion detection (Manchanda & Sharma, 2016; Wu, McGinnity, Maguire, Cai, & Valderrama-Gonzalez, 2008), and object tracking (Luo et al., 2020; Soleimanitaleb, Keyvanrad, & Jafari, 2019; Zhang et al., 2021). In an effort to understand and measure human visual perception, machine learning models, including support-vector networks, have been trained to decode stimulus-induced fMRI activity patterns in the human V1 cortical area, and were able to visually reconstruct the local contrast of presented and internal mental images (Kamitani & Tong, 2005; Miyawaki et al., 2008).

    • Words as a window: Using word embeddings to explore the learned representations of Convolutional Neural Networks

      2021, Neural Networks
      Citation Excerpt :

      The majority of movement towards semantic representations at the end of training happens in the later layers of the CNN. This finding has many implications, including possible new regularization schemes where DS models are used to guide CNN training (similar to the regularization schemes in Federer, Xu, Fyshe, & Zylberberg, 2020). This result also implies that we could improve training time by freezing the weights of early layers after the first few epochs of training.

    View all citing articles on Scopus
    View full text