Elsevier

Neurocomputing

Volume 205, 12 September 2016, Pages 382-392
Neurocomputing

Bio-inspired unsupervised learning of visual features leads to robust invariant object recognition

https://doi.org/10.1016/j.neucom.2016.04.029Get rights and content

Abstract

Retinal image of surrounding objects varies tremendously due to the changes in position, size, pose, illumination condition, background context, occlusion, noise, and non-rigid deformations. But despite these huge variations, our visual system is able to invariantly recognize any object in just a fraction of a second. To date, various computational models have been proposed to mimic the hierarchical processing of the ventral visual pathway, with limited success. Here, we show that the association of both biologically inspired network architecture and learning rule significantly improves the models׳ performance when facing challenging invariant object recognition problems. Our model is an asynchronous feedforward spiking neural network. When the network is presented with natural images, the neurons in the entry layers detect edges, and the most activated ones fire first, while neurons in higher layers are equipped with spike timing-dependent plasticity. These neurons progressively become selective to intermediate complexity visual features appropriate for object categorization. The model is evaluated on 3D-Object and ETH-80 datasets which are two benchmarks for invariant object recognition, and is shown to outperform state-of-the-art models, including DeepConvNet and HMAX. This demonstrates its ability to accurately recognize different instances of multiple object classes even under various appearance conditions (different views, scales, tilts, and backgrounds). Several statistical analysis techniques are used to show that our model extracts class specific and highly informative features.

Introduction

Humans can effortlessly and rapidly recognize surrounding objects [1], despite the tremendous variations in the projection of each object on the retina [2] caused by various transformations such as changes in object position, size, pose, illumination condition and background context [3]. This invariant recognition is presumably handled through hierarchical processing in the so-called ventral pathway. Such hierarchical processing starts in V1 layers, which extract simple features such as bars and edges in different orientations [4], continues in intermediate layers such as V2 and V4, which are responsive to more complex features [5], and culminates in the inferior temporal cortex (IT), where the neurons are selective to object parts or whole objects [6]. By moving from the lower layers to the higher layers, the feature complexity, receptive field size and transformation invariance increase, in such a way that the IT neurons can invariantly represent the objects in a linearly separable manner [7], [8].

Another amazing feature of the primates׳ visual system is its high processing speed. The first wave of image-driven neuronal responses in IT appears around 100 ms after the stimulus onset [1], [3]. Recordings from monkey IT cortex have demonstrated that the first spikes (over a short time window of 12.5 ms), about 100 ms after the image presentation, carry accurate information about the nature of the visual stimulus [7]. Hence, ultra-rapid object recognition is presumably performed in a feedforward manner [3]. Moreover, although there exist various intra- and inter-area feedback connections in the visual cortex, some neurophysiological [9], [10], [3] and theoretical [11] studies have also suggested that the feedforward information is usually sufficient for invariant object categorization.

Appealed by the impressive speed and performance of the primates׳ visual system, computer vision scientists have long tried to “copy” it. So far, it is mostly the architecture of the visual system that has been mimicked. For instance, using hierarchical feedforward networks with restricted receptive fields, like in the brain, has been proven useful [12], [13], [14], [15], [16], [17]. In comparison, the way that biological visual systems learn the appropriate features has attracted much less attention. All the above-mentioned approaches somehow use non biologically plausible learning rules. Yet the ability of the visual cortex to wire itself, mostly in an unsupervised manner, is remarkable [18], [19].

Here, we propose that adding bio-inspired learning to bio-inspired architectures could improve the models׳ behavior. To this end, we focused on a particular form of synaptic plasticity known as spike timing-dependent plasticity (STDP), which has been observed in the mammalian visual cortex [20], [21]. Briefly, STDP reinforces the connections with afferents that significantly contributed to make a neuron fire, while it depresses the others [22]. A recent psychophysical study provided some indirect evidence for this form of plasticity in the human visual cortex [23].

In an earlier study [24], it is shown that a combination of a temporal coding scheme – where in the entry layer of a spiking neural network the most strongly activated neurons fire first – with STDP leads to a situation where neurons in higher visual areas will gradually become selective to complex visual features in an unsupervised manner. These features are both salient and consistently present in the inputs. Furthermore, as learning progresses, the neurons׳ responses rapidly accelerates. These responses can then be fed to a classifier to do a categorization task.

In this study, we show that such an approach strongly outperforms state-of-the-art computer vision algorithms on view-invariant object recognition benchmark tasks including 3D-Object [25], [26] and ETH-80 [27] datasets. These datasets contain natural and unsegmented images, where objects have large variations in scale, viewpoint, and tilt, which makes their recognition hard [28], and probably out of reach for most of the other bio-inspired models [29], [30]. Yet our algorithm generalizes surprisingly well, even when “simple classifiers” are used, because STDP naturally extracts features that are class specific. This point was further confirmed using mutual information [31] and representational dissimilarity matrix (RDM) [32]. Moreover, the distribution of objects in the obtained feature space was analyzed using hierarchical clustering [33], and objects of the same category tended to cluster together.

Section snippets

Materials and methods

The algorithm we used here is a scaled-up version of the one presented in [24]. Essentially, many more C2 features and iterations were used. Our code is available upon request. We used a five-layer hierarchical network S1C1S2C2classifier, largely inspired by the HMAX model [14] (see Fig. 1). Specifically, we alternated simple cells that gain selectivity through a sum operation, and complex cells that gain shift and scale invariance through a max operation. However, our network uses spiking

Dataset and experimental setup

To study the robustness of our model with respect to different transformations such as scale and viewpoint, we evaluated it on the 3D-Object and ETH-80 datasets. The 3D-Object is provided by Savarese et al. at CVGLab, Stanford University [25]. This dataset contains 10 different object classes: bicycle, car, cellphone, head, iron, monitor, mouse, shoe, stapler, and toaster. There are about 10 different instances for each object class. The object instances are photographed in about 72 different

Discussion

Position and scale invariance in our model are built-in, thanks to weight sharing and scaling process. Conversely, view-invariance must be obtained through the learning process. Here, we used all images of five object instances from each category (varied in all dimensions) to learn the S2 visual features, while images of all other object instances of each category were used to test the network. Hence, the model was exposed to all possible variations during the learning to gain view-invariance.

Conclusions

To date, various bio-inspired network architectures for object recognition have been introduced, but the learning mechanism of biological visual systems has been neglected. In this paper, we demonstrate that the association of both bio-inspired network architecture and learning rule results in a robust object recognition system. The STDP-based feature learning, used in our model, extracts frequent diagnostic and class specific features that are robust to deformations in stimulus appearance. It

Acknowledgements

We would like to thank Mr. Majid Changi Ashtiani at the Math Computing Center of IPM (http://math.ipm.ac.ir/mcc) for letting us to perform some parts of the calculations on their computing cluster. We also thank Dr. Reza Ebrahimpour for his helpful discussions and suggestions.

Saeed Reza Kheradpisheh is currently doing his Ph.D. in Computer Science at University of Tehran. He has received B.Sc. and M.Sc. degrees in Computer Science. His research interests in the area of Computational Visual Neuroscience include invariant object recognition, spiking neural networks, and deep learning.

References (65)

  • D.D. Cox et al.

    Neural networks and neuroscience-inspired computer vision

    Current Biol.

    (2014)
  • O. Bichler et al.

    Extraction of temporally correlated features from dynamic vision sensors with spike-timing-dependent plasticity

    Neural Network: Official J. Int. Neural Netw. Soc.

    (2012)
  • T. Dorta et al.

    Aer-srt: scalable spike distribution by means of synchronous serial ring topology address event representation

    Neurocomputing

    (2016)
  • S. Thorpe et al.

    Speed of processing in the human visual system

    Nature

    (1996)
  • I. Biederman

    Recognition-by-components: a theory of human image understanding

    Psychol. Rev.

    (1987)
  • P. Lennie et al.

    Coding of color and form in the geniculostriate visual pathway (invited review)

    J. Opt. Soc. Am. A

    (2005)
  • K. Tanaka et al.

    Coding visual images of objects in the inferotemporal cortex of the macaque monkey

    J. Neurophysiol.

    (1991)
  • C.P. Hung et al.

    Fast readout of object identity from macaque inferior temporal cortex

    Science

    (2005)
  • N.C. Rust et al.

    Selectivity and tolerance (invariance) both increase as visual information propagates from cortical area V4 to IT

    J. Neurosci.

    (2010)
  • W.A. Freiwald et al.

    Functional compartmentalization and viewpoint generalization within the macaque face-processing system

    Science

    (2010)
  • F. Anselmi, J.Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, T. Poggio, Unsupervised learning of invariant...
  • K. Fukushima

    Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

    Biol. Cybern.

    (1980)
  • Y. LeCun, Y. Bengio, Convolutional networks for images, speech, and time series, in: M.A. Arbib (Ed.), The Handbook of...
  • T. Serre et al.

    Robust object recognition with cortex-like mechanisms

    IEEE Trans. Pattern Anal. Machine Intel.

    (2007)
  • H. Lee, R. Grosse, R. Ranganath, A.Y. Ng, Convolutional deep belief networks for scalable unsupervised learning of...
  • A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, in: Neural...
  • Q.V. Le, Building high-level features using large scale unsupervised learning, in: Proceedings of the IEEE...
  • S. Huang et al.

    Associative hebbian synaptic plasticity in primate visual cortex

    J. Neurosci.

    (2014)
  • T. Masquelier et al.

    Unsupervised learning of visual features through spike timing dependent plasticity

    PLoS Comput. Biol.

    (2007)
  • S. Savarese, L. Fei-Fei, 3D generic object categorization, localization and pose estimation, in: Proceedings of the...
  • B. Pepik, M. Stark, P. Gehler, B. Schiele, Multi-view priors for learning detectors from sparse viewpoint data, in:...
  • B. Leibe, B. Schiele, Analyzing appearance and contour based methods for object categorization, in: Proceedings of the...
  • Cited by (120)

    • Resolving the neural mechanism of core object recognition in space and time: A computational approach

      2023, Neuroscience Research
      Citation Excerpt :

      In each level of categorization, the average accuracy of these pairs is calculated as the performance in that level. It is important to note that the existing computational models of core feedforward object recognition have usually ignored the brain’s dynamics in the representation and decision-making parts (Gold and Shadlen, 2007; Hanks et al., 2014; He et al., 2016; Heekeren et al., 2004; Heidari-Gorji et al., 2015; Hubel and Wiesel, 1968; Kanwisher et al., 1997; Karimi-Rouzbahani et al., 2017; Kheradpisheh et al., 2016a), such that most of them are only capable of following either the accuracy (Gold and Shadlen, 2007; Hanks et al., 2014; He et al., 2016; Heidari-Gorji et al., 2015; Kiani et al., 2013; Hubel and Wiesel, 1968; Kheradpisheh et al., 2016a) or the reaction time (Kiani et al., 2013) of the recognition. The structure of the proposed model in its second stage was similar to the one proposed by Kheradpishe et al. (Kheradpisheh et al., 2018).

    • Backpropagation with biologically plausible spatiotemporal adjustment for training deep spiking neural networks

      2022, Patterns
      Citation Excerpt :

      Diehl et al.13 used the STDP learning rule and lateral inhibition in a two-layer SNN and achieved 95% accuracy on the MNIST dataset. Saeed et al.14 introduced a weight-sharing strategy and designed a spiking convolutional neural network. The weight was learned by the STDP layer-wisely.

    View all citing articles on Scopus

    Saeed Reza Kheradpisheh is currently doing his Ph.D. in Computer Science at University of Tehran. He has received B.Sc. and M.Sc. degrees in Computer Science. His research interests in the area of Computational Visual Neuroscience include invariant object recognition, spiking neural networks, and deep learning.

    Mohammad Ganjtabesh received his B.Sc. degree in Pure Mathematics from the University of Tabriz in 2001, and the M.Sc. and Ph.D. degrees in computer science from the University of Tehran in 2003 and 2008, respectively. He has also performed another Ph.D. program in Bioinformatics at Ecole Polytechnique (France). Then he became an assistant professor at the University of Tehran since 2008. His current research interests include Computational Neuroscience (mainly visual cortex modeling based on spiking neural networks) and Computational Biology (all the problems associated to the RNA structures).

    Timothée Masquelier is a researcher in Computational Neuroscience. His research is highly interdisciplinary - at the interface between Biology, Computer Science, and Physics. He uses numerical simulations and analytical calculations to gain understanding on how the brain works, and more specifically on how neurons process, encode and transmit information through action potentials (a.k.a spikes), in particular in the visual modality. He is also interested in bio-inspired computer vision and neuromorphic engineering. Timothée Masquelier was trained at Ecole Centrale Paris (Ingénieur 1999), MIT (M. Sc. 2001), and Université Toulouse III (Ph.D. 2008). He was recruited by the C.N.R.S. in 2012.

    View full text