Bio-inspired unsupervised learning of visual features leads to robust invariant object recognition
Introduction
Humans can effortlessly and rapidly recognize surrounding objects [1], despite the tremendous variations in the projection of each object on the retina [2] caused by various transformations such as changes in object position, size, pose, illumination condition and background context [3]. This invariant recognition is presumably handled through hierarchical processing in the so-called ventral pathway. Such hierarchical processing starts in V1 layers, which extract simple features such as bars and edges in different orientations [4], continues in intermediate layers such as V2 and V4, which are responsive to more complex features [5], and culminates in the inferior temporal cortex (IT), where the neurons are selective to object parts or whole objects [6]. By moving from the lower layers to the higher layers, the feature complexity, receptive field size and transformation invariance increase, in such a way that the IT neurons can invariantly represent the objects in a linearly separable manner [7], [8].
Another amazing feature of the primates׳ visual system is its high processing speed. The first wave of image-driven neuronal responses in IT appears around 100 ms after the stimulus onset [1], [3]. Recordings from monkey IT cortex have demonstrated that the first spikes (over a short time window of 12.5 ms), about 100 ms after the image presentation, carry accurate information about the nature of the visual stimulus [7]. Hence, ultra-rapid object recognition is presumably performed in a feedforward manner [3]. Moreover, although there exist various intra- and inter-area feedback connections in the visual cortex, some neurophysiological [9], [10], [3] and theoretical [11] studies have also suggested that the feedforward information is usually sufficient for invariant object categorization.
Appealed by the impressive speed and performance of the primates׳ visual system, computer vision scientists have long tried to “copy” it. So far, it is mostly the architecture of the visual system that has been mimicked. For instance, using hierarchical feedforward networks with restricted receptive fields, like in the brain, has been proven useful [12], [13], [14], [15], [16], [17]. In comparison, the way that biological visual systems learn the appropriate features has attracted much less attention. All the above-mentioned approaches somehow use non biologically plausible learning rules. Yet the ability of the visual cortex to wire itself, mostly in an unsupervised manner, is remarkable [18], [19].
Here, we propose that adding bio-inspired learning to bio-inspired architectures could improve the models׳ behavior. To this end, we focused on a particular form of synaptic plasticity known as spike timing-dependent plasticity (STDP), which has been observed in the mammalian visual cortex [20], [21]. Briefly, STDP reinforces the connections with afferents that significantly contributed to make a neuron fire, while it depresses the others [22]. A recent psychophysical study provided some indirect evidence for this form of plasticity in the human visual cortex [23].
In an earlier study [24], it is shown that a combination of a temporal coding scheme – where in the entry layer of a spiking neural network the most strongly activated neurons fire first – with STDP leads to a situation where neurons in higher visual areas will gradually become selective to complex visual features in an unsupervised manner. These features are both salient and consistently present in the inputs. Furthermore, as learning progresses, the neurons׳ responses rapidly accelerates. These responses can then be fed to a classifier to do a categorization task.
In this study, we show that such an approach strongly outperforms state-of-the-art computer vision algorithms on view-invariant object recognition benchmark tasks including 3D-Object [25], [26] and ETH-80 [27] datasets. These datasets contain natural and unsegmented images, where objects have large variations in scale, viewpoint, and tilt, which makes their recognition hard [28], and probably out of reach for most of the other bio-inspired models [29], [30]. Yet our algorithm generalizes surprisingly well, even when “simple classifiers” are used, because STDP naturally extracts features that are class specific. This point was further confirmed using mutual information [31] and representational dissimilarity matrix (RDM) [32]. Moreover, the distribution of objects in the obtained feature space was analyzed using hierarchical clustering [33], and objects of the same category tended to cluster together.
Section snippets
Materials and methods
The algorithm we used here is a scaled-up version of the one presented in [24]. Essentially, many more C2 features and iterations were used. Our code is available upon request. We used a five-layer hierarchical network , largely inspired by the HMAX model [14] (see Fig. 1). Specifically, we alternated simple cells that gain selectivity through a sum operation, and complex cells that gain shift and scale invariance through a max operation. However, our network uses spiking
Dataset and experimental setup
To study the robustness of our model with respect to different transformations such as scale and viewpoint, we evaluated it on the 3D-Object and ETH-80 datasets. The 3D-Object is provided by Savarese et al. at CVGLab, Stanford University [25]. This dataset contains 10 different object classes: bicycle, car, cellphone, head, iron, monitor, mouse, shoe, stapler, and toaster. There are about 10 different instances for each object class. The object instances are photographed in about 72 different
Discussion
Position and scale invariance in our model are built-in, thanks to weight sharing and scaling process. Conversely, view-invariance must be obtained through the learning process. Here, we used all images of five object instances from each category (varied in all dimensions) to learn the S2 visual features, while images of all other object instances of each category were used to test the network. Hence, the model was exposed to all possible variations during the learning to gain view-invariance.
Conclusions
To date, various bio-inspired network architectures for object recognition have been introduced, but the learning mechanism of biological visual systems has been neglected. In this paper, we demonstrate that the association of both bio-inspired network architecture and learning rule results in a robust object recognition system. The STDP-based feature learning, used in our model, extracts frequent diagnostic and class specific features that are robust to deformations in stimulus appearance. It
Acknowledgements
We would like to thank Mr. Majid Changi Ashtiani at the Math Computing Center of IPM (http://math.ipm.ac.ir/mcc) for letting us to perform some parts of the calculations on their computing cluster. We also thank Dr. Reza Ebrahimpour for his helpful discussions and suggestions.
Saeed Reza Kheradpisheh is currently doing his Ph.D. in Computer Science at University of Tehran. He has received B.Sc. and M.Sc. degrees in Computer Science. His research interests in the area of Computational Visual Neuroscience include invariant object recognition, spiking neural networks, and deep learning.
References (65)
- et al.
How does the brain solve visual object recognition?
Neuron
(2012) - et al.
The fine structure of shape tuning in area V4
Neuron
(2013) - et al.
Timing, timing, timing: fast decoding of object information from intracranial field potentials in human visual cortex
Neuron
(2009) Learning in mammalian sensory cortex
Current Opinion Neurobiol.
(2004)- et al.
Learning and neural plasticity in visual object recognition
Current Opinion Neurobiol.
(2006) - et al.
Receptive-field modification in rat visual cortex induced by paired visual stimulation and single-cell spiking
Neuron
(2006) The spike-timing dependence of plasticity
Neuron
(2012)- et al.
Stimulus timing-dependent plasticity in high-level vision
Current Biol.
(2012) - et al.
Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits
Computer Speech & Language
(2015) - et al.
Taking the MAX from neuronal responses
Trends Cogn. Sci.
(2003)
Neural networks and neuroscience-inspired computer vision
Current Biol.
Extraction of temporally correlated features from dynamic vision sensors with spike-timing-dependent plasticity
Neural Network: Official J. Int. Neural Netw. Soc.
Aer-srt: scalable spike distribution by means of synchronous serial ring topology address event representation
Neurocomputing
Speed of processing in the human visual system
Nature
Recognition-by-components: a theory of human image understanding
Psychol. Rev.
Coding of color and form in the geniculostriate visual pathway (invited review)
J. Opt. Soc. Am. A
Coding visual images of objects in the inferotemporal cortex of the macaque monkey
J. Neurophysiol.
Fast readout of object identity from macaque inferior temporal cortex
Science
Selectivity and tolerance (invariance) both increase as visual information propagates from cortical area V4 to IT
J. Neurosci.
Functional compartmentalization and viewpoint generalization within the macaque face-processing system
Science
Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position
Biol. Cybern.
Robust object recognition with cortex-like mechanisms
IEEE Trans. Pattern Anal. Machine Intel.
Associative hebbian synaptic plasticity in primate visual cortex
J. Neurosci.
Unsupervised learning of visual features through spike timing dependent plasticity
PLoS Comput. Biol.
Cited by (120)
Visual tracking in video sequences based on biologically inspired mechanisms
2024, Computer Vision and Image UnderstandingResolving the neural mechanism of core object recognition in space and time: A computational approach
2023, Neuroscience ResearchCitation Excerpt :In each level of categorization, the average accuracy of these pairs is calculated as the performance in that level. It is important to note that the existing computational models of core feedforward object recognition have usually ignored the brain’s dynamics in the representation and decision-making parts (Gold and Shadlen, 2007; Hanks et al., 2014; He et al., 2016; Heekeren et al., 2004; Heidari-Gorji et al., 2015; Hubel and Wiesel, 1968; Kanwisher et al., 1997; Karimi-Rouzbahani et al., 2017; Kheradpisheh et al., 2016a), such that most of them are only capable of following either the accuracy (Gold and Shadlen, 2007; Hanks et al., 2014; He et al., 2016; Heidari-Gorji et al., 2015; Kiani et al., 2013; Hubel and Wiesel, 1968; Kheradpisheh et al., 2016a) or the reaction time (Kiani et al., 2013) of the recognition. The structure of the proposed model in its second stage was similar to the one proposed by Kheradpishe et al. (Kheradpisheh et al., 2018).
Learning rules in spiking neural networks: A survey
2023, NeurocomputingBackpropagation with biologically plausible spatiotemporal adjustment for training deep spiking neural networks
2022, PatternsCitation Excerpt :Diehl et al.13 used the STDP learning rule and lateral inhibition in a two-layer SNN and achieved 95% accuracy on the MNIST dataset. Saeed et al.14 introduced a weight-sharing strategy and designed a spiking convolutional neural network. The weight was learned by the STDP layer-wisely.
SAR image classification based on spiking neural network through spike-time dependent plasticity and gradient descent
2022, ISPRS Journal of Photogrammetry and Remote Sensing
Saeed Reza Kheradpisheh is currently doing his Ph.D. in Computer Science at University of Tehran. He has received B.Sc. and M.Sc. degrees in Computer Science. His research interests in the area of Computational Visual Neuroscience include invariant object recognition, spiking neural networks, and deep learning.
Mohammad Ganjtabesh received his B.Sc. degree in Pure Mathematics from the University of Tabriz in 2001, and the M.Sc. and Ph.D. degrees in computer science from the University of Tehran in 2003 and 2008, respectively. He has also performed another Ph.D. program in Bioinformatics at Ecole Polytechnique (France). Then he became an assistant professor at the University of Tehran since 2008. His current research interests include Computational Neuroscience (mainly visual cortex modeling based on spiking neural networks) and Computational Biology (all the problems associated to the RNA structures).
Timothée Masquelier is a researcher in Computational Neuroscience. His research is highly interdisciplinary - at the interface between Biology, Computer Science, and Physics. He uses numerical simulations and analytical calculations to gain understanding on how the brain works, and more specifically on how neurons process, encode and transmit information through action potentials (a.k.a spikes), in particular in the visual modality. He is also interested in bio-inspired computer vision and neuromorphic engineering. Timothée Masquelier was trained at Ecole Centrale Paris (Ingénieur 1999), MIT (M. Sc. 2001), and Université Toulouse III (Ph.D. 2008). He was recruited by the C.N.R.S. in 2012.