Bio-inspired unsupervised learning of visual features leads to robust invariant object recognition

doi:10.1016/j.neucom.2016.04.029

Neurocomputing

Volume 205, 12 September 2016, Pages 382-392

https://doi.org/10.1016/j.neucom.2016.04.029 Get rights and content

Abstract

Retinal image of surrounding objects varies tremendously due to the changes in position, size, pose, illumination condition, background context, occlusion, noise, and non-rigid deformations. But despite these huge variations, our visual system is able to invariantly recognize any object in just a fraction of a second. To date, various computational models have been proposed to mimic the hierarchical processing of the ventral visual pathway, with limited success. Here, we show that the association of both biologically inspired network architecture and learning rule significantly improves the models׳ performance when facing challenging invariant object recognition problems. Our model is an asynchronous feedforward spiking neural network. When the network is presented with natural images, the neurons in the entry layers detect edges, and the most activated ones fire first, while neurons in higher layers are equipped with spike timing-dependent plasticity. These neurons progressively become selective to intermediate complexity visual features appropriate for object categorization. The model is evaluated on 3D-Object and ETH-80 datasets which are two benchmarks for invariant object recognition, and is shown to outperform state-of-the-art models, including DeepConvNet and HMAX. This demonstrates its ability to accurately recognize different instances of multiple object classes even under various appearance conditions (different views, scales, tilts, and backgrounds). Several statistical analysis techniques are used to show that our model extracts class specific and highly informative features.

Introduction

Humans can effortlessly and rapidly recognize surrounding objects [1], despite the tremendous variations in the projection of each object on the retina [2] caused by various transformations such as changes in object position, size, pose, illumination condition and background context [3]. This invariant recognition is presumably handled through hierarchical processing in the so-called ventral pathway. Such hierarchical processing starts in V1 layers, which extract simple features such as bars and edges in different orientations [4], continues in intermediate layers such as V2 and V4, which are responsive to more complex features [5], and culminates in the inferior temporal cortex (IT), where the neurons are selective to object parts or whole objects [6]. By moving from the lower layers to the higher layers, the feature complexity, receptive field size and transformation invariance increase, in such a way that the IT neurons can invariantly represent the objects in a linearly separable manner [7], [8].

Another amazing feature of the primates׳ visual system is its high processing speed. The first wave of image-driven neuronal responses in IT appears around 100 ms after the stimulus onset [1], [3]. Recordings from monkey IT cortex have demonstrated that the first spikes (over a short time window of 12.5 ms), about 100 ms after the image presentation, carry accurate information about the nature of the visual stimulus [7]. Hence, ultra-rapid object recognition is presumably performed in a feedforward manner [3]. Moreover, although there exist various intra- and inter-area feedback connections in the visual cortex, some neurophysiological [9], [10], [3] and theoretical [11] studies have also suggested that the feedforward information is usually sufficient for invariant object categorization.

Appealed by the impressive speed and performance of the primates׳ visual system, computer vision scientists have long tried to “copy” it. So far, it is mostly the architecture of the visual system that has been mimicked. For instance, using hierarchical feedforward networks with restricted receptive fields, like in the brain, has been proven useful [12], [13], [14], [15], [16], [17]. In comparison, the way that biological visual systems learn the appropriate features has attracted much less attention. All the above-mentioned approaches somehow use non biologically plausible learning rules. Yet the ability of the visual cortex to wire itself, mostly in an unsupervised manner, is remarkable [18], [19].

Here, we propose that adding bio-inspired learning to bio-inspired architectures could improve the models׳ behavior. To this end, we focused on a particular form of synaptic plasticity known as spike timing-dependent plasticity (STDP), which has been observed in the mammalian visual cortex [20], [21]. Briefly, STDP reinforces the connections with afferents that significantly contributed to make a neuron fire, while it depresses the others [22]. A recent psychophysical study provided some indirect evidence for this form of plasticity in the human visual cortex [23].

In an earlier study [24], it is shown that a combination of a temporal coding scheme – where in the entry layer of a spiking neural network the most strongly activated neurons fire first – with STDP leads to a situation where neurons in higher visual areas will gradually become selective to complex visual features in an unsupervised manner. These features are both salient and consistently present in the inputs. Furthermore, as learning progresses, the neurons׳ responses rapidly accelerates. These responses can then be fed to a classifier to do a categorization task.

In this study, we show that such an approach strongly outperforms state-of-the-art computer vision algorithms on view-invariant object recognition benchmark tasks including 3D-Object [25], [26] and ETH-80 [27] datasets. These datasets contain natural and unsegmented images, where objects have large variations in scale, viewpoint, and tilt, which makes their recognition hard [28], and probably out of reach for most of the other bio-inspired models [29], [30]. Yet our algorithm generalizes surprisingly well, even when “simple classifiers” are used, because STDP naturally extracts features that are class specific. This point was further confirmed using mutual information [31] and representational dissimilarity matrix (RDM) [32]. Moreover, the distribution of objects in the obtained feature space was analyzed using hierarchical clustering [33], and objects of the same category tended to cluster together.

Section snippets

Materials and methods

The algorithm we used here is a scaled-up version of the one presented in [24]. Essentially, many more C2 features and iterations were used. Our code is available upon request. We used a five-layer hierarchical network $S_{1} \to C_{1} \to S_{2} \to C_{2} \to classifier$ , largely inspired by the HMAX model [14] (see Fig. 1). Specifically, we alternated simple cells that gain selectivity through a sum operation, and complex cells that gain shift and scale invariance through a max operation. However, our network uses spiking

Dataset and experimental setup

To study the robustness of our model with respect to different transformations such as scale and viewpoint, we evaluated it on the 3D-Object and ETH-80 datasets. The 3D-Object is provided by Savarese et al. at CVGLab, Stanford University [25]. This dataset contains 10 different object classes: bicycle, car, cellphone, head, iron, monitor, mouse, shoe, stapler, and toaster. There are about 10 different instances for each object class. The object instances are photographed in about 72 different

Discussion

Position and scale invariance in our model are built-in, thanks to weight sharing and scaling process. Conversely, view-invariance must be obtained through the learning process. Here, we used all images of five object instances from each category (varied in all dimensions) to learn the S₂ visual features, while images of all other object instances of each category were used to test the network. Hence, the model was exposed to all possible variations during the learning to gain view-invariance.

Conclusions

To date, various bio-inspired network architectures for object recognition have been introduced, but the learning mechanism of biological visual systems has been neglected. In this paper, we demonstrate that the association of both bio-inspired network architecture and learning rule results in a robust object recognition system. The STDP-based feature learning, used in our model, extracts frequent diagnostic and class specific features that are robust to deformations in stimulus appearance. It

Acknowledgements

We would like to thank Mr. Majid Changi Ashtiani at the Math Computing Center of IPM (http://math.ipm.ac.ir/mcc) for letting us to perform some parts of the calculations on their computing cluster. We also thank Dr. Reza Ebrahimpour for his helpful discussions and suggestions.

Saeed Reza Kheradpisheh is currently doing his Ph.D. in Computer Science at University of Tehran. He has received B.Sc. and M.Sc. degrees in Computer Science. His research interests in the area of Computational Visual Neuroscience include invariant object recognition, spiking neural networks, and deep learning.

References (65)

J.J. DiCarlo et al.
How does the brain solve visual object recognition?
Neuron
(2012)
A.S. Nandy et al.
The fine structure of shape tuning in area V4
Neuron
(2013)
H. Liu et al.
Timing, timing, timing: fast decoding of object information from intracranial field potentials in human visual cortex
Neuron
(2009)
G.M. Ghose
Learning in mammalian sensory cortex
Current Opinion Neurobiol.
(2004)
Z. Kourtzi et al.
Learning and neural plasticity in visual object recognition
Current Opinion Neurobiol.
(2006)
C.D. Meliza et al.
Receptive-field modification in rat visual cortex induced by paired visual stimulation and single-cell spiking
Neuron
(2006)
D.E. Feldman
The spike-timing dependence of plasticity
Neuron
(2012)
D.B.T. McMahon et al.
Stimulus timing-dependent plasticity in high-level vision
Current Biol.
(2012)
J. Pohjalainen et al.
Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits
Computer Speech & Language
(2015)
G.A. Rousselet et al.
Taking the MAX from neuronal responses
Trends Cogn. Sci.
(2003)

D.D. Cox et al.

Neural networks and neuroscience-inspired computer vision

Current Biol.

(2014)

O. Bichler et al.

Extraction of temporally correlated features from dynamic vision sensors with spike-timing-dependent plasticity

Neural Network: Official J. Int. Neural Netw. Soc.

(2012)

T. Dorta et al.

Aer-srt: scalable spike distribution by means of synchronous serial ring topology address event representation

Neurocomputing

(2016)

S. Thorpe et al.

Speed of processing in the human visual system

Nature

(1996)

I. Biederman

Recognition-by-components: a theory of human image understanding

Psychol. Rev.

(1987)

P. Lennie et al.

Coding of color and form in the geniculostriate visual pathway (invited review)

J. Opt. Soc. Am. A

(2005)

K. Tanaka et al.

Coding visual images of objects in the inferotemporal cortex of the macaque monkey

J. Neurophysiol.

(1991)

C.P. Hung et al.

Fast readout of object identity from macaque inferior temporal cortex

Science

(2005)

N.C. Rust et al.

Selectivity and tolerance (invariance) both increase as visual information propagates from cortical area V4 to IT

J. Neurosci.

(2010)

W.A. Freiwald et al.

Functional compartmentalization and viewpoint generalization within the macaque face-processing system

Science

(2010)

F. Anselmi, J.Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, T. Poggio, Unsupervised learning of invariant...

K. Fukushima

Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

Biol. Cybern.

(1980)

Y. LeCun, Y. Bengio, Convolutional networks for images, speech, and time series, in: M.A. Arbib (Ed.), The Handbook of...

T. Serre et al.

Robust object recognition with cortex-like mechanisms

IEEE Trans. Pattern Anal. Machine Intel.

(2007)

H. Lee, R. Grosse, R. Ranganath, A.Y. Ng, Convolutional deep belief networks for scalable unsupervised learning of...

A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, in: Neural...

Q.V. Le, Building high-level features using large scale unsupervised learning, in: Proceedings of the IEEE...

S. Huang et al.

Associative hebbian synaptic plasticity in primate visual cortex

J. Neurosci.

(2014)

T. Masquelier et al.

Unsupervised learning of visual features through spike timing dependent plasticity

PLoS Comput. Biol.

(2007)

S. Savarese, L. Fei-Fei, 3D generic object categorization, localization and pose estimation, in: Proceedings of the...

B. Pepik, M. Stark, P. Gehler, B. Schiele, Multi-view priors for learning detectors from sparse viewpoint data, in:...

B. Leibe, B. Schiele, Analyzing appearance and contour based methods for object categorization, in: Proceedings of the...

Cited by (120)

Visual tracking in video sequences based on biologically inspired mechanisms
2024, Computer Vision and Image Understanding
Visual tracking is the process of locating one or more objects based on their appearance. The high variation in the conditions and states of a moving object and presence of challenges such as background clutter, illumination variation, occlusion, etc. makes this problem extremely complex, and hard to achieve a robust algorithm in this field. However, unlike the machine vision, in the biological vision, the task of visual tracking is ideally conducted even in the worst conditions. Consequently, in this paper, taking into account the superior performance of biological vision in visual tracking, a biologically inspired visual tracking algorithm is introduced. The proposed algorithm inspiring the task-driven recognition procedure of the primary layers of the ventral pathway, and visual cortex mechanisms including spatial–temporal processing, motion perception, attention, and saliency to track a single object in the video sequence. For this purpose, a set of low-level features including the oriented-edges, color, and motion information (inspired by the layer V1) extracted from the target area and based on the discrimination rate that each feature creates with the background (inspired by the saliency mechanism), a subset of these features are employed to generate the appearance model and identify the target location. Moreover, by memorizing the shape and motion information (inspired by the short-term memory) scale variation and occlusion are handled. The experimental results showed that the proposed algorithm can well handle most of the visual tracking challenges, achieve high precision in target locating and act in a real-time manner.
Resolving the neural mechanism of core object recognition in space and time: A computational approach
2023, Neuroscience Research
Citation Excerpt :
In each level of categorization, the average accuracy of these pairs is calculated as the performance in that level. It is important to note that the existing computational models of core feedforward object recognition have usually ignored the brain’s dynamics in the representation and decision-making parts (Gold and Shadlen, 2007; Hanks et al., 2014; He et al., 2016; Heekeren et al., 2004; Heidari-Gorji et al., 2015; Hubel and Wiesel, 1968; Kanwisher et al., 1997; Karimi-Rouzbahani et al., 2017; Kheradpisheh et al., 2016a), such that most of them are only capable of following either the accuracy (Gold and Shadlen, 2007; Hanks et al., 2014; He et al., 2016; Heidari-Gorji et al., 2015; Kiani et al., 2013; Hubel and Wiesel, 1968; Kheradpisheh et al., 2016a) or the reaction time (Kiani et al., 2013) of the recognition. The structure of the proposed model in its second stage was similar to the one proposed by Kheradpishe et al. (Kheradpisheh et al., 2018).
The underlying mechanism of object recognition- a fundamental brain ability- has been investigated in various studies. However, balancing between the speed and accuracy of recognition is less explored. Most of the computational models of object recognition are not potentially able to explain the recognition time and, thus, only focus on the recognition accuracy because of two reasons: lack of a temporal representation mechanism for sensory processing and using non-biological classifiers for decision-making processing. Here, we proposed a hierarchical temporal model of object recognition using a spiking deep neural network coupled to a biologically plausible decision-making model for explaining both recognition time and accuracy. We showed that the response dynamics of the proposed model can resemble those of the brain. Firstly, in an object recognition task, the model can mimic human’s and monkey’s recognition time as well as accuracy. Secondly, the model can replicate different speed-accuracy trade-off regimes as observed in the literature. More importantly, we demonstrated that temporal representation of different abstraction levels (superordinate, midlevel, and subordinate) in the proposed model matched the brain representation dynamics observed in previous studies. We conclude that the accumulation of spikes, generated by a hierarchical feedforward spiking structure, to reach abound can well explain not even the dynamics of making a decision, but also the representations dynamics for different abstraction levels.
Learning rules in spiking neural networks: A survey
2023, Neurocomputing
Spiking neural networks (SNNs) are a promising energy-efficient alternative to artificial neural networks (ANNs) due to their rich dynamics, capability to process spatiotemporal patterns, and low-power consumption. The complex intrinsic properties of SNNs give rise to a diversity of their learning rules which are essential to functional SNNs. This paper is aimed at presenting a comprehensive overview of learning rules in SNNs. Firstly, we introduce the basic concepts of SNNs and commonly used neuromorphic datasets. Then, guided by a hierarchical classification of SNN learning rules, we present a comprehensive survey of these rules with discussions on their characteristics, advantages, limitations, and performance on several datasets. Moreover, we review practical applications of SNNs, including event-based vision and audio signal processing. Finally, we conclude this survey with a discussion on challenges and promising future research directions in this area.
Feature disentangling and reciprocal learning with label-guided similarity for multi-label image retrieval
2022, Neurocomputing
Image retrieval usually faces scale-variance issues as the amount of image data is rapidly increasing, which calls for more accurate retrieval technology. Besides, existing methods usually treat pair-image similarity as a binary value which indicates whether two images share either at least one common label or none of shared labels. However, such similarity definition cannot truly describe the similarity ranking for different numbers of common labels when handling the multi-label image retrieval problem. In this paper, a Feature Disentangling and Reciprocal Learning (FDRL) method is introduced with label-guided similarity to solve the above multi-label image retrieval problem. Multi-scale features are first extracted by BNInception network and then disentangled to the corresponding high- and low-correlation features under the guidance of estimated global correlations. After that, the disentangled features are combined through a reciprocal learning approach to enhance the feature representation ability. Final hash codes are learned based on the global features derived from BNInception network and the combined features generated by reciprocal learning. The whole network is optimized by the proposed label-guided similarity loss function which aims to simultaneously preserve absolute similarity for hard image pairs and relative similarity for soft image pairs. Experimental results on three public benchmark datasets demonstrate that the proposed method outperforms current state-of-the-art techniques. The code is online here: ‘https://github.com/Yong-DAI/FDRL’.
Backpropagation with biologically plausible spatiotemporal adjustment for training deep spiking neural networks
2022, Patterns
Citation Excerpt :
Diehl et al.13 used the STDP learning rule and lateral inhibition in a two-layer SNN and achieved 95% accuracy on the MNIST dataset. Saeed et al.14 introduced a weight-sharing strategy and designed a spiking convolutional neural network. The weight was learned by the STDP layer-wisely.
The spiking neural network (SNN) mimics the information-processing operation in the human brain. Directly applying backpropagation to the training of the SNN still has a performance gap compared with traditional deep neural networks. To address the problem, we propose a biologically plausible spatial adjustment that rethinks the relationship between membrane potential and spikes and realizes a reasonable adjustment of gradients to different time steps. It precisely controls the backpropagation of the error along the spatial dimension. Secondly, we propose a biologically plausible temporal adjustment to make the error propagate across the spikes in the temporal dimension, which overcomes the problem of the temporal dependency within a single spike period of traditional spiking neurons. We have verified our algorithm on several datasets, and the experimental results have shown that our algorithm greatly reduces network latency and energy consumption while also improving network performance.
SAR image classification based on spiking neural network through spike-time dependent plasticity and gradient descent
2022, ISPRS Journal of Photogrammetry and Remote Sensing
At present, the Synthetic Aperture Radar (SAR) image classification method based on Convolution Neural Network (CNN) has faced some problems such as poor noise resistance and generalization ability. Spiking Neural Network (SNN) is one of the core components of brain-like intelligence and has good application prospects. This article constructs a complete SAR image classifier based on unsupervised and supervised learning of SNN by using spike sequences with complex spatio-temporal information. We firstly expound on the spiking neuron model, the receptive field of SNN, and the construction of spike sequence. Then we put forward an unsupervised learning algorithm based on STDP and a supervised learning algorithm based on gradient descent in series. The average classification accuracy of single layer and bilayer unsupervised learning SNN in three categories images on MSTAR dataset is 81.1% and 82.9%, respectively. Furthermore, the convergent output spike sequences of unsupervised learning can be used as teaching signals. Based on the TensorFlow framework, a single layer supervised learning SNN is built from the bottom, and the classification accuracy reaches 90.2%. By comparing noise resistance and model parameters between SNNs and CNNs, the effectiveness and outstanding advantages of SNN are verified. Code to reproduce our experiments is available at https://github.com/Jiankun-chen/Supervised-SNN-with-GD.

View all citing articles on Scopus

Mohammad Ganjtabesh received his B.Sc. degree in Pure Mathematics from the University of Tabriz in 2001, and the M.Sc. and Ph.D. degrees in computer science from the University of Tehran in 2003 and 2008, respectively. He has also performed another Ph.D. program in Bioinformatics at Ecole Polytechnique (France). Then he became an assistant professor at the University of Tehran since 2008. His current research interests include Computational Neuroscience (mainly visual cortex modeling based on spiking neural networks) and Computational Biology (all the problems associated to the RNA structures).

Timothée Masquelier is a researcher in Computational Neuroscience. His research is highly interdisciplinary - at the interface between Biology, Computer Science, and Physics. He uses numerical simulations and analytical calculations to gain understanding on how the brain works, and more specifically on how neurons process, encode and transmit information through action potentials (a.k.a spikes), in particular in the visual modality. He is also interested in bio-inspired computer vision and neuromorphic engineering. Timothée Masquelier was trained at Ecole Centrale Paris (Ingénieur 1999), MIT (M. Sc. 2001), and Université Toulouse III (Ph.D. 2008). He was recruited by the C.N.R.S. in 2012.

View full text

Bio-inspired unsupervised learning of visual features leads to robust invariant object recognition

Abstract

Introduction

Section snippets

Materials and methods

Dataset and experimental setup

Discussion

Conclusions

Acknowledgements

Neuron

Neuron

Neuron

Current Opinion Neurobiol.

Current Opinion Neurobiol.

Neuron

Neuron

Current Biol.

Computer Speech & Language

Trends Cogn. Sci.

Current Biol.

Neural Network: Official J. Int. Neural Netw. Soc.

Neurocomputing

Speed of processing in the human visual system

Nature

Recognition-by-components: a theory of human image understanding

Psychol. Rev.

Coding of color and form in the geniculostriate visual pathway (invited review)

J. Opt. Soc. Am. A

Coding visual images of objects in the inferotemporal cortex of the macaque monkey

J. Neurophysiol.

Fast readout of object identity from macaque inferior temporal cortex

Science

Selectivity and tolerance (invariance) both increase as visual information propagates from cortical area V4 to IT

J. Neurosci.

Functional compartmentalization and viewpoint generalization within the macaque face-processing system

Science

Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

Biol. Cybern.

Robust object recognition with cortex-like mechanisms

IEEE Trans. Pattern Anal. Machine Intel.

Associative hebbian synaptic plasticity in primate visual cortex

J. Neurosci.

Unsupervised learning of visual features through spike timing dependent plasticity

PLoS Comput. Biol.