Abstract

Autonomous object recognition in images is one of the most critical topics in security and commercial applications. Due to recent advances in visual neuroscience, the researchers tend to extend biologically plausible schemes to improve the accuracy of object recognition. Preprocessing is one part of the visual recognition system that has received much less attention. In this paper, we propose a new, simple, and biologically inspired pre processing technique by using the data-driven mechanism of visual attention. In this part, the responses of Retinal Ganglion Cells (RGCs) are simulated. After obtaining these responses, an efficient threshold is selected. Then, the points of the raw image with the most information are extracted according to it. Then, the new images with these points are created, and finally, by combining these images with entropy coefficients, the most salient object is located. After extracting appropriate features, the classifier categorizes the initial image into one of the predefined object categories. Our system was evaluated on the Caltech-101 dataset. Experimental results demonstrate the efficacy and effectiveness of this novel method of preprocessing.

1. Introduction

One of the challenges in the field of artificial intelligence is object recognition. The objective of this process is to classify an object into one of the predefined categories. There are various challenges in this field, such as cluttered and noisy background or objects under different illumination and contrast environments. Human beings can detect and classify objects without any effort in a short time. Researchers believe that the recognition system is closer to the human visual system will be better. In other words, numerous studies [13] have shown that inspired by the human visual system, the recognition system can be designed with relatively high accuracy. According to the recent advances in visual neuroscience, the researchers tend to develop biologically plausible algorithms to improve the accuracy of the object recognition system. Object recognition considerably relies on image representation, for which, in this paper, a novel biologically inspired model is presented for this stage. Among image representation models, bag-of-words (BoW) representation [4] has been generally employed because it is robust to object scale and translation changes. Three different modules of BoW models are extraction, coding, and pooling of different features. K-means clustering, which is applied for feature coding, will cause severe information loss because of the hard assignment of each feature to the nearest cluster center. So, soft k-means [5] and sparse coding [6] procedures are presented to overcome this problem.

Sparse coding-based methods are broadly used since they have fewer parameters and more reliable performance than soft k-means. Some different sparse coding-related feature coding techniques [68] are offered and obtain the best achievement for image presentation. During the sparse coding-based strategies, the image is represented by a vector of sparse codes matching the features in the individual image within the feature pooling module.

In the BoW model, the whole image is the pooling area, and therefore, the spatial information may be lost. Such data can significantly affect recognition accuracy.

A spatial pyramid matching (SPM) plan [9] is introduced to represent an image. This algorithm is divided into finer regions introduced to preserve spatial information.

It can be said that current approaches for object recognition mainly use machine learning methods. There are several solutions for improving the recognition accuracy, such as collecting larger datasets, using powerful learning algorithms, and preventing overfitting by using better techniques. In recent years, significant steps have been taken to make the recognition systems more effective.

Deep Neural Network (DNN) is one of the best algorithms that have shown excellent results on benchmark datasets [10, 11]. One of the most applicable of these networks is Convolutional Neural Networks (CNNs) [1214]. These networks are variations of multilayer perceptron and are inspired by biological processes. These models have a massive learning capacity in which we can learn about thousands of objects from millions of images by using them.

Another set of methods that are used for object recognition is classifying the salient objects. To the best of our knowledge, the objects usually are more conspicuous than the background in object recognition systems. Recognizing the objects in the human visual system is closely related to saliency detection. The biological systems using this process will be able to remove unneeded information and focus on the essential regions in an image. In this procedure, two factors determine the pertinent information: Top-Down (TD) or Bottom-Up (BU).

There are several ways to extract the highlighted area of the image, including [1522]. In [15], a new classification scheme is presented which combines CNN and visual attention mechanism; Shariatmadar and Faez [16] proposed a model that combines the bottom-up and top-down features to extract the prominent part of the image; He et al., in [17], extended Itti’s model by using structure tensor; Luo et al. [18] identified the salient object based on backbone enhanced network; Yang et al. [19] detected the salient part of the image by introducing the double-random-walks; in [20], the BU and TD features of a single image are used to detect salient region; Wang et al., in [21], proposed a model of saliency detection related to multilevel deep pyramid (MLDP), and finally, in [22], the diver target detection is performed based on a saliency detection method.

Image quality assessment [23], video coding [24, 25], image contrast estimation [26], and image Watermarking [27] are other uses of the salient area extraction.

Researchers have revealed that saliency discovery has an inherent utilization in target recognition [2831]. In [28], the images are categorized by immediately classifying the obtained saliency maps. Shokoufandeh et al. [29] represent 3D objects by building a hierarchical graph arrangement based on the saliency map. Moosmann et al. [30] used the remarkable characteristics to boost the classifiers for recognizing the objects. In [32], a saliency network in the first layer of the HMAX [33] architecture was used. Frintrop et al. [34] trained the classifier relying on the conspicuous areas instead of the whole image to speed up classification.

In this paper, by mimicking the human visual system and using machine learning algorithms, we designed a system for object recognition. In fact, after simulating the RGCs responses of an RGB image (new representation of the image), the spike map is obtained by selecting an appropriate threshold. The image pixels corresponding to spikes highlight the salient objects (BU saliency detection). Then, the saliency submaps are linearly combined with the entropy coefficients. In the final stage, the features of remarkable objects are extracted and the classifier categorizes it to predefined classes.

Briefly, the central contributions of this research are expressed as follows:(i)Using the RGCs responses for obtaining a good representation of an image (inspiring human visual system)(ii)Determining only those pixels generate an action potential (inspiring the spiking neural network in human beings)(iii)Reducing the computational cost of object recognition by extracting the most salient object in the image

The rest of this paper is prepared as follows. The description of our structure is developed in Section 2. In Section 3, experiments are carried out to assess our proposed system. The discussion and conclusion are given in Sections 4 and 5, respectively.

2. System Overview

Figure 1 shows the general review of the proposed method. At first, the raw image is defined in the CIE LAB color space, and each channel is preprocessed by simulating RGCs in the human retina. Then, the resultant images are fed into a spike generator. In this stage, the pixels which are higher than a predefined threshold are selected. The new images obtained by these pixels are saliency submaps, combined linearly by entropy coefficients, and so the final saliency map (Region of Interest (ROI)) is obtained. Finally, after extracting appropriate features from ROI, the classification step is done, and a label is assigned to each initial image. All of these stages are described in the following sections (since the focus of this paper is to represent an image and to extract ROI, feature extraction and classification procedures are described briefly).

2.1. Image Preprocessing

Image preprocessing is the front-end of each recognition system. In this stage, a good and meaningful representation of the raw image is obtained for further processing in the next steps (in this paper, new description is equivalent to ROI extraction). Various phases of the proposed method for obtaining the new representation of the raw image are as follows.

2.1.1. Color Space Transformation

In this stage, the RGB image is encoded into three color channels defined in the CIE LAB color space. The CIE LAB color is a three-dimensional space that comprises the entire range spectrum of human color perception. This color space represents color as three states: L (lightness), a (green to red), and b (blue to yellow). Therefore, the four individual colors of human vision are covered (yellow, blue, green, and red). In other words, the LAB color space models are the opponent characteristics of color in the human visual system. Different color spaces (such as CMYK and RGB, which are device-dependent) are not designed to imitate human visual perception. In other words, these spaces model the output of physical devices.

Briefly, in this paper, the aim is to offer a preprocessing method based on the human visual system. Accordingly, LAB color space was selected for subsequent image processing for these reasons: (1) considering the uniform distribution of color in human vision, (2) modelling the opponent properties of color in the human visual system, and (3) imitation of the three-stimulus model in the color vision system of humans by many digital cameras.

For this transformation, two conversions are done: the RGB space to XYZ space and XYZ space to CIELAB space. The formulas of them are shown as follows:where

2.1.2. Simulating RGCs’ Responses

In this stage, we used the same functionality of RGCs in the human retina.

In the human eye, the neurons near the retina’s inner surface are referred to as retinal ganglion cells. The visual information is passed to these cells via bipolar and retinal amacrine cells. They differ remarkably in terms of their dimension, associations, and responses to optical stimulation. There is a minimum of five central categories of retinal ganglion cells, which are based on their purposes. One of these classes which are considered in this paper is Midget cells (parvocellular, or P pathway; P cells). Midget retinal ganglion cells map into the parvocellular layers of the lateral geniculate nucleus. Parasol cells are approximately 80% of the total retinal ganglion cells that get information from several rods and cones. They have a fast transfer rate and can react to low-contrast stimuli. They have uncomplicated center-surround receptive fields, where the center can be either OFF or ON while the surrounding is in the opposite mode (the receptive field of a single sensory neuron is the specific area of the sensory space in which a stimulus will activate the firing of the neuron). Three main steps in modelling these cells are as follows (for modelling Midget cells, we further developed receptive fields for peripheral and foveal cells which have ON and OFF parts):Step 1: initializing basic parameters (standard deviations of the central Gaussians) according to Table 1 [35].Step 2: making a difference of Gaussian (DOG) filters to model RGC receptive fields.In this stage, center and surrounding Gaussians of ON and OFF DOG filters for the simulated RGC cells are obtained. Three steps of these filters are as follows:(i)Making central Gaussian,(ii)Converting it to the 2D filter,(iii)Normalizing to sum to 1,It is emphasized that the goal of this article is not to use the concepts related to spikes and excitatory neurons in different areas of the human brain. The only biological use in this paper is to simulate the response of retinal ganglion cells without spike encoding.Figure 2 shows the center and surrounding receptive fields for foveal midget cells.Step 3: applying the RGC models to the image matrix for the foveal pathway.

In this step, the image is convolved to RGC filters. The resultant of this step is six responses: midget foveal/peripheral OFF reactions for each of the three channels L, A, and B. Figure 3 shows the responses of foveal/peripheral OFF midget cells for three channels.

Finally, six maps of a raw image are obtained, which is fed to the next step. We believed that, by getting the responses of RGCs, the better representation of the raw image is achieved.

2.1.3. Binary_Map Generation

In this stage, the front-end of the individual visual system is inspired. This section, which consists of several primary layers of neurons (retina photoreceptors to the primary visual cortex), is shown in Figure 4.

There are two classes of photoreceptors referred to as rods and cones. Rods have a great sensibility to low levels of illumination, and cones require high levels of intensity. These cells that are susceptible to a specific interval of the electromagnetic spectrum convert visual information to neural signals. The outputs of this biological pathway are action potentials.

In physiology, an action potential is related to a short-lasting situation in which the membrane’s electrical potential in a cell immediately rises and falls. Action potentials occur in neurons, which are excitable cells. In neurons, cell-to-cell communication is done with action potentials. When the neurons fire, the action potentials or spikes are emitted.

If the image pixels are considered as photoreceptors and the action potential is represented with the binary string, we can emulate the above biological pathway by a linear-nonlinear cascade. A linear function is the linearly combining different weights of bipolar cells, and the nonlinear function is the rectifier function (comparing with a threshold).

This stage aims to generate the binary strings which are obtained by selecting an appropriate threshold. On the contrary, each pixel of RGCs responses is compared to the limit as follows:in which

In the above formula, L, A, and B are three channels of LAB space.

After various experiments, we found that the mean value of each image is the best threshold value for the same image. In other words, the average value of each image is obtained experimentally.

The threshold is suitable when it can detect prominent objects in multiple images. In other words, since the objects in many images are distinguished from the background by the average amount of image pixels, this threshold value seems appropriate (experimental results in the salient object detection (Section 3.3.1)).

If the threshold value is not selected correctly, the preprocessing step will not work successfully. As a result, the object class in the recognition phase is not assigned correctly. In other words, the accuracy of the classification depends on the correct detection of the object in the preprocessing stage. This detection also corresponds to the selection of the threshold. Therefore, with the accurate choice of this measure, the object of the images can be detected with high accuracy, and finally, the correct class of that object in the classification stage can be guaranteed.

2.1.4. New Representation of the Image

So far, we have six binary maps of RGCs which correspond to action potentials of Ganglion cells. Then, by using these maps, those pixels of the raw image in each L, A, and B channel form a new image. So, there will be two new images for each of LAB channels: one for foveal response and one for peripheral response in off-pathway of the ganglion cell: in which and . At the end of this stage, we have six new images in which their pixels correspond to those neurons that emit an action potential.

2.1.5. ROI Extraction

This stage is the last step in the preprocessing stage which is proposed as follows:(i)Computing the entropy coefficient from each of the New_Image obtained in the previous step(ii)Linearly combining New_Images along with entropy coefficients(iii)Cropping the final result from the image as ROI

The proposed fusion rule for obtained NewImages is as follows:

In this equation, the final image is formed by combining with (i = 1, 2, 3) and with (j = 1,2,3). The and coefficients ( coefficients) are determined by entropy measure, and i and j correspond to L, A, and B channels. The texture of the image is specified by a statistical measure of randomness called entropy. In each image, this criterion is determined as

In the above equation, m is the maximum image value, is the probability of value i, and coefficients are defined as follows:where representing the index of new images obtained by two pathways of L, A, and B channels. is the entropy of the jth and is the weight of the ith New_Image.

At the end of this stage, Final_Image is cropped and then fed to the feature extraction stage.

The steps of image preprocessing are summarized in Algorithm 1.

Input: raw image (I)
Output: saliency region (S)
Step 1: RGB to CIE LAB color space conversion:(1) RGB space is transformed to XYZ space according to formula (1)(2) XYZ space is transformed to LAB space according to formula (2)
Step 2: RGC response’s simulation:(1) The L, A, and B channels are convolved with the Gaussian function based on the midget cells’ in the human retina (the standard deviation of this Gaussian function is considered according to Table 1)
Step 3: binary map generation(1) By selecting an appropriate threshold, the Responses are converted to binary images (the threshold is chosen based on the mean value of each gray image)
Step 4: the new image representation(1) The mask operation is done by using the Binary maps in the previous stage
Step 5: ROI extraction(1) All images are combined by using entropy quantity according to formula (9)The coefficients are calculated according to formula (10)
2.2. Feature Extraction and Classification

Feature Extraction is one of the most crucial stages of visual recognition. In this stage, the discriminant features for better classification in the next processing are extracted. In this paper, we used the Log-Gabor function [36] for obtaining localized frequency information. Log-Gabor filters give useful information on receptive fields from the simple cell located in V1 of the human brain. For exploiting the diversity of shape characteristics of an image, a bank of Log-Gabor filters was used (at three scales and eight orientations). Then, we used the Principle Component Analysis (PCA) for dimensionality reduction. The obtained vectors are 128-dimensional. After extracting PCA vectors of different photos, Support Vector Machine (SVM) is trained and used as a classifier. In this paper, a simple, linear, and multiclass SVM is applied. Here, binary classifiers are created, and a distinction is made between one label and the rest of the labels (one-versus-all). When a new instance has come for classification in this case, a winner-takes-all procedure is done.

This paper used the SVM with the linear kernel because its structure can be efficiently implemented in the cortex [37]. Also, the user must select a few parameters and does not need to specify parameters for different kernels.

Also, the experiments using nonlinear kernels were performed. These kernels did not significantly improve the proposed model. Therefore, a linear kernel seems to be the right choice. In this paper, the linear Lib-SVM classifier [38] is used.

3. Experimental Results

Our proposed method is compared with other state-of-the-art algorithms in both saliency detection and object recognition in this section. In other words, the output of the preprocessing stage in the explained scheme is considered as a salient object. Then, the extracted remarkable object is classified as a predefined class (Table 2).

3.1. Experimental Setup

(1)Datasets. In the salient object detection stage (the output of preprocessing unit), the proposed algorithm is evaluated on two publicly datasets: MSRA-B [39] and ECSSD [40], and in the classification phase, the Caltech-101 [41] database is used. MSRA-B has many natural images (so, the comparison is made on a large scale), and ECSSD has structurally complex images (Table 2). The ground truth in these two datasets has segmented manually.(2)Implementation details. In our experiment, our scheme is carried out in MATLAB 2013b on a Dell Vostro 3300 with an Intel i5 M520 2.4 GHz CPU and 8 GB RAM.

3.2. Evaluation Metrics

In our experiments, a widely selected metric called F-measure is used. This measure is calculated to assess the performance of schemes comprehensively and determined as follows:

In the above equation, the precision and recall quantities are defined as follows. is fixed to 0.3 as proposed by [42] to highlight precision:where G is the ground truth in the above relation and B is the binary map of the salient object.

3.3. Performance Evaluation
3.3.1. Salient Object Detection

In this section, the proposed method is compared with various saliency detection schemes, which include GC [43], MC [44], HC [45], RC [45], ST [46], SF [47], HS [40], RBD [48], HDCT [49], DRFI [50], GMR [51], DSR [52], and MFS [53].

In Figure 5, the maximum F-measure using an adjusted threshold is used to compare various methods on the two benchmark datasets. According to this figure, our proposed method works on par with the most reliable schemes on the two databases. Notably, for the ECSSD and MSRA datasets, our method’s F-measure is only 3.39% and 3.79% less than the best model [50], respectively. Compared with all the other schemes, our approach is slightly worse than the best systems on these datasets.

3.3.2. Object Recognition

As mentioned in Section 3.1, we used the Caltech-101 database [41] to evaluate our proposed method’s performance. This database includes 101 object classes and each of which comprises between 40 and 800 images. After considering different training images, we found experimentally that 15 training images for each class are the best. We used 15 images for training, and 50 other chosen images for testing in each group (pictures are selected randomly). All the algorithms compared with the proposed method in the next section have also selected their training and testing images in the same way. If less than 65 images were available for a category, we trained on 15 random pictures per class and then examined on all unused images. For describing the results among this database, we estimate the mediocre efficiency for every category. Table 3 shows our per-class results. This table exhibits the first ten best-classified categories utilizing the Caltech-101 database.

Typical images from these classes are shown in Figure 6.

Following the images in Figure 6, it can be said that our results are quite good on average because(i)The pictures of these classes have a simple background, and the proposed method in ROI extraction can obtain a good representation of an object.(ii)In these classes, the shape is an excellent discriminatory feature, and so our scheme in saliency detection is quite good.(iii)In categories such as wild-cat, hedgehog, and butterfly, our algorithm does not have acceptable accuracy. The reasons for this observation are the cluttered background and lack of appropriate features. On the contrary, in the images with natural objects, additional features such as color and texture should be considered. It seems that, by combining different elements efficiently, we can appropriately recognize the objects of these categories.

Generally, the objects are divided into two categories (natural and human-made). Physical objects usually have a cluttered background and various colors, so our proposed method in recognizing these objects is impotent (of course, if the background is simple, our algorithm will be relatively good).

In the following, the comparison between our method and other algorithms are given in Table 4. This comparison is accomplished in terms of eight categories of the Caltech-101 dataset. It should be noted that, for a fair comparison, the training and testing images must be the same for all methods. Therefore, the listed results of other schemes are not used (different algorithms were reimplemented by the authors), and the accuracy values are reported using the same training and testing images for all algorithms.

The scheme developed by Berg [54] is one of the most reliable nonbiologically inspired methods. In his work, the shapes are presented by sampling several pixel positions obtained from the output of an edge discovering process. Lazebnik et al. [9] adopted the kernels of spatial pyramid matching of visual words of Scale Invariant Feature Transform (SIFT) [55] descriptors. HMAX [33], a famous architecture for extracting the biologically inspired visual features, achieves 51.2 percent accuracy when used with SVM as a classifier (in their scheme, 15 images are used for training). This paradigm, based on considerations of visual receptive fields discovered in monkey and cat visual cortex, is usually extremely slow for real-time purposes. Also, the researchers in [56] suggest a neuromorphic visual object recognition structure motivated by neuroscience principles of recognition and visual attention in the human brain. Lu et al., in [57], enhanced the HMAX model by modifying the path selection in S2 layer of it. For this purpose, they used the concept of salient region in any image. Finally, Norizadeh et al. [58] developed an enhanced model of HMAX based on SIFT features. These characteristics are used to select the regions with most information.

Machine learning schemes develop a model based on example data, identified as “training data” to make decisions. According to this definition, the methods described above all use machine learning algorithms. According to Table 4, different methods in various classes can have the best accuracy. Some of the specifications of these methods are summarized in Table 5.

As reported in Table 4, it seems that our proposed method works better than other research studies and is 2% less than the Khosla model in faces objects. It can be said that the new representation of images by using the RGCs responses can obtain an accurate ROI in pictures by highlighting the salient objects and removing the redundant information.

Also, MATLAB code of some methods, such as [33], are publicly available, which has been used in our comparisons. However, some schemes, such as [5557], have been reimplemented by the authors.

It should be noted that, in this paper, our focus is on the preprocessing stage. The goal is to show that if image preprocessing is done with high accuracy, the classification precision will eventually increase. In other words, the goal is to find an efficient shallow model that can achieve high accuracy in classification by performing precise preprocessing. Also, the amount of training data in this issue is much less than the data available for deep network training. For example, the amount of data in Caltech is much smaller than in ImageNet. So, we used the Caltech-101 dataset to compare shallow networks. For this reason, the findings were not compared to the CNN networks.

4. Discussion

Since the proposed method uses the RGCs responses in the human retina for image presentation, it achieved the best performance over other methods in the Caltech-101 dataset. On the contrary, if we have an appropriate representation of an image in the initial stages of the recognition system, we will send salient information for the classifier in the last step. In the human retina, the essential information and the less critical data are transferred into the visual cortex. By using this idea, we obtain a new representation of the raw image with RGCs responses.

Some of the capabilities of our proposed approach include(i)If the image has more than one object, our proposed method can discover and eliminate unrelated data from an image (if the background is not too cluttered)(ii)In human-made objects and simple background of images, our algorithm can extract the most salient information(iii)Simplicity and robustness to illumination and noise are other advantages of our scheme

Some of the limitations of our proposed approach include(i)In our algorithm, only the shape features are used. Integration features can be used to obtain better recognition accuracy. For example, the color feature can be used in the preprocessing step (different channels of color (Red-off, Red-on, Blue-off, and so on) according to visual perception in the human retina) or the feature extraction stage (for example, color naming).(ii)The proposed method does not work relatively well, mainly in dealing with images with crowded backgrounds (the results of salient object detection for ECSSD database in Section 3.3.1). A raw idea to solve this problem is to consider a combination of different channels in various color spaces to use the unique color information of the objects. Acceptance of this idea depends on further experiments.(iii)Another limitation of our method is that we do not consider the occlusion challenge. On the contrary, our proposed method can work well when the objects are without any occlusion. Of course, the suggested algorithm may work well on some partially occluded images, but if there is a lot of occlusion in the image, our scheme will not be usable.(iv)Matlab software is used for implementing our method. The proposed algorithm’s speed can be improved by using the C++-based execution or employing the parallelization techniques.

5. Conclusion

In this paper, a visual recognition scheme using natural scenes is presented. In this study, we focus on the preprocessing stage and reveal that the image is more similar to the processed image of the human retina and the final classification accuracy would be higher. At first, each raw image is represented in six activation maps, which are the simulated reactions of retinal ganglion cells of human retina (two responses of Midget cells in three channels of LAB color space). After modelling the RGC’s response, an acceptable threshold is selected and a Binary_Map is created by comparing the RGC’s response to that threshold. This binary mask forms a new representation of the raw image. Finally, by computing the entropy coefficients of each image and combining them, the final image is obtained. This image highlights the most salient information and removes the redundant background.

The results of various experiments presented in Section 3 illustrate the suitability of our proposed methods for recognizing objects. Some of the future works for obtaining better recognition accuracy can be mentioned as follows:(i)Using different color channels by inspiring color perception in the human retina(ii)Integrating shape features with texture and color(iii)Implementing coarse and fine classification in the last step of the visual recognition system: this type of classification is speculated to arise in the inferior temporal cortex (IT) in the human brain [59](iv)Using the PASCAL-S and Judd datasets for considering the images with complex scenes(v)Investigating the proposed preprocessing approach for deep-structured networks (testing the proposed method on the image net database) and achieving reasonable results for fine-grained classification

Data Availability

The data used to support the findings of this study are included within the supplementary information file and are publicly available.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors acknowledge the Amirkabir University of Technology (Tehran Poly Technique) for financially supporting and processing equipment of this work.

Supplementary Materials

The three public datasets used to support this study are available at , , and . (https://mmcheng.net/msra10k/ http://www.cse.cuhk.edu.hk/∼leojia/projects/hsaliency/dataset.html http://www.vision.caltech.edu/Image_Datasets/Caltech101/ Supplementary Materials)