Abstract

Event cameras which transmit per-pixel intensity changes have emerged as a promising candidate in applications such as consumer electronics, industrial automation, and autonomous vehicles, owing to their efficiency and robustness. To maintain these inherent advantages, the trade-off between efficiency and accuracy stands as a priority in event-based algorithms. Thanks to the preponderance of deep learning techniques and the compatibility between bio-inspired spiking neural networks and event-based sensors, data-driven approaches have become a hot spot, which along with the dedicated hardware and datasets constitute an emerging field named event-based data-driven technology. Focusing on data-driven technology in event-based vision, this paper first explicates the operating principle, advantages, and intrinsic nature of event cameras, as well as background knowledge in event-based vision, presenting an overview of this research field. Then, we explain why event-based data-driven technology becomes a research focus, including reasons for the rise of event-based vision and the superiority of data-driven approaches over other event-based algorithms. Current status and future trends of event-based data-driven technology are presented successively in terms of hardware, datasets, and algorithms, providing guidance for future research. Generally, this paper reveals the great prospects of event-based data-driven technology and presents a comprehensive overview of this field, aiming at a more efficient and bio-inspired visual system to extract visual features from the external environment.

1. Introduction

In event-based visual systems comprising perception and processing, algorithms in the processing part are designed to maintain the intrinsic advantages of event cameras, among which data-driven approaches are basically the most prevalent and promising algorithms. Since the development of algorithms is inseparable from relevant hardware and datasets, data-driven approaches along with the required hardware and datasets collectively constitute an emerging research field named data-driven technology. Focusing on data-driven technology in event-based vision, this paper successively presents the background knowledge, the reasons why data-driven technology becomes a research focus, current status of data-driven technology, and future trends, as illustrated in Figure 1. Both current development and future trends of data-driven technology in event-based vision are discussed in terms of algorithms, hardware, and datasets’ field.

For numerous smart devices, a visual system that can perceive the external environment and extract visual features of interest from it acts as a crucial prerequisite for performing specific tasks. New requirements such as low power consumption, stability in challenging environments, and real-time response have emerged as well. Though frame-based visual sensors have taken over the market for more than a century, recording instant and rich information about the whole viewed scene, their operating principle leads to natural failures in case of fast motions, or difficult lighting scenarios. Moreover, transmission of massive redundant information also increases the power consumption and latency, limiting application in scenarios that demand immediate reaction or lacking computation power. Inspired by traditional frame-based vision and biological mechanisms, a new type of visual sensor named event camera [1, 2] is on the rise, aiming at applications where traditional cameras fail. Equipped with merits such as high dynamic range, high temporal resolution, low latency, and low power consumption, event cameras own a bright future in various practical applications owing to their efficiency and robustness.

Outline: the comprehensive knowledge embodied in this paper is illustrated in Figure 2 and the rest of the paper is organized as follows. Section 2 gives a brief introduction to the relevant background and raises the concept of data-driven technology. In Section 3, two levels of reasons for the predominance of data-driven technology are elaborated according to practical application requirements and existing technologies. Section 4 reviews the current situation of data-driven technology. Concretely, data-driven algorithms are introduced in categories, as well as the relevant hardware and datasets. Section 5 proposes several promising orientations spanning algorithms, datasets, and hardware, providing guidance for future research. This paper ends with a conclusion in Section 6.

2. Background Knowledge and Data-Driven Technology

Compared with traditional cameras that capture whole frames at a fixed rate, event cameras generate event data at a different data form. The novel working principle naturally leads to unique advantages and inherent properties of event data. Underlying the superiority of event cameras exists a subfield named event-based vision [3], which drives a paradigm shiftăin visual features’ acquisition and comprises multiple mutually promoting modules such as startups, applications, algorithms, hardware, and datasets. Event-based visual systems are designed to extract interested visual features from the external environment. According to the practical application requirements, maintaining the advantages of robustness and efficiency of event cameras is of utmost urgency in event-based visual systems. Therefore, event-based algorithms should balance the trade-off between accuracy and efficiency, among which data-driven approaches stand out on the basis of existing theories and technologies. The relationship among various parts in the relevant background is shown in Figure 3.

2.1. Mechanisms of Event Cameras

A growing understanding of biological vision is inspiring multiple efficient means of perceiving the environment. Inside an eye ball, the network of various linking cells provides a spatio-temporal filtering mechanism, emphasizes edges and temporal changes in the viewed scene, and discards redundant information. Event-based visual systems are designed by mimicking the biological retina and the subsequent processing in the brain. Event cameras are bio-inspired visual sensors that respond to scene dynamics and record log brightness changes rather than full images as frame-based visual sensors do. With pixel-wise logic, the response is defined by the viewed scene, quite like the human eye. Event cameras output a sequence of events represented with four different components: two coordinates for the location, a timestamp , and a polarity for the change (i.e., brightness increase 1 or decrease −1), denoted aswhere is the predefined threshold, is the time interval, and is the intensity. Events are generated with various data rates depending on the magnitude of brightness changes in the scene. Since illumination in the scenario is usually constant, events are mainly triggered by object movements. Specifically, event cameras own microsecond time resolution and submillisecond transmission latency, offering fast reaction time and outstanding efficiency.

The novel working principle that each pixel reports log-scale intensity changes independently and asynchronously brings vision to the very edge and focus more on the scene dynamics, thus mitigating latency and data redundancy dramatically. In summary, event cameras deliver numerous advantages over frame-based visual sensors, including(1)High temporal resolution, which enables event cameras to capture fast motions without suffering from motion blur(2)Low latency, which benefits from the working principle that exempts pixels from global exposure time(3)High dynamic range and low light sensitivity, suitable for a wider range of lighting conditions, including challenging lighting environments(4)Low power consumption and high data efficiency, which filters out redundant by recording only pixel-level brightness changes

In a nutshell, its availability in various operating requirements demonstrates a large potential in both machine vision applications and research field.

Event cameras output a stream of events represented as a tuple of location, timestamp, and polarity of the intensity change. Owing to the unique working principle, event data is endowed with the spatial and temporal sparsity nature [4], corresponding to its high efficiency. On the other side, information extraction from event data plays a crucial role for further analysis in event-based vision. Considering the two points mentioned above, an ideal event-based algorithm is supposed to exploit the spatio-temporal sparsity of event data and, at the same time, extract sufficient information from it. In other words, the balance between high accuracy and high efficiency remains a core challenge.

2.2. A Glance over Event-Based Vision

With the emergence and advancement of event cameras, a new field named event-based vision is on the rise, spanning aspects of algorithms, hardware, datasets, manufactures, and applications. During the development of event-based algorithms, existing theories and techniques in biological visual systems and frame-based vision provide considerable support, as illustrated in Figure 4. With the main point of efficiency and accuracy, several rules can be summed up in an event-based vision. First, an event-based vision targets at practical application requirements and takes lessons from existing theories and techniques in biological systems and frame-based computer vision. Moreover, since event-based vision consists of two collaborating parts, namely, perception and processing, comprehensive information should be extracted from event data and passed down to the processing part with the spatio-temporal sparsity well maintained. Last but not the least, since the improvement of algorithms is inseparable from the related hardware and datasets, all three parts should be collectively analyzed in event-based vision.

Event-based vision is still at its early stages compared with frame-based vision. Lessons from frame-based vision include efficient algorithms, dedicated hardware, and large datasets, which also guide the development of event-based vision, as illustrated in Figure 5. Diverse algorithms have been developed to unlock event cameras’ potentials ranging from low-level to high-level vision tasks. By adjusting parameters, efficient algorithms extract task-oriented information from event data for optimal accuracy. From model-based methods to data-driven approaches, improvements in event-based algorithms lie primarily in the trade-off between high accuracy and efficiency. Classic architectures include HOTS [5], HATS [13], SNN [14], EST [6], and Ev-flownet [7].

In terms of hardware, development is divided into perception and processing part. As for the perception part, event cameras were first commercialized in 2008 by T. Delbruck under the name of Dynamic Vision Sensor (DVS), followed by several developments such as the Asynchronous Time-Based Image Sensor (ATIS) [1], the Dynamic and Active Pixel Vision Sensor (DAVIS) [2], and the color-DAVIS346 [15]. As the “silicon retina” for event cameras, a corresponding “silicon visual cortex” is required for further disposal. Pairing a neuromorphic processor and an event camera plays a vital role in event-based end-to-end systems, facilitating the development of dedicated hardware for spiking neural networks. As for the processing part, several mature event-based computing platforms have been raised including different neuron model types such as SCAMP, SpiNNaker [8], TrueNorth [9], and Loihi [10], with some of them supporting on-chip learning. Furthermore, recent research studies also aim to design vision systems holistically, that is, co-optimizing hardware with algorithms as a whole vision pipeline [16, 17].

Manufactures for event cameras mainly include Paris-based Prophesee, Zurich-based iniVation, and Shanghai-based CelePixel and Insightness (recently acquired by Sony), cultivating an ecosystem with real products. Recently, with Samsung and Sony collaborating with Swiss-based iniVation and Paris-based Prophesee, respectively, putting their image sensor process technologies on the market, the whole event-based industry has been staging a comeback. Applying a data-efficient and low latency approach to sensing and acquisition by mimicking the human retina, event cameras find utilities in areas such as resource-constrained platforms, highly reactive systems, and limited illumination conditions. The emergence of event cameras also promotes machine vision in applications such as automotive, robotics, ARVR, space, inspection, surveillance, and star tracking. Equipped with outstanding properties, event cameras have unlocked new scenarios previously inaccessible, leaving considerable room for improvement in various aspects.

Sufficient labeled data are of vital importance in the performance evaluation part. By bringing down expenses and providing reliable benchmarks, datasets as well as data simulators are elementary tools for further improvement in event-based vision. Sorted by task, they are generally divided into datasets for regression tasks [1821] and those for classification tasks [22, 23]. Simulators [24, 25] and emulators [26] imitate the sampling mechanism of event-based sensors, generating data in a low-budget manner, which meets the demand for cheap, high-quality labeled event in algorithm prototyping and algorithm benchmarking. Broadly, efficient algorithms added with proper hardware and sufficient data promise to accelerate progress of the event-based research community, promoting its application in various fields.

2.3. Data-Driven Technology in Event-Based Vision

With the main target of accuracy and efficiency, algorithms in event-based vision are mainly classified as model-based methods and data-driven approaches, and the latter category is gradually occupying the mainstream on the basis of existing theories and technologies. Since the improvement of algorithms is closely related to the development of hardware and datasets, data-driven approaches along with the required hardware and datasets collectively constitute an emerging research field named data-driven technology. A comprehensive analysis on data-driven technology is unfolded in the rest of this paper. Reasons why data-driven technology grows popular are illustrated first, considering the exact need of practical applications and existing technical conditions, and then comes the current status of data-driven technology, among which data-driven approaches as well as the development of relevant hardware and datasets are explained in details. Finally, future trends of data-driven technology in event-based vision are pointed out with respect to algorithms, hardware, and datasets. The holistic structure of data-driven technology is illustrated in Figure 6.

3. Reasons for the Prevalence of Data-Driven Technology

Regardless of what kind of technology it is, one criterion used for assessing its prospects is whether there exist corresponding requirements in practical applications and relevant technical foundation in support of its development. In terms of data-driven technology in event-based vision, reasons for its prosperity can be further divided into two levels considering the criterion mentioned above, as illustrated in Figure 7. Concretely, factors contributing to the rise of event-based vision are presented in the first place, followed by the superiority of data-driven approaches over other kinds of event-based algorithms.

Firstly, as for event-based vision, new areas in commercial markets and industrial fields [27, 28], such as robotics and consumer electronics, give rise to an urgent demand for efficiency and robustness [29] in visual systems, which event cameras can just offer. Despite the lack of experience in event-based vision, mature techniques in frame-based vision and biological visual systems provide valuable guidance in the emerging event-based vision and promote its rapid progress. From another perspective, though frame-based sensors have been adopted by a preponderance of applications, their operating principle that whole frames are recorded at a fixed rate leads to the natural failure in fast motions’ scenarios and difficult lighting conditions, which weakens their robustness. Transmission of redundant information also adds to power consumption and latency, resulting in incompetence in highly reactive systems and resource-constrained platforms and weakening their efficiency. Considering the above factors, there are application scenarios that existing techniques cannot satisfy, where event cameras act just as a natural fit. As a result, event-based vision is gaining a slow but steady rise and demonstrates impressive advantages in machine vision applications and consumer electronics.

Secondly, as event-based visual systems comprise perception and processing part, algorithms along with hardware and datasets play a vital role in the latter part, aiming at inheriting the intrinsic advantages of event cameras. Therefore, the trade-off between accuracy and efficiency remains a core challenge for event-based algorithms. An ideal algorithm is supposed to extract sufficient information from events, namely, exploit the fine temporal information of individual events, to ensure accuracy and, at the same time, exploit the sparse and asynchronous nature for low latency and computation, as illustrated in Figure 8.

In terms of model-based methods, common types and their corresponding deficiencies are listed below. A common characteristic of model-based methods is that algorithms are artificial designs based on researchers’ knowledge reservation rather than preset models with parameters learned from data. Depending on their relationship with traditional frame-based vision, model-based methods can be further subdivided into two categories: methods that reutilize traditional algorithms [3033] and methods separate from traditional algorithms [34, 35]. Orthogonally, depending on the degree of information extraction, we can distinguish between methods with information compression [36, 37] and methods that exploit the fine temporal information of each event [31, 33]. A more widely known criterion is to categorize model-based methods by information aggregation manners. One class of research studies [5] uses filtering-based models updated asynchronously with each incoming event. Since an event alone contains little information, each incoming event is usually coupled with extra information from past events for further estimation. Requiring expert knowledge, feature descriptors and measurement update functions [4] require to be handcrafted and task-oriented in these methods, which slows down their widespread adoption in high-level vision such as recognition and segmentation. Despite minimal latency, these methods perform redundant computation owing to frequent system state update, with accuracy sensitive to algorithm parameters. Other works process groups of events simultaneously for high signal-to-noise ratio, integrating events with a fixed number [38] or in a fixed time window [36, 37]. These methods achieve remarkable performance at the cost of losing the inherent low latency property of event data.

Besides, data-driven approaches possess great potentials on the basis of existing theories and technologies, namely, deep learning techniques and biological mechanisms. Recently, growing numbers of event-based research have adopted data-driven approaches [7, 3941] rather than model-based methods, spanning diverse vision tasks. Inspired by the great success of deep learning methods in traditional frame-based vision, some event-based algorithms aim to reutilize standard learning architectures adopted in frame-based vision after converting groups of event data into frame-like representations. From another perspective, apart from the perception part, biological mechanisms also drive the design of several postprocessing algorithms, such as spiking neural networks (SNN). To some degree, the combination of bio-inspired perception and processing provides an efficient and long-term solution in event-based vision, suiting scenarios where standard cameras fail just well.

In conclusion, the emergence and advancement of event-based vision is to meet the need of efficiency and robustness in several practical applications where traditional frame-based vision shows incompetence. Among event-based algorithms, two main reasons contribute to the prevalence of data-driven approaches: the limitation of model-based methods in complex, high-level vision tasks and the prevalence of deep learning techniques in frame-based vision. Data-driven approaches, dedicated hardware and large datasets remaining core components in data-driven technology, a paradigm shift towards data-driven technology is triggered in event-based vision.

4. Current Status of Data-Driven Technology

Since data-driven approaches, dedicated hardware, and relevant datasets collectively constitute the research field of data-driven technology, current status of the three parts will be explained concretely in this section. In terms of algorithms, data-driven approaches are mainly divided into three categories: spiking neural networks, standard learning architectures, and novel architectures. With quantities of event data in urgent need for the implementation of algorithms, large-scale datasets and event camera simulators act as a promising alternative for cost reduction. To improve the holistic performance of event-based visual systems, optimization of hardware ranges from the sensor level to processor level. All three parts influence and promote each other during their development.

4.1. Data-Driven Approaches

With each pixel detecting local luminance changes independently, event cameras pose a paradigm shift in acquisition of scene information, outputting discrete and asynchronous event streams. Due to the unique working principle, deep learning architectures adopted in frame-based vision can not be applied directly to event data, provoking the innovation of several event-based data-driven approaches including spiking neural networks, standard learning architectures, and novel architectures. Spiking neural networks (SNNs) [42] have emerged as a promising candidate since they perform asynchronous inference on specialized neuromorphic hardware [10] with low power consumption, exploiting the inherent spatio-temporal sparsity of event data. However, with the number of spikes drastically vanishing at deeper layers, deep SNNs are notoriously difficult to train, weakening their performance in high-level tasks. Additionally, hardware for SNNs remains expensive and scarce, also limiting their generalization. Some other research [6, 7, 40] aims to reutilize standard learning architectures which are designed for image data by converting groups of events into tensors [5, 6, 13]. These methods have achieved state-of-the-art results in several tasks because of the high-capacity deep neural networks and sufficient signal-to-noise ratio of tensor-like event representations. Correspondingly, the discrete and asynchronous property is sacrificed as a price. Methods above provide solutions with both advantages and disadvantages, and recent research makes optimization by combining complementary advantages that each category has to offer. Current innovation architectures tailored to high-rate, variable-length, and nonuniform event stream have been proposed to balance the trade-off between accuracy and efficiency while protecting the spatio-temporal sparsity [4, 43, 44]. Generally speaking, with high capacity neural networks, features are automatically learned from data by optimizing the corresponding object function in data-driven approaches, exempt from handcrafted feature descriptors. In other words, provided with sufficient high-quality data and suitable network architecture, parameters are well adjusted according to labeled data through training, regardless of task types.

Inspired by the efficiency and adaptability to changes in biological systems, neuromorphic computing becomes a natural fit to process event data. Though conventional artificial neural networks (ANNs) are predominant tools, the resulting energy requirement hinders their implements on resource-constrained platforms such as embedded devices where event cameras show advantages. Offering asynchronous and sparse computations compatible with event data, spiking neural networks (SNNs) are promising computational partners for event cameras to create an end-to-end event-based system.

Biological neurons work in a separate way from neurons in ANNs, mainly in the way that information propagates between units. Rather than static nonlinearities, neurons in bio-systems are dynamic devices encoding and processing information in the form of discrete spikes, requiring a minute fraction of power. The efficient biological perception and computation principles lead to the realm of SNNs. An SNN is a hierarchical network of dynamic spiking neurons defined by the model parameters, with each unit receiving and outputting spiking signals. Each neuron receives input from its corresponding receptive field, while modifying its membrane potential and emitting an output spike when the state reaches a certain threshold. Neurons connected together form an SNN that processes information with spiking signals rather than numeric values. Universally, input spiking signals are collected from neuromorphic sensors or converted from natural signals. There exist diverse rules for conversion, such as rate encoding, time encoding, and population encoding. Generally, the spiking rate and the temporal pattern contain valuable information about stimulation and computations. Analogously, the output spiking signals are fed to a neuromorphic actuator or converted to natural signals following decoding principles, as illustrated in Figure 9.

Coupling event streams with spiking neural networks exploits the signal’s spatio-temporal sparsity and develops event-based end-to-end systems. Without previous conversion from events to image-like tensors, SNNs are more efficient and long-term solutions than conventional frame-based approaches in event-based vision. Despite their asynchronous inference and low power consumption compatible with event cameras, SNNs demand expensive specialized hardware (such as TrueNorth [9] from IBM and Loihi [10] from Intel) and face spikes vanishing problem at deeper layers. Moreover, since the spike-generation mechanism within a neuron is nondifferentiable, feedforward SNNs are difficult to train without effective backpropagation algorithms. Shortcomings adding up together restrict its popularization in complex real-world scenarios. Several event-based research have been working on resolving SNN’s shortcomings based on former achievements and utilizing its superiority to the best.

Spiking neural networks (SNNs) [45] have been applied to various event-based fields, including low-level tasks such as optical flow estimation [4648], high-level tasks such as object recognition [49, 50] and classification [51], and tasks concerning the 3D structure of the scene [52, 53] and robotic visual perception [54]. Benosman et al. [16] used a spiking neural network that is theoretically similar to the classical Lucas–Kanade algorithm to estimate visual motion, exploiting the sparse high temporal resolution event data. Based on [16], the same team [17] demonstrated a fully spiking neural network for optical flow prediction on TrueNorth hardware [9]. Inspired by the visual cortex, a hierarchical feature extraction mechanism has been adopted in SNNs to extract information from the precise timing of the spikes [55]. Authors of [48] estimated event-based optical flow by means of a hierarchical spiking architecture based on Spike-Time-Dependent-Plasticity learning [56]. Researchers in [49] presented a spiking hierarchical model for object recognition utilizing the precise timing information contained in the output data. Tang et al. [50] proposed a hierarchical feedforward spiking neural network for the classification of digital characters recorded by DVS. In addition, Acharya et al. [51] presented a three-layer SNN-based region proposal network operating on event data and applied it to real recordings. Although SNNs have mainly been applied to classification problems [50, 51, 57], a recent research [58] unlocked the potential of SNNs to tackle numeric regression problems in the continuous-time domain for event-based data. Benosman et al. [52, 53] solved the stereo-correspondence problem and estimated depth in a 3D scene in event-based vision with a spiking neural network working on neural processing devices. Moreover, Bing et al. [54] designed an end-to-end SNN based on STDP learning rules for the robotic visual navigation system.

Since spiking neurons’ transfer function is naturally nondifferentiable, backpropagation cannot be directly used in training SNNs, limiting the computational potential of SNNs. To breakthrough this bottleneck, some research [59, 60] aims to tailor backpropagation for SNNs and backpropagates error at spike times, reaching limited success. Some other works [61, 62] focus on applying a continuous function as a proxy for spike function derivative, which has been proven effective in deep feedforward SNNs. Since methods above only compute approximate gradients, algorithms for spiking deep convolutional networks [39] remain to be improved in the future study.

Due to the difficult training procedure, limited accuracy in complex tasks remains a challenge for SNN-based methods. Recently, lots of research aims at reutilizing conventional machine learning techniques after converting groups of events into intermediate image-like representations. Mature frame-based learning architectures help accelerate the development of event-based algorithms [57, 13].

Additionally, the similarities between tensor-like representations and natural images enable transfer learning with networks pretrained in frame-based vision to some degree [6, 63, 64]. Owing to high capacity neural networks and sufficient signal-to-noise ratio, methods that recur to standard learning architectures have achieved satisfying performance in diverse fields of event-based vision [40, 41, 65]. Broadly, differences between these approaches mainly lie in three aspects: the event representation methods, the network architecture, and the loss functions used for optimization during training.

In order to handle an event stream reutilizing frame-based deep convolutional neural networks (CNN) or recursive neural networks (RNN), events require to be transformed into image-like representations compatible with natural images in advance, as illustrated in Figure 10. The intermediate representations are usually synthesized by batches of events with a fixed number [66] or in a constant temporal window [37]. From another perspective, these representations can be distinguished as updated synchronously [7] or updated asynchronously [13]. Additionally, according to the utilization of temporal information, event data can be represented as the basic dense encoding of event location [40]and dense encoding including temporal information [13, 65]. Several widely adopted representations have emerged during exploration, including event count images [40], maps of most recent timestamps [7, 67], an interpolated voxel grid [64, 65], and the latest end-to-end event representation framework (also referred to as Event Spike Tensor) [6]. Among all these representations, the Event Spike Tensor reserves the maximum amount of information without compression along any dimension.

Maqueda et al. [40] accumulated events of different polarities over a fixed temporal window to predict steering angle. As for optical flow estimation, Zhu et al. [7] proposed a four-dimensional grid encoding the last timestamp and event count of each pixel. Later, the same team made improvements by representing events as a spatio-temporal voxel grid [65], which accumulated events in a linearly interpolated manner using time information as the weight, saving the full spatio-temporal distribution of events. In high-level tasks, Sironi et al. [13] paired event data converted into histograms of averaged time surfaces (HATS) with a support vector machine in object recognition tasks, reutilizing the standard learning pipeline and enabling asynchronous update if given sufficient computing power. Alonso et al. [68] contributed a 6-channel image representation in the semantic segmentation task, recording the event count as well as the mean and standard deviation of the timestamps of events. The latest research [6] proposed an end-to-end framework with a learnable kernel, which represented events in a data-driven manner, and evaluated its performance in object recognition and optical flow estimation tasks. Generally, the interpolated voxel grid and Event Spike Tensor are the most popular representations in recent research [41, 64, 65] for their little information compression in raw event data.

In terms of the network architecture, most research [7, 41, 64, 65, 68, 69] that refers to deep neural networks commonly adopts an encoder-decoder structure inspired by the stacked hourglass [70] and the UNet [71] architecture. The encoder-decoder structure helps ensure the same resolution between the output and the input tensor, satisfying the needs of several tasks, as illustrated in Figure 11. For instance, semantic segmentation is often regarded as a per-pixel classification, with an output at the same resolution of the input. Alonso et al. [68] took an Xception model [72] as the encoder and built a light decoder in event-based semantic segmentation task. Furthermore, fully convolutional networks require less network weight parameters, consequently decreasing the risk of overfitting. Last but not least, the loss can be applied to each intermediate state of the decoder [7], dramatically improving the accuracy of algorithms. Concretely, in optical flow estimation tasks, the EV-FlowNet [7] closely resembled the UNet in terms of the network architecture and designed loss in every intermediate flow from the decoder by downsampling grayscale images. Growing numbers of successful CNN-based or RNN-based approaches have emerged in event-based vision and the encoder-decoder architecture serves as a frequent partner.

As for training methods, supervised learning supported by sufficient labeled data plays a dominant role in classification tasks. For instance, researchers in [6] used the N-Cars dataset [13] and the N-Caltech101 dataset [12] for object recognition evaluation. Authors of [41] compared each reconstruction with the corresponding ground truth grayscale frame pairing a temporal consistency loss with an image reconstruction loss. It is worth mentioning that due to the heavy burden of manually labeling the ground truth and CNNs’ robustness to training with approximated labels [73, 74], researchers have successfully used generated labels from grayscale for supervised training in several event-based tasks, such as object detection [75] and semantic segmentation [68].

Since the ground truth for some event-based tasks is difficult to generate and datasets with manually annotated labels remain rare and expensive, self-supervised learning and unsupervised learning have offered an opportunity to learn relevant parameters using event data without corresponding labels, adding to the availability of deep networks in event-based vision. Without enough labeled data [5], a feature extractor trained with unsupervised learning with a classifier is coupled which demands labeled data for training in the recognition task. To predict a vehicle’s steering angle [40], it is referred to the pose as a third party ground truth, that is, selecting the frame 1/3 s after the current one as its ground truth. To estimate optical flow [7], a self-supervised loss is applied over the predicted flow using grayscale images generated by the DAVIS camera. Supported by a motion blur based loss function [65], motion information and the structure of the scene in an unsupervised manner are learned using only event data. Generally, unsupervised learning methods make the most of events’ coordinates and polarity information and hence do not rely on ground truth or additional information. Gallego et al. [34] provided a unifying framework to handle several estimation problems in event-based vision, obtaining motion parameters that best fit the input data by contrast maximization. The additionally generated motion-corrected edge-like images can also serve several downstream research studies such as feature tracking [38] and object recognition. Researchers carried on this study in later research [35] and analyzed 22 objective functions for unsupervised learning.

An ideal event-based algorithm should extract events’ full coordinates and polarity information while exploiting signal’s spatio-temporal sparsity to ensure low latency and low power consumption. Data association, namely, establishing inherent correspondences between events, remains a central challenge in event-based vision. With increasing preponderance in event-based algorithms, data-driven approaches are mainly categorized into two classes: asynchronous spiking neural networks [46, 49, 52, 54] and standard learning architectures [7, 40, 65, 68]. As bio-inspired methods, SNNs offer asynchronous inferences at a fraction of power consumption but suffer from the vanishing spike phenomenon [76] in deeper layers. In contrast, the latter methods commonly trade-off efficiency for accuracy and generalize well in complex vision tasks, sacrificing the inherent sparsity of spatio-temporal events. Methods above contribute solutions with both advantages and disadvantages and recent research aims at integrating the possibilities each category has to offer, as illustrated in Table 1. Current innovation architectures tailored to high-rate variable-length event streams have emerged as a promising candidate to balance the trade-off between accuracy and efficiency while protecting the spatio-temporal sparsity [4, 44]. Authors of [44] presented a deep hybrid neural network named Spike-FlowNet for optical flow estimation. The combination of SNNs and ANNs improves its computation efficiency while maintaining performance. Researchers in [4] brought the concept of event-based asynchronous sparse convolutional networks into public sight. To be concrete, a framework that transforms models trained with image-like event representations into asynchronous models is raised to better leverage the asynchronous and sparse nature of events, surmounting the original limitation of deep neural networks in event-based vision.

4.2. Datasets and Simulators

Compared with the mature frame-based computer vision, event-based research is still in its infancy in every aspect, including algorithms, hardware, and datasets. Since event cameras are expensive sensors, only a proportion of research teams can afford this device, severely slowing down the research. In parallel, with the rise of data-driven approaches in the event-based field, a huge amount of data is required for deep neural network training. To circumvent this contradiction, growing numbers of datasets and simulators have been proposed to facilitate the development of event-based algorithms, and they dramatically reduce the research cost and offering quantitative benchmarks for performance evaluation. Available event-based datasets target various applications, including optical flow estimation [20, 77], intensity-image reconstruction from events [78, 79], visual odometry and SLAM [19, 21, 80], segmentation [67, 81], and recognition [22, 23, 82].

For optical flow estimation evaluation, Rueckauer and Delbruck [77] created a dataset recorded with a DAVIS camera and creatively substituted groundtruth optical flow with rate gyro data, since the true optical flow can be computed using gyro data with camera motion restricted to camera rotation. Barranco et al. [20] provided both real-world and simulated datasets to evaluate the performance of visual navigation tasks. Real data was recorded with an RGB-D sensor and a DAVIS sensor on a mobile platform, containing events, images, optical flow, 3D camera motion, and the depth of the scene.

As for image reconstruction task, Binas et al. [11] paired DVS events with APS streams in driving applications and published the DDD17 dataset which contains annotated DAVIS driving recordings in various event-based applications. Researchers in [83] provided the DVS-Intensity dataset recorded with a DAVIS240C camera for event-based image reconstruction. Later, the same team released the CED dataset [78] recorded with a Color-DAVIS346 camera. For the evaluation of event-based video reconstruction, Stoffregen et al. [79] published a High Quality Frames (HQF) dataset with information of events and ground truth frames recorded with a DAVIS240C camera.

Among diverse datasets for SLAM, the most widespread dataset [19] stood out as a benchmark for event-based visual odometry [37, 38, 84] and found great utilities in feature detection [31] and tracking tasks [85] as well. Moreover, for 3D perception tasks, Delmerico et al. [21] released the large MVSEC dataset recorded at different lighting conditions and environments with a DAVIS346 camera. Authors of [86] presented a robotic dataset that contained information about the indoor environment recorded by a mobile robot embedded with two DAVIS240C cameras and an Astra depth camera.

By far, datasets for recognition and classification account for the largest proportion of event-based datasets, including N-MNIST, N-Caltech101 [12], N-CARS [13], and DVS-Gesture [22]. For gesture recognition, the IBM research group collected the DVS-Gesture dataset comprising 11 kinds of hand gestures under 3 illumination levels. Benosman et al. released the first real-world event-based dataset for object classification, namely, the N-CARS dataset which embodied cars at different poses or speeds and various background scenarios. The Dynamic Vision Sensor Human Pose dataset (DHP19) [87] served as a benchmark for human body movements and included a set of 33 movements with 17 subjects from a DAVIS 346 camera.

With the recent preponderance of data-driven approaches in event-based computer vision, a large amount of event data is required to design efficient end-to-end algorithms without intermediate tensor-like representations. To address this issue, apart from large-scale datasets, event camera simulators [19, 24] act as a promising alternative since data and ground truth are easily procurable compared with real data. Simulators in [19] emulated the operation principle of event cameras and generated the corresponding event stream, intensity frames, and depth maps, provided with a virtual scene and a camera trajectory. Simulators in [88] adopted a custom rendering engine to render images from a 3D scene at a very high frame rate, generating the asynchronous output. Different from the fixed-rate sampling approaches mentioned above [24], the event simulator and the rendering engine for a more accurate simulation were tightly coupled. Based on ESIM, the latest research [25, 89] aimed at converting existing video datasets to event datasets, facilitating a variety of event-based applications. Videos of the low frame rate were first transformed into high frame rate ones leveraging an interpolation method [89] and then used for events generation [25]. The v2e toolbox [90] generated events from real or synthetic videos and served as the first candidate to synthesize realistic low light DVS data. Since random noise in neuromorphic sensors remains a challenge for simulation [91], the event probability mask (EPM) to label real-world events with a likelihood was proposed and the DVSNOISE20 dataset for benchmarking denoising was released.

4.3. Sensors and Processors

To improve the holistic performance of event-based visual systems, optimization of hardware mainly consists of improvements from the sensor level and processor level, as illustrated in Figure 12. After the emergence of the first silicon retina, a series of developments have been raised [1, 2, 92]. Among existing event-based sensors, the DVS and its derivatives, namely, DVS128, DAVIS240 [2], and color-DAVIS346 [15], have taken over the academic research, as illustrated in Table 2. Afterward, how to process event data with high efficiency based on brain-inspired chips becomes another focus. Mainstream neuromorphic processors include TrueNorth [9] from IBM, Loihi [10] from Intel, and SpiNNaker [8] from the University of Manchester. Considering the combination, for instance, IBM adopted the DVS128 camera for visual perception before gesture recognition on the TrueNorth chip.

The first commercial camera, the pixel DVS128 [93], was published by iniVation and Delbruck et al., with a sampling frequency of and dynamic range of . It was widely adopted in various tasks including recognition, detection, and tracking. Recently, Samsung has raised a new generation of DVS [94] with higher spatial resolution of and smaller pixel size of .

Recording only brightness changes without absolute intensity information, DVS reduces redundant data and performs well in fast motion scenarios where a ton of data is produced. However, owing to the absence of absolute intensity of the whole scene and the response to only scene dynamics, image reconstruction remained a challenge especially in static scenarios, which stimulated the generation of ATIS [95], DAVIS [2], and CeleX [96]. ATIS [95] reconstructed images referring to an intensity measurement circuit based on a time interval. Cooperating with Posch et al., Prophesee created the pixel ATIS [1], with sampling frequency of and dynamic range of . Financed by Intel, Prophesee also applied ATIS to the visual system of an autonomous vehicle. However, the mismatch between event data and grayscale reconstruction still exists in fast motion conditions. DAVIS emerged as a combination of DVS and a standard camera by the addition of an extra Active Pixel Sensor, represented as DAVIS240 proposed by iniVation and Delbruck et al. Color-DAVIS346 was later published with a spatial resolution of . CeleX recorded intensity information by means of expanding event bandwidth to . The recent CeleX-V [97] possesses a competitive spatial resolution of , showing great potential in applications such as industrial automation and autonomous vehicles.

Inspired by biological systems, event-based visual systems are designed to pursue high efficiency. Neuromorphic processors unlock the unparalleled computational power of hardware for spiking neural networks, acting as a natural computational partner for event-based sensors. According to neuron types, neuromorphic processors are mainly divided into processors with analog neurons, digital neurons, and software neurons. Several mature architectures such as TrueNorth [9] and Loihi [10] are composed of digital neurons. On-chip learning is also feasible for some processors. Originating from the University of Manchester, the SpiNNaker (Spiking Neural Network Architecture) [8] mimics the biological mechanism in the human brain with constituent neurons as software on the ARM cores, achieving better model flexibility. Except for software neurons adopted in SpiNNaker, IBM’s TrueNorth employs digital neurons for real-time computation with networks trained offline, incapable of on-chip learning. IBM has adopted TrueNorth for postprocessing after visual perception with a DVS128 in gesture recognition task [22].The Loihi chip created by Intel also takes digital neurons for inference to support on-chip learning and use programmable rules. Additionally, pretrained nonspiking networks can be transformed into approximate spiking networks with asynchronous inference for Loihi, offering extra possibilities.

Inspired by the efficient working principals of biological systems, neuromorphic sensors just stepped into public sight in 1991, in the form of a “silicon retina.” The subsequent DVS and DAVIS have significantly promoted event-based computer vision by offering numerous advantages over standard cameras, unlocking scenarios previously unreachable. However, compared with the mature frame-based computer vision, event-based research is still in its infancy considering algorithms, hardware, and datasets. With the coexistence of opportunities and challenges, considerable room is left for future improvement in event-based vision. Several possible directions are pointed out below: spanning hardware, datasets, and algorithms perspectives.

5.1. Hardware

Recently, companies such as Samsung and Sony have built DVS with competitive pixel size, overcoming an initial limit in size and resolution, therefore adding to its fighting chance in practical use. Since the emergence of the first silicon retina by Mahowald and Mead, a series of improvements [1, 2, 92] have been proposed over the past decade, with some of them outputting static along with dynamic information. To accelerate their employment in automotive and consumer applications, further improvement in sensing hardware remains an open challenge. Additionally, since neuromorphic sensors are motivated by the biological retina and subsequent processing principle of the brain, neuromorphic processors act as a natural fit to build an end-to-end event-based system. Several mature processors have emerged, including TrueNorth from IBM [9] and Loihi [10] from Intel. In order to support a branch of event-based algorithms (e.g., SNN), processor-related research is worth carrying on in the future study. Furthermore, inspired by the concept of co-optimizing hardware with postprocessing algorithms holistically [16] and the fact that the contrast threshold of event cameras affect the function of CNN-based algorithms [79], learnable parameters in event-based hardware might be another point of interest.

5.2. Datasets

Since event-based sensors remain scarce and expensive to acquire, large-scale high-quality datasets with ground truth are urgently needed in event-based research community. With the prevalence of data-driven approaches, well-labeled datasets play an increasingly significant role in both algorithm parameters tuning and algorithm benchmarking. Prevailing event-based datasets mainly include real-world datasets [19, 80] recorded with event cameras, simulated datasets [24, 88], and datasets [25, 89] and converted from existing video datasets. Despite those existing datasets, several problems remain unanswered in this field, leaving considerable room for improvement. For instance, most event-based datasets focus on scenarios where event cameras excel and traditional ones fail, namely, scenarios with fast motion or challenging illumination levels. Only recently an alternative [79] recorded in well-lit conditions with little motion blur was provided. Furthermore, owing to the burden of manually labeling a per-pixel ground truth brought by the novel data form, few large-scale datasets have been published, especially in complex visual tasks such as object detection and segmentation. As a consequence, a large amount of reliable event data is in desperate need to promote event-based research.

5.3. Algorithms

As mentioned in Section 4, event-based data-driven approaches are mainly divided into three categories: spiking neural networks, standard learning architectures, and novel architectures. Inspired by this, improvements of algorithms mainly focus on the three aspects below, namely, event representations, novel architectures, and general frameworks.

Observing algorithms in diverse vision tasks, using deep learning techniques or not, most event-based algorithms comprise two modules: event representation methods and postprocessing inferences. That is, event streams are first summarized using an intermediate representation and then fed to the subsequent algorithms such as a deep neural network or a classic filtering-based structure because of the intrinsic incompatibility between the novel asynchronous event data and traditional frame-based algorithms. Generally, an ideal event representation is supposed to extract comprehensive information from raw event data to ensure accuracy as well as match the input size requirement. In recent research, widely adopted event representations include Surface of Active Events (SAE) [98] and its derivatives, Event frame [66], Event count image [40], Voxel grid [65], and Event Spike Tensor (EST) [6]. However, most of the existing transformations suffer from information compression compared with the raw data. Event representations without information loss are well worth further research.

As novel types of sensors, event cameras transmit per-pixel intensity changes in the form of event streams with each event encoding space-time coordinates and sign information. Basically, existing event-based algorithms either refer to frame-based computer vision techniques or take inspiration from biological systems. However, pure bio-inspired SNN architectures suffer from the vanishing spike phenomenon [76] in deeper layers and the lack of specific hardware. As for standard learning architectures designed for frame-based vision tasks, the reliance of intermediate image-like representation usually leads to the loss of asynchronous property and redundant computation. Efficient long-term solutions specially tailored to event streams are of utmost urgency. To tackle the challenge, recent research studies with novel architectures include a deep hybrid neural network [44] that combines SNNs with ANNs and an efficient processing framework [4] of event-based asynchronous sparse convolutional network, offering valuable reference for the future work.

Apart from local techniques that enhance task-specific performance, event-based processing frameworks available for diverse visual tasks also arouse researchers’ enthusiasm owing to their strong generalization and inheritability in relevant studies. Determined by the unique working principle, event data possess several inherent properties: the spatio-temporal sparsity and the observation that events usually encode moving edges of the viewed scene. To some degree, efficient algorithms commonly make the best of events’ intrinsic nature based on a concise and essential design concept. For instance, Gehrig et al. [6] presented a general framework that transforms event data into task-specific tensor-like representations in an end-to-end paradigm, suitable for various tasks such as optical flow prediction and object recognition. Gallego et al. [34] also introduced a unifying framework aiming at several estimation problems in event-based vision, obtaining motion parameters that best fit the input data by contrast maximization. Furthermore, the generated motion-corrected edge-like images can serve several downstream tasks such as feature tracking and object recognition. In later research [35], 22 objective functions for unsupervised learning were analyzed to perfect this framework. An efficient processing framework [4] insensitive to event representation, neural network architecture, and task was raised recently, achieving both low latency and high accuracy. Observing the contributions mentioned above, research focusing on general frameworks has a more profound effect over research subject to a specific task, pointing out new opportunities.

6. Conclusion

In challenging scenarios such as resource-constrained platforms, highly reactive systems, or limited illumination conditions, a more efficient and robust way of visual perception is of urgent need. Event-based visual systems including sensing and processing have emerged as promising candidates in applications such as consumer electronics, Internet of Things, industrial automation, and autonomous vehicles. Background knowledge in event-based vision was concisely presented in Section 2, paving the way for further analysis. During the development of event-based vision, the limitation of model-based methods in complex, high-level vision tasks and the great success of deep learning techniques in frame-based vision jointly lead to the predominance of event-based data-driven technology, including data-driven approaches in event-based algorithms, dedicated hardware, and relevant datasets. Reasons for the rise of event-based data-driven technology were concretely analyzed in Section 3. Lessons from frame-based vision include efficient algorithms, dedicated hardware, and large datasets, which also guide the development of event-based vision. In Section 4 of this paper, we focused on the current status of event-based data-driven technology, spanning areas of algorithms (data-driven approaches), datasets, and hardware. During the development of event-based algorithms, existing theories and techniques from biological systems and frame-based vision provide considerable support. Maintaining the inherent advantages of event-based perception, namely, low power consumption and low latency, stands as a priority on a par with high accuracy in event-based algorithms. Among various event-based algorithms, data-driven approaches is gaining growing prevalence in prospect, considering the preponderance of deep learning techniques and the compatibility of bio-inspired spiking neural networks with event-based sensors. With two main categories of existing algorithms both suffering from intrinsic limitations, efficient long-term solutions specially tailored to event streams to balance the trade-off between efficiency and accuracy are of utmost urgency. Several recent research studies with novel architectures have been proposed. To circumvent the contradiction between the lack of event-based sensors and the huge demand for event data driven by learning algorithms, growing numbers of datasets and simulators have been proposed to facilitate the development of event-based algorithms, offering quantitative benchmarks for performance evaluation. To advance the holistic performance of an event-based visual system, optimization of hardware in terms of sensors and processors also plays a vital role. Last but not least, event-based research is still in its infancy compared with frame-based computer vision, leaving considerable room for future improvement. Several opportunities were pointed out in Section 5, spanning hardware, datasets’ and algorithms’ perspectives.

Conflicts of Interest

The authors declare that they have no conflicts of interest.