1 Introduction

During past decades, the biometric-based recognition systems have gained huge popularity for secured human authentication in several computing applications. Consequently, the contemporary era of digitization led to the emergence of biometrical technologies as secured authentication tool and becomes an active field of research [1, 2]. In comparison to traditional mechanisms, the biometric-based human recognition has shown promising performance specifically in terms of security, accuracy and ease of use [3, 4]. These systems primarily rely on unique biological, physiological or chemical human characteristics such as face, fingerprint, iris, voice, hand geometry, signature, ear, and DNA [5, 6]. Among all, face-based recognition systems are comparatively more popular authentication infrastructure that is employed in a variety of applications such as forensics, law enforcement (i.e., issuing identity documents, border checks, police checks, drones, and facial recognition CCTV systems), health (i.e., patient tracking, genetic digital detection), banking systems, home security, and smart phones access. The Government of India has successfully developed a unique identification system (UIDAI-Aadhaar) for a population that covers more than 1.3 billion individuals. The Aadhaar system follow a multi-modal approach comprised of three traits namely; face, fingerprint, and iris [7]. However, the facial recognition systems are also exposed to various security threats such as breaching the authentication system by presenting a forged biometrical data that pose a challenging concern for these systems. A latest study reported that the face-based recognition systems may be spoofed by presenting fake artifacts with an overall success rate of ~ 70% [8]. Conversely, the facial recognition systems are also vulnerable to various threats such as breaching the authentication via the forged biometrical data that pose a challenging concern for the security of these systems. A latest study has reported that the facial-based recognition systems are spoofed by presenting fake artifacts with an overall success rate of 70% [9]. As per ISO/IEC 30107, a presentation attack (PA) is an attempt by an imposter to impersonate as genuine user by means of an attack instrument (PAI). The typical examples of PAIs include printed photographs, masks or a video clips for getting an illegal access to the authentication system.

Figure 1 shows few samples of genuine users and their analogous artifacts reproduced or fabricated from inkjet printer, mobile display and laser printer that may be employed to circumvent the FR system. Therefore, it becomes a challenging task to discriminate among the real and fake face images. The literature exemplify that imposters adopt plentiful PAIs to spoof the face-based recognition systems that consist of synthetic face, sketches, printed photograph, 3D masks, video clips, reverse engineered face images and plastic surgery [10, 11]. There exists ample number of real-life cases to get an insight of the severity of vulnerabilities in face-based recognition systems. The black-hat test in [12] demonstrates the spoofing of laptops of various manufacturers using face artifacts. In 2010, a man in cap, boards flight in Hong Kong city and alights in Canada as an Asian man [13]. Likewise, at New York, in 2014, the robbers were caught while looting a cash checking store by imitating themselves as white cops by making use of face masks [14]. Accordingly, it becomes momentous to comprehend the security issues related to face PAs and their countermeasures. Moreover, the security concerns need to be analyzed to avert the possible assaults or to offer novel attack detection mechanisms that could augment the benefits of typical biometric-based authentication systems for the end users [15, 16]. To mitigate face PAs, the development in PAD methods turn out to be imperative for successful operation of these systems [17, 18]. Commonly, face PAD methods rely on extracting distinctive features from the face images to categorize these as either real or fake. However, in the current scenario, the widespread research is undertaken to design innovative face PAD techniques. Therefore, it pose a new challenge where an adversary gets more refined cues as identified to countermeasure the existing face spoofing attacks, and this makes an attacker become well aware of the weakness for further exploration [19].

Fig. 1
figure 1

Instances of original face images (upper row) with their artifacts (lower row)

To the best of our knowledge, earlier detailed review studies on face PAD techniques are presented by [20,21,22,23,24]. In [17, 18] and [20], the authors well highlighted the aspects of face presentation attacks with face artifacts, face PAD techniques, and face anti-spoofing databases. Similarly, in [19], an overview of advances in face anti-spoofing techniques, benchmarking databases, the performance achieved by openly organized competitions with the various evaluation metrics and open research issues are discussed. Certain previous reviews [20,21,22] until 2017, presented their studies on traditional PAD approaches. Also, as DL-based approaches are rapidly employed in anti-spoofing solutions since 2014, the earlier studies in [20], [21], [22] and [23] have not focused on DL-based PAD approaches. However, in [21], an exhaustive review of existing PAD techniques with an explicit focus on deep learning-based approaches is presented. Besides, a brief overview of recent datasets and performance evaluation protocols is also discussed. A comparison of our survey with other earlier similar studies on face PAD is illustrated in Table 1. Thus, this article offers an in-depth review and analysis on progresses in face anti-spoofing mechanisms covering both traditional and more emphasis on modern deep leaning-based approaches up to 2023. The main motivation behind this study is to supplement the existing review articles with recitation of the more contemporary developments and further research directions in this rapidly emerging research field. The aim is to present a thorough investigation of face PAD methods with paradigm shifts along with evaluation methodologies adopted by various state-of-the-art (SOTA) spoof detectors. Besides, we also expound a performance analysis of most recent face PAD approaches under a common assessment setup.

Table 1 A comparison among related studies on face PAD techniques

In summary, our contributions of this article are as follows:

  1. i.

    We present an illustration of various face presentation attacks that are employed by assailants to spoof the biometric systems.

  2. ii.

    We analyze and summarize the face spoof detection mechanisms along with underlying key concept, performance, and scope.

  3. iii.

    We preset an analysis of standard benchmark face anti-spoofing datasets and performance protocols that are widely employed for evaluating PAD algorithms.

  4. iv.

    A performance analysis of selected SOTA face PAD techniques on a common evaluation criterion is presented.

  5. v.

    Our study identifies and lists open research issues for face PAD mechanisms and suggests potential viewpoint that may set direction to future research.

Remainder of the article is structured in different sections and a graphical picture is shown in Fig. 2. Section 2 presents the scope and coverage of our survey along with general trend of face anti-spoofing. Section 3 briefly enlightens the security issues in a typical FR-based system with special emphasis on face presentation attacks. In Sect. 4, a taxonomy, and thorough study with analysis of various face PAD approaches is presented. Section 5 includes the performance evaluation methodologies used for evaluating the face PAD algorithms with benchmark datasets and evaluation protocols. The performance analysis of some popular SOTA techniques and an overall analysis of our survey are discussed in Sect. 6. Section 7 put forward an insight on various research challenges and opportunities. Finally, the conclusions are drawn in Sect. 8 with future scope of the study.

Fig. 2
figure 2

The schematic organization of the article

2 Scope and coverage

In this study, we illustrate a meticulous study and analysis on face PAD techniques that comprises pioneer research contributions of prominent authors during past decades. The distribution of articles based on various face PAD approaches is depicted in the Fig. 3a, b, which reveals that most of the research before 2014 has focused on handcrafted features engineering and later on, the paradigm has shifted oriented to deep feature-based models. It may be inferred from Fig. 3c, that most of the studies related to face PAD are published during the period 2016–2021, which shows criticality of the current research topic. As shown in Fig. 3d, this survey covers transactions, journals, conference proceedings, and workshops on face PAD methods from diverse repositories including ScienceDirect, IEEE Xplore, Springer, Elsevier, ACM and Google Scholar.

Fig. 3
figure 3

Overall distribution of face PAD literature in the present survey a approach-wise coverage of articles b number of articles under various face PAD techniques c year wise face PAD publications to date. d Sources of published face PAD articles

The wide ranging trend of research and growth in face PAD mechanisms throughout the last couple of decades is illustrated in Fig. 4. The entire study is divided broadly based on a variety of face PAD mechanisms depending upon the underlying concept of classification task as fake or live faces. The research is dominated by the hardware-based face PAD solutions in the era of 2000–2014, software-based methods using handcrafted features during 2005–2015, and since 2014 it is oriented to modern deep feature-based anti-spoofing models.

Fig. 4
figure 4

A general trend of research and developments in face PAD mechanisms

It can be analyzed that the hardware-based face PAD methods exploit the camera characteristics like variable focusing properties, degree of depth or effect of defocus, and these methods are relatively efficient as these does not involve any additional device besides original camera. The static software-based methods that explore low-level textural features have an edge over their dynamic counterparts mainly due to extra overhead of processing the multiple image frames. The foremost notion in handcrafted feature-based methods rely on exploring various image feature descriptors (i.e., LBP, LPQ, HoG, SIFT, SURF, and etc.) with high discrimination power together with a robust classifier (i.e., SVM, Decision tree, LDA). The literature witnessed that notable research has been reported which exploits the merits of both static and dynamic approaches to design more robust and efficient PAD models.

With the emergence of deep neural networks (DNN) such as CNN, RNN and auto-encoders, recent research has shifted to new paradigm since the year 2014. In this era, deep-level features are automatically extracted by using models such as basic CNN, ResNet-18, ResNet-50, Inception-v4 and lightweight CNN. The automated methods of feature extraction built a novel pathway to solve a face PAD challenges. Existing handcrafted methods have rarely explored the cross-dataset testing whereas majority of recent deep learning-based approaches validate these algorithms for cross generalization capability. But the limited performance of deep feature-based techniques in inter-dataset for unseen attacks validation is a major challenge in the current scenario. The current trend in this active area of research motivates the research community to develop optimized face PAD models that utilize the pros of both handcrafted and DL-based feature engineering to conquer several issues like generalization capability, vulnerabilities in DCNN, availability of sufficient datasets and training overhead.

3 Security issues in face recognition

Biometrics has pointed momentous consideration for its wide specific applications in government, commercial and forensic areas. Instead of their numerous advantages, these systems are susceptible to threats as these include few points where attackers hold malicious activities to breach the security aspects [25, 26]. Ratha et al. [27] identified eight points in a generic biometric system and these can be broadly categorize as either direct or indirect attacks as shown in Fig. 5.

Fig. 5
figure 5

Ratha’s framework (eight probable attack points in face biometric authentication system)

In direct attacks, explicit knowledge concerning the operation of the system is not essential [28]. It includes only type 1 attack point, where an intruder presents a face artifact to the sensor module of a biometric system [29]. On contrary to this, indirect attacks require the inner information of the system to make it successful, thus these include other seven points (i.e., type 2–type 8) [30]. Among all, the frequent attack point includes type 1 that is also termed as spoof or presentation attacks (PAs). In PAs, intruder present the forged version of original biometrical data to the sensor to circumvent the security of the biometric system [31] [32]. However, the present review focus on type 1 attacks and their countermeasures in face recognition systems.

3.1 Face presentation attacks

The presentation attacks (PAs) in face recognition system occurs when an adversary tries to impersonate as genuine user by presenting forged face biometrical data (generally a video, a photograph or a mask) and thereby getting illegitimate access to biometric authentication systems [33, 34]. Thus, PAs in face recognition systems are broadly categorized in various types of counterfeits as shown in Fig. 6.

Fig. 6
figure 6

A broad classification of face PAs

  1. i.

    Photo attacks: In these cases, a photograph (as shown in Fig. 7b) of the attacked identity is fed to the sensor module of the FR-based systems. Amongst, these are most common attacks as printing the image of a genuine user is an easier task. These attacks are also known as print attacks [35]. Moreover, the genuine face images are easily accessible through the social media sites such as Facebook, Twitter, and Instagram. In recent years, the availability of high-definition digital cameras with low cost has made the possibility of these attacks more trouble free [19].

  2. ii.

    Video attacks: These are popularly known as replay attacks and here the attackers presents a video of genuine user as a PAI (as shown in Fig. 7c) to the sensor of recognition systems. This attack represents the more advanced version of the photo attacks and it is difficult to detect compared to former. In case of video attacks, only the shape and texture of face is not imitated but dynamics (i.e., blinking of eye and movements of face or eye) are also included [36]. During the video-based attacks, the continuous signal is digitized and is recaptured by the recognition sensor.

  3. iii.

    Mask attacks: In these attacks, adversaries bring into play the three-dimensional mask (as shown in Fig. 7d) of a genuine user as PAI to the sensing device. To create a mask that looks realistic, additional skills and efforts are required. The simplest technique to construct a mask is to print the two-dimensional photograph of a genuine user’s face and then stick it to some deformable structure. Due to some complexities, mask attacks are far less common than their photo and video-based counterparts [37, 38].

4 State-of-the art face PAD mechanisms

In previous section, we discussed (Fig. 7c) the security issues of face biometric systems with central focus on type1, i.e., presentation attacks. Consequently, the severity and possibility of these attacks has motivated researchers to develop the PAD module which can counter the problem of face spoofing and hence, discriminating the bonafide and artificially created face samples as shown in Fig. 8. In Case 1, it is inferred that when an individual is genuine he or she will be accepted by the system. While in Case 2, when an imposter tries to breach the security by presenting a variety of PAIs, the system will reject it by implementing PAD mechanisms. This mechanism is known as face liveness detection or face presentation attack detection (PAD) [17]. However, to enhance the security of these FR-based authentication systems, numerous techniques have been developed and it becomes an active field for research community.

Fig. 7
figure 7

An illustration of face spoofing attacks with few samples a original image b spoofed photo c video frame through electronic display d 3D mask

Fig. 8
figure 8

A depiction of face-based system under spoof attack and PAD countermeasure

From a general perspective in the literature, the face PAD techniques are broadly classified into two categories as shown in Fig. 9. In other words, these countermeasures are either hardware- or software-based face PA detectors. In the succeeding paragraphs, we discuss and analyze these methods in more details.

Fig. 9
figure 9

Our novel taxonomy of face PAD approaches

4.1 Hardware-based approaches

These techniques explore the key characteristics of human face using an additional hardware device integrated with the sensor of FR-based systems. The extra hardware device detect some of the properties possessed by the living beings such as blood pressure, conductivity, sweat, facial thermo-gram, etc. [39].

The signals captured by the hardware device are used to distinguish live face images from the fake ones with satisfactory accuracy rate [40]. These are expensive techniques as there is a requirement of an additional hardware device [19, 34]. The hardware-based techniques further explore vitality characteristics, sensor properties, eye-blink detection and challenge-response tasks. The face PAD literature related to these features is reviewed in detail in the following paragraphs.

4.1.1 Sensor characteristics-based approaches

These methods are based on sensor characteristics and prospected the features of the sensor (camera) module. The observed features are dependent on the design of image capturing device that has been used for acquiring the data, for instance measuring the reflectance or the focus variation with near infrared/multispectral sensor and light field camera, respectively. Similarly, reflectance measurement in a 3D scan is a good example where sensor characteristics are examined. In the beginning, Kim et al. [41] used the notion of variable focusing (one of the characteristic of camera/sensor) for discrimination artificial and bonafide face samples. The approach relies on the degree of depth of field (DoF), where the DoF is a range between farthest and nearest objects within a given focus. Every image has both focused and blurred areas, whereas the former is in a control of user. Besides, each lens have unique focal length, this feature discovered that when two sequential images are collected from each sample then there lies some difference between the focus values of bonafide and fake image. These real faces images are solid as the regions which are focused are clear and remaining are blurred due to information related to depth whereas spoofed images are flat. These features help to segregate genuine face images from the fake ones. The proposed technique exhibit better results when the degree of DoF is smaller. Yi et al. [42] introduced a multi-spectral system for recognizing faces and is based on visible (VIS) and near Infrared (NIR) spectrums. From the results obtained, it is cleared that VIS system is robust to NIR photo attacks whereas NIR system is robust to VIS photo attacks. Then, a countermeasure is proposed that is based on color and texture information. The NIR photo attacks are resisted by color analysis as NIR photos when captured by VIS camera shows no color. Now by setting a certain threshold value of color information NIR photos can be rejected easily. On the other hand, VIS photos usually pass the color analysis but they are rejected when the texture information is analyzed. Yang [43] presented a revised model of Kim et al. [41], and consider the blurriness degree of face and background for increasing the difference of two sequential photos, instead of nose and ear which has been used in [41]. Two sequential images of a given object are taken with focus on nose and background and are analyzed, if the degree of blurriness of these two is dissimilar for an object than that object is classified as real. However, if both the images are similar, then object is categorized as spoofed. Kim et al. [44] further explored the effect of defocus. Where three different features namely; focus, power histogram and gradient location and orientation histogram are extracted for spoof detection. In another approach, a vitality characteristic of human is explored, where Li et al. [45] detect the pulse signal that is absent in artificially created face mask or printed photos and is used as a basis for liveness detection. Raghavendra et al. [46] designed a novel technique which can sense and countermeasure the spoof attacks on the recognition system by employing light field camera (LFC) as a capturing device. An LFC is capable of recording a direction of all the incoming rays, instead of intensity it also exhibits a unique property of rendering the multiple depth in single capture. Thus, this unique property of LFC which explores the variation of depth between the multiple depth face images is used to detect the attacks. In the proposed approach, three methods are introduced for computing the change of image depth (focus) from several depth images, which is further discovered for detecting presentation attacks. In first, the absolute variation in depth (focus) is captured by computing the difference between the minimum and maximum amount of an estimated focus FMC, which is computed using Eq. 1.

$${\text{VF}}_{{\text{A}}} = {\text{max }}\left( {{\text{FM}}_{{\text{C}}} } \right){-}{\text{min}}\left( {{\text{FM}}_{{\text{C}}} } \right).$$
(1)

The second approach of variation calculation uses the relative value for measuring the change of the minimum to maximum ratio of estimated focus FMC, which is specified by the given Eq. 2.

$${\text{VF}}_{{\text{R}}} = \frac{{{\text{max }}\left( {{\text{FMc}}} \right){ }}}{{{\text{min }}\left( {{\text{FMc}}} \right){ }}}.$$
(2)

The third approach combines both relative and absolute values of variation calculated from Eqs. 1 and 2. The concatenation is performed as per Eq. 3.

$${\text{VF}}_{{{\text{FU}}}} = {\text{VF}}_{{\text{A}}} ||{\text{VF}}_{{\text{R}}} .$$
(3)

Once the variation difference is computed, an SVM classifier is learned to discriminate live and fake face images. The proposed technique is tested on a self-created database consisting of 80 subjects of various gender, age and ethnicity, and face artifacts are created using three different types of PAIs namely; laserJet photo print, inkjet photo print and electronic screen using iPad. In 2017, Raghavendra et al. [47] extended their work and evaluate the susceptibilities of a multispectral FR-based systems towards PAs. Moghaddam et al. [48] proposed an approach where light field images are explored to extract features by making use of Light Field LBP (LFLBP) descriptor. The LFLBP is used not only for capturing spatial information but also the light field angular data, linked to the set images, are explored for face PAD mechanisms. Tang et al. [49] designed a technique where randomly flashed images are analyzed for reflected light and human living characteristics are explored such as: textural features, 3D shape, processing of reflection at various light speeds. With the cooperation of digital cameras, the proposed technique is able to detect the traces left by an attacking procedure.

4.1.2 Blink detection-based approaches

In these techniques, the spontaneous action of blinking of eyes that is performed unconsciously by the person is tracked continuously. The eye blink is a physiological task of closing or opening of eyelids which is an essential function of eyes. This feature of blinking of eye is used for mitigating PAs in face recognition systems. A method for eye-blink detection using frame differencing coupled with optical flow computation is employed by Bhaskar et al. [50]. For discriminating the blinking of eyes from the other face motions, the optical flow is computed that makes use of both the magnitude as well as direction of flow vectors. This algorithm of eye-blinking detection for eyes location is useful for face anti-spoofing based on using a simple web camera-based hardware. However, the operational feasibility of this method for face liveness detection proved to be costlier in terms of complexity of the algorithm. Ali M. Al-Qayedi [51] used a scheme for eye detection that is based on experimental interpretations of the gesture of both the eyes in a set of benchmark and already recorded head-and-shoulders patterns. The approach first captures an image of eye and extracts facial features from a video frame. Then, a codebook is built for the extracted features by tracking the motion. For this purpose, a normalized cross-correlation (NCC) block-matching technique is deployed where the current and previous frames are compared on the basis of correlation coefficients of the shapes of extracted features. The obtained results are compared with an empirically set threshold to arrive at a decision to add or reject the new shape to codebook. Pan et al. [52] presented an approach for detecting photographic face by using a generic web camera which limits the use of extra hardware device. A conditional random field (CRF) framework (CRF) is used for blinks detection with inference, which finds the long-range dependencies in the observations and the states. The distance between the centroid of two irises is used as a discriminative feature to classify a face as real or fake. A sequence of features in the images of eyes may be used to build CRFs to observe the behavior of eyes in a real or fake face. The approach is tested on a publicly available blinking video database collected by a generic web camera, i.e., Logitech Pro5000. For testing the proposed approach against photo imposters, they collected a photo-imposter video database of 20 users. It can be claimed that the method performs better than cascaded approach of Adaboost and HMM. Recently, in 2020, Arpita Nema [53] scrutinize the eye-blink count and HoG feature descriptor for face liveness detection and achieved 96% classification accuracy for two publicly available datasets namely; ORL and CASIA-FASD.

4.1.3 Challenge response-based approaches

In these techniques, an interface is provided where response to any challenge is processed and recorded to identify the bonafide presentations. Here, the cooperation of the user is required because these methods detect involuntary or voluntary responses to the external stimuli. For example, contraction of pupil after any lighting event or tracking of gaze towards external predefined stimuli. In the following paragraphs, we present an insight of existing challenge-response-based face PAD methods proposed by various authors.

Kollreider et al. [54] proposed a challenge-response-based face liveness detection technique which is useful for detecting video attacks. Ali et al. [55] introduced another technique which is based on visual stimulus. It measures the gaze of an individual to establish the presence of photo spoof attacks. Then, the collinearly features are used to differentiate live and the fake attempts. Lagorio et al. [56] presented a 3D face structure-based approach for face biometric systems. The 3D curvature of the extracted data is processed to distinguish live and fake faces. The proposed technique is based on first-order statistics estimation of the surface curvature and it does not require any user cooperation. Smith et al. [57] designed an approach, where images on the screen are used for creating challenge and the reflections that are captured dynamically from the face forms response. The sequence formed by images and their corresponding reflections watermarks the video. The reflection region features are used to determine whether the reflection matches the image sequence that is displayed on the screen. Table 2 summarizes various hardware-based approaches that are contributed by various researchers for detecting the liveness of a face biometric trait.

Table 2 A comparison among hardware-based face PAD techniques

Table 2 exhibits that the hardware methods are popular during last earlier decade from 2000 to 2015 until the emergence of data-driven approaches. Besides, majority of these methods are able to counter photo attacks and only least are employed to tackle the video or replay attack. Few of these claim a higher accuracy of approximately 100% when evaluated on self-created datasets. However, initially non-availability of standard or benchmark face anti-spoofing databases limits a true representation for validation of these solutions. The analysis reveals that the evaluation of these approaches is on small and private datasets, and standard metrics are rarely used by the researchers, which becomes a challenging task for comparing these methods in general with a benchmark.

4.2 Software-based approaches

In software-based techniques, features are extracted from the face images which are captured through the high resolution camera and acquired features are used to discriminate fake face images from the live ones.

Figure 10 demonstrates the basic activities flow for software-based face PAD mechanisms. These techniques are different from the sensor-level techniques, as in sensor-level techniques, features extracted from the live person are used to differentiate real and artificially created faces; whereas, in feature extraction methods, the features obtained from the face image are processed to make decision of live or fake face image [40]. These are further classified into two categories, namely static- and dynamic-based approaches.

Fig. 10
figure 10

A generic flow of activities in handcrafted features-based face PAD mechanisms

4.2.1 Static software-based approaches

These approaches are designed to detect presentation attacks and work with the single instance of a given face image without needing temporal information. These can also be applicable for video sequences where each of frames is independently analyzed and final decision can be taken on the basis of majority voting strategy. These are faster when compared to dynamic approaches and have better performance with low computational cost [22]. These techniques are further classified into three categories namely; texture, frequency and image quality-based approaches which are discussed in following sections.

4.2.1.1 Texture features-based approaches

The texture features of the given face image are extracted and then used for discriminating bonafide and fake presentations. These techniques examined the micro-textural patterns from a sample of existing face image and that play a significant role for detecting display, masked, and photo artifacts. These methods can easily discriminate between artifact characteristics like presence of pigments possibly due to defects in printing or presence of shade due to display attacks [22]. In the face PAD literature, majority of the techniques have applied LBP operator for feature extraction as LBP is a powerful and computationally efficient image descriptor. The original LBP descriptor as introduced by Ojala et al. [60] is a gray-scale texture measure which is derived from the relationship of pixels with its local neighborhood. It is computationally efficient and possesses greater tolerance against the gray-level changes. The operator figured image pixel labels by thresholding the neighborhood of each pixel with the central value and considers binary number as its result. The computation of an LBP descriptor is shown in Fig. 11. A histogram created from these different labels is then used for texture description. The LBP code of a pixel is calculated using Eqs. 4 and 5. The feature vector is constructed by consolidating all the patterns for a given image.

$${\text{LBP}} = \mathop \sum \limits_{j = 1}^{8} v\left( {i - c} \right) \times 2^{j} ,$$
(4)
$$f\left(d\right)=\left\{\begin{array}{c}1, if\quad d\ge 0\\ 0, otherwise \quad if \quad d<0,\end{array}\right.$$
(5)

where ‘v’ indicates the quantized values and the central pixel is represented by ‘c’ and ‘i’ denotes the considered pixel. The function ‘f’ computes the difference between the intensities of two given pixels.

Fig. 11
figure 11

An illustration for computation of LBP code from face image

The micro-textural-based features extracted from sample face images using LBP feature descriptor are shown in the Fig. 12. Two original face images with their corresponding feature histograms are portrayed in row 1 whereas; row 2 consists of an example of computed LBP representation from the original input images with their corresponding histogram that tabulates the total number of times each LBP pattern occurs and the obtained histograms are treated as feature vector.

Fig. 12
figure 12

An example of LBP micro-textural features of sample face images

Maatta et al. [61] have applied LBP for the first time in face PAD technique. They proposed a novel approach where texture of a face image is analyzed [60] using LBP with multi-scaling. The obtained micro-texture patterns are encoded into an enhanced histogram and then an SVM classifier is trained for image discrimination The LBP method is extended and is also used for addressing the video playback attacks [36]. Maatta et al. [62] put forward their work where in collaboration with LBP features, two more low-level features, namely Gabor Wavelet and Histogram of Oriented (HoG) are extracted. The LBP encodes the micro-texture features and Gabor filters are used for macroscopic texture information whereas, the HoG analyzes the shape features of an image. The SVM classifiers are trained and the final decision is based on the score-level fusion. Waris et al. [63] considered different textural features in facial images and proposed a new approach to counter photograph and video replay attacks. They extracted rotational invariant LBP, Gabor and GLCM face features from the images and trained an SVM model for classification task. However, their claims of superiority over other state-of-the-art methods lack extensive experimentation on standard publicly available datasets. Yang et al. [64] employed component-based face encoding and their approach focused on segmenting the face into multiple components like eyes, nose, facial region, canonical face region, mouth, and etc. Then, Fisher analysis is done to explore the difference among these face regions. The features from all these components are extracted and then a coding scheme is applied which is based on vector quantization (VQ) to represent the high-level features from these lower-level ones. Raghavendra and Bush [65] explored global and local features of an image and proposed an algorithm for extracting both the BSIF [66] and LBP features from eye and face regions with weighted score-level fusion. The BSIF descriptor is also widely deployed in face PAD literature. Originally, BSIF is introduced by Juho Kannala et al. [66] where the binary string for all the image pixels is generated by computing response to a kernel bank, which is trained by making use of statistical properties of natural images. The computed code of a pixel is taken as a local descriptor of the intensity pattern of image in pixel’s surroundings. For an image segment B (u, v) of size m × n pixels and a linear filter Xi of the same size, the response of the filter ri is computed via Eq. 6.

$$ri = \mathop \sum \limits_{u,v} Xi\left( {u,v} \right)B\left( {u,v} \right) = x_{i}^{T} p,$$
(6)

where the pixels of image B and Xi are denoted by the vector B and x, respectively. The binarized feature bi is obtained using Eq. 7.

$$bi=\left\{\begin{array}{c}1, ri>0\\ 0, otherwise.\end{array}\right.$$
(7)

The feature vectors generated can be used to build histograms to represent the image feature descriptors. Let Ii(x,y), with i = 1,2,3…..8 represents the n different natural images and Xi be the filter with 11 × 11 size pre-learned from ith image. The ith kernel is convolved with the input image B(u, v) for computing the response value ri(u, v). Additionally bi (u, v) be the binary response of the ith kernel at pixel (u, v). Hence, the binary response of all the kernels is computed for the pixels of the whole image. These responses are used to generate a BSIF code for each pixel of the image to construct respective BSIF images. In the last, the normalized histogram of the BSIF image provides the feature descriptor. The BSIF features extracted from a face image are shown in Fig. 13.

Fig. 13
figure 13

Micro-textural features of a sample image along with its descriptor using BSIF

For combating 3D mask attacks, Nesli Erdogmus et al. [67] used the variants of LBP textural feature like tLBP, mLBP, dLBP, etc. They trained classifiers like LDA, SVM with RDB and Linear SVM for real or masked classification. One of the main findings of this work is that block-based and multi-scale LBP features yield better performance for 2D images except for dLBP. As a further research work, more generalizable algorithms may be searched to detect the mask attacks.

Boulkenafet et al. [68] examined the color face space for liveness detection by extracting textural features from HSV and YCbCr color spaces. The authors explored and extracted holistic face representations from luminance and chrominance images in various color spaces. In this work, five gray-scale-level texture features are used, i.e., LBP, BSIF, Co-occurrence of Adjacent Local Binary Patterns (CoALBP), Scale-Invariant Descriptor (SID) and Local Phase Quantization (LPQ). A linear SVM classifier is used to label an image as real or fake. One of the limitations in the present work is that face normalization or the limits of the face bounding box are not optimized which may significantly affect the performance in case of cross-testing of databases. Furthermore, the study suggests that size, descriptors and other acquisition conditions may also be investigated as a new research work. Peng et al. [69] exploited guided scale texture features of face photo or video to counter the redundant noise contamination effect. Here, two guide scale texture feature descriptors are analyzed, namely: “guided scale-based local binary pattern” (GS-LBP) and “local guided binary pattern” (LGBP). The GS-LBP features guarantees the edge defending property of an image feature while double quantization is used to generate better LGBP features with neighboring pixels. This work advocates focusing on the optimal guidance images selection for scale conversion. Moreover; sampling and quantization techniques may be improved for better results of the proposed technique. Zhang et al. [70] employed adjacent pixel discrepancy that is termed as CTMF feature descriptor. The authors investigated the intrinsic properties of color texture for discriminating between the real and fake traits by considering the neighborhood pixels. Initially, the face image area is detected and normalized. Thereafter, two important features “Color Channel Markov feature” (CCMF) and “Color Channel Difference Markov feature” (CCDMF) are extracted from the color space of an image. The dimensions of the extracted features are reduced using a feature selection technique known as SVM-RFE. In the last, an SVM classifier is trained using selected feature vector to discriminate live and fake faces. The current technique may be extended to work in hybrid color spaces and can be combined with other types of feature descriptors to enhance its efficiency. Peng et al. [71] measure chromatic face textural differences and design a novel chromatic co-occurrence LBP (CCoLBP) descriptor to examine inter-channel facial texture. Hasan et al. [72] used Difference of Gaussian (DoG) filtering as a pre-processing step with local binary pattern variance (LBPV) as a feature extractor for face liveness detection. The DoG filtering is applied to remove noise from the face image and used contrast features to extract LBPV, and then the concept of uniform LBP is applied. However, authors claim the effectiveness of the proposed method, but extensive experimental results on more number of publicly available datasets may have supported the outcome. Schardosim et al. [73] designed another PAD technique using facial deformation energy and other attributes, i.e., background and face textures with steganalysis features. Recently, Du et al. [74] deployed residual color texture representation and proposed a novel CM texture-based descriptor that exhibit better performance compared to LPQ or LBP. Texture features are extracted from RGB and other spaces such as YCbCr, CIE and HSV. Finally, an ensemble with probabilistic voting is used for classification task. Majority of texture-based face PAD approaches cannot directly distinguish between live and fake samples using depth-supervised approach that makes use of stacked convolutions. To address this limitation, the Sobel operator is shown to be successful in acquiring gradient magnitude because of its quick calculation capability for high-frequency data. However, the Sobel operator is handcrafted, it is unable to handle intricate textures. As an alternative, Wang et al. [75] created the learnable gradient operator (LGO), a generalization of the gradient operators already in use, to efficiently extract fine-grained discriminative hints from unprocessed pixels. To improve optimization, the authors concurrently present an adaptive gradient loss. Extensive experimental comparisons on the widely utilized Replay-Attack, CASIA-FASD, OULUNPU, and SiW datasets with SOTA techniques demonstrate the excellence of the proposed technique.

The comparative analysis of static texture-based techniques based on method, type of attack to be counter measured, classifier, database and performance is illustrated in Table 3.

Table 3 A comparative summary of various texture-based software face PAD techniques

It is observed from Table 3, that majority of face PAD techniques explored either LBP or a combination of descriptors to overcome the problem of limited performance demonstrated by standalone LBP feature set. Our analysis reports lesser number of techniques that deals with face mask attacks and comparatively several research contributions are narrated for combating photo and video-based presentation attacks. It is seen that an SVM is an obvious choice among majority of the presented studies. In comparison to other learners, the SVM constructs a hyper-plane that maps training samples to higher dimensional space, which provides better classification accuracy even in the non-linearly separable problems. For mapping samples from dimensional space to higher, SVM utilizes a mapping function (i.e., linear, RBF or perceptron) that is termed as kernel.

4.2.1.2 Frequency-based approaches:

Certain characteristics of images such as high- and low-frequency signals with their variations are significantly employed for extracting frequency level features from face images. In the subsequent paragraphs, we present a review of frequency-based face PAD techniques.

Earlier, Li et al. [76] introduced a 2D Fourier Spectra-based face PAD technique. The proposed method is based on two facts; one is, the size of live face is different as compared to the photograph print whereas other includes the poses and expressions variations between the live and fake faces. Teja et al. [77] designed a technique to countermeasure photo attacks, where DCT energy of an image as aided by pupil and eye-blink detection is explored. Zhang et al. [78] uses a novel multiple Difference of Gaussian (DoG) filter. The DoG extracts the information related to high frequency from a face image. The EER of 0.17% is achieved when experiments are performed on self-created private database. Pithadia et al. [79] explored the corner, curve and edge features from face images and the local geometry of these extracted features in lower resolution is same as their higher resolution counterpart. Then, LBP is used to model geometric information with super-resolution. Further, Peng and Chan [80] proposed a Dynamic High Frequency Descriptor (DHFD) which calculates the difference of high-frequency energy components between the with and without illumination face images. The extra illumination raises the energy of the live face by revealing the excessive details of skin and hair. The experiments are performed on self-created database and the result shows that the proposed DHFD is robust and more powerful than the original HFD. Similarly, Liu et al. [81] proposed a technique for video replay attack that also considers the difference of hair texture by using and without using the extra illumination. The experiments are performed on self-created database where ~ 100% accuracy is claimed. The details of the frequency-based approaches with their comparative analysis are summarized in Table 4.

Table 4 A comparative summary of various frequency-based software face PAD techniques

Table 4 illustrates that none of frequency-based face PAD technique learned a model that counteracts mask attacks. These methods were used during 2004–2014 and thereafter, a few of the techniques explored frequency-based features from images in combination with texture features; therefore, these are reviewed in dynamic texture-based approaches. All of these techniques are evaluated on self-created datasets instead of publicly available standard datasets.

4.2.1.3 Image quality-based approaches

In these methods, image quality assessment (IQA) is used for detecting the face liveness. The images captured for attempting an attack usually have dissimilar quality compared to real ones. The difference of quality between real and fake images may include structural distortion, degree of entropy, sharpness, luminance level, etc. [82]. In the following paragraphs, we briefly discuss existing image quality-based face PAD techniques.

Kose and Dugelay [83] analyzed the reflectance characteristics for combating the face mask attacks. The variational retinex algorithm is used for decomposing a gray-level image into illumination and reflectance-based components. In another work, Galbally and Marcel [40] uses full reference image quality measures (IQMs). Among 14 full references, total corner and the total edge difference is considered in the present work. Pal et al. [82] also used six different IQMs for distinguishing artificial and bonafide presentations. Wen et al. [84] proposed a face PAD technique that is based on Image Distortion Analysis (IDA). The four features namely; chromatic moment, specular reflection, color diversity and blurriness are extracted to create a feature vector for training an SVM classifier, which is used to combat photo and video attacks. Agarwal et al. [85] used Haralick features extracted from redundant wavelet transformed video frames. Then, PCA is used to reduce the dimensions of extracted features. Boulkenafet et al. [86] proposed a novel solution for face PAD which is based on facial appearance description by applying Fisher Vector Encoding (FVE) on SURF. The SURF descriptor is obtained using the Wavelet responses in vertical and horizontal directions. The region around each point is divided into 4 × 4 sub-region denoted by j and for each value of j, the vertical and horizontal responses were used to form feature vector Vj. Vj is represented by [Xdx, Xdy, X|dx|, X|dy|], where dx and dy represent the Haar wavelet responses in the horizontal and vertical directions, respectively. The SURF descriptor is formed by combining the feature vector extracted from each sub-region with 64 dimensions and these features are represented by [V1, V2–V16]. In the proposed work, SURF feature are extracted from two color spaces HSV and YCbCr. First, this descriptor is applied on each of color band separately, then resultant features are combined to form a single feature vector that is referred to as CSURF. Finally, PCA is applied for dimensionality reduction and the FVE is applied on combined features. However, the proposed approach yields a better generalization performance in inter-database experiments even for limited training data but other strategies can also be investigated for creating more robust feature spaces for spoofing detection. Wang et al. [87] proposed two novel features that offers a shield against printed photo and replay attacks. The first feature is used to find the difference between green and red channel of an image and other approximate the distribution of color in the local regions. Then, both features are consolidated to form a multi-scale binary pattern (MSBP). Nikisins and Mohammadi [88] proposed a system through different IQMs and generated a feature space. The, GMM is trained to present the probability distribution of real face samples. Apart from this, their system has an advantage of testing unknown presentation attacks. Yeh and Chang [89] inspected image quality features by employing multi-scale analysis where a novel blind image quality evaluator (BIQE) is proposed. The BIQE is integrated with an effective pixel similarity deviation (EPSD) model which, is used to attain the standard deviation similarity maps of gradient magnitude. The twelve quality features attained from the combination of BIQE and EPSD constitute multiple-scale descriptor for image classification. Nguyen et al. [90] probe the differences between bonafide and fake images on the basis of noise statistics that exist between the face skin. Moreover, a new face anti-spoofing dataset is also introduced containing high quality attack and bonafide image samples. Table 5 summarizes analysis of image quality-based face PA detection techniques that include article information, key concept, attack details, trained classifier with database information and performance evaluation.

Table 5 A comparative summary of various image quality-based face PAD techniques

It is evident from Table 5, the static image quality-based face PAD methods mainly focused to countermeasure both the photo and video face attacks. The techniques prominently used image quality parameters to extract the discriminatory information from a dataset consisting of real and fake face images [91]. However, the extracted features may not be identical in various conditions due to scaling, illumination, head movement and noise. The few samples of these variations in face images are shown in Fig. 14, which leads to the extraction of invariant features from images for accurate classification. One of the key issues is whether a single texture feature from face images has sufficient discriminatory information that is utilized to classify the images into real or fake ones. From this study, it is also clear that linear SVM have been dominantly used to serve the task of classifier. The summary of this review indicates that, earlier up to 2011, these face PAD techniques are tested on private datasets and since 2011 onwards, these methods are evaluated on the publicly available face anti-spoofing datasets. However, most of the research focused on efficient face PAD methods and successfully attained promising classification accuracy (e.g., 99.2% accuracy by Hassan et al. 2019), still challenges are manifolds such as number of features extracted, robustness of features or parameters, number and types of classifiers, size of benchmark datasets and generalization capability of these methods.

Fig. 14
figure 14

Few sample face images with variations due to rotational, scaling and illumination effects

4.2.2 Dynamic software-based approaches

In dynamic approach, multiple instances of the given facial biometric trait are used for extracting the features from the given set of images. Some dynamic PAD techniques are specifically proposed for the detection of the video-based spoof attacks. These techniques usually attain very competitive performance, as these are proposed to exploit both temporal and static information of the face videos. However, this approach is not successes in the scenarios where single face image is available [92]. The dynamic-based approaches are further classified into two more classes such as texture and motion-based.

4.2.2.1 Texture features-based approaches

These approaches explore the dynamic texture information across the multiple frames or the captured video frames. In the following paragraphs, we study the existing dynamic software-based techniques that utilize textural information.

The dynamic texture study marked its beginning by the contribution of Pereira et al. [93], Pereira et al. [94] and Komulainen et al. [95] where an LBP descriptor from three different orthogonal planes (TOP-LBP) is used. The extended LBP has been yield to Volume local binary pattern (VLBP), which is a spatiotemporal extension of original LBP descriptor [60]. This helps in combining the time and space related information into a single operator with multi-solution strategy. The VLBP operator combines appearance and motion into the dynamic description of the texture. A more efficient and better higher order local binary feature descriptor for face PAD is given by Phan et al. [96] which is known as Local Derivative Pattern (LDP). As oppose to LBP, where relation between central and neighboring pixels is encoded, LDP descriptor extracts local information of higher order by encoding different spatial relationships from a given region. Hence, they extend LDP descriptor to a dynamic feature descriptor which exploits the higher order LDP from three different orthogonal planes to discriminate live and fake facial traits. Bharadwaj et al. [97] introduced a technique where two feature extraction algorithms are proposed namely; LBP to provide better performance as compared to other texture-based techniques and HOOF descriptor for motion estimation. Arashloo et al. [98] put forward a fused approach for face liveness detection, where the results of two feature descriptors, namely multi-scale dynamic texture descriptor based on binarized statistical image (MBSIF) and multi-scale local phase quantization (MLPQ) are fused for better results of face PAD. The MBSIF proposed on three orthogonal planes (MBSIF-TOP) offer state-of-the-art performance against face spoof detection. Then, this descriptor is combined with MLPQ-TOP. The fused information obtained from these two descriptors is realized by kernel discriminant analysis (KDA) and the results outperform the state-of-the-art methods. Afterwards, Tirunagari et al. [99] applied a dynamic mode decomposition (DMD) algorithm that captures the vitality cues such as eye-blinking, face dynamics, lip movement, etc. A classification pipeline which consists of DMD, LBP and an SVM with intersection kernel is proposed. The pipeline is efficient and is convenient to use as there is no need of tuning. When compared to the state-of-the-art methods, DMD is proved to be a best technique. In another approach, Pinto et al. [100] proposed a two-tier characterization technique known as “time spectral visual words” consisting of low and middle level features for face anti-spoofing. The concept of visual codebook is used for extracting middle level features from low level as former are more robust and efficient for detecting several spoof attacks. The proposed technique acquires the patterns that are present in the noise signatures for liveness detection. Zhao et al. [101] proposed a novel spatio-temporal descriptor named volume local binary count (VLBC) for dynamic texture representation. The VLBC extracted local spatio-temporal volumes by exploiting motion and appearance textural features. Contrary to VLBP, the proposed descriptor does not use the local structural information and only ones (1 s) are calculated from the threshold codes. Thus, VLBC includes more neighboring pixels without augmenting the feature dimension. Additionally, a new completed version of VLBC (CVLBC) is also investigated for enhancing texture recognition performance with superfluous information about central pixel intensities and local contrast. Chan et al. [102] uses flash to combat face presentation attacks as it declines the influence of environmental factors. Two images are captured (one with flash and other without flash) and texture and structural features extracted to distinguish bonafide and fake samples. Pan and Deravi [103] introduced a new descriptor named Temporal Co-occurrence Adjacent LBP which is based on temporal modifications in textural information to discriminate images into two classes. The comparative analysis of dynamic texture-based approaches is listed in Table 6 with main concept, type of attack, classifier used, and evaluation dataset with performance.

Table 6 A comparative description of various dynamic texture-based face PAD techniques

It may be interpreted from the comparative analysis that in dynamic texture exploration LBP descriptor is deployed widely. Besides, BSIF features are chosen by various researchers where features are extracted from natural images. However, in future studies, it would be stimulating to assess the excellence of training BSIF image filters using presentation attacks data. Moreover, these techniques result in an overall error rate in terms of HTER ranges between 0 and 15%.

4.2.2.2 Motion-based approaches

In these methods, the motion characteristics exhibited by muscles of a face due to head movement are captured and analyzed. The acquired motion characteristics are due to the movement of mouth, head, eyes, and etc. [105]. We present a brief review of motion-based face PAD methods in the following paragraphs.

Kim et al. [106] presented a similarity and motion-based algorithm for fake face detection. Here, initially an input video is divided into two different foreground and background regions. The similarity is computed to differentiate between the backgrounds and face regions of the face frame. Then, a motion index is computed to measure the amount of motion in the foreground compared to motion quantity in the background region. The combination of both the similarity and motion is used to detect the fake or real face. In a similar approach, Junjie et al. [107] used three scenic clues which include non-rigid motion, imaging banding effect and face-background consistency for liveness detection. The non-region clues indicate the amount of motion exhibited in a real face that is eye blink while face-background consistency indicates that motion of face and background has low consistency for real face and high for fake. The image quality defect that has been induced in the reproduction of a fake face is exhibited by image banding effect. The fusion of all these three scenic clues yielded around 100% face liveness detection accuracy. The complementary countermeasures based on motion and textural features to detect the liveness are explored by Komulainen et al. [108]. The motion-based countermeasure is used as correlations between the background scene and client head movements. The score-level fusion method is used to provide the final result of spoof detection. The total error rate of the individual methods is approximately 12% while the percentage of mutual errors is under 2%, which revealed that the countermeasures are really complementary. Anjos et al. [37] deployed optical flow where foreground or background motion correlation is used for differentiating fake and real faces. The approach is evaluated on publicly available Photo-Attack dataset where the proposed approach outperformed other similar methods with an ERR of 1.52%. Cai et al. [109] used gaze estimation approach which is based on the assumption that gaze trajectory of real face has higher uncertainly level compared to fake face. The notion behind the work is based on extracting gaze features and motion from a moving video to obtain a gaze histogram. The information entropy from gaze histogram is used to determine uncertainty level of user’s gaze movement which helps to predict the liveness. Kiilioglu et al. [110] applied pupil tracking for face PAD in which Haar-Cascade classifier is used to detect the eye area from the face. The feature points of the stable eyes are computed with minimum head movement using Kohenen Leove transform (KLT). The pupil area signals are sent to eight different LEDs through an Aurdino device to detect the motion of pupil in real face. The proposed method is tested only on volunteer faces and not evaluated on standard datasets. Singh and Arora [111] used eye-blink and mouth movements for face liveness detection. The morphological opening operations are used to extract the abstract false regions in the consecutively captured frames from face video. Face liveness is detected by checking the motion of eyes and/or mouth of a person and the approach is evaluated on both in-house and publicly available face datasets. However, results revealed the effectiveness of the method compared to other similar techniques. Similarly, Edmunds and Caplier [112] proposed a technique based on face tracking Conditional Local Neural Fields (CLNF) algorithm, where rigid and non-rigid motions exhibited by human face are extracted. Then, a motion sequence vocabulary is constructed for deriving discriminant middle level motion features by making use of the fisher vector framework. Li et al. [113] combat replay video attacks by investigating motion blurs, which is reflected mainly in width and intensity. The features extracted through 1D CNN (explored in DL-based section) model are used to distinguish bonafide and fake samples. Zhang and Xiang [114] employed DWT-LBP-DCT for face PAD task. The DWT is applied for decomposing face images into various frequency components whereas LBP is used for spatial information extraction and the DCT helps to apply energy concentration characteristic for getting the temporal information by proficiently merging multiple video frames. Lastly, the DWT-LBP-DCT features are generated to illustrate frequency spatial temporal data of video. Yukun et al. [115] designed a scheme where two perspective dynamic features are extracted where the former involves the temporal motion properties of face video and latter uses visual beats of noise pattern. These extracted features are fused at decision level and an SVM classifier is fitted to differentiate face images into two labels. Table 7 presents an overview of various motion-based approaches in the literature for detecting the liveness of a face.

Table 7 An analysis of existing dynamic motion-based face PAD techniques

The overall analysis of dynamic software-based PAD techniques reveals that the texture-based dynamic methods are dominant over the motion-based face anti-spoofing mechanisms. The LBP have been frequently used image descriptors in majority of these detection mechanisms to countermeasure photo, mask and video attacks. The prevailing methods are evaluated on publicly available benchmarking face anti-spoofing databases such as CASIA, REPLAY-Attack, MSU, etc. and standard performance metrics using EER and HTER. Few dynamic-based face PAD mechanisms are successful to achieve an accuracy of 100% as reported by Yan et al. in their work while the HTER reported in this survey lies between 0 and 18. 20%. The challenges to the existing methods still manifold like appropriateness of single descriptors, cross-dataset validation, variation in the images, selection of appropriate classifiers and training hyper- parameters. One of the limitations of the motion-based face PAD methods is that whole sequence of the video is used for face liveness detection, where training time is more. Whereas, the texture-based dynamic face PAD techniques suffer from the problem of selection of discriminative features by using either a single or multiple descriptors.

The presented assessment of static as well as dynamic-based approaches has exploited several handcrafted features along with popular classifiers like SVM, MLP and LDA. Table 8 summarizes the widely employed feature descriptors for pioneer face anti-spoofing mechanisms. It may be analyzed that majority of the existing state-of-the-art handcrafted texture feature-based methods utilized LBP or its variants. The popularity of LBP among all other descriptors may be due to its ability to compute a powerful feature set with comparatively lesser complexity. Other descriptors such as LPQ, HoG and BSIF are also significantly utilized for the task of liveness detection.

Table 8 A comparative analysis of handcrafted texture feature-based face PAD methods

Moreover, among all the classifiers used for PAD task in the literature, SVM surpasses other classifiers like LDA, DT, MLP, Random Forest, etc. Whereas most of the handcrafted feature-based face PAD techniques used HTER and EER protocol for evaluation of these approaches. Similarly, a very least number of PAD models have performed the cross-database testing to improve the generalization capability against un-known attacks.

4.2.3 Deep learning-based approaches

With the emanation of DCNN models and their prevalent applications in pattern recognition, face spoof attack detection methods based on deep learning has accomplished a consequential breakthrough in recent years and has surpassed the traditional handcrafted feature-based detectors. Compared to manually crafted features-based PAD approaches, the DL-based techniques generate stratified feature depictions which are learned in automatic way and demonstrate superfluous discriminative power. Prior to discussion of DL-based face PAD mechanisms, we provide an outline of emergence of DL-based models. The history dates back to the 1940s; however, the assertion is to prepare the machines for performing tasks which are being performed by the human with utmost ease, it seems like fitting to imitate and perceive the working of a human brain. This has led to the advent of deep learning, where early research begins with MLP [116]. However, it become prominent in 1980s and 90s with the initiation of Back-Propagation (BP) learning algorithms. In 1998, LeCun and Bengio introduced the concept of CNN known as LeNet [117] which was originally used for recognizing hand-written digits.

Another DCNN model known as AlexNet [118] was proposed for image classification followed by numerous others like VGGNet [119], GoogLENet [120], and ResNet [121]. A timeline of most persuasive DCNN models, designed for classification task, with their evolution over the past few years is shown in Fig. 15.

Fig. 15
figure 15

A timeline showing development of CNN models

The recent advancements of DL-based approaches towards face PAD detection demonstrates its higher classification accuracy in complex scenarios, leading to robustness and effectiveness into the problem. Several DL-based face PAD techniques are reported so far to confront with extremely diverse presentation or spoof attacks. The Fig. 16 illustrates the topology of recent trends in automatic feature extraction-based face PAD approaches.

Fig. 16
figure 16

A depiction of progression in deep learning-based face PAD techniques

4.2.3.1 Pure CNN-based approaches

The CNN models are widely deployed in many computer vision activities particularly for pattern classification. Employing CNN in face PAD task helps in classifying the face images to spoof or live classes effectively. Initially in 2014, Yang et al. [122] introduced the conception of CNN model for extracting deep-level features which has marked the new path of DL in the facial PAD field. Xu et al. [123] have proposed a model which combines LSTM unit with CNN, their model learns temporal structures from videos that are helpful for face anti-spoofing task. Tu and Fang [124] also applied LSTM unit where the theory of transfer learning (TL) is used and a pre-trained ResNet-50 DCNN model extracts spatial features from image frames. The features are presented as input to LSTM for getting temporal features which are then assuredly used for classification task. Then, the basic CNN structure is modified by Souza et al. [125] and a new CNN model, known as LBPnet, is proposed. The first layer of LBPnet is assimilated with the LBP feature information and the convolution operation, instead of convolving the kernel values with the gray-scale image labels, finds the LBP codes for pixels before applying the convolution function. Consequently, the convolutions are performed on newly obtained LBP values which improve the results of proposed LBPnet framework. A variant of LBPnet is extended by Souza et al., which is known as n-LBPnet. The architecture of n-LBPnet is similar to original one; only a normalization step, called Local Response Normalization (LRN), is included between the convolution and pooling layer. Wang et al. [126] developed a robust method using a joint representation of 2D depth and textural information. The texture features are learned from CNN, whereas depth representation is acquired by making use of Kinect. The method consists of three prime components; in the first one, generalized texture features are explored using 2D images; second involves a feature depiction using depth image and lastly a fusion technique is applied. To discriminate live and fake spoof face images the popular LBP features are chosen and for video clips the voting strategy is used. The face clip can be categorized as genuine, if both (i.e., texture and depth information-based) methods’ resulted as live. Iteratively, an LBP-based end-to-end learnable network is designed by Li et al. [127] that can extensively decline the number of network parameters by combing learnable convolutional layers with fixed-parameter LBP layers. The network comprises of sparse binary filters and derivable simulated gate functions. It may be inferred that compared with the existing DL detection approaches, the proposed CNN structure may reduce the number of parameters up to 64 times. The parameters in convolutional layers are saved by adding complete LBP layers with fixed sparse binary filters. In the case of fully connected layer, a novel statistical histogram function is applied to save these parameters. The network has four layers, out of which first two are convolutional layers followed by an LBP layer used for extracting the virtual LBP features and finally the classification layer for class prediction.

The method is not able to achieve good detection results for some specific attacks. Furthermore, the impact of the size of fixed convolution kernels is not shown and also the effect of extracting the simulated LBP features from different convolutional layers is missing. In addition, a multi-level LBP feature is combined with CNN by Nguyen et al. [128] to extract the details of face skin. Similarly, Peng et al. [70] introduced a scheme where they convert the color space of normalized faces for enhancing the chromatic dispersion between real and fake face images. After detecting the color chromatic space, LBP features are extracted to train an ensemble for classification. Rehman et al. [129] designed a LiveNet architecture to address the problem of cross-database testing scenario. The approach is based on using continuous data-randomization (like bootstrapping) with small mini-batches while training CNN classifiers on small scale face anti- spoofing database. The approach has a limitation that its learning time raises when the training dataset is of large scale with small batches. To explore textural as well as deep-level features, Chen et al. [130] fused the texture features extracted by means of a rotational invariant LBP (RI-LBP) and deep-level features mined through CNN network. The dimensionality of obtained deep features is reduced by employing PCA algorithm. Then, these reduced feature sets are used for training an SVM with RBF classifier. Similarly, Grover and Mehra [131] use LBP features and CNN model for discrimination of live and fake face images. Rehman et al. [132] proposed stereo camera-based technique, as demonstrated in Fig. 17, that use dynamic disparity maps with input as RGB data. The disparity maps are learned from the convolution layers of CNN model. A custom disparity layer is designed which supervises the rest of the network. Then, the learned maps are evaluated by using various operations such as Absolute Disparity (AD), Approximate Disparity (APD), Square Disparity (SD) and Feature Multiplication (FM) which may be computed using Eqs. 811.

Fig. 17
figure 17

A stereo camera-based face liveness detection framework ([132])

$${AD}_{K}=\left(x,y,k\right)\sigma \left(|{F}_{r}\left(x,y,k\right)-{F}_{l}\left(x,y,k\right)|\right),$$
(8)
$${SD}_{K}=\left(x,y,k\right)\sigma \left({\left({F}_{r}\left(x,y,k\right)-{F}_{l}\left(x,y,k\right)\right)}^{2}\right),$$
(9)
$${FM}_{K}=\left(x,y,k\right)\sigma \left({F}_{r}\left(x,y,k\right)\cdot {F}_{l}\left(x,y,k\right)\right),$$
(10)
$${APD}_{K}=\left(x,y,k\right)\sigma \frac{\left({F}_{r}\left(x,y,k\right)-{F}_{l}\left(x,y,k\right)\right)}{\frac{\partial {F}_{L}\left(x,y,k\right)}{\partial x}}.$$
(11)

In Eqs. (8)–(11), the symbol ‘σ’ represents the sigmoid function defined as: σ(x’) = 1/1 + ex’, Fr (x, y, k) and Fl (x, y, k) denotes the kth convolution feature-maps trained by convolutional layer from the right I r (x, y) and left camera I l (x,y) face image respectively, whereas ADk, APDk SDk and FMk are the disparity maps trained for the kth convolutional feature map and (x, y) represents spatial size of respective kth feature-map. The authors created a novel stereo camera-based face PAD anti-spoofing database and achieved an average 0.47% of APCER. Rehman et al. [133] extended their earlier work, and deployed a DCNN model with an adaptive layer of feature fusion. The fusion layer performs the weighted fusion of convolutional features which are learned through convolutional layers from DCNN-based “auto-encoder generated” (DNG) face images and real-world face images. The adaptive blending and disparity between these types of fused images is applicable for face liveness detection. Although, in this work, the real-world and their corresponding DNG face images are utilized, but other robust combinations can also be explored in another feature space like HoG, LBP, BSIF, etc. Rehman et al. [134] continued their work and augment an additional perturbation layer that is a learnable pre-processing layer for low-level deep features to improve the overall performance of DCNNs. The perturbation layer induces the LBP features into deep level of a candidate layer in the basic CNN model and the handcrafted features are extracted from both colored and non-colored face images using LBP. The adaptive weights are yielded through the joint data of the deep-level as well as handcrafted features of the candidate layer. These convolutional weights are multiplied with the low-level features to amplify the intensity of these pixels in the convolutional layer. The updated features are then fed to the rest of the network. A limitation to the existing method is the presence of inherent uncertainty while choosing an appropriate feature set that is fed to the perturbation layer. Accordingly, the further scope may involve exploring other handcrafted features for the perturbation layer in CNN model.

Li et al. [135] proposed a CompactNet structure to overcome the issue of overlapped samples in prevailing color spaces. The model consists of three phases, namely a space generator, feature extractor and triplet loss function. In the beginning, the RGB image is given to the compact space generator with lesser parameters and helps in mapping the present color spaces into another new space. Then, the generated face image is given as input to feature extractor module for computing deep-level features. Lastly, the points-to-center integration mechanism is applied to choose training samples with triplet loss function being employed to maximize inter-class and minimize the intra-class distance. Yukun et al. [136] used a Multi Region CNN-based approach and introduced a new conception of local classification loss for scattering gradient distribution. The scattering helps to boost the face PAD results against both adversarial and traditional attacks. Pinto et al. [137] presented a novel research work where three intrinsic characteristics of the scene, i.e., depth, albedo and reflectance of facial images is recovered through Sequential Feature Selection (SFS) algorithm. A shallow CNN architecture is designed for extracting meaningful patterns from the maps acquired through SFS algorithm. The technique is experimentally evaluated with different cross-sensor and cross-database scenarios and state-of-the-art performance is achieved. From the aforementioned pure CNN-based techniques, it is inferred that feeding an entire face image as an input to CNN model results in added overhead due to processing of non ROI image segments. Since the pixels lying in the outside face region are non-desirable thus, providing small patches of the face images to the classifiers offer improvement in performance. Hence, in the subsequent paragraphs, we review the locally supervised Fully Convolution Network (FCN)-based face spoof detection approaches. Recently, Pei et al. [138] presented a person-specific approach where a spoof is detected after face recognition process. In this approach, to detect PAs, a deep Siamese Network is learned on pair-wise photos of the clients. Table 9 summarizes the CNN-based techniques along with information related to methodology, attacks, databases and performance evaluation in intra and inter-dataset scenario.

Table 9 A comparison of pure CNN-based face PAD techniques
4.2.3.2 Fully convolution neural network-based approaches

According to Jourabloo et al. [139], there exists two key spoof distortion (a high-frequency weak signal when added to a clear face image) properties, namely ubiquitous and repetitive. The former makes the distortion to be available throughout the spatial domain, whereas the latter avails the distortion as a repetition of few legitimate patterns. Accordingly, it is rational to employ FCN models in face PAD as they fully exploit the aforementioned distortion properties. The FCN maps the local patches to local labels and are positioned in same face image to generate a map consisting of all zeros and ones. For example, Atoum et al. [140] and Liu et al. [141] enforced depth map, analogous to local labels map, as an auxiliary mark for training their FCN model. In [140], two CNN streams are coalesced for attaining promising results. The first model is relying on patch-based face images while other adopts full face images for extracting depth information. Likewise, in [141] CNN and RNN architectures are integrated, the CNN practices depth supervision for discovering texture-based features leading to discrete depths for spoof and live face images. The feature vector and estimated depth information is provided as an input to the registration layer for aligned feature-maps generation. The acquired maps and rPPPG (Remote Photoplethysmography) supervision is applied for training RNN model that examines the time variability across the given video frames. The rPPG signal and depth map are used for spoof detection mechanism. Li et al. [142] developed a 3D CNN structure where both temporal as well as spatial natured videos are processed. The model is trained, after applying data augmentation with entropy loss function, afterwards specifically developed generalization loss is applied that acts as a factor for regularization. The proposed technique enhanced the generalization ability by minimizing the MMD difference among various domains. George et al. [143] and Sun et al. [144] revealed that the global labels are less efficient comparable to local labels for face liveness detection. In [143], a DenseFCN model is trained with binary supervision. The supervision on output maps forced the network structure to learn shared representations by availing data from different patches. The DenseFCN handles frame-level information that makes it appropriate for quick decision capability as there is no urgency of processing multiple frames. In [144], the “Spatial Aggregation of Pixel-level Local Classifiers” (SAPLC) with an FCN and aggregation part is designed. The FCN predicts the local labels for all patches. Then, the predicted labels are combined together for pursuing an image level decision. Recently, Deb and Jain [145] proposed a supervised learning-based approach named SSR-FCN, which is trained on discriminative face images cues. Initially, the FCN model is trained with global images to learn discriminative cues present globally for identifying attack prone regions. Then, the model is trained to learn locally present cues by presenting specific regions of face image to locate presentation attack areas only. Finally, testing is carried out and final classification score is computed for discriminating live and fake face images. Arora et al. [146] used convolutional auto-encoders to diminish the dimensionality of images. The encoder weights are loaded to another network consisting of Flatten layer and FC layer. The resultant images after dimensionality reduction are given to this model for classifying these to live and fake class labels. In another latest research work, Muhammad et al. [147] proposed a video pre-processing technique termed Temporal Sequence Sampling (TSS) for 2D face PAs. Additionally, they take advantage of the characteristics of a CNN model by introducing a self-supervised representation learning scheme, where the stabilized frames accumulated over video clips of various temporal lengths serve as the supervision and the labels are automatically generated by the TSS method. Using labeled face PAD data, the learned feature representations are then adjusted for the downstream task. Table 10 presents the overview of FCNN-based techniques along with information linked to key concept, attacks, databases and performance evaluation in intra and cross-dataset scenario.

Table 10 An illustration of FCNN-based face PAD techniques
4.2.3.3 Transfer learning-based approaches

Although, FCN-based face spoof detectors are superior to CNN methodology but there are certain limitations associated with FCN frameworks. First, these networks use fixed size of input so face images need to be cropped to arbitrary fixed size and other is these models cannot remove domain shift which gave a new direction to researchers. This marks the deployment of domain adaptation and domain generalization in face PAD mechanisms, which uses the conception of transfer learning. The domain adaptation (DA) is a type of TL that deals with the multi-dimensional data. It aims to transmit the knowledge from a source to target domain and it can be useful when there is limited training data in a different application scenario [148, 149]. The idea of domain adaptation was originally employed for linear and kernel model, but may be used in deep learning with some modifications [150, 151]. Let us assume source domain SD and target domain TD, for adapting the SD and TD, the technique augments the input y \(\in\) SD \(\cup\) TD \(\in\) RC by computing F(y) using Eq. 12.

$$F\left( y \right) = \left\{ {\begin{array}{*{20}c} {\left[ {y,y,0} \right],\quad y \in S^{D} } \\ {\left[ {y,0,y} \right],\quad y \in T^{D} ,} \\ \end{array} } \right.$$
(12)

where [.,.,.] is a concatenation operator for vectors and 0 = (0,0,0,…) \(\in\) RC and F(y) \(\in\) R3C. This augmentation is based on Kernel. Assume ‘\(\Psi\)’ as a mapping function that converts input space to a Reproducing Kernel Hilbert Space (RKHS). Let \(\omega \left( {y,y\prime } \right) = \left\langle {\Psi \left( y \right),\Psi \left( {y\prime } \right)} \right\rangle\) be a kernel function. The augmentation using RKHS with infinite dimensions may be achieved by Eq. 13

$$F\left( y \right) = \left\{ {\begin{array}{*{20}c} {\left[ {\Psi \left( y \right),\Psi \left( y \right),0} \right],\quad y \in S^{D} } \\ {\left[ {\Psi \left( y \right),0,\Psi \left( y \right)} \right],\quad y \in T^{D} .} \\ \end{array} } \right.$$
(13)

As RKHS has been expanded, the new kernel function after augmentation may be computed by Eq. 14

$$\omega \left( {y, y\prime } \right) = \left\{ {\begin{array}{*{20}c} {2\omega \left( {y, y\prime } \right),\quad {\text{like domain}}} \\ {\omega \left( {y,y\prime } \right),\quad {\text{unlike domain}}{.}} \\ \end{array} } \right.$$
(14)

The face PAD mechanisms endure inter-domain inconsistency problem such as the training and testing domains have different data distributions w.r.t face pose, capturing device, facial appearance, illumination, etc. Consequently, latest studies related to face PAD utilize the notion of domain adaptation as well as domain generalization to cover the anomaly detection (i.e., poor generalization capability) issues. The generalized features learned by the domain generalization technique must be discriminative and shared among multiple source domains. Thus, the extracted features can explore the common differentiated cues for PAD task across multi source samples, that are lesser likely to be biased and are more generalized. The basic functioning of DA in face PAD reveals that, as shown in Fig. 18, the models that learn generalized feature space during training and testing phase can extract more global cues shared among all source domains and hence are more generalized to unknown face attacks.

Fig. 18
figure 18

A depiction of domain transfer learning to learn discriminative and shared feature space for more generalized face presentation attack detection

Yang et al. [152] originally employed the notion of subject domain adaptation to produce virtual features that makes it compliant to train anti-spoofing classifiers with improved performance. Later, an unsupervised domain adaptation method is successfully employed by Li et al. [153]. The framework is inclined on facial feature space transformation from labeled source to unlabeled target domain. The experiments are performed on CASIA, Replay-Attack, MSU and newly created database SiW. The obtained results revealed 20% improvement by implementing domain adaptation as compared to learning approach without domain adaptation. Li et al. [154] utilized another technique to tackle the issue of model training with limited dataset in an specific application domain. The neural network distillation is employed to leverage data from related domain for learning significant features. Wang et al. [155] introduced the new concept of adversarial domain adaptation for enhancing face PAD mechanisms generalization ability. First, the source model, optimized using triplet loss function, is trained on source domain. After source training, adversarial adaptation is utilized for training a target model to learn shared embedding space by source as well as target domain models. Lastly, destination images are mapped to embedding space and are classified using KNN algorithm. Then, Shao et al. [156] proposed a framework where multi-adversarial domain generalization is carried out under a dual-force triplet-mining constraint. It certifies that the feature space is distinctive and can be shared by several source domains and hence more generalized to new presentation attacks. Sun et al. [157] designed a “Fully Convolutional Network with Domain Adaptation and Lossless Size Adaptation (FCN-DA-LSA)”. The model accounts a lossless size adaptation pre-processor followed by an FCN for pixel-level classification which is embedded within domain adaptation layer. The lossless size adaptation layer helps to perpetuate the high frequent clues induced during face recapture process. The DA layer improves the generalization capabilities across different face domains due to variations in lighting conditions, face attacks, face datasets and cameras. The FCN-DA-LSA results in an HTER of 11.22% and 21.92% under with two hybrid protocols using DA and LSA. The proposed approach is suitable when the target domain is already known and the requirement of external data is a major limitation. To overcome this problem the future study may be undertaken that is based on few-shot domain adaptation unsupervised methods. Mohammadi et al. [158] proposed a new one class DA approach where domain guided pruning is utilized for adapting pre-trained PAD model to the target database. This pruning is implemented because in initial layers of CNN model, certain learned filters are robust having good generalization capability for the target dataset, whereas others are specific to the source data and need to be pruned for improving network performance on target domain. In another work, Wang et al. [159] proposed an approach consisting of Disentangled Representation learning (DR-Net) and Multi-Domain feature learning (MD-Net). The disentangled features from diverse domains extracted through generative models are given as input to MD-Net which learns features that are independent to domain for the ultimate cross-domain face PAD mechanism. Wang et al. [160] extend their work [159], and proposed an unsupervised adversarial domain adaptation (DR-UDA) technique that addresses the cross-domain issue by leveraging unlabeled target and labeled source data for building robust PAD model. The DR-UDA model consists of three subnets, namely UDANet, ML-Net and DR-Net. The ML-Net integrates center and triplet loss for feature representation to classify images in the source domain via metric learning. The UDA-Net is used for optimizing target and source domain representation, resulting in effective transfer of source domain to unlabeled target PAD domain. Then, DR-Net again separates the irrelevant features by recreating the source as well as target domain images, which discriminates live and fake classification. The approach when evaluated on public database offers improved generalization capability. Safaa El-Din [161] proposed an approach where source domain data are used for training classifier by cross-entropy loss, whereas unsupervised target domain samples are used in adversarial DA approach. The authors performed the deep clustering of target data to keep the intrinsic target domain properties and for enhancing generalization capabilities. Recently, Kotwal et al. [162] proposed a PAD technique for passenger vehicles by designing a 9-layered CNN framework. The main aim is to alleviate the issue of dataset scarcity by adapting domain specific layers and task-specific tuning of the base networks. Peng et al. [163] designed a two-stream vision transformers-based framework (TSViT) relying on transfer learning in two complimentary areas to address the performance loss of the current PAD methods caused by illumination change. To train TSViT, face images in RGB color space and multi-scale retinex with color restoration (MSRCR) space are given as input. An effective feature fusion method based on self-attention is developed, which can successfully capture the complementarity of two features, to successfully fuse features from two sources (RGB color space images and MSRCR images). Studies and analysis on the Oulu-NPU, CASIA-MFSD, and Replay-Attack databases reveal that it performs better in intra-database testing than the majority of current approaches and performs well in cross-database testing in terms of generalization. Majority of the existing methods use domain adaptation to minimize domain variation. Recently, Kim et al. [164] proposed a new face PAD method that uses Meta style selective normalization with domain adaptation which detects domain-centric styles of specific domains. The parameters are selected with optimal normalization by reducing the discrepancies between source and target domains.

Table 11 summarizes the transfer learning-based techniques along with information related to methodology, attacks, databases and performance evaluation in intra and inter-dataset scenario.

Table 11 A comparative summary of transfer learning-based face PAD techniques
4.2.3.4 Multi-channel-based approaches

An additional issue pertaining visible spectrum-based SOTA PAD techniques presented in the previous paragraphs include the improved quality of image capturing devices (such as cameras, printers, mobile-phones, etc.) minimize the subtle difference between low and high quality fake and real face images, respectively. The advanced technologies for generating 3D masks are now available easily, it makes difficulty in categorizing the real and fake face artifacts. Hence, to address these problems, several researchers have offered multi-channel or extended range imaging system. Bhattacharjee and Marcel [166] considered visible spectrum, thermal, infrared and depth channels where they demonstrate that detecting 2D and 3D masks attacks are simple in depth and thermal channels, respectively. Majority of the attacks are detected using a similar technique with integration of different channels, where the combinations of channels and features are found using a learning-based methodology. Similarly, in another work, Bhattacharjee et al. [167] explored the temperature of the facial region to detect PAs and also illustrated the possibility of spoof commercial facial recognition systems with customized silicone-based masks. Later, George et al. [168] designed a Multi-Channel Convolution Neural Network (MC-CNN) for implementing joint representation from multi channels, by considering TL from a pre-trained facial recognition network. The major advantage of MC-CNN architecture is that domain-specific units (lower layer features) and high-level FC layers are adjusted in the training phase. Hence, adapting only last FC layers and DSUs reduces the possibility of over-fitting. The LightCNN (a sub-network) is used, which makes the proposed model reusable for both face recognition and detection. Their framework is segmented into two stages namely; preprocessing and network architecture. During pre-processing, the face detection is performed using MTCNN algorithm in color channel, while for non-RGB channels, the face images are required to be aligned in both temporal and spatial manner. The authors generated their own database named WMCA, which consists of subjects captured using multiple channels/devices. The channels presented are depth, color, thermal and infrared. From the results, it is clear that the performance of the proposed algorithm is not satisfactory when single color channel is used. Hence, by adding multiple channels the performance can be improved significantly. Recently, George et al. [169] used their existing MC-CNN [168] as base model. In the proposed technique the outcome from the penultimate FCN layer is used as embedding. Besides, a novel loss function is expressed that forced the CNN model to learn a compact and distinctive representation for facial images and resulted in a formation of compact clusters in feature space. Finally, a one class classifier is used for distinguishing both recognized and unidentified attack maps from bonafide clusters. Li et al. [170] investigated the facial movement and texture cues, to detect PAs. However, optical flows from a continuous video sequence are first extracted to characterize the precise movement amplitude and direction. Then, as the network's input, the retrieved optical flows are concatenated with the video frames. The next step is the introduction of combined region and channel attention methods for adaptively allocating the categorization weights. Finally, to extract features and determine if the input video sequence has a live face or not, the fused motion and texture cues are fed into a convolutional network. In a recent face PAD mechanism, Silva et al. [171], presents a hybrid model where the residual spatial–temporal CNN is combined with channel separated CNN to yield better performance in both known as well as unknown attack scenarios. After all, we consummate the review of spectrum-based face PAD techniques, by providing an overview including article details, key concept, type of attack to be counter-measured, training and testing database with performance accuracy in Table 12.

Table 12 A comparison of multi-channel-based face PAD techniques for known and unknown attack scenario

From Tables 9, 10, 11 and 12, the certain remarks can be concluded; the trend shows that the majorly of face PAD-based methods use CNN for liveness detection with variations in hyper-parameters. As the face liveness detection is a binary classification problem, some authors like George et al. [168], Souza et al. [125], George et al. [169], etc. used lightweight version of CNN models. To enhance discriminative power of CNN, few researchers like Chen et al. [130] integrated the CNN model with handcrafted features like LBP for face PAD mechanisms. Some studies by Rehman et al. [129], Rehman et al. [108], George et al. [168], Souza et al. [125], George et al. [169], etc. addressed the problem of cross-dataset validation in face liveness detection; however, results show limited performance like HTER of 8.28–41.9%. Li et al. [127] reduces the number of network parameters up to 64 times and resulted a promising lower ERR of 1.5%. Our survey clearly reveals that DCNN-based approaches countermeasures all the basic three photo, video and mask attacks, but still there arise a constraint of generalization to unknown attacks. All the face PAD methods as included in this study have been validated on benchmark datasets, but only few have addressed the problem of inter-dataset validation. Moreover, the size of datasets in these standard datasets is still inappropriate for training of face PAD models. Table 10 infers that majority of face PAD techniques in case of cross-dataset testing resulted in high HTER value such as Li et al. [153] reported 39.2% HTER when their model is trained on CASIA-FASD and tested on Replay-Attack dataset, Peng et al. [70] when trained their network with Mobile database and tested it on CASIA-FASD resulted in HTER = 39.56%. Recently, Deb and Jain [145] reported HTER = 41% when model is trained and tested on CASIA-FASD and Replay-Attack, respectively. The overall performance of deep feature-based face PAD techniques exhibits that there is a requirement of designing approaches with lower EER and HTER in case of intra- as well as inter-dataset testing scenario for improving generalization capabilities. Apart from this, our analysis indicates that since 2014, the literature of DL-based face PAD techniques has marked the evolution of various CNN models specifically designed as facial PA detectors. We provide a comparative analysis of these prominent CNN frameworks in Table 13 that include platform, learning/training method, input image dimension, loss function and number of training parameters.

Table 13 A comparative analysis of CNN architectures employed for face PAD mechanisms

The comparative analysis in Table 13 is evident of certain conclusive observances such as; with the introduction of FCN, depth maps are imposed as auxiliary labels for model learning. Besides, FCNs are also benefited by distortion properties that make their use rationale in face anti-spoofing mechanisms. Due to massive number of training parameters (1.3–43 million), to attain robust features, data augmentation might have played significant role for DL-based methods. A proper incorporation of CNN network may drifty increase the PAs detection performance. There are other factors that may remarkably influence the performance of detection models such as multi-channel or multi-scale feature extraction (e.g., MC-CNN), enhanced classification networks and mining of negative samples.

5 Performance evaluation methodologies

The effectiveness of face PAD algorithms is evaluated on the basis of appropriate datasets and widely accepted performance indices. The selection of suitable dataset may play a significant role in the modern data-driven approaches where the PAD models learn the samples covering both the classes. The performance of the trained models is evaluated or compared with quantitatively measured performance indicators. In this section, we present a performance evaluation methodologies employed for comparative analysis of PAD algorithms.

5.1 Benchmark datasets

The face anti-spoofing datasets comprehend the live and fake images of face modality. The captured images are then used for evaluating the performance of PAD algorithms. The availability of these databases holds an important role in the validation of developing face anti-spoofing techniques [177].

The diverse datasets have been collected using various imaging devices (such as smartphones or cameras) under different operational scenarios. The most prominently used benchmarking databases are illustrated in Fig. 19, which depicts the total number of face image samples acquired from different subjects with different age, PAIs and different lightning scenarios. Fig. 19 depicts the total number of genuine and fake face samples contained in each of these standard databases.

Fig. 19
figure 19

A comparison among various face anti-spoofing datasets

In Table 14, a brief overview of publicly available face anti-spoofing databases including sensor details, resolution of captured images, total number of subjects, and PAIs information is depicted. These benchmark datasets are widely employed by researchers for evaluating their face PAD techniques during past decade (i.e., 2010–2021).

Table 14 A comparative analysis of benchmark face anti-spoofing datasets

From the summary, it may be noticed that the existing datasets are adequate for evaluating handcrafted feature-based methods and the DCNN-based face PAD models may not offer desired performance with limited size of these datasets.

In addition, most of the anti-spoofing datasets covers photo, display and replay video attacks whereas for mask attack countermeasure more datasets needs to be developed. The Fig. 20 shows some samples of the real and fake face images from CASIA, SiW, Replay-Attack, OULU-NPU, and MSU-MFSD face anti-spoofing datasets.

Fig. 20
figure 20

Real (first three images of each row) and fake face samples (last three images of each row) a CASIA-FASD b SiW c Replay-Attack d OULU-NPU e MSU-MFSD

5.2 Evaluation metrics

To evaluate the performance of face PAD algorithms, several evaluation indicators are utilized such as accuracy (ACC), average classification accuracy (ACA), Attack Presentation Classification Error Rate (APCER), Bonafide Presentation Classification Error Rate (BPCER), Average classification error rate (ACER), Half total error rate (HTER), Equal error rate (EER), False acceptance rate (FAR), False rejection rate (FRR), True-positive rate (TPR), True-negative rate (TNR), False-positive rate (FPR), False-negative rate (FNR), Receiver operating characteristic (ROC), and Detection error trade-off (DET). A brief summary of these metrics with their description and formula is listed in Table 15. Additionally, the face PAD research studies that follow the respective performance index with their references are also listed. Majority of these techniques are evaluated for its effectiveness with error rates such as HTER or EER.

Table 15 A comparative analysis of evaluation protocols

6 Analysis and discussion

In the previous sections, a detailed investigation of several face anti-spoofing approaches from conventional to up-to date DL-based models is presented, that establish a foundation to comprehend critical parameters for designing the effective mechanisms. The core objective of this section is twofold: (i) First, we perform an analysis of some recent state-of-the-art (SOTA) approaches using a popular face anti-spoofing dataset. (ii) A broad-spectrum scrutiny of the face PAD methods covered in this study is deliberated.

6.1 Performance analysis

In this section, we follow a mutual evaluation framework for measuring the performance of widely deployed face PAD algorithms on benchmark face anti-spoofing dataset with identical performance protocol. As most of the face PAD methods have been implemented and evaluated on diverse platforms including datasets and metrics. Therefore, it becomes an interesting task to assess the effectiveness of these approaches on a common interface. Moreover, it is a challenging assignment to evaluate all the hardware-based algorithms as it requires devices with specific characteristics. Our study reports that, the handcrafted feature-based approaches are less effective and robust as compared to recent deep features-based counterpart. As the trend has gradually shifted from traditional hardware-based to contemporary data-driven approaches, hence we limit our evaluation to recent deep learning-based face PAD techniques. All the selected SOA algorithms are re-implemented and evaluated in a common environment with REPLAY attack dataset. We selected this dataset as it includes a variety of face artifacts such as printed photos, displayed photos and replayed videos. These images were acquired through different sensing techniques with varied resolutions in both the training and testing sets. We follow ISO/IEC metrics for measuring the performance of these algorithms using accuracy and error rate protocols. The performance of SOA face PAD techniques with HTER and ACA protocol is highlighted in Table 16.

Table 16 An evaluation of selected face PAD methods on REPLAY-ATTACK dataset

It may be inferred from Table 16, the least HTER = 0.30% is reported by the technique [133], where a DCNN model is used and an adaptive feature fusion layer is augmented. Among all, the best performance is shown by CompactNet [135] in case of displayed photo and replayed video attacks where the CompactNet model overcomes the issue of overlapped samples in dominant color spaces. The superlative outcome in terms of ACA is demonstrated by CNN + SfS algorithm [137] resulted in 97.99% average classification accuracy.

Figure 21 illustrates the HTER of the selected SOTA face PAD approaches; it is clearly revealed that the CompactNet model results in outstanding performance with least HTER value of 1.02%, 0.81% and 0.41% for combating printed photos, displayed photos, and replayed video attacks, respectively. In [133] surprising results are offered for printed attacks the HTER = 0.30% that is least among SOA approaches but the same technique resulted in highest HTER = 11.54% for replayed video attacks. Another technique, where MC-CNN model [169] is used illustrates promising performance for counter-measuring all the three types of attacks. The highest HTER of 7.89% is achieved by [146] in case of photo attacks.

Fig. 21
figure 21

Error rates of selected methods on REPLAY-ATTACK dataset for different attacks

The performance of SOA face PAD approaches in terms of ACA is portrayed in Fig. 22, where it is revealed that all of these techniques have classification accuracy in the range of 76.60 to 97.99%. The lowest accuracy of 76.60% is shown by [146] for detecting printed photo attacks, where pre-trained AEs are utilized for attack detection. Majority of the SOA techniques ([129, 136, 145, 168], and [186]) results in almost identical classification accuracy for counter-measuring printed photo, displayed photo, and replayed video attacks.

Fig. 22
figure 22

Attack-wise accuracy for selected SOA methods on REPLAY-ATTACK dataset

6.2 Overall analysis

Traditional hardware-based face anti-spoofing mechanisms include certain aspects of an image that are imperative for good visual perception. On the other hand, recent data-driven approaches mainly rely on quality as well as quantity of the face images and offer enhanced performance. As shown in literature, such methods not only help in counter-measuring spoof attacks but also provide additional security at initial stage of biometric systems. The existing approaches may be evaluated based on certain characteristics such as availability of datasets, feature-level, type of the attacks, and ability to dissuade known or unknown attacks. The aforementioned parameters manifested in face anti-spoofing mechanisms can be assessed by qualitative factors that are computationally inexpensive to evaluate. A face anti-spoofing sub-module should be sufficiently capable of defying different types of attacks with varied artifacts such as photo, video, print display, and mask attacks. Our broad comparison of existing face anti-spoofing methods in literature based on ability to counter four typical attacks is depicted in Fig. 23a. The literature has witnessed that the majority of face anti-spoofing methods are mainly capable of preventing photo attacks. The performance of face PAD sub-system has been affected through the ability to perform well in attack scenario that involves unseen samples captured through different environmental conditions. An overall picture from our survey is summarized in Fig. 23b, where it is seen that DL-based approaches proves considerably effective to deal with unknown attacks. It is worth to mention that there are only data-driven methods which prove momentous for dealing with unfamiliar attacks. There are noticeable numbers of reported DL-based methods, which automatically extract features from face images and efficiently work against unknown spoof attacks. An important parameter that is central to any PAD method is the feature sets, and their assortment significantly decides robustness of the method.

Fig. 23
figure 23

An inclusive comparative analysis among different data-driven face PAD techniques a Face methods that countermeasure various spoof attacks b comparison of studies that addresses the known and unknown attacks c PAD approaches based on feature engineering d face anti-spoofing methods which are evaluated on various datasets

Our analysis of method-wise comparison based on feature-level is shown Fig. 23c. It exhibit an overall dominance of DL-based approaches as compared to other static or dynamic feature-based counterparts. An advancement is also observed in hybrid approaches based on handcrafted and deep-level features. Most of the face anti-spoofing approaches are evaluated on benchmark anti-spoofing datasets and a comparative share of these methods analyzed on these datasets is exhibited in Fig. 23d. It is clear that REPLAY-Attack and CASIA-FASD databases have been most frequently used for building face PAD models available in the literature.

Our analysis of handcrafted and DL-based anti-spoofing mechanisms infers that both of these exhibit their own pros and cons. The contemporary methods have explored the notion of integrated approaches that employ the merits of both in complementary manner. Additionally, deep feature-based models have undergone certain limitations, i.e., availability of large training datasets, highly complex learnable parameters and architectural complexity. This has given a novel direction to researchers where feature descriptors are added in the CNN model as a perturbation layer and with this the research in spoof detection has oriented to new direction where handcrafted and DL-based features are integrated for improving the model complexity or classification accuracy. On the basis of certain parameters, Table 17 shows a comparative study of both the original approaches along with hybrid methods. It is clearly observed that the integrated techniques may offer noteworthy improvement in performance with added training overhead.

Table 17 A comparative analysis among face PAD techniques based on various characteristics

7 Open research challenges and future directions

In the previous sections, we thoroughly explored a variety of face PAD mechanisms and despite the impressive improvements in the performance of existing SOTA methods; there exists several open problems due to trade-off among various design parameters. From our exhaustive review, we have identified a number of open research issues that may be addressed in the further research. These broad issues with their possible future directions are discussed in the following paragraphs.

Research Challenge 1. Performance in unknown attacks scenario: A face PAD method that is trained on specific anti-spoofing dataset should perform uniformly across diverse unseen samples acquired through different capturing environment. Our analysis report that the traditional hardware-based techniques fails to perform well against unknown face attacks, while the data-driven approaches based on handcrafted feature engineering exhibit limited performance. However, the literature outlines meager techniques for unknown attacks scenario, out of which the recent DL-based approaches offer satisfactory results with additional training overhead. Thus, it is clear that the presented SOA methods have a limitation to generalize across different datasets, sensors and spoof fabrication materials.

Research opportunities: One of the viable solutions is to explore the conception of transfer learning using domain adaptation to counter the challenge of unknown attack generalization. Only few authors ([65,66,67,68, 71, 106, 108, 109, 125,126,127, 129, 130, 133, 138,139,140, 147, 151, 154, 158, 159]) have used cross-dataset scenario to address the problem of generalization to unknown attacks. The pioneer contributions by Chugh et al. [187] and Gajwada et al. [188] of universal material translator (UMT) which creates new spoof images from few spoof images that transfer the styles of fake finger materials to bonafide images. In future work, these methods may be explored to develop face anti-spoofing mechanisms to transfer the style of any spoof image to genuine images by augmenting a wrapper that may dynamically improve the robustness of any spoof detector and consequently deals with the problem of unknown attacks.

Research Challenge 2. Lack of sufficient size face anti-spoofing databases: The existing deep feature-based face PAD mechanisms ([32, 43, 80, 136, 147, 158, 167,168,169,170,171,172,173,174]) use an end-to-end deep neural networks that requires larger amount of training data for achieving better performance. The currently available face anti-spoofing databases like NUAA, CASIA, OULU, REPLAY-ATTACK, PRINT-ATTACK, and 3DMAD contain adequate number of images for evaluating the machine learning-based face PAD methods, whereas these datasets are inadequate for learning DL-based models. Additionally majority of datasets includes the photo and video artifacts with limited number of samples acquired from diverse population. Moreover, the datasets covering mask attacks are rarely available, which is still a scalable dataset problem for research community.

Research opportunities: To overcome the scarcity of datasets, an alternative possibility is to develop large sized standard datasets for liveness detection models. To achieve this, a concept of data augmentation is used to generate large number of images by applying variations in the face images such as rotation, flipping, scaling, zooming, and etc. However, the literature has reported considerable number of face anti-spoofing techniques that utilize data augmentation with limited performance in different scenario. Therefore, the future research must orient towards efficient data augmentation techniques for designing robust anti-spoofing mechanisms.

Research challenge 3. Variations in face images: Although, in this study majority of the texture features-based face PAD methods exhibits low computational complexity but these suffers from a limitation of variations in different acquisition environments. The inherent inconsistencies and variations during capturing the face images are induced mainly due to illumination effect, lightning condition, head movement, rotational, and scalar inconsistencies. However, to overcome these issues a variety of techniques have been proposed by various authors using image descriptors such as LBP ([32, 58, 64, 68, 73,74,75, 83, 99]) LPQ + LBP + HOG ([59, 61, 78]), LBP + BSIF ([62]), BSIF + LPQ + COLABP + SID and LBP ([65]), but these solutions perform well in constrained environment. Whereas, some of these descriptors offer robust feature sets for better classification accuracy but sometimes their results are not up to mark due to image variations. However, few of these descriptors are capable of dealing specific effects of image variations.

Research opportunities: Therefore, an opportunity is to design robust textural descriptor-based face PAD mechanisms. Additionally, an alternative solution is to use a combination of multiple image descriptors like SURF, SIFT, HoG, LABP etc. which are invariant to different face acquisition conditions and also provide more discriminative features from face images.

Research challenge 4. Generalization versus time in motion-based face PAD techniques: This survey clearly revealed that the motion-based face PAD methods ([33, 102,103,104,105,106,107,108,109,110,111]) exhibits superior capabilities of generalization due to the cues and information of the entire sequence of video which are explored. However, in these methods, the whole sequence of video is required and this process results in higher time complexity.

Research opportunities: As a future research, the new motion-based face PAD methods may be designed to mitigate this trade-off between time and generalization capability for counter-measuring both the photo and video face attacks.

Research Challenge 5. Speed versus variations in image quality-based face PAD approaches: The presented techniques in this study demonstrates that majority of the image quality-based face PAD methods ([37, 78]-[86]) have an edge in terms of simplicity and speed, but they have a limitation of sensitivity to changing quality of an image or a video frame.

Research opportunities: To extract the invariant quality features from images, a combination of quality assessment parameters may be used or multiple quality features may be investigated separately to train parallel classifier with suitable fusion techniques.

Research challenge 6. Discriminative power of lightweight DL-based PAD methods: Our analysis reports that the recent advancement has focused to design lightweight face PAD models. These models make use of lesser number of layers, which result in compact network structure with lower training overhead. However, the existing lightweight models results in lower discriminative power.

Research opportunities: A possible solution to this overcome this challenge is to use handcrafted feature descriptors such as LBP ([32, 58, 64, 68, 73,74,75, 83, 99]), BSIF ([53]), etc. that are augmented as an additional perturbation layer in CNN. Furthermore, the discriminative power of these approaches may be improved by exploring other feature descriptors such as SURF, SIFT, BRIEF, etc.

Research challenge 7. Countermeasuring new fake faces for training PAD models: The available handcrafted features-based face PAD mechanisms ([37, 58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77, 79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111]) use either single or a combination of classifiers. The effectiveness of learning process becomes a challenging task when more number of new fake faces is encountered during testing phase of the recognition system. These fake faces may be further used in face PAD mechanisms to improve the accuracy by involving these samples during re-training. Moreover, the re-training of the entire system requires an additional overhead for model resetting and hyperparameters tuning.

Research opportunities: One of the effective solutions that are least explored in face PAD mechanisms is to use the concept of incremental learning with ensemble methods where retraining of whole system is not required. In these approaches, the model is continuously learned in incremental manner where the incorrectly classified samples in the previous iteration are included in the training dataset to build an improved model.

Research challenge 8. Improved learning with deep features-based face PAD: The existing DL-based PAD sub-systems with large number of layers provide excellent classification accuracy but require additional training and testing time that involves millions of trainable parameters. Apart from this, these models are trained with millions of images that may take several days to complete the learning process.

Research opportunities: The concept of transfer learning is one of the efficient methods to train the large scale face PAD technique significantly reducing the overall cost of the system. The literature presented in this work has reported only few articles ([70, 121, 131, 138, 140, 151, 154, 158, 159]), where authors have designed a lightweight model by reusing the pre-trained CNN networks. Therefore, a new research opportunity is to use the concept of transfer learning for building efficient DL-based face PAD methods.

Research challenge 9. Challenges amid COVID-19 pandemic: Currently, the entire world is fighting against the spread of Covid-19 pandemic that has posed a novel issue in the existing face recognition systems. This is mainly because of the preventive measures being used by individuals to protect their faces against the spread of virus with the help of face masks. It may affect the overall efficiency of the already deployed face recognition systems and hence trained face PAD methods may fails to perform accurately.

Research opportunities: The recent research must focus to re-design and re-design the existing face PAD models to counter novel face attacks. Furthermore, the overall performance of the systems requires significant improvement due to presence of surgical masks covering the faces of users.

8 Conclusion

We investigated the evolutionary advancements in last decades based on the momentous accomplishments and scientific trends in this interesting field of face liveness detection. We expounded an in-depth analysis of SOTA face spoof detection methods to countermeasure widely attempted spoof attacks. In addition, we present an illustrative analysis of various available benchmark face anti-spoofing datasets and performance protocols that are extensively employed for evaluating these algorithms. To assess the performance of some recently introduced face PAD mechanisms under common protocols, we evaluated these methods on a recurrently used REPLAY-ATTAKS dataset. Substantial amount of work has been reported for counter-measuring various types of attacks, however, the most commonly addressed is photo and replay video attacks. One of the major findings of our study is the paradigm shift from traditional to modern data-driven CNN-based methodologies from 2014 onwards. The early successful attempts were broadly concentrated on building efficient classifiers via robust handcrafted features. However, these techniques suffer from the limitation of electing the sufficient features and an appropriate process of their extraction, hence, sometimes exhibit satisfactory accuracies. Moreover, face being a dynamic biometrical trait in human body results in variations during the acquisition of the face images. To tackle these problems, in recent times, the DL-based face PAD models yields state-of-the-art results. Considering the fact of existing databases, this represents a clear indication of inadequate amount of data for successful deployment of DL-based face PAD models under various scenarios such as cross-database, cross-sensor or cross-material evaluation. Despite aforementioned limitations, the generalizations to unknown attacks with lower HTER is also another issue that needs to be addressed further. Although, CNN-based architectures like ResNet, DenseNet, VGGNet, InceptionNet, etc. have achieved groundbreaking performance for the face PAD tasks, still researchers are searching for efficient methods that involve lower training overhead, small datasets and lightweight structures. A recent conception of vision transformers (ViT) for face spoof detectors is another area that can be further explored. Finally, a recent trend that is highlighted from our analysis directs the investigators to design more proficient face PAD methods, which consolidates the merits of both handcrafted and automatic feature engineering.