Elsevier

Image and Vision Computing

Volume 72, April 2018, Pages 1-13
Image and Vision Computing

A novel local wavelet energy mesh pattern (LWEMeP) for heterogeneous face recognition

https://doi.org/10.1016/j.imavis.2018.01.004Get rights and content

Highlights

  • Challenging heterogeneous face recognition (HFR) problem is considered.

  • Local adaptive 2D-wavelet packet energy feature is measured to get local edge and texture information.

  • Inspired from networking mesh topology, a novel derived local mesh pattern (DLMeP) is proposed.

  • Proposed LWEMeP combining with deep learning enhance the recognition accuracy.

Abstract

This paper proposes a novel and accurate methodology for matching of heterogeneous faces, such as sketch–photo and near infrared (NIR)-visible (VIS) images. Inspired from mesh topology in computer networking, a new local binary pattern has been developed. We call it derived local mesh pattern (DLMeP). DLMeP is computed based on the relationship among each and every pixel present in a local window. For heterogeneous face recognition, more emphasis is given to the edge and texture features, because these features could be extracted invariant to different modalities. The wavelet transform is employed to capture the edge and texture features simultaneously. Then a local wavelet energy feature is calculated to enhance the local texture information and edges. Finally, DLMeP is used to measure the local variation or pattern of wavelet energy, and we call it local wavelet energy mesh pattern (LWEMeP). For refinement of LWEMeP, a model based weight value learning is suggested. We have tested the proposed methodology on different sketch–photo and NIR-VIS benchmark databases. In the case of viewed sketches, the rank-1 recognition accuracy of 99.37% is achieved on CUFSF database. LWEMeP gives the rank-1 accuracy of 65.31% on challenging e-PRIP composite sketch database. In the case of NIR-VIS matching, the rank-1 accuracy of 89.78% is achieved and which is superior to other state-of-the-art methods. Finally, proposed LWEMeP is also compared with state-of-the-art deep learning based methods in the case of composite sketch vs photo matching and NIR vs VIS matching. Superior results of the proposed feature after combining with deep learning suggest that handcrafted features combining with deep learning give excellent classification results.

Introduction

Biometric authentication is becoming the most important and essential universal tool in any security system. The ubiquitous influence of biometric authentication can be felt in mobile, smart TV, CCTV, forensic, etc. Finger prints, DNA samples, faces, retinas and ears, all are used for biometric authentication [1], [2], [3]. Faces are the most easily available and easily recognizable among them. Any human can easily recognize and authenticate a face. However, in the naked eye without an expert, it becomes very difficult to authenticate depending on finger prints, DNA samples, retinas and ears. In recent years, a lot of works has been done on face biometric authentication based on real-life applications [4]. Different real-life applications need faces accumulated in different environments or even captured with different modern equipment. Faces captured in near-infrared are very much useful for illumination invariant face recognition [5]. Thermal-infrared (TIR) faces are useful for liveness detection. When there is no available finger prints, no available DNA samples, and images captured by devices are poor in quality, then face sketches generated by interviewing the eye-witness are the only way for law enforcement agency. Therefore, various scenarios and necessities create faces of different modalities. It becomes difficult to deal with these scenarios using conventional face recognition systems. Therefore, an attracting and challenging field of face biometric recognition system has emerged, called heterogeneous face recognition (HFR) [6], [7].

The utmost challenge in any HFR system is to handle images of the same person in various modalities. The face sketches are drawn using repeated strokes and shades of pencil. The NIR faces are generated by capturing radiation of light wavelength in the range of 0.7–1.1 μm using special cameras. On the other hand, normal visual faces are generated by capturing radiation in the range of 0.4–0.7 μm. These face images (sketch, NIR, and visual) have different gray level distributions. Sketch images have almost constant gray tones with different values. However, infrared images are usually in low contrast with a blurry effect; compared to visual images. Therefore, the modality gap between these images is inevitable.

The problem of matching heterogeneous face images has received increasing attention in recent years. Up to now, a flurry of techniques has been proposed to solve the problem. Ouyang et al. [16] provided a detailed review of the existing techniques and recent development in HFR. We can easily classify these solutions into three broad categories: image synthesis based, subspace learning based, and common feature representation based.

  • Image synthesis: In this category, a synthesized face image is generated using some mechanism to convert one modality image into another modality and then compared. Tang & Wang [8] introduced an Eigen transformation based sketch– synthesis method. Liu et al. [9] proposed local linear embedding to perform sketch–photo synthesis. The same mechanism was also used by Chen et al. [10] for NIR-VIS synthesis. Wang & Tang [11] proposed a patch based Markov Random Field (MRF) model for the sketch–photo synthesis. The same MRF model also used for TIR-VIS synthesis by Li et al. [12]. The selection of the best patch as a candidate of MRF model, creates a deformation in facial image synthesis. Therefore, Zhou et al. [13] proposed Markov weight field for sketch–photo synthesis. A transductive learning based face sketch–photo synthesis (TFSPS) framework was proposed by Wang et al. [14]. Song et al. [15] proposed a de-noising based real-time face sketch synthesis method. Recently, Peng et al. [17] proposed a multiple representation based face sketch–photo synthesis. In this category, more emphasis is given on synthesis, which is time-consuming and “task-specific” procedure.

  • Common subspace learning: In this category, face images of different modalities are projected into a latent subspace for learning. Lin & Tang [18] introduced a common discriminant feature extraction (CDFE) for face sketch–photo recognition. Yi et al. [19] proposed a canonical correlation analysis based regression method for NIR-VIS face images. A coupled spectral regression (CSR) based learning for NIR-VIS face images was proposed by Lei & Li [20]. Sharma & Jacobs [21] proposed a partial least square (PLS) based subspace learning method for both sketch and photo images. Mignon & Jurie [22] proposed a cross-modal metric learning (CMML) for heterogeneous face matching. Lei et al. [23] proposed a coupled discriminant analysis for HFR. Kan et al. [24] proposed a multi-view discriminant analysis (MvDA) technique to search a single discriminant common space for multiple views using view-specific linear transformation. In this category, the projection of images is performed, which automatically creates some information loss and reduces the accuracy.

  • Common feature representation: In this category, images of different modalities are represented using a common feature domain to minimize the modality difference and then utilized for recognition. Liao et al. [25] introduced difference of Gaussian (DoG) filter and multi-block local binary pattern (MB-LBP) features for both NIR and VIS face images. Klare et al. [26] employed the scale invariant feature transform (SIFT), and multi-scale local binary pattern (MLBP) features along with a local feature-based discriminant analysis (LFDA) framework for forensic sketch recognition. A learning-based coupled information-theoretic encoding (CITE) feature was proposed by Zhang et al. [27]. Bhatt et al. [28] proposed a multi-scale circular Weber’ s local descriptor (MCWLD) for semi-forensic sketch matching. Klare & Jain [29] proposed a kernel prototype random subspace (KP-RS) learning on MLBP features; extracted from heterogeneous face images. Zhu et al. [30] proposed a transductive learning (THFM) for NIR-VIS face images. They used Log-DoG filter based LBP and histogram of oriented gradient (HOG) features. Gong et al. [31] combined histogram of gradients (HOG) and multi-scale local binary pattern (MLBP) with canonical correlation analysis (MCCA). Roy & Bhattacharjee [32] proposed a geometric edge-texture based feature with hybrid multiple fuzzy classifier; consisting of fuzzy PLS (FPLS) and fuzzy LDA (FLDA) for HFR. Roy & Bhattacharjee [33] again proposed an illumination invariant local gravity face (LG-face) for HFR. A local gradient checksum (LGCS) feature for face sketch–photo matching was also proposed by Roy & Bhattacharjee [34]. Another local gradient fuzzy pattern (LGFP) based on restricted equivalent function for face sketch–photo recognition was proposed by Roy & Bhattacharjee [35]. Recently, a graphical representation based HFR (G-HFR) was proposed by Peng et al. [38]. In this category, local handcrafted features are directly used, which means no loss of local information and algorithms are more time-saving than other two categories. One and only problem in this category is to recognize or search features, which are either common to different modalities or invariant in different modalities.

Some of the methods [39], [40] were recently developed for computer-generated composite sketch recognition.

Recently, deep learning based methods, mainly convolutional neural networks (CNNs) have accomplished great success in face recognition. Few successful attempts also have been made on HFR, mainly in NIR-VIS. Ngiam et al. [41], first proposed a Bimodal Deep AE based on de-noising auto encoder. A multi-modal Deep Boltzmann machine (DBM) approach is suggested by Srivastava & Salakhutdinov [42]. A CNN based sketch–photo synthesis method was proposed by Zhang et al. [43]. Yi et al. [44] proposed a shared representation learning using Restricted Boltzmann machines (RBM) and Gabor features. Mittal et al. [45] first time applied the transfer learning based deep network on composite sketch recognition. Liu et al. [46] proposed a deep transfer NIR-VIS HFR network (TRIVET) using CNN. Saxena & Verbeek [47] applied CNN on both composite sketch and NIR-VIS recognition. The recognition results of these deep learning based methods are excellent. However, due to the very little theoretical guidance, it becomes difficult to design a good deep learning network. The architecture of the network i.e. number of convolution layers, the number of neurons, etc. varies from problem to problem. Most of the deep learning methods highly depend on huge training data. Hu et al. [48] evaluated that CNNs trained using small datasets performed worse than the handcrafted features. For different face recognition problems, training for various parameters is required e.g. a network trained for sketch–photo face recognition cannot be used for NIR-VIS. Therefore, a generalized design of the network is not always guaranteed. Another obvious problem of deep learning is the required time complexity for training. In most of the deep learning cases, raw image pixels are used to learn the best-required feature for the classification problem. However, Yi et al. [44] used Gabor features and Huang et al. [49] used LBP features with deep learning to get better results. In [36], we also used local quaternary pattern features with deep learning. The experimental results from the above mentioned methods show that handcrafted features combined with deep learning improve the classification result. Inspired from this, instead of raw image pixels we use LWEMeP as an input to the CNN. The superior result is obtained in the case of CASIA NIR-VIS 2.0 database.

Feature extraction is undoubtedly the most important stage in any face recognition system. So far, many research works have been done to extract best handcrafted features. Suddenly, deep learning methods have solved the problem of feature extraction by replacing handcrafted features with the learning based features. Again, it is assumed from the excellent results of deep learning methods that the learning based features are better and robust than the handcrafted features. Despite the excellent performance achieved by the deep learning based methods, it faces few major difficulties:

  • The theoretical framework is not well proved.

  • The architecture is determined empirically and it is application specific.

  • Highly depend on large training data.

  • Many user-defined parameters and tricky methods (max-pooling, soft-max classification) are required.

  • No guaranteed generalization learning.

In the case of HFR, we are dealing with face images of different modalities, which have different image representations. Common feature representation methods are neither dependant on time-consuming and task-specific synthesis nor dependant on common subspace based learning, but able to consider the local spatial features. In the case of synthesis based methods, the accuracy depends on the precision of copying and reproducing of synthesized images. For different modalities, we need different synthesis techniques. Therefore, a general or common technique for this purpose is not possible. The same generalization problem also exists in the case of deep learning. Again in deep learning huge training data as well as huge amount of time is required for learning. In the case of common subspace, important information can be lost due to the projection of images of different modalities. Whereas, in the case of common feature representation, local features are directly used without losing any information and more time-saving than the other two categories. Therefore, motivated by the advantages of common feature representation methods and their superior accuracy results in HFR, in this paper, we propose a novel common feature representation for images of different modalities. The goal of the proposed method is to recognize the facial features, which are invariant in different modalities. All the common feature representation based methods, in the literature, use the handcrafted features, and due to the better performance, those have gained huge popularity. We have tried to develop common features, which are invariant to modalities. So, after a thorough study of the literature and from our visual inspection, we conclude that edges are the most important modality invariant feature. Psychological studies also say that we can recognize a face from its edges only [50]. Inspired from that, available in the literature, there are different edge-based features used for face recognition, such as line edge map [51], string face [52], etc. In the face sketches, more emphasis is given on edges. Again in NIR images, we are also able to detect the edges. In the work [34], [35], it was considered that the facial images basically had two different textures: regular texture consisting of smooth facial regions and irregular texture consisting of facial components like eyes, nose, eye-brows. It is easy to understand that facial components in a face have maximum edges. At the same time, regular texture information is also important for face matching. Therefore, we need one such domain, which is not only able to preserve edge and texture present but also enhances them. The wavelet transform is one such domain, which is used often in image processing for texture classification [53], [54], texture segmentation [55], [56], [57] and also in edge detection [58], [59]. Wavelet transform provides various channels for representing the image in different frequency sub-bands. It also offers better frequency resolution with high temporal localization for high and low frequencies. Now, edges and textures in a face image are sensitive to illumination variations. Surprisingly, wavelet domain gives a solution for illumination invariant face recognition [60].

In this paper, we try to generate a common feature representation for different modality images by selecting the modality invariant features, such as edge and texture. In the case of face sketches, the artist gives more attention towards edges and texture information. Again, in NIR images, the high-frequency information are captured, and only the color information is lost. Therefore, selecting edge and texture as modality invariant is correct in the sense. We use wavelet domain to select the edge and texture feature. Instead of the normal wavelet transform, we apply adaptive wavelet packet transform to reduce the search space for better frequency channel. Then, an average energy is measured in a local window for all selected sub-bands. Since wavelet domain captures the texture and edge information over a global coarse level, a feature representation in local micro level is very much essential. Motivated by the superior output results of local binary pattern and LBP-like features in the face and texture recognition, we propose one novel local binary pattern DLMeP. It is also inspired by local mesh pattern (LMeP) [61] and one of the computer networking topology: mesh topology. The proposed DLMeP is applied on local wavelet energy sub-bands to measure the pattern of wavelet energy called LWEMeP. The local features represented by LWEMeP carries discriminating complementary information at different spatial scales of the wavelet transform. Therefore, LWEMeP is boosting the performance of HFR by combining the local wavelet energy feature with LBP-like representation. Finally, to enhance the discriminating power of the LWEMeP, a weight learning method is developed. Experimental results on different HFR databases show the excellent performance of the proposed methodology.

The major contributions are:

  • 1.

    An adaptive 2D wavelet packet transform followed by a novel local energy measure is proposed to capture local edge and texture information.

  • 2.

    Novel DLMeP is proposed for capturing local image pattern.

  • 3.

    Weights learning model is developed to enhance the discrimination power of the proposed LWEMeP.

This paper is organized as follows: in Section 2, the proposed LWEMeP is described in detail. Experimental results and comparisons are presented in Section 3, and finally the paper concludes with Section 4.

Section snippets

Proposed work

In this section, we present the proposed local wavelet energy mesh pattern (LWEMeP) for heterogeneous face matching. We first give a brief idea about adaptive wavelet packet transformation and local energy calculations of the wavelet coefficients. Then, we introduce the ideas of the proposed derived local mesh pattern (DLMeP) and LWEMeP. Finally, the detail concept of the weighted LEWMeP model is provided.

Experimental results

In this section, LWEMeP is evaluated based on two different HFR scenarios, i.e. face sketch vs photo recognition and NIR image vs VIS image recognition respectively on the existing benchmark databases. For face sketch vs. photo recognition, we tested the proposed method on the following datasets: the CUHK Face Sketch FERET Database (CUFSF) [11], [27], and more complicated software generated e-PRIP [40] composite sketch Database. CASIA NIR-VIS 2.0 Face Database [70] is used for NIR face image vs

Conclusion

We have presented a novel local binary pattern DLMeP inspired from mesh topology. The proposed binary pattern is applied to represent the local wavelet energy pattern. To calculate the local wavelet energy from the coefficients, we chose an adaptive way to select and decompose one wavelet band into its sub-band. Finally, we come up with a LWEMeP method, which is equally able to deal with different modality facial images, namely, sketch–photo and NIR-VIS.

Experimental results on sketch–photo and

Hiranmoy Roy received a B.E. (CSE) degree from Burdwan University and an M.Tech (CT) degree from Jadavpur University, India, in the years of 2003 and 2009, respectively. He is now with the RCC Institute of Information Technology, India, as an Assistant Professor in the Department of IT. His research interests lie in the fields of Image Processing and Pattern Recognition, primarily Human Face Recognition.

References (72)

  • M.N. Shirazi et al.

    Texture classification based on Markov modeling in wavelet feature space

    Elsevier J. Image Vis. Comput.

    (2000)
  • I. Dagher et al.

    Subband effect of the wavelet fuzzy C-means features in texture classification

    Elsevier J. Image Vis. Comput.

    (2012)
  • Y. Wu et al.

    Optimal threshold selection algorithm in edge detection based on wavelet transform

    Elsevier J. Image Vis. Comput.

    (2005)
  • M. Shih et al.

    A wavelet-based multiresolution edge detection and tracking

    Elsevier J. Image Vis. Comput.

    (2005)
  • Y.Z. Goh et al.

    Wavelet local binary patterns fusion as illuminated facial image preprocessing for face verification

    Elsevier J. Expert Syst. Appl.

    (2011)
  • L. Liu et al.

    Extended local binary patterns for texture classification

    Elsevier J. Image Vis. Comput.

    (2012)
  • S. Li et al.

    Illumination invariant face recognition using NIR images

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2007)
  • S. Li

    Encyclopaedia of Biometrics

    (2009)
  • X. Tang et al.

    Face sketch recognition

    IEEE Trans. Circ. Syst. Video Technol.

    (2004)
  • Q. Liu et al.

    A nonlinear approach for face sketch synthesis and recognition

  • J. Chen et al.

    Learning mappings for face synthesis from near infrared to visual light images

  • X. Wang et al.

    Face photo-sketch synthesis and recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • J. Li et al.

    Hallucinating faces from thermal infrared images

  • H. Zhou et al.

    Markov weight fields for face sketch synthesis

  • N. Wang et al.

    Transductive face sketch-photo synthesis

    IEEE Trans. Neural Netw.

    (2013)
  • Y. Song et al.

    Real-time exemplar-based face sketch synthesis

  • C. Peng et al.

    Multiple representation-based face sketch-photo synthesis

    IEEE Trans. Neural Netw. Learn. Syst.

    (2016)
  • D. Lin et al.

    Inter-modality face recognition

  • D. Yi et al.

    Face matching between near infrared and visible light images

  • Z. Lei et al.

    Coupled spectral regression for matching heterogeneous faces

  • A. Sharma et al.

    Bypassing synthesis: Pls for face recognition with pose, low-resolution and sketch

  • A. Mignon et al.

    Cmml: a new metric learning approach for cross modal matching

  • Z. Lei et al.

    Coupled discriminant analysis for heterogeneous face recognition

    IEEE Trans. Inf. Forensics Secur.

    (2012)
  • M. Kan et al.

    Multi-view discriminant analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • S. Liao et al.

    Heterogeneous face recognition from local structure of normalized appearance shared representation learning for heterogeneous face recognition

  • B.F. Klare et al.

    Matching forensic sketches to mug shot photos

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • Cited by (8)

    • Nearly symmetric orthogonal wavelets for time-frequency-shape joint analysis: Introducing the discrete shapelet transform's third generation (DST-III) for nonlinear signal analysis

      2021, Communications in Nonlinear Science and Numerical Simulation
      Citation Excerpt :

      Over the years, different automated strategies [30] have been adopted to efficiently implement SS algorithms, such as phase- and shape-based [31], filtering-based [32], direct feature extraction- [33,34] and fusion-based [35], neural network-based [36], Gaussian Mixture Model-based [37], just to mention a few possibilities, where the specific objectives vary considerably [38–40]. Concomitantly to the progress on SS techniques, matched wavelets have been at the forefront of signal processing and pattern recognition research, as shown, for instance, in papers [41–50], where the applications focus on a majority of areas [51–65]. Particularly, a few research works have been dedicated to the use of matched wavelets for SS, such as those documented in papers [18,66–71].

    View all citing articles on Scopus

    Hiranmoy Roy received a B.E. (CSE) degree from Burdwan University and an M.Tech (CT) degree from Jadavpur University, India, in the years of 2003 and 2009, respectively. He is now with the RCC Institute of Information Technology, India, as an Assistant Professor in the Department of IT. His research interests lie in the fields of Image Processing and Pattern Recognition, primarily Human Face Recognition.

    Debotosh Bhattacharjee received MCSE and Ph.D. (Engineering) degrees from Jadavpur University, India, in 1997 and 2004, respectively. He was associatedwith different institutes in various capacities until March 2007. He then returned to his alma mater, Jadavpur University. His research interests pertain to applications of computational intelligence techniques such as Fuzzy Logic, Artificial Neural Networks, Genetic Algorithms, and Rough Set Theory in Face Recognition, OCR, and Information Security. He has authored or coauthored more than 150 journal and conference publications, as well as several book chapters in the area of Biometrics. During his postdoctoral research, Dr. Bhattacharjee visited several universities in Europe, including the University of Twente, The Netherlands; Instituto Superior Técnico, Lisbon, Portugal; and Heidelberg University, Germany. He is a life member of the Indian Society for Technical Education (ISTE, New Delhi) and the Indian Unit for Pattern Recognition and Artificial Intelligence (IUPRAI) as well as a senior member of IEEE (USA).

    This paper has been recommended for acceptance by Matti Pietikäinen.

    View full text