Elsevier

Speech Communication

Volume 54, Issue 6, July 2012, Pages 836-843
Speech Communication

Short Communication
Feature selection for reduced-bandwidth distributed speech recognition

https://doi.org/10.1016/j.specom.2012.01.003Get rights and content

Abstract

The impact on speech recognition performance in a distributed speech recognition (DSR) environment of two methods used to reduce the dimension of the feature vectors is examined in this paper. The motivation behind reducing the dimension of the feature set is to reduce the bandwidth required to send the feature vectors over a channel from the client front-end to the server back-end in a DSR system. In the first approach, the features are empirically chosen to maximise recognition performance. A data-centric transform-based dimensionality-reduction technique is applied in the second case. Test results for the empirical approach show that individual coefficients have different impacts on the speech recognition performance, and that certain coefficients should always be present in an empirically selected reduced feature set for given training and test conditions. Initial results show that for the empirical method, the number of elements in a feature vector produced by an established DSR front-end can be reduced by 23% with low impact on the recognition performance (less than 8% relative performance drop compared to the full bandwidth case). Using the transform-based approach, for a similar impact on recognition performance, the number of feature vector elements can be reduced by 30%. Furthermore, for best recognition performance, the results indicate that the SNR of the speech signal should be considered using either approach when selecting the feature vector elements that are to be included in a reduced feature set.

Introduction

The front-end in an automatic speech recognition (ASR) system extracts feature vectors from the input speech that are suitable for processing by a classifier. A training process is used to produce a set of acoustic models from a set of known utterances. These models are used to perform the recognition testing in the classifier. The parametric representation of the speech signal is an important factor in the design of an ASR system. Mel frequency cepstral coefficient (MFCC)-based analysis (Davis and Mermelstein, 1980), which is widely used in ASR systems, uses a cepstral feature vector that is derived by applying a discrete cosine transform (DCT) to the outputs of a mel-scaled triangular filter bank. The cepstral feature vectors are referred to as “static” features since they contain information from a given frame, typically 20–30 ms of speech. To improve the frame representation, it is common to use “dynamic” features such as delta (or velocity) and delta–delta (or acceleration) coefficients that attempt to capture some information about the time evolution of the signal.

In a distributed speech recognition (DSR) system (Peinado and Segura, 2006, Tan and Lindberg, 2008), the speech recognition task is split between the terminal or client, where the front-end feature extraction is performed, and the network or server, where the back-end recognition is performed. The features that represent the speech are sent by means of an error protected data channel to the classifier for processing. The European Telecommunications Standards Institute (ETSI) published its first standard for DSR in 2000 under the Aurora working group. In the ETSI standards for DSR ETSI ES 201 108 Ver. 1.1.3, 2003, ETSI ES 202 050 Ver. 1.1.5, 2007, the static MFCC feature vector coefficients are sent over the channel from the client to the server where the dynamic features are then calculated.

The size or dimension of the feature vector set is an important consideration in speech recognition. Typical implementations of MFCCs use thirteen static coefficients (cepstral coefficients C1C12 and the frame log energy) and the associated velocity and acceleration coefficients, resulting in a 39-element feature vector. It is accepted that recognition accuracy is improved with the use of the dynamic coefficients; however, redundancies do exist and these coefficients can be discarded without impacting recognition performance (Choi, 2002). A speech recognition system using large feature vectors requires a computationally complex recogniser in the back-end with increased demand on memory and computation requirements. Previous work on dimensionality reduction has generally focused on reducing the computational and storage requirements for the classifier. However, in a DSR system, more bandwidth is required to transmit the static feature vector coefficients from the client to the server as the feature vector dimensions increase, thus increasing the demand on wireless bandwidth. It is therefore of interest to reduce the feature set size to reduce bandwidth requirements, subject to speech recognition performance requirements being maintained at an acceptable level. This is the primary motivation for the work described in this paper.

Different methods to reduce the size of the feature vector required for speech recognition are well documented in the literature. Paliwal (1992) examines four different methods (F-ratio, training data, linear discriminant analysis (LDA) and principal component analysis (PCA)) to reduce the dimensions of feature vectors used in a HMM-based speech recognition system. In a multi-speaker isolated digit recognition task the results indicated that it is possible to reduce the feature vector size without impacting the recognition performance. Bocchieri and Wilpon (1993) use discriminative analysis to select a subset of coefficients from the feature vectors in continuous speech recognition experiments. Nicholson et al. (1997) measured the correlation between MFCC feature sets that have good class discrimination and those that produced good speech recognition results. It was found that the discriminative measures used correlated strongly with the recognition accuracy of a feature space. Knowledge of the feature space separability can be used to select the best feature sub-set and to predict recognition performance. Kumar and Andreou (1998) apply heteroscedastic discriminant analysis to reduce the feature vector size in isolated digit recognition, improving recognition performance. Cernak et al. (2007) in an investigation of MFCC and perceptual linear prediction (PLP) features found that the higher index coefficients, along with the associated velocity and acceleration coefficients, were the most informative for predicting incorrect speech recognition. Koniaris et al., 2010a, Koniaris et al., 2010b apply knowledge of the human auditory periphery in the reduction of the dimensions of the feature vectors used for ASR. Perturbation theory and the sensitivity matrix (Plasberg and Kleijn, 2007) are used to select a subset of MFCC coefficients that maximise the similarity between the Euclidean geometry of the MFCC feature set extracted from the signal and the representation of the signal by the chosen model of the human auditory system. The results of Koniaris et al., 2010a, Koniaris et al., 2010b show that applying knowledge of the auditory system to the reduction of feature vector dimension produces feature vectors that are more robust compared with linear and heteroscedastic discriminant analysis methods. The Laplacian eigenmaps latent variable model (LEVLM) is proposed by Jafari and Almasganj (2010) as a new dimension reduction method in the task of speech recognition. The LEVLM model is applied to 39-dimensional MFCC feature vectors to produce feature vectors with fewer elements without compromising the speech recognition performance for the task investigated. LEVLM is shown to produce better recognition results compared to PCA, with smaller dimension feature vectors. A technique for feature selection using singular value decomposition (SVD) followed by QR decomposition with column pivoting (QRcp) is proposed by Chakroborty and Saha (2010). SVD-QRcp is used to reduce the complexity of a speaker identification task and it is shown to outperform the F-ratio feature selection method with an improved speaker identification rate.

This paper first investigates the relative importance of the different feature vector coefficients produced by the ETSI advanced front-end (AFE) (ETSI ES 202 050 Ver. 1.1.5, 2007). Recognition tests are also carried out using empirically reduced feature vector sizes, based on published work, with a view to reducing the bandwidth requirements for a DSR system. Secondly, the combination of heteroscedastic linear discriminant analysis (HLDA) (Kumar and Andreou, 1998) and the semi-tied covariance transform (Gales, 1999) is investigated. The paper extends previous work reported in the literature by considering a range of different options for bandwidth reduction, and further by examining the relationship between the signal-to-noise ratio (SNR) of the speech signal and the optimal bandwidth reduction. The recognition problem examined is connected digit recognition using the Aurora 2 database (Hirsch and Pearce, 2000). The classifier used for the recognition experiments is the Hidden Markov Model (HMM) recogniser specified for use with the Aurora database.

The layout of the paper is as follows. The recognition system and the database used are described in Section 2. The detailed results from the conducted experiments are presented in Section 3. In Section 4 the results are discussed with conclusions and suggestions for future work following in Section 5.

Section snippets

Experimental framework

In this section, the experimental framework used for the work discussed in this paper is presented.

Effect of removing individual coefficients

As a first experiment, and to establish a baseline, the effect of removing each one of the thirteen static coefficients individually (C1C12 and the combined C0–log E term) on the recognition performance of the ETSI AFE (ETSI ES 202 050 Ver. 1.1.5, 2007) is first investigated (Paliwal, 1992). Note that in a DSR context, removal of any coefficient implies removal of its associated velocity and acceleration coefficients as well, since these are calculated in the network or server after

Discussion

The test results in Table 1, Table 2 for when a single static coefficient (along with its associated velocity and acceleration coefficients) is removed show that the impact on speech recognition performance depends on the particular coefficient and on the noise type (more specifically, the combination of noises used in the test set). Detailed analysis also shows that the SNR of the speech signal is another factor that influences how much the recognition performance changes with coefficient

Conclusion

This paper has presented results from an investigation into the effects of feature selection on speech recognition performance, in order to reduce transmission bandwidth in a DSR system using wireless communication. This topic has been addressed in the literature in the past, mainly using data-centric rather than empirical approaches, and this study extends previous work through examination of a range of possible approaches, as well as analysis of the effect of SNR on the choice of the

References (19)

There are more references available in the full text version of this article.

Cited by (3)

  • A privacy-aware feature selection method for solving the personalization-privacy paradox in mobile wellness healthcare services

    2015, Expert Systems with Applications
    Citation Excerpt :

    In general, as the amount of training data increased, the difference in accuracy with and without feature selection was reduced. However, prior studies have not considered how the privacy concerns associated with each feature affect the performance of the algorithm and the adoption of feature selection methods (Chu et al., 2012; Flynn, 2012). The privacy concerns associated with each feature are assumed to be the same, which is not a realistic representation of the situation, especially in the context of MWS.

  • Speech recognition in inflective languages

    2017, SpringerBriefs in Speech Technology
View full text