Deep neural network ensemble for reducing artificial noise in bandwidth extension

https://doi.org/10.1016/j.dsp.2020.102760Get rights and content

Abstract

In this paper, we propose a deep neural network (DNN) ensemble for reducing artificial noise in speech bandwidth extension (BWE). The proposed DNN ensemble consists of three DNN models; one is a classification model, and the other two are regression models. When estimating sub-band energies of the high-frequency region using sequential DNNs in a frequency domain, the over-estimation of sub-band energies causes annoying artificial noise. To mitigate this artificial noise, we design a DNN classification model that can classify over-estimation frames against normal frames. Then, we separately develop two DNN regression models using half of the entire training set and a limited training set built with over-estimation frames and some normal frames to improve the performance at the over-estimation frames. Since the outputs of the classification model are probabilities of either a normal frame or an over-estimation frame, respectively, two regression models are adjustably combined by using the probabilistic weights; thus, the final output of the DNN ensemble is the weighted sum of two estimated sub-band energies. As a result, artificial noise is significantly reduced, yielding improved speech quality. The proposed method is objectively and subjectively evaluated by comparing it with conventional approaches.

Introduction

In traditional narrowband (NB) speech codec-based digital speech transmission systems [1], the frequency band is limited from 300 Hz to 3.4 kHz. Recently, many services use a wideband (WB) speech codec, which covers the frequency bandwidth from 50 Hz to 7 kHz [2]. However, network support for the WB codec and terminal devices replacements have been required to use the WB service, and bandwidth extension (BWE) algorithms with NB speech signals have been proposed for this transition period to provide speech communication services close to WB.

BWE algorithms can be broadly categorized as either using side information which is extracted on the encoder side or only using NB speech information. In the case of using side information, high-band (HB) feature vectors are extracted and embedded in additional bits, whereas the BWE algorithms with only NB speech information can be applied without modifying existing transmission systems. Thus, BWE with only NB speech information is more challenging than those that use side information [3]. BWE without side information is called blind BWE or stand-alone BWE and can be further divided into two categories. One is based on a speech production model using linear predictive coding (LPC) coefficients [4], and the other directly estimates the HB spectra in the frequency domain using extrapolation [5] or machine learning [6]. Especially, deep neural network (DNN) regression model-based BWE algorithms that can estimate the log power magnitudes of the high-frequency band were first proposed in [7].

Recently, many studies have been reported using DNN models. For example, DNN was used for estimating HB cepstrum with log magnitudes of the NB speech signal and autocorrelation function coefficients in [8], [9]. Furthermore, convolutional neural network (CNN), recurrent neural network (RNN), convolutional recurrent neural network (CRNN), and generative adversarial networks (GAN) models were employed for estimating HB spectra in [10], [11], [12], [13]. More recently, WaveNet, a dilated convolution layer-based waveform generative model which was firstly proposed in the speech synthesis domain, was adopted for speech bandwidth extension [14]. Deep learning-based methods turn out to improve the performance of BWE by effectively estimating the high-frequency band. However, they generally have two problems. One is over-smoothing [15], and the other is over-estimation. Especially in the BWE tasks, improving the over-estimation problem is more critical because over-estimation generates artificial noise, which sounds very annoying to the listener [16]. Although improving clarity and reducing artificial noise have a trade-off relationship, many studies have focused only on improving clarity and have not considered the occurrence of artificial noise.

In this paper, we propose a DNN ensemble for the BWE task that can significantly improve speech quality by reducing artificial noise caused by over-estimation of HB energies. To do this, two different DNN regression models yielding sub-band energies are trained; one is trained using the half of the entire training set, and the other is trained using a limited training set composed of the over-estimation frames and some normal frames. Additionally, the DNN classification model that determines the probability of each DNN regression model is also trained. Finally, we estimate sub-band energies of the HB by calculating the weighted sum of the outputs of the two sub-band energies with probabilities. The experimental results show that the proposed approach provides better speech quality than the conventional DNN-based speech bandwidth extension algorithm.

The remainder of this paper is organized as follows. In the next section, we review the conventional methods for speech bandwidth extension. In Section 3, we describe the design of the proposed DNN ensemble for reducing artificial noise. The experiments and results of the proposed method are discussed in Section 4, and conclusions are presented in Section 5.

Section snippets

Review of sequential DNNs for speech bandwidth extension

Here, we briefly review sequential DNNs for BWE, as described in [17]. Sequential DNNs use sub-band energies in the discrete Fourier transform (DFT) domain as input features and targets because a direct mapping method using DNNs is known to have an over-smoothing problem which can cause a lisping effect [15]. For extracting sub-band energies, we first perform the short-time Fourier transform (STFT) on the windowed frame to obtain DFT coefficients and log power magnitudes. Then, sub-band

Proposed DNN ensemble for reducing artificial noise

In this section, we describe our proposed BWE algorithm, as shown in Fig. 1. The proposed DNN ensemble consists of three models; one is for classification, and the other two are for regression. The training of the regression models is the same as that of the sequential DNNs, and the classification model estimates the reliability of the regression models as probabilities. Since the over-estimation of HB energies dominantly causes audible artifacts in the BWE algorithms [16], the proposed method

Experiments and results

We experimented with the TIMIT dataset consisting of 10 sentences spoken by each of 630 American English speakers of different sexes and eight major dialects. The recording was done in the studio with a sampling frequency of 16 kHz, mono. The dataset contains 4620 sentences for the training set and 1680 sentences for the test set. To consider noisy environments, four different noise sources, including babble and white noises from the NOISEX-92 database [18] and additional office and street

Conclusions

In this paper, the DNN ensemble for reducing artificial noise was proposed. First, we trained the DNN regression model with half of the training set, which can estimate the sub-band energies of the HB. Next, an over-estimation measure was defined and calculated by inserting the rest of the training set through the trained DNN regression model. Based on the degree of over-estimation, we prepared another training set composed of over-estimation frames and the same size of normal frames; this was

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2017R1A2A1A17069651).

Kyoungjin Noh was born in Seoul, Korea, in 1990. He received the B.S. degree in electronic engineering from Hanyang University, Seoul, Korea, in 2015. He is currently pursuing the Ph.D. degree in electronics and computer engineering at the Hanyang University. His research interests are in sound event detection, acoustic scene classification, acoustic signal processing, speech processing, deep/machine learning and artificial intelligence (AI).

References (23)

  • P. Jax et al.

    On artificial bandwidth extension of telephone speech

    Signal Process.

    (Aug. 2003)
  • A. Varga et al.

    Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems

    Speech Commun.

    (Jul. 1993)
  • E. Ekudden et al.

    The adaptive multi-rate speech coder

  • B. Bessette et al.

    The adaptive multirate wideband speech codec (AMR-WB)

    IEEE Trans. Speech Audio Process.

    (Nov. 2002)
  • P. Gajjar et al.

    Artificial bandwidth extension of speech & its applications in wireless communication systems: a review

  • H. Pulakka et al.

    Evaluation of an artificial speech bandwidth extension method in three languages

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (Aug. 2008)
  • H. Pulakka et al.

    Speech bandwidth extension using Gaussian mixture model-based estimation of the highband mel spectrum

  • K. Li et al.

    A deep neural network approach to speech bandwidth expansion

  • J. Abel et al.

    A simple cepstral domain DNN approach to artificial speech bandwidth extension

  • J. Abel et al.

    Artificial speech bandwidth extension using deep neural networks for wideband spectral envelope estimation

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (Jan. 2018)
  • Y. Gu et al.

    Waveform modeling using stacked dilated convolutional neural networks for speech bandwidth extension

  • Cited by (1)

    Kyoungjin Noh was born in Seoul, Korea, in 1990. He received the B.S. degree in electronic engineering from Hanyang University, Seoul, Korea, in 2015. He is currently pursuing the Ph.D. degree in electronics and computer engineering at the Hanyang University. His research interests are in sound event detection, acoustic scene classification, acoustic signal processing, speech processing, deep/machine learning and artificial intelligence (AI).

    Joon-Hyuk Chang received the B.S. degree in electronics engineering from Kyungpook National University, Daegu, South Korea, in 1998, and the M.S. and Ph.D. degrees in electrical engineering from Seoul National University, South Korea, in 2000 and 2004, respectively. From 2000 to 2005, he was with Netdus Corp., Seoul, as CTO. From 2004 to 2005, he held a post-doctoral position with the University of California at Santa Barbara, Santa Barbara, where he was involved in adaptive signal processing and audio coding. In 2005, he joined the Korea Institute of Science and Technology, Seoul, as a Research Scientist, where he was involved in speech recognition. From 2005 to 2011, he was an Assistant Professor with the school of Electronic Engineering, Inha University, Incheon, South Korea. He is currently a Full Professor with the School of Electronic Engineering, Hanyang University, Seoul, South Korea. His research interests are in speech recognition, deep/machine learning, artificial intelligence (AI), speech processing, acoustic signal processing, bio-medical signal processing. He was recipient of the IEEE/IEEK IT Young Engineer of the year 2011. He is currently serving as Editorial Board of the Elsevier Journal, Digital Signal Processing.

    View full text