ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
advertisementadvertisement
Signal Processing
Volume 82, Issue 9, September 2002, Pages 1283-1304
 
Font Size: Decrease Font Size  Increase Font Size
 Article - selected
PDF (330 K)
Thumbnails - selected | Full-Size Images

 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
View Record in Scopus
 
doi:10.1016/S0165-1684(02)00285-2    How to Cite or Link Using DOI (Opens New Window)
Copyright © 2002 Elsevier Science B.V. All rights reserved.

Time-varying MMSE modulated lapped transform and its applications to transform coding for speech and audio signals

Guangyu WangE-mail The Corresponding Author, a and Ulrich HeuteCorresponding Author Contact Information, E-mail The Corresponding Author, b

a Infineon Technologies AG, D-81541, Munich, Germany b Institute for Circuit and System Theory, University of Kiel, Kaiserstr. 2, D-24143, Kiel, Germany

Received 24 February 2000; 
revised 28 May 2001. 
Available online 3 June 2002.

Abstract

The time-varying modulated lapped transform (MLT) is used in speech and audio coding schemes to adjust the time-frequency resolution, to eliminate pre-echoes in the reconstructed signal, and to improve the coding quality. In order to maintain the perfect-reconstruction property in transition periods, an asymmetrical window has to be used at cost of poorer frequency characteristics. We firstly generalize a window-design method for transition periods in the time-varying MLT with a rigorous proof of its PR property, then present a new window-design method, with which the prototype window is so designed that the total reconstruction distortion in presence of coefficient quantization is minimized. This leads to the time-varying minimum mean-square error (MMSE) MLT. Experiments have shown that the designed windows have better frequency characteristics than the sine window in both transition and regular periods. A general formulation of the quantization distortion for different quantization-error models and for all coding systems is given. A simplified optimal window-design algorithm without direct minimization of the distortion equation is suggested. As an example a transform-coding scheme with time-varying MMSE MLT for speech and audio signals is presented.

Article Outline

1. Introduction
2. Time-varying modulated lapped transform
2.1. Time-invariant modulated lapped transform
2.2. Time-varying forms of the MLT
3. Analysis of the quantization effects in the time-varying MLT
3.1. Regular period
3.2. Transition period
4. Optimization algorithm
4.1. Review of some optimization algorithms
4.2. Simplified optimization for window
5. Application to speech and audio coding
5.1. Coder structure
5.2. Window design
6. Conclusion
Appendix
References

1. Introduction

The modulated lapped transform (MLT), also termed modified discrete cosine transform (MDCT) and time-domain alias cancellation (TDAC), is widely used in both speech and audio compression coding schemes [11, 15 and 19]. The main reason is that the MLT can eliminate the so-called blocking effects by introducing an overlap between adjacent windows of consecutive transform segments. As this leads to a higher delay, an MLT with 50% window overlap offers a good compromise. Additionally, the MLT has also other advantageous properties such as higher stopband attenuation than the DCT.

The classical MLT belongs to the time-invariant transforms in which the basis functions do not vary with time. However, in speech and audio coding systems (see, e.g., [3, 7, 8, 9, 10, 12, 17, 18, 23 and 25]), the properties and statistics of the input signal vary temporarily. Hence, the MLT with fixed block size cannot match the short-term properties of the input signal appropriately. One of the potential degradations in low bit-rate speech and audio coding is the so-called pre-echo phenomenon shown in Fig. 1: a spreading of quantization noise occurs in segments where the input signal is characterized by a sharp attack in the ending part of the transform block. After inverse MLT the quantization noise extends throughout the segment. In the reconstructed signal the phenomenon is observed that the quantization noise begins before the signal. Hence, it is called pre-echo. In speech and audio coding, it can be clearly heard in the reconstructed signal. The longer the transform block is, the stronger is the pre-echo distortion.



Full-size image (10K) - Opens new windowFull-size image (10K)

Fig. 1. Spreading of the quantization noise in time: pre-echo phenomenon. The solid line represents the input signal, the dotted line represents the reconstructed signal.


Obviously, it can be reduced by using a time-varying transform, in which the basis functions and the block sizes are changed with time to match the signal short-time properties. In a similar manner, the MDCT applied in the pre-processing of a speech recognizer may be adapted to the short-time signal behavior [1]. The time-varying MLT has been used in some audio coding schemes to obtain better time-frequency tiling and minimize the pre-echo distortion, e.g. in MPEG layer-3, AAC, and Dolby AC-3 [17]. In a speech and audio coding system the implementation of a time-varying MLT involves a classification of the transform size as, e.g., in [13]. When a sharp attack is detected, the MLT is automatically changed into a short-frame state. Beyond, the prototype-window function has to be designed such that the perfect reconstruction is maintained during transition periods.

A good review about the modulated lapped transform and its time-varying forms is presented by Shlien in [21]. A general design of the time-varying MLT can be realized using different methods [22, 6 and 20]. The most practical approach maintaining the perfect-reconstruction (PR) property is to reconstruct an asymmetrical window or set different phase shifts in basis transform functions during transition periods, such as done in audio coding standard MPEG layer-3 and Dolby AC-3. In this paper, we generalize the method used in MPEG layer-3 with a rigorous mathematical proof of its PR. However, all PR conditions for the time-varying filter banks are only valid in absence of quantization. Considering the quantization effects, the coding system is no more perfect. This leads to the question, how the reconstruction distortion can be minimized by designing the time-varying MLT. The theory of the time-varying MLT was also discussed in [14] with considering the problem of time-varying orthonormal tilings of the time-frequency plane. Different from [14], this paper focuses on studying the behavior of the time-varying MLT in presence of quantization and describes an algorithm to design the time-varying MLT in order to minimize the reconstruction distortion in presence of quantization.

Recently, frequency-coefficient quantization effects in the filter-bank system have received some attention in literature. The design of the filter bank in the mean-square-error (MSE) sense has been studied in some papers [2, 5, 27, 26 and 4] with different quantization models. A brief review of the past work in this field can be found in the introduction of [4]. Especially, the so-called convolutional effects in transform coding with MLT are discussed in [2], where a closed-form expression is suggested to approximate an optimal transform window offering a desired trade-off between the reduction of the time artifacts and reduction of the stop-band leakage. To some degree, our work can be seen as an extension of these studies into transform-coding systems with time-varying MLT, but with a different analysis approach. In our work we model the quantization operation as a linear operation involving a multiplication and an addition, from which the other quantization-noise models such as a purely additive-noise model can be deduced.

The usual optimization algorithm for designing the minimum MSE (MMSE) filter banks will use the autocorrelation function of the input signal. Hence, before the design, the autocorrelation function of the input signal is needed. Some of the simulation experiments in the literature [5, 27 and 4] used to support the theoretical analysis, apply a synthetic signal such as AR(1) or estimate the autocorrelation function off-line before design. However, in a practical application with a non-stationary input like, e.g., in speech and audio coding, the autocorrelation function should be estimated adaptively. Additionally, the estimated autocorrelation function must be transmitted to the synthesis side as side-information in order to design the optimal synthesis filter. So, the theoretical optimization algorithm for designing the MMSE filter banks is difficult to be used in practice. In our contribution, we present a suboptimal algorithm which can be used practically in a time-varying MMSE MLT for speech and audio coding.

The paper is organized as follows: Section 2 gives a review of the time-varying MLT theory. The PR conditions imposed on the prototype-window function in absence of quantization are presented with a proof of its PR property. In Section 3 a general quantization error model is given, on the base of which the MSE of the total quantization error is formulated. In Section 4 the general MSE formulation is further simplified using different simplifying quantization-noise assumptions. The suboptimal simplified algorithm is given. In Section 5 we present an example and demonstrate how to design a speech and audio coder using time-varying MMSE MLT, and emphasize the design of the transition-period filter. The improvement of the frequency characteristics over that of a normal time-varying MLT is shown. Section 6 summarizes the main results of this study.

2. Time-varying modulated lapped transform

2.1. Time-invariant modulated lapped transform

The time-invariant modulated transform is a special case of the lapped orthogonal transform (LOT), or uniform paraunitary filter banks, where the filter length L is equal to twice the number of subbands, i.e., L=2M. Furthermore, the MLT is the same as the oddly stacked TDAC filter bank, while the evenly stacked TDAC filter bank is not a lapped transform.

We first define the modulated lapped transform matrix P as

(1)
Image
with the basis functions

(2)
Image
where k=0,1,…,M−1, n=0,1,…,2M−1, and h(n) is the window function describing the prototype filter; pk(n) is the (k,n)th element in the matrix P.

For the sake of the derivation to be discussed later, the MLT matrix P of dimension M×L(L=2M) can be divided into two square M×M submatrices as

P=[P0 P1].
Alternatively, the MLT can be expressed by the direct transform equation

(3)
Image
and inverse-transform equation

(4)
Image
where xm(n) means the mth segment of the input, ym(n) that of the output signal. Note that only the first M transform coefficients in Xm(k) will be transmitted, and the overlap-and-add (LOA) operation is needed after the inverse transform to obtain the reconstructed segment.

It is shown that the necessary and sufficient conditions for the PR property of the MLT are given by [16]

(5)
P0TP0+P1TP1=I,


(6)
P1TP0=0,


(7)
P0TP1=0.
Condition (5) requires the orthogonality of the basis functions, while ((6) and (7)) are called orthogonality of the tail. The above conditions will be met, if the window function h(n) is chosen as

(8)
h(n)=h(L−1−n),


(9)
h2(n)+h2(n+M)=1.
There are different window functions which can satisfy constrains ((5), (6) and (7)), e.g. the rectangular window, the trapezoidal window and the sine window [2]. Among them the sine window provides better out-of-band energy isolation and better attenuation of block-edge effects. Therefore, the sine window is widely used in speech and audio coding schemes [7, 18 and 25]; it has the form

(10)
Image

2.2. Time-varying forms of the MLT

In description and analysis of the time-varying MLT, it is convenient to define an infinite transform matrix T(m) as

(11)
Image
where P0(m) and P1(m) are submatrices of size M×M, and contain the basis function pk(n) in (2); m is block number of the input signal. Defining that xT=[cdots, three dots, centeredx(−1)x(0)x(1)cdots, three dots, centered] and Image are the infinite vectors for the input and reconstructed signals, respectively, we have

Image

In time-invariant MLT, T(m) remains unchanged along block number m. This means that the transform size M is constant and unchanged with m. The PR conditions can be simply expressed by ((5), (6) and (7)). However, in time-varying MLT, T(m) is no longer constant with m, the transform size M changes with block number m. In this case, the PR conditions must be rewritten to ensure the orthogonality among P(m) and its neighbors P(m±1). From TT(m)T(m)=I, the PR conditions for time-varying MLT can be expressed as

(12)
P0T(m)P0(m)+P1T(m−1)P1(m−1)=I,


(13)
P1T(m)P1(m)+P0T(m+1)P0(m+1)=I,


(14)
P0T(m)P1(m)=0.

In this case, choosing the prototype window h(n) according to criteria ((5), (6) and (7)) can no more ensure the PR conditions, because the transform size is changed in the transition region between different size MLTs. Therefore, how to construct the transition filter to satisfy the PR conditions is the central task in designing a time-varying MLT. The most practical and simple approaches construct an asymmetric prototype window h(n) for transition periods according to the upper and lower boundary transform windows, among which the method to be used in audio coding standard MPEG layer-3 and AAC has been widely used because of its better frequency characteristics during transition periods [21]. In the following, we generalize this method with a mathematical proof of its PR property.

Theorem.  Suppose that in the time-varying MLT, the transform size (block size) is changed at block m from M1 to M2, where M1=rM2, r=1,2,… , and at block m+1 the transform size is changed again from M2 back to M1. This time-varying MLT has perfect reconstruction if the boundary window functions are constructed in the following way:

The window function for the size-M1 transform is

(15)
Image
The window function at the start transition from M1 to M2 is

(16)
Image
The window function for the size-M2 transform is

(17)
Image
There are r times size-M2 modulated lapped transform operations between two transition periods (from start to stop). The first size-M2 MLT transform starts at point (M1M2)/2. The last size-M2 MLT transform ends at point (3M1+M2)/2.
The window function at the stop transform from M2 back to M1 is

(18)
Image

An example of the prototype window of a time-varying MLT as constructed according to the above prescription is illustrated in Fig. 2, with M1=4M2.



Full-size image (7K) - Opens new windowFull-size image (7K)

Fig. 2. An example of the prototype window function for the time-varying MLT with PR in transition periods, where M1/M2=4.


Before proving the theorem we rewrite the PR conditions ((12) and (13)) according to the window function in the theorem. In block matrix notation, we have

(19)
Image


(20)
Image


(21)
Image
where C and C1 correspond to the blocks with constant windows and are defined as

Image


Image
in which the basis functions pk(n) have the same definition as it in (2), but with h(n)=1. Substituting ((19), (20) and (21)) into (11) gives

(22)
Image
Note that the block number m in the above matrices is removed because of the assumption that the transform size is unchanged besides transition periods. Using ((12), (13) and (14)) we obtain the PR conditions for the left transition period (start transition)

(23)
CTC=I,


(24)
A11TA11+A0TA0=I,


(25)
P0T[CA11 0]=0,


(26)
A11TC=0, CA11T=0
and for the right transition period (stop transition)

(27)
C1TC1=I,


(28)
A10TA10+A1TA1=I,


(29)
[0A10 C1] P1T=0,


(30)
A10TC1=0, C1A10T=0.
The time-varying MLT constructed according to the theorem satisfies the above PR conditions in transition period. The proof is given in the appendix.

3. Analysis of the quantization effects in the time-varying MLT

In this section we present an analysis of the quantization effects in the time-varying MLT. In transform speech and audio coding with the MLT, the quantization is an important part. The bit-rate reduction and the quality depend on the proper quantization of the MLT transform coefficients. Fig. 3 depicts the MLT signal processing with quantization.



Full-size image (5K) - Opens new windowFull-size image (5K)

Fig. 3. Signal processing of the MLT with quantization.


In the frequency domain, the quantization effects can be considered as a spectral modulation. In general, this effect is modeled as a combination of both multiplicative terms Gm(k) and additive terms Em(k) in segment m. Then, the quantized MLT coefficients can be expressed as

(31)
Image
and in time domain as

(32)
Image
where gm(n) and em(n) are the inverse MLT of Gm(k) and Em(k), and circle times operator means the circular convolution of x(n) and gm(n). Eq. (32) indicates that parts of the quantization noise are introduced through a convolution operation. This is the so-called convolution effect in transform coding with MLT [2].

The quantization effect appears especially in low-bit-rate coding systems, where the MLT coefficients are quantized with few bits, and in the high-frequency region some MLT coefficients may be dropped totally. In high-bit-rate coding, the multiplicative term Gm(k) varies slowly with k, and can be considered as constant. Actually, expression (31) is the description of a gain-plus-additive noise model. Commonly, the term em(n) has random behavior, and depends on the bit rate and quantizer. When a probability density function (pdf)-optimized quantizer is used, Gm(k) can be expressed as a function of the word length R allocated to quantize the kth MLT coefficient [5]

(33)
Gm(k)=1−β(R)2−2R.
Here β(R) depends on the pdf and R, and can be determined using rate-distortion theory. However, if the quantizer is not pdf-optimized, the term Gm(k) cannot be expressed using (33). Another simple method is to model Gm(k) as a lowpass filter as done in [2]. This is based on the fact that some high-frequency MLT coefficients are set to zero at low bit rate.

In speech and audio coding, quantization errors are unavoidable; therefore, their minimization is an important task in a coder design. Here, we use the MMSE as a criterion, and minimize it through optimal design of the window function for the MLT. Note that with such an optimal window the PR property will be lost, but the coder is optimal in the sense of minimum MSE. This optimal window leads to the “MMSE-MLT”. In the following we present an analytical formulation of the MSE.

Beyond violating the PR condition, MMSE of course does not mean an “optimum subjective quality”. [10] The latter, however, does certainly not require PR either. A formal (and reliable) optimization criterion aiming at the quality is difficult to formulate, as is well known, and it would in any case be much more difficult to evaluate in an optimization. A relatively simple step towards this aim might be the inclusion of a fixed frequency weighting following the hearing threshold, a more effective one the inclusion of adaptive masking curves and non-linear compressions into the minimization. The latter idea, however, would be really useful only if variable or, at least, very large bit rates were allowed in a way that just (un-) noticeable errors are aimed at. In our work, such (simple or complex) weighted-MMSE concepts were beyond the scope.

3.1. Regular period

A regular period is defined to be a period in which the MLT transform block size is constant. In a regular period the analysis and synthesis matrix can be obtained using the expression in (2).

We define the total quantization error as the difference between the reconstructed signal Image and the corresponding input signal Image with M samples (M being the transform-block size):

(34)
Image
where Image and Image are defined by

(35)
Image


(36)
Image

The delay between input and output is L=2M samples. We suppose that in all segments the mean-square quantization error has the same form. For analysis we define the following vectors and matrices:

(37)
Image


(38)
Image


(39)
XmT=[Xm−1(0) Xm−1(1) cdots, three dots, centered Xm−1(M−1) Xm(0) Xm(1) cdots, three dots, centered Xm(M−1)],


(40)
ET=[Em−1(0) Em−1(1) cdots, three dots, centered Em−1(M−1) Em(0) Em(1) cdots, three dots, centered Em(M−1)],


(41)
G=diag{Gm−1(0) Gm−1(1) cdots, three dots, centered Gm−1(M−1) Gm(0) Gm(1) cdots, three dots, centered Gm(M−1)},


(42)
Image


(43)
Image
Here T is the analysis matrix, and F is the synthesis matrix taking into account the overlap-add (OLA) operation.

From Fig. 3, Image can be written as

(44)
Image
Commonly, the MSE is defined as the average of the M variances. In the following analysis we suppose that the multiplicative term Gm(k) varies with k more slowly than the undecoded transform coefficients X(k); so Gm(k) can be considered as constant. Using ((34) and (44)) the MSE can be obtained by

(45)
Image
where

(46)
Image


(47)
Image
The terms Rxx and Ree are the autocorrelation matrices of the input signal and the additive noise, respectively. RXe is the covariance matrix of the unquantized transform coefficients and the additive noise, and fi is the ith row of matrix F. In (45) the MSE is divided into two parts: σd12 describes the deviation from perfect reconstruction due to the quantization operation, while σd22 is the random-noise component introduced by the quantization procedure.

The expression (45) can be considered as a general formulation of the mean-square quantization errors in coding systems with the MLT, which is suitable for different quantization models.

Additive model. In this model the multiplicative term is equal to unity, G=I. The quantization noise consists of the random term Em(k) only, which can be assumed to be white or colored noise. An example is shown in [4], where a similar expression as (45) is also presented, but without the multiplicative term.
Gain-plus-additive model. This model is widely used in quantization-effect analyses. With a pdf-optimized quantizer, the multiplicative term becomes constant in (45). An application example is shown in [27], where the polyphase-decomposition method is used to calculate the MSE.
Multiplicative model. This model is mainly used to analyze the circular-convolution effect in (32). Here, the additive noise is considered to be much smaller than the “noise” due to circular convolution and can be ignored. An example is shown in [2], where the distortion is formulated in time domain using ODFT with the assumption that Gm(k) has a low-pass character.

3.2. Transition period

A transition period was defined in Section 2.2 to be a period in which one regular period is changed into another regular period. There are two kinds of transition periods, namely a “start” period from regular 1 to regular 2 and a “stop” period from regular 2 to regular 1. The corresponding transform matrices have different basis functions. If the transition window functions are constructed according to ((15), (16), (17) and (18)), the basis functions are given by

(48)
Image


(49)
Image
From ((19) and (21)), it is shown that the analysis and synthesis matrices T and F can be blocked into three submatrices in a transition period (both for ‘start’ and ‘stop’). Therefore, we can separate the mean-square quantization errors (MSE) into three parts according to

(50)
σd2=D1+D2+D3,
where D1 corresponds to the MSE value in the first regular period, D2 corresponds to that in the region in which the window is equal to unity, and D3 corresponds to that in the second regular period. Then, minimizing σd2 is equal to minimizing D1, D2 and D3, respectively.

4. Optimization algorithm

The optimization procedure aims at minimizing the MSE in (45) through proper design of the prototype window. From (45), we know that the optimization algorithm depends on the input signal statistics. Furthermore, direct use of (45) is difficult due to its complexity. In order to make a numerical solution feasible, a further simplification of expression (45) is necessary. However, this depends on the quantization model to be used. The goal of this section is not to formulate a concrete simplified MSE expression for simulation purpose, because such a formulation for different quantization models has been reported in the literature [2, 5, 27, 26 and