Type-II/III DCT/DST algorithms with reduced number of arithmetic operations

doi:10.1016/j.sigpro.2008.01.004

Signal Processing

Volume 88, Issue 6, June 2008, Pages 1553-1564

https://doi.org/10.1016/j.sigpro.2008.01.004 Get rights and content

Abstract

We present algorithms for the discrete cosine transform (DCT) and discrete sine transform (DST), of types II and III, that achieve a lower count of real multiplications and additions than previously published algorithms, without sacrificing numerical accuracy. Asymptotically, the operation count is reduced from $2 N \log_{2} N + O (N)$ to $\frac{17}{9} N \log_{2} N + O (N)$ for a power-of-two transform size N. Furthermore, we show that an additional N multiplications may be saved by a certain rescaling of the inputs or outputs, generalizing a well-known technique for $N = 8$ by Arai et al. These results are derived by considering the DCT to be a special case of a DFT of length $4 N$ , with certain symmetries, and then pruning redundant operations from a recent improved fast Fourier transform algorithm (based on a recursive rescaling of the conjugate-pair split-radix algorithm). The improved algorithms for the DCT-III, DST-II, and DST-III follow immediately from the improved count for the DCT-II.

Introduction

In this paper, we describe recursive algorithms for the type-II and type-III discrete cosine and sine transforms (DCT-II and DCT-III, and also DST-II and DST-III), of power-of-two sizes, that require fewer total real additions and multiplications than previously published algorithms (with an asymptotic reduction of about 5.6%), without sacrificing numerical accuracy. Our DCT and DST algorithms are based on a recently published fast Fourier transform (FFT) algorithm, which reduced the operation count for the discrete Fourier transform (DFT) compared to the previous best split-radix algorithm [1]. The gains in this new FFT algorithm, and consequently in the new DCTs and DSTs, stem from a recursive rescaling of the internal multiplicative factors within an algorithm called a “conjugate-pair” split-radix FFT [2], [3], [4], [5] so as to simplify some of the multiplications. In order to derive a DCT algorithm from this FFT, we simply consider the DCT-II to be a special case of a DFT with real input of a certain even symmetry, and discard the redundant operations [1], [6], [7], [8], [9], [10]. The DCT-III, DST-II, and DST-III have identical operation counts to the DCT-II of the same size, since the algorithms are related by simple transpositions, permutations, and sign flips [11], [12], [13].

Since 1968, the lowest total count of real additions and multiplications, herein called “flops” (floating-point operations), for the DFT of a power-of-two size $N = 2^{m}$ was achieved by the split-radix algorithm, with $4 N \log_{2} N - 6 N + 8$ flops for $N > 1$ [6], [8], [14], [15], [16]. This count was recently surpassed by new algorithms achieving a flop count of $\frac{34}{9} N \log_{2} N + O (N)$ [1], [17]. Similarly, the lowest-known flop count for the DCT-II of size $N = 2^{m} > 1$ was previously $2 N \log_{2} N - N + 2$ for a unitary normalization (with the additive constant depending on normalization) [6], [7], [12], [13], [18], [19], [20], [21], [22], [23], [24], [25], [26] and could be achieved by starting with the split-radix FFT and discarding redundant operations [6], [7]. (Many DCT algorithms with an unreported or larger flop count have also been described [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40].) Based on our new FFT algorithm, the flop counts for the various DCT types were reduced using a code generator [10], [41] that automatically pruned the redundant operations from an FFT with a given symmetry, but neither an explicit algorithm nor a general formula for the flop count were presented except for DCT-I [1]. In this paper, we use the same starting point to “manually” derive a DCT-II algorithm by pruning redundant operations from a real-even FFT, and give the general formula for the new flop count (for $N = 2^{m} > 1$ ): $\frac{17}{9} N \log_{2} N - \frac{17}{27} N - \frac{1}{9} (- 1)^{\log_{2} N} \log_{2} N + \frac{7}{54} (- 1)^{\log_{2} N} + \frac{3}{2} .$ The first savings over the previous record occur for $N = 16$ and are summarized in Table 1 for several N. We also consider the effect of normalization on this flop count: the above count was for a unitary transform, but slightly different counts are obtained by other choices. Moreover, we show that a further N multiplications can be saved by individually rescaling every output of the DCT-II (or input of the DCT-III). In doing so, we generalize a result by Arai et al., who showed that eight multiplications could be saved by scaling the outputs of a DCT-II of size $N = 8$ [42], a result commonly applied to JPEG compression [43].

If we merely wished to show that the DCT-II/III could be computed in $\frac{17}{9} N \log_{2} N + O (N)$ operations, we could do so by using well-known techniques to re-express the DCT in terms of a real-input DFT of length N plus $O (N)$ pre/post-processing operations [29], [30], [33], [44], [45], [46], and then substituting our new FFT that requires $\frac{17}{9} N \log_{2} N + O (N)$ operations for real inputs [1]. However, with FFT and DCT algorithms, there is great interest in obtaining not only the best possible asymptotic constant factor, but also the best possible exact count of arithmetic operations (which, for the DCT-II of power-of-two sizes, had not changed by even one operation for over 20 years). Our result (1) is intended as a new upper bound on this (still unknown) minimum exact count, and therefore we have done our best with the $O (N)$ terms as well as the asymptotic constant. It turns out, in fact, that our algorithm to achieve (1) is closely related to well-known algorithms for expressing the DCT-II in terms of a real-input FFT of the same length, but it does not seem obvious a priori that this is what one obtains by pruning our FFT for symmetric data.

In the following sections, we first review how a DCT-II may be expressed as a special case of a DFT, and how the previous optimum flop count can be achieved by pruning redundant operations from a conjugate-pair split-radix FFT. Then, we briefly review the new FFT algorithm presented in [1], and derive the new DCT-II algorithm. We follow by considering the effect of normalization and scaling. Finally, we consider the extension of this algorithm to algorithms for the DCT-III, DST-II, and DST-III. We close with some concluding remarks about future directions. We emphasize that no proven tight lower bound on the DCT-II flop count is currently known, and we make no claim that Eq. (1) is the lowest possible (although we have endeavored not to miss any obvious optimizations).

Section snippets

DCT-II in terms of DFT

Various forms of DCT have been defined, corresponding to different boundary conditions on the transform [47]. Perhaps the most common form is the type-II DCT, used in image compression [43] and many other applications. The DCT-II is typically defined as a real, orthogonal (unitary), linear transformation by the formula (for $k = 0, \dots, N - 1$ ): $C_{k}^{II} = \sqrt{\frac{2 - δ_{k, 0}}{N}} \sum_{n = 0}^{N - 1} x_{n} \cos [\frac{π}{N} (n + \frac{1}{2}) k]$ for N inputs $x_{n}$ and N outputs $C_{k}^{II}$ , where $δ_{k, 0}$ is the Kronecker delta ( $= 1$ for $k = 0$ and $= 0$ otherwise). However, we wish to emphasize

Conjugate-pair FFT and DCT-II

Although the previous minimum flop count for DCT-II algorithms can be derived from the ordinary split-radix FFT algorithm [6], [7] (and it can also be derived in other ways [12], [13], [18], [19], [20], [21], [22], [23], [24], [25], [26]), here we will do the same thing using a variant dubbed the “conjugate-pair” split-radix FFT. This algorithm was originally proposed to reduce the number of flops [2], but was later shown to have the same flop count as ordinary split-radix [3], [4], [5]. It

New FFT and DCT-II

In this section, we first review the new FFT algorithm introduced in our previous work based on a recursive rescaling of the conjugate-pair FFT [1], and then apply it to derive a fast DCT-II algorithm as in the previous section.

Normalizations

In the above definition of DCT-II, we use a scale factor “2” in front of the summation in order to make the DCT-II directly equivalent to the corresponding DFT. But in some circumstances, it is useful to multiply by other factors, and different normalizations lead to slightly different flop counts. For example, one could use the unitary normalization from Eq. (2), which replaces 2 by $\sqrt{2 / N}$ or $\sqrt{1 / N}$ (for $k = 0$ ) and requires the same number of flops. If one uses the unitary normalization multiplied by $\sqrt{N}$

Fast DCT-III, DST-II, and DST-III

Given any algorithm for the DCT-II, one can immediately obtain algorithms for the DCT-III, DST-II, and DST-III with exactly the same number of arithmetic operations. In this way, any improvement in the DCT-II, such as the one described in this paper, immediately leads to improved algorithms for those transforms and vice versa. In this section, we review the equivalence between those transforms.

Conclusion

We have shown that an improved count of real additions and multiplications (flops), Eq. (1), can be obtained for the DCT/DST types II/III by pruning redundant operations from a recent improved FFT algorithm. We have also shown how to save N multiplications by rescaling the outputs of a DCT-II (or the inputs of a DCT-III), generalizing a well-known technique for $N = 8$ [42]. However, we do not claim that our new count is optimal—it may be that further gains can be obtained by, for example,

Acknowledgments

This work was supported in part by a grant from the MIT Undergraduate Research Opportunities Program. The authors are also grateful to M. Frigo, co-author of FFTW with S.G. Johnson [10], for many helpful discussions.

References (53)

M. Vetterli et al.
Simple FFT and DCT algorithms with reduced number of operations
Signal Processing
(1984)
P. Duhamel et al.
Fast Fourier transforms: a tutorial review and a state of the art
Signal Processing
(April 1990)
Z. Guo et al.
Two new algorithms based on product system for discrete cosine transform
Signal Processing
(2001)
G. Plonka et al.
Fast and numerically stable algorithms for discrete cosine transforms
Linear Algebra Appl.
(2005)
P.N. Swarztrauber
Vectorizing the FFTs
S.G. Johnson et al.
A modified split-radix FFT with fewer arithmetic operations
IEEE Trans. Signal Process.
(2007)
I. Kamar et al.
Conjugate pair fast Fourier transform
Electron. Lett.
(1989)
R.A. Gopinath
Comment: conjugate pair fast Fourier transform
Electron. Lett.
(1989)
H.-S. Qian et al.
Comment: conjugate pair fast Fourier transform
Electron. Lett.
(1990)
A.M. Krot et al.
Comment: conjugate pair fast Fourier transform
Electron. Lett.
(1992)

P. Duhamel

Implementation of “split-radix” FFT algorithms for complex, real, and real-symmetric data

IEEE Trans. Acoust. Speech Signal Process.

(1986)

R. Vuduc, J. Demmel, Code generators for automatic tuning of numerical kernels: experiences with FFTW, in: Proceedings...

M. Frigo et al.

The design and implementation of FFTW3

Proc. IEEE

(2005)

Z. Wang

A fast algorithm for the discrete sine transform implemented by the fast cosine transform

IEEE Trans. Acoust. Speech Signal Process.

(1982)

S.C. Chan et al.

Direct methods for computing discrete sinusoidal transforms

IEE Proc. F

(1990)

P. Lee et al.

Restructured recursive DCT and DST algorithms

IEEE Trans. Signal Process.

(1994)

R. Yavne

An economical method for calculating the discrete Fourier transform

P. Duhamel et al.

Split-radix FFT algorithm

Electron. Lett.

(1984)

J.B. Martens

Recursive cyclotomic factorization—a new algorithm for calculating the discrete Fourier transform

IEEE Trans. Acoust. Speech Signal Process.

(1984)

T. Lundy et al.

A new matrix approach to real FFTs and convolutions of length $2^{k}$

Computing

(2007)

B.G. Lee

A new algorithm to compute the discrete cosine transform

IEEE Trans. Acoust. Speech Signal Process.

(1984)

Z. Wang

On computing the discrete Fourier and cosine transforms

IEEE Trans. Acoust. Speech Signal Process.

(1985)

N. Suehiro et al.

Fast algorithms for the DFT and other sinusoidal transforms

IEEE Trans. Acoust. Speech Signal Process.

(1986)

H.S. Hou

A fast algorithm for computing the discrete cosine transform

IEEE Trans. Acoust. Speech Signal Process.

(1987)

F. Arguello et al.

Fast cosine transform based on the successive doubling method

Electronic Lett.

(1990)

C.W. Kok

Fast algorithm for computing discrete cosine transform

IEEE Trans. Signal Process.

(1997)

Cited by (50)

Distributions of discrete spherical particles with a constant susceptibility can lead to echo time dependent phase shifts which deviate from theories
2019, Magnetic Resonance Imaging
Citation Excerpt :
The second modification takes advantage of the 3D even symmetry of the Green's function and of the finite cylinder in the rectangular FOV. As a cosine Fourier transform performed on one half of a function with even symmetry is equivalent to performing a fast Fourier transform on the entire function, the use of a cosine Fourier transform on one half of the FOV in all three dimensions saves a factor of 16 in computation time and memory [16]. However, as a result, the particles will only be randomly placed within one octant of the cylinder, and then reflected over other octants.
Purpose: When an object contains a distribution of discrete magnetic inclusions with a constant susceptibility, the MRI signal inside the object may no longer be determined analytically by assuming that the object is uniform or magnetic inclusions are completely random. Through simulations and experiments with spherical particles inside cylinders, this work is to study the signal behavior in the static dephasing regime.
Methods: MRI complex images of long cylinders containing spherical particles with different arrangements were simulated and compared to similar experimental phantom data. All experiments were designed for the static dephasing regime so that diffusion was neglected.
Results: Several factors can lead to different phase shifts over echo time. These include numbers of particles per image voxel, particle arrangements, and Gibbs ringing effects. Purely random arrangements of particles in simulations can agree with a revised theoretical formula at short echo times, but quasi-random arrangements of particles do not agree with the theory. In addition, close to half of experimental results show deviations from the theory and the quasi-random arrangements of particles can explain those experimental results. Simulated R₂^′ values are about the same for different cylinder orientations but increase when random particle arrangement is restricted toward lattice. Nonetheless, as expected, phase distributions outside and far away from each cylinder are independent of any factor affecting phase inside and behave as if they are from a cylinder with a uniform bulk susceptibility.
Conclusion: Phase over echo time inside an object containing discrete spheres can be nonlinear and deviate from current theories in the static dephasing regime. Phase outside the object can be used to accurately determine its magnetic moment and bulk susceptibility without a priori knowledge of the spherical particle distribution inside the object. These results can be extended to the subcortical gray matter and suggest that in vivo susceptibility quantification will need to be re-thought.
MASS: distance profile of a query over a time series
2024, Data Mining and Knowledge Discovery
DCT-Former: Efficient Self-Attention with Discrete Cosine Transform
2023, Journal of Scientific Computing
Hopping discrete cosine transform
2023, Proceedings of SPIE - The International Society for Optical Engineering
VLSI Architecture of DCT-Based Harmonic Wavelet Transform for Time-Frequency Analysis
2023, IEEE Transactions on Instrumentation and Measurement
An Efficient DCT-II Based Harmonic Wavelet Transform for Time-Frequency Analysis
2022, Journal of Signal Processing Systems

View all citing articles on Scopus

View full text

Type-II/III DCT/DST algorithms with reduced number of arithmetic operations

Abstract

Introduction

Section snippets

DCT-II in terms of DFT

Conjugate-pair FFT and DCT-II

New FFT and DCT-II

Normalizations

Fast DCT-III, DST-II, and DST-III

Conclusion

Acknowledgments

Signal Processing

Signal Processing

Signal Processing

Linear Algebra Appl.

A modified split-radix FFT with fewer arithmetic operations

IEEE Trans. Signal Process.

Conjugate pair fast Fourier transform

Electron. Lett.

Comment: conjugate pair fast Fourier transform

Electron. Lett.

Comment: conjugate pair fast Fourier transform

Electron. Lett.

Comment: conjugate pair fast Fourier transform

Electron. Lett.

Implementation of “split-radix” FFT algorithms for complex, real, and real-symmetric data

IEEE Trans. Acoust. Speech Signal Process.

The design and implementation of FFTW3

Proc. IEEE

A fast algorithm for the discrete sine transform implemented by the fast cosine transform

IEEE Trans. Acoust. Speech Signal Process.

Direct methods for computing discrete sinusoidal transforms

IEE Proc. F

Restructured recursive DCT and DST algorithms

IEEE Trans. Signal Process.

An economical method for calculating the discrete Fourier transform

Split-radix FFT algorithm

Electron. Lett.

Recursive cyclotomic factorization—a new algorithm for calculating the discrete Fourier transform

IEEE Trans. Acoust. Speech Signal Process.

A new matrix approach to real FFTs and convolutions of length 2k

Computing

A new algorithm to compute the discrete cosine transform

IEEE Trans. Acoust. Speech Signal Process.

On computing the discrete Fourier and cosine transforms

IEEE Trans. Acoust. Speech Signal Process.

Fast algorithms for the DFT and other sinusoidal transforms

IEEE Trans. Acoust. Speech Signal Process.

A fast algorithm for computing the discrete cosine transform

IEEE Trans. Acoust. Speech Signal Process.

Fast cosine transform based on the successive doubling method

Electronic Lett.

Fast algorithm for computing discrete cosine transform

IEEE Trans. Signal Process.

A new matrix approach to real FFTs and convolutions of length $2^{k}$