Type-II/III DCT/DST algorithms with reduced number of arithmetic operations
Introduction
In this paper, we describe recursive algorithms for the type-II and type-III discrete cosine and sine transforms (DCT-II and DCT-III, and also DST-II and DST-III), of power-of-two sizes, that require fewer total real additions and multiplications than previously published algorithms (with an asymptotic reduction of about 5.6%), without sacrificing numerical accuracy. Our DCT and DST algorithms are based on a recently published fast Fourier transform (FFT) algorithm, which reduced the operation count for the discrete Fourier transform (DFT) compared to the previous best split-radix algorithm [1]. The gains in this new FFT algorithm, and consequently in the new DCTs and DSTs, stem from a recursive rescaling of the internal multiplicative factors within an algorithm called a “conjugate-pair” split-radix FFT [2], [3], [4], [5] so as to simplify some of the multiplications. In order to derive a DCT algorithm from this FFT, we simply consider the DCT-II to be a special case of a DFT with real input of a certain even symmetry, and discard the redundant operations [1], [6], [7], [8], [9], [10]. The DCT-III, DST-II, and DST-III have identical operation counts to the DCT-II of the same size, since the algorithms are related by simple transpositions, permutations, and sign flips [11], [12], [13].
Since 1968, the lowest total count of real additions and multiplications, herein called “flops” (floating-point operations), for the DFT of a power-of-two size was achieved by the split-radix algorithm, with flops for [6], [8], [14], [15], [16]. This count was recently surpassed by new algorithms achieving a flop count of [1], [17]. Similarly, the lowest-known flop count for the DCT-II of size was previously for a unitary normalization (with the additive constant depending on normalization) [6], [7], [12], [13], [18], [19], [20], [21], [22], [23], [24], [25], [26] and could be achieved by starting with the split-radix FFT and discarding redundant operations [6], [7]. (Many DCT algorithms with an unreported or larger flop count have also been described [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40].) Based on our new FFT algorithm, the flop counts for the various DCT types were reduced using a code generator [10], [41] that automatically pruned the redundant operations from an FFT with a given symmetry, but neither an explicit algorithm nor a general formula for the flop count were presented except for DCT-I [1]. In this paper, we use the same starting point to “manually” derive a DCT-II algorithm by pruning redundant operations from a real-even FFT, and give the general formula for the new flop count (for ): The first savings over the previous record occur for and are summarized in Table 1 for several N. We also consider the effect of normalization on this flop count: the above count was for a unitary transform, but slightly different counts are obtained by other choices. Moreover, we show that a further N multiplications can be saved by individually rescaling every output of the DCT-II (or input of the DCT-III). In doing so, we generalize a result by Arai et al., who showed that eight multiplications could be saved by scaling the outputs of a DCT-II of size [42], a result commonly applied to JPEG compression [43].
If we merely wished to show that the DCT-II/III could be computed in operations, we could do so by using well-known techniques to re-express the DCT in terms of a real-input DFT of length N plus pre/post-processing operations [29], [30], [33], [44], [45], [46], and then substituting our new FFT that requires operations for real inputs [1]. However, with FFT and DCT algorithms, there is great interest in obtaining not only the best possible asymptotic constant factor, but also the best possible exact count of arithmetic operations (which, for the DCT-II of power-of-two sizes, had not changed by even one operation for over 20 years). Our result (1) is intended as a new upper bound on this (still unknown) minimum exact count, and therefore we have done our best with the terms as well as the asymptotic constant. It turns out, in fact, that our algorithm to achieve (1) is closely related to well-known algorithms for expressing the DCT-II in terms of a real-input FFT of the same length, but it does not seem obvious a priori that this is what one obtains by pruning our FFT for symmetric data.
In the following sections, we first review how a DCT-II may be expressed as a special case of a DFT, and how the previous optimum flop count can be achieved by pruning redundant operations from a conjugate-pair split-radix FFT. Then, we briefly review the new FFT algorithm presented in [1], and derive the new DCT-II algorithm. We follow by considering the effect of normalization and scaling. Finally, we consider the extension of this algorithm to algorithms for the DCT-III, DST-II, and DST-III. We close with some concluding remarks about future directions. We emphasize that no proven tight lower bound on the DCT-II flop count is currently known, and we make no claim that Eq. (1) is the lowest possible (although we have endeavored not to miss any obvious optimizations).
Section snippets
DCT-II in terms of DFT
Various forms of DCT have been defined, corresponding to different boundary conditions on the transform [47]. Perhaps the most common form is the type-II DCT, used in image compression [43] and many other applications. The DCT-II is typically defined as a real, orthogonal (unitary), linear transformation by the formula (for ):for N inputs and N outputs , where is the Kronecker delta ( for and otherwise). However, we wish to emphasize
Conjugate-pair FFT and DCT-II
Although the previous minimum flop count for DCT-II algorithms can be derived from the ordinary split-radix FFT algorithm [6], [7] (and it can also be derived in other ways [12], [13], [18], [19], [20], [21], [22], [23], [24], [25], [26]), here we will do the same thing using a variant dubbed the “conjugate-pair” split-radix FFT. This algorithm was originally proposed to reduce the number of flops [2], but was later shown to have the same flop count as ordinary split-radix [3], [4], [5]. It
New FFT and DCT-II
In this section, we first review the new FFT algorithm introduced in our previous work based on a recursive rescaling of the conjugate-pair FFT [1], and then apply it to derive a fast DCT-II algorithm as in the previous section.
Normalizations
In the above definition of DCT-II, we use a scale factor “2” in front of the summation in order to make the DCT-II directly equivalent to the corresponding DFT. But in some circumstances, it is useful to multiply by other factors, and different normalizations lead to slightly different flop counts. For example, one could use the unitary normalization from Eq. (2), which replaces 2 by or (for ) and requires the same number of flops. If one uses the unitary normalization multiplied by
Fast DCT-III, DST-II, and DST-III
Given any algorithm for the DCT-II, one can immediately obtain algorithms for the DCT-III, DST-II, and DST-III with exactly the same number of arithmetic operations. In this way, any improvement in the DCT-II, such as the one described in this paper, immediately leads to improved algorithms for those transforms and vice versa. In this section, we review the equivalence between those transforms.
Conclusion
We have shown that an improved count of real additions and multiplications (flops), Eq. (1), can be obtained for the DCT/DST types II/III by pruning redundant operations from a recent improved FFT algorithm. We have also shown how to save N multiplications by rescaling the outputs of a DCT-II (or the inputs of a DCT-III), generalizing a well-known technique for [42]. However, we do not claim that our new count is optimal—it may be that further gains can be obtained by, for example,
Acknowledgments
This work was supported in part by a grant from the MIT Undergraduate Research Opportunities Program. The authors are also grateful to M. Frigo, co-author of FFTW with S.G. Johnson [10], for many helpful discussions.
References (53)
- et al.
Simple FFT and DCT algorithms with reduced number of operations
Signal Processing
(1984) - et al.
Fast Fourier transforms: a tutorial review and a state of the art
Signal Processing
(April 1990) - et al.
Two new algorithms based on product system for discrete cosine transform
Signal Processing
(2001) - et al.
Fast and numerically stable algorithms for discrete cosine transforms
Linear Algebra Appl.
(2005) Vectorizing the FFTs
- et al.
A modified split-radix FFT with fewer arithmetic operations
IEEE Trans. Signal Process.
(2007) - et al.
Conjugate pair fast Fourier transform
Electron. Lett.
(1989) Comment: conjugate pair fast Fourier transform
Electron. Lett.
(1989)- et al.
Comment: conjugate pair fast Fourier transform
Electron. Lett.
(1990) - et al.
Comment: conjugate pair fast Fourier transform
Electron. Lett.
(1992)
Implementation of “split-radix” FFT algorithms for complex, real, and real-symmetric data
IEEE Trans. Acoust. Speech Signal Process.
The design and implementation of FFTW3
Proc. IEEE
A fast algorithm for the discrete sine transform implemented by the fast cosine transform
IEEE Trans. Acoust. Speech Signal Process.
Direct methods for computing discrete sinusoidal transforms
IEE Proc. F
Restructured recursive DCT and DST algorithms
IEEE Trans. Signal Process.
An economical method for calculating the discrete Fourier transform
Split-radix FFT algorithm
Electron. Lett.
Recursive cyclotomic factorization—a new algorithm for calculating the discrete Fourier transform
IEEE Trans. Acoust. Speech Signal Process.
A new matrix approach to real FFTs and convolutions of length
Computing
A new algorithm to compute the discrete cosine transform
IEEE Trans. Acoust. Speech Signal Process.
On computing the discrete Fourier and cosine transforms
IEEE Trans. Acoust. Speech Signal Process.
Fast algorithms for the DFT and other sinusoidal transforms
IEEE Trans. Acoust. Speech Signal Process.
A fast algorithm for computing the discrete cosine transform
IEEE Trans. Acoust. Speech Signal Process.
Fast cosine transform based on the successive doubling method
Electronic Lett.
Fast algorithm for computing discrete cosine transform
IEEE Trans. Signal Process.
Cited by (50)
Distributions of discrete spherical particles with a constant susceptibility can lead to echo time dependent phase shifts which deviate from theories
2019, Magnetic Resonance ImagingCitation Excerpt :The second modification takes advantage of the 3D even symmetry of the Green's function and of the finite cylinder in the rectangular FOV. As a cosine Fourier transform performed on one half of a function with even symmetry is equivalent to performing a fast Fourier transform on the entire function, the use of a cosine Fourier transform on one half of the FOV in all three dimensions saves a factor of 16 in computation time and memory [16]. However, as a result, the particles will only be randomly placed within one octant of the cylinder, and then reflected over other octants.
MASS: distance profile of a query over a time series
2024, Data Mining and Knowledge DiscoveryDCT-Former: Efficient Self-Attention with Discrete Cosine Transform
2023, Journal of Scientific ComputingHopping discrete cosine transform
2023, Proceedings of SPIE - The International Society for Optical EngineeringVLSI Architecture of DCT-Based Harmonic Wavelet Transform for Time-Frequency Analysis
2023, IEEE Transactions on Instrumentation and MeasurementAn Efficient DCT-II Based Harmonic Wavelet Transform for Time-Frequency Analysis
2022, Journal of Signal Processing Systems