Full length articleA compression scheme for radio data in high performance computing
Introduction
The simultaneous drives to wider fields and higher sensitivity have led radio astronomy to the cusp of a big-data revolution. There is a multitude of instruments, including 21 cm cosmology experiments (Pober et al., 2013, Battye et al., 2013, Canadian Hydrogen Intensity Mapping Experiment, CHIME, Pober et al., 2014, Greenhili et al., 2012, van Haarlem et al., 2013, Zheng et al., 2013, Parsons et al., 2010, Chen, 2012), Square Kilometer Array Precursors (Johnston et al., 2008, Lonsdale et al., 2009, Booth et al., 2009), and ultimately the Square Kilometer Array (SKA Organization, 2015), whose rate of data production will be orders of magnitude higher than any existing radio telescope. An early example is the CHIME Pathfinder (Bandura et al., 2014, Newburgh et al., 2014) which will soon be producing data at a steady rate of over 4 TB per day. The cost associated with storing and handling these data can be considerable and therefore it is desirable to reduce the size of the data as much as possible using compression. At the same time, these data volumes produce a significant data processing challenge. Any data compression/decompression scheme must be fast enough as to not hinder data processing, and would ideally lead to a net increase in performance due to the reduced time required to read the data from disk.
Here, after discussing some general considerations for designing data storage formats in Section 2, we present a scheme for compressing astronomical radio data. Our procedure has two steps: a controlled (relative to thermal noise) reduction of the precision of the data which reduces its information entropy (Section 3), and a lossless compression algorithm—Bitshuffle1—which exploits this reduction in entropy to achieve a very high compression ratio (Section 4). These two steps are independent in that, while they work very well together, either of them can be used without the other. When we evaluate our method in Section 5 we show that the precision reduction improves compression ratios for most lossless compressors. Likewise, Bitshuffle outperforms most other lossless compressors even in the absence of precision reduction.
Section snippets
Characteristics of radio-astronomy data and usage patterns
Integrated, post-correlation radio-astronomy data are typically at least three dimensional, containing axes representing spectral frequency, correlation product, and time.2 The correlation product refers to the correlation of all antenna
Lossy entropy reduction: reduction of precision
All experiments must perform some amount of lossy compression simply by virtue of having to choose a finite width data type which reduces precision by truncation. Here, we focus on performing a reduction of precision in a manner that is both controlled, in that it has a well-understood effect on the data; and efficient, in that only the required precision is kept allowing for better compression.
Reducing the precision of the data involves discarding some number of the least significant bits of
Lossless compression: Bitshuffle
Here we discuss lossless data compressors in the context of radio astronomical data. We seek a compressor that is fast enough for high performance applications but also obtains high compression ratios, especially in the context of the precision reduction discussed in the previous section. Satisfying both criteria is difficult and existing compressors are found to be inadequate. Therefore, a custom compression algorithm, Bitshuffle, was developed; it is both fast and obtains high compression
Evaluation of method
In this section we apply the compression algorithm described above to data from the CHIME Pathfinder to assess the algorithm’s performance and to compare it with other compression schemes. The Pathfinder comprises two parabolic cylinders, each 20 m wide by 35 m long, with their axes running in a north–south direction. 64 identical dual-polarization feeds are located at 0.3 m intervals along the central portion of each focal line.
The data used for the following comparisons was collected on
Summary and conclusions
We have presented a high-throughput data compression scheme for astronomical radio data that obtains a very high compression ratio. Our scheme includes two parts: reducing the precision of the data in a controlled manner to discard noisy bits, hence reducing the entropy of the data; and the lossless compression of the data using the Bitshuffle algorithm.
The entire compression algorithm consists of the following steps, starting with the precision reduction:
- 1.
Estimate the thermal noise on a
Acknowledgments
We are very grateful for the warm reception and skillful help we have received from the staff of the Dominion Radio Astrophysical Observatory, operated by the National Research Council Canada.
CHIME is Leading Edge Fund project 31170 funded by the Canada Foundation for Innovation, the B.C. Knowledge Development Fund, ‘le Cofinancement gouvernement du Québec-FCI, and the Ontario Research Fund. K. Masui is supported by the Canadian Institute for Advanced Research, Global Scholars Program. M. Deng
References (31)
- et al.
Canadian Hydrogen Intensity Mapping Experiment (CHIME) pathfinder
- et al.
HI intensity mapping: a single dish approach
Mon. Not. R. Astron. Soc.
(2013) - Booth, R.S., de Blok, W.J.G., Jonas, J.L., Fanaroff, B., 2009. MeerKAT key project science, specifications, and...
The Tianlai project: a 21CM cosmology experiment
Int. J. Mod. Phys. Conf. Ser.
(2012)- Denman, N., Amiri, M., Bandura, K., Cliche, J.-F., Connor, L., Dobbs, M., Fandino, M., Halpern, M., Hincks, A.,...
- Deutsch, L.P., DEFLATE Compressed Data Format Specification version 1.3, RFC 1951, RFC Editor (May 1996). URL...
- et al.
A broadband 512-element full correlation imaging array at VHF (LEDA)
- et al.
Reducing the HPC-datastorage footprint with MAFISC—multidimensional adaptive filtering improved scientific data compression
Comput. Sci. Res. Dev.
(2012) A method for the construction of minimum-redundancy codes
Proc. IRE
(1952)- The IEEE, 2008. Standard for floating-point arithmetic, IEEE Std. 754-2008....
Science with ASKAP. The Australian square-kilometre-array pathfinder
Exp. Astron.
Self-noise in interferometers — radio and infrared
Astron. J.
The Murchison Widefield Array: Design Overview
IEEE Proc.
Cited by (47)
Precision requirements and data compression in CryoEM/CryoET
2022, Journal of Structural BiologyCitation Excerpt :For comparison we also tested the Blosc implementation of zstd, a newer alternative to zlib which leverages modern multithreaded CPUs to improve speed with some other optimizations (https://github.com/facebook/zstd). For the 5 bit case we also include results for bitshuffle followed by zstd as it has been suggested (Masui et al., 2015, McLeod et al., 2018) that shuffling should improve compression levels. Extensive testing with bitshuffle (not shown) yielded no consistent size reductions other than a slight reduction for floating point data.
Jungfraujoch: hardware-accelerated data-acquisition system for kilohertz pixel-array X-ray detectors
2023, Journal of Synchrotron RadiationData reduction and processing for photon science detectors
2024, Frontiers in Physics