Abstract
DNA storage has been a thriving interdisciplinary research area because of its high density, low maintenance cost, and long durability for information storage. However, the complexity of errors in DNA sequences including substitutions, insertions and deletions hinders its application for massive data storage. Motivated by the divide-and-conquer algorithm, we propose a hierarchical error correction strategy for text DNA storage. The basic idea is to design robust codes for common characters which have one-base error correction ability including insertion and/or deletion. The errors are gradually corrected by the codes in DNA reads, multiple alignment of character lines, and finally word spelling. On one hand, the proposed encoding method provides a systematic way to design storage friendly codes, such as 50% GC content, no more than 2-base homopolymers, and robustness against secondary structures. On the other hand, the proposed error correction method not only corrects single insertion or deletion, but also deals with multiple insertions or deletions. Simulation results demonstrate that the proposed method can correct more than 98% errors when error rate is less than or equal to 0.05. Thus, it is more powerful and adaptable to the complicated DNA storage applications.
Similar content being viewed by others
Availability of Data and Material
The data that support the findings of this study are available from the corresponding authors upon reasonable request.
Code Availability
Code is available from the corresponding authors upon reasonable request.
References
Panda D, Molla KA, Baig MJ, Swain A, Behera D, Dash M (2018) DNA as a digital information storage device: hope or hype? 3 Biotech 8:239. https://doi.org/10.1007/s13205-018-1246-7
Williams ED, Ayres RU, Heller M (2002) The 1.7 kilogram microchip: energy and material use in the production of semiconductor devices. Environ Sci Technol 36:5504–5510. https://doi.org/10.1021/es049890z
Goda K, Kitsuregawa M (2012) The history of storage systems. Proc IEEE 100:1433–1440. https://doi.org/10.1109/JPROC.2012.2189787
Reinsel D, Gantz J, and R. J (2018) The digital of the world from edge to core[EM/OL]. http://book.itep.ru/depository/dig_economy/idc-seagate-dataage-whitepaper.pdf.
Bonnet J, Colotte M, Coudy D, Couallier V, Portier J, Morin B, Tuffet S (2010) Chain and conformation stability of solid-state DNA: implications for room temperature storage. Nucleic Acids Res 38:1531–1546. https://doi.org/10.1093/nar/gkp1060
Qian L, Ouyang Q, Ping Z, Sun F, Dong Y (2020) DNA storage: research landscape and future prospects. Natl Sci Rev 7:1092–1107. https://doi.org/10.1093/nsr/nwaa007
Ceze L, Nivala J, Strauss K (2019) Molecular digital data storage using DNA. Nat Rev Genet 20:456–466. https://doi.org/10.1038/s41576-019-0125-3
Heckel R, Mikutis G, Grass RN (2018) A characterization of the DNA data storage channel. Sci Rep. https://doi.org/10.1038/s41598-019-45832-6
Cretu Stancu M, van Roosmalen MJ, Renkens I, Nieboer MM, Middelkamp S, de Ligt J, Pregno G, Giachino D, Mandrile G, Espejo Valle-Inclan J, Korzelius J, de Bruijn E, Cuppen E, Talkowski ME, Marschall T, de Ridder J, Kloosterman WP (2017) Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat Commun 8:1326. https://doi.org/10.1038/s41467-017-01343-4
Kumar UK, Umashankar BS (2007) Improved Hamming Code for Error Detection and Correction. Proc Int Symp Wirel Pervasive Comput. https://doi.org/10.1109/ISWPC.2007.342654
Takahashi CN, Nguyen BH, Strauss K, Ceze L (2019) Demonstration of end-to-end automation of DNA data storage. Sci Rep 9:4998. https://doi.org/10.1038/s41598-019-41228-8
Blawat M, Gaedke K, Huetter I, Chen X-M, Turczyk B, Inverso S, Pruitt B, Church G (2016) Forward error correction for DNA data storage. Proced Comput Sci 80:1011–1022. https://doi.org/10.1016/j.procs.2016.05.398
Chen WG, Wang LX, Han MZ, Han CC, Li BZ (2020) Sequencing barcode construction and identification methods based on block error-correction codes. Sci China Life Sci 63:1580–1592. https://doi.org/10.1007/s11427-019-1651-3
Meiser LC, Antkowiak PL, Koch J, Chen WD, Kohll AX, Stark WJ, Heckel R, Grass RN (2020) Reading and writing digital data in DNA. Nat Protoc 15:86–101. https://doi.org/10.1038/s41596-019-0244-5
Antkowiak PL, Lietard J, Darestani MZ, Somoza MM, Stark WJ, Heckel R, Grass RN (2020) Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction. Nat Commun 11:5345. https://doi.org/10.1038/s41467-020-19148-3
Grass RN, Heckel R, Puddu M, Paunescu D, Stark WJ (2015) Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew Chem Int Ed Engl 54:2552–2555. https://doi.org/10.1002/anie.201411378
Chen W, Han M, Zhou J, Ge Q, Wang P, Zhang X, Zhu S, Song L, Yuan Y (2021) An artificial chromosome for data storage. Natl Sci Rev. https://doi.org/10.1093/nsr/nwab028
Deng L, Wang YX, Noor-A-Rahim M, Guan YL, Shi ZP, Gunawan E, Poh CL (2019) Optimized code design for constrained DNA data storage with asymmetric errors. IEEE Access 7:84107–84121. https://doi.org/10.1109/ACCESS.2019.2924827
Lu XZ, Jeong J, Kim JW, No JS, Park H, No A, Kim S (2020) Error rate-based log-likelihood ratio processing for low-density parity-check codes in DNA storage. Ieee Access 8:162892–162902. https://doi.org/10.1109/ACCESS.2020.3021700
Hou HX, Shum KW, Chen MH, Li H (2016) BASIC codes: low-complexity regenerating codes for distributed storage systems. IEEE Trans Inf Theory 62:3053–3069. https://doi.org/10.1109/TIT.2016.2553670
Organick L, Ang SD, Chen YJ, Lopez R, Yekhanin S, Makarychev K, Racz MZ, Kamath G, Gopalan P, Nguyen B, Takahashi CN, Newman S, Parker HY, Rashtchian C, Stewart K, Gupta G, Carlson R, Mulligan J, Carmean D, Seelig G, Ceze L, Strauss K (2018) Random access in large-scale DNA data storage. Nat Biotechnol 36:242–248. https://doi.org/10.1038/nbt.4079
Erlich Y, Zielinski D (2017) DNA fountain enables a robust and efficient storage architecture. Science 355:950–954. https://doi.org/10.1126/science.aaj2038
Anavy L, Vaknin I, Atar O, Amit R, Yakhini Z (2019) Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat Biotechnol 37:1229–1236. https://doi.org/10.1038/s41587-019-0240-x
Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB (2013) Characterizing and measuring bias in sequence data. Genome Biol 14:R51. https://doi.org/10.1186/gb-2013-14-5-r51
Church GM, Gao Y, Kosuri S (2012) Next-generation digital information storage in DNA. Science 337:1628. https://doi.org/10.1126/science.1226355
Goldman N, Bertone P, Chen S, Dessimoz C, Leproust EM, Sipos B, Birney E (2013) Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494:77–80. https://doi.org/10.1038/nature11875
Bornholt J, Lopez R, Carmean D, Ceze L, Seelig G, Strauss K (2017) A DNA-based archival storage system. ACM SIGPLAN Notices 51(4):637–649. https://doi.org/10.1145/2954679.2872397
Wang Y, Noor-A-Rahim M, Zhang J, Gunawan E, Guan YL, Poh CL (2019) High capacity DNA data storage with variable-length Oligonucleotides using repeat accumulate code and hybrid mapping. J Biol Eng 13:89. https://doi.org/10.1186/s13036-019-0211-2
Zhang SF, Peng K (2020) DNA information storage technology based on raptor code. Laser Optoelectron. https://doi.org/10.3788/LOP57.151701
Xue TB, Lau FCM (2020) Construction of GC-balanced DNA with deletion/insertion/mutation error correction for DNA storage system. IEEE Access 8:140972–140980. https://doi.org/10.1109/ACCESS.2020.3012688
Press WH, Hawkins JA, Jones SK, Schaub JM, Finkelstein IJ (2020) HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. Proc Natl Acad Sci USA 117:18489–18496. https://doi.org/10.1073/pnas.2004821117
Song L, Geng F, Gong Z, Li B, Yuan Y (2020) Super-robust data storage in DNA by de Bruijn graph-based decoding. BioRxiv. https://doi.org/10.1101/2020.12.20.423642
Zhong Y, Qi S, Sheng F et al (2018) A new digital information storing and reading system based on synthetic DNA. Sci China Life Sci 61:733–735. https://doi.org/10.1007/s11427-017-9131-7
Lee UJ, Hwang S, Kim KE, Kim M (2020) DNA data storage in perl. Biotechnol Bioprocess Eng 25:607–615. https://doi.org/10.1007/s12257-020-0022-9
Jeong J, Park SJ, Kim JW, No JS, Jeon HH, Lee JW, No A, Kim S, Park H (2021) Cooperative sequence clustering and decoding for DNA storage system with fountain codes. Bioinformatics. https://doi.org/10.1093/bioinformatics/btab246
Funding
This work was supported by the National Natural Science Foundation of China (Grant nos. 62072128, 61876047 and 62002079).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
There is no conflict of interest.
Ethical Approval
Not applicable.
Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Rights and permissions
About this article
Cite this article
Zan, X., Yao, X., Xu, P. et al. A Hierarchical Error Correction Strategy for Text DNA Storage. Interdiscip Sci Comput Life Sci 14, 141–150 (2022). https://doi.org/10.1007/s12539-021-00476-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-021-00476-x