Skip to main content

Advertisement

Log in

A Hierarchical Error Correction Strategy for Text DNA Storage

  • Original research article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

DNA storage has been a thriving interdisciplinary research area because of its high density, low maintenance cost, and long durability for information storage. However, the complexity of errors in DNA sequences including substitutions, insertions and deletions hinders its application for massive data storage. Motivated by the divide-and-conquer algorithm, we propose a hierarchical error correction strategy for text DNA storage. The basic idea is to design robust codes for common characters which have one-base error correction ability including insertion and/or deletion. The errors are gradually corrected by the codes in DNA reads, multiple alignment of character lines, and finally word spelling. On one hand, the proposed encoding method provides a systematic way to design storage friendly codes, such as 50% GC content, no more than 2-base homopolymers, and robustness against secondary structures. On the other hand, the proposed error correction method not only corrects single insertion or deletion, but also deals with multiple insertions or deletions. Simulation results demonstrate that the proposed method can correct more than 98% errors when error rate is less than or equal to 0.05. Thus, it is more powerful and adaptable to the complicated DNA storage applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of Data and Material

The data that support the findings of this study are available from the corresponding authors upon reasonable request.

Code Availability

Code is available from the corresponding authors upon reasonable request.

References

  1. Panda D, Molla KA, Baig MJ, Swain A, Behera D, Dash M (2018) DNA as a digital information storage device: hope or hype? 3 Biotech 8:239. https://doi.org/10.1007/s13205-018-1246-7

    Article  PubMed  PubMed Central  Google Scholar 

  2. Williams ED, Ayres RU, Heller M (2002) The 1.7 kilogram microchip: energy and material use in the production of semiconductor devices. Environ Sci Technol 36:5504–5510. https://doi.org/10.1021/es049890z

    Article  CAS  PubMed  Google Scholar 

  3. Goda K, Kitsuregawa M (2012) The history of storage systems. Proc IEEE 100:1433–1440. https://doi.org/10.1109/JPROC.2012.2189787

    Article  Google Scholar 

  4. Reinsel D, Gantz J, and R. J (2018) The digital of the world from edge to core[EM/OL]. http://book.itep.ru/depository/dig_economy/idc-seagate-dataage-whitepaper.pdf.

  5. Bonnet J, Colotte M, Coudy D, Couallier V, Portier J, Morin B, Tuffet S (2010) Chain and conformation stability of solid-state DNA: implications for room temperature storage. Nucleic Acids Res 38:1531–1546. https://doi.org/10.1093/nar/gkp1060

    Article  CAS  PubMed  Google Scholar 

  6. Qian L, Ouyang Q, Ping Z, Sun F, Dong Y (2020) DNA storage: research landscape and future prospects. Natl Sci Rev 7:1092–1107. https://doi.org/10.1093/nsr/nwaa007

    Article  PubMed  PubMed Central  Google Scholar 

  7. Ceze L, Nivala J, Strauss K (2019) Molecular digital data storage using DNA. Nat Rev Genet 20:456–466. https://doi.org/10.1038/s41576-019-0125-3

    Article  CAS  PubMed  Google Scholar 

  8. Heckel R, Mikutis G, Grass RN (2018) A characterization of the DNA data storage channel. Sci Rep. https://doi.org/10.1038/s41598-019-45832-6

    Article  PubMed  PubMed Central  Google Scholar 

  9. Cretu Stancu M, van Roosmalen MJ, Renkens I, Nieboer MM, Middelkamp S, de Ligt J, Pregno G, Giachino D, Mandrile G, Espejo Valle-Inclan J, Korzelius J, de Bruijn E, Cuppen E, Talkowski ME, Marschall T, de Ridder J, Kloosterman WP (2017) Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat Commun 8:1326. https://doi.org/10.1038/s41467-017-01343-4

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Kumar UK, Umashankar BS (2007) Improved Hamming Code for Error Detection and Correction. Proc Int Symp Wirel Pervasive Comput. https://doi.org/10.1109/ISWPC.2007.342654

    Article  Google Scholar 

  11. Takahashi CN, Nguyen BH, Strauss K, Ceze L (2019) Demonstration of end-to-end automation of DNA data storage. Sci Rep 9:4998. https://doi.org/10.1038/s41598-019-41228-8

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Blawat M, Gaedke K, Huetter I, Chen X-M, Turczyk B, Inverso S, Pruitt B, Church G (2016) Forward error correction for DNA data storage. Proced Comput Sci 80:1011–1022. https://doi.org/10.1016/j.procs.2016.05.398

    Article  Google Scholar 

  13. Chen WG, Wang LX, Han MZ, Han CC, Li BZ (2020) Sequencing barcode construction and identification methods based on block error-correction codes. Sci China Life Sci 63:1580–1592. https://doi.org/10.1007/s11427-019-1651-3

    Article  CAS  PubMed  Google Scholar 

  14. Meiser LC, Antkowiak PL, Koch J, Chen WD, Kohll AX, Stark WJ, Heckel R, Grass RN (2020) Reading and writing digital data in DNA. Nat Protoc 15:86–101. https://doi.org/10.1038/s41596-019-0244-5

    Article  CAS  PubMed  Google Scholar 

  15. Antkowiak PL, Lietard J, Darestani MZ, Somoza MM, Stark WJ, Heckel R, Grass RN (2020) Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction. Nat Commun 11:5345. https://doi.org/10.1038/s41467-020-19148-3

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Grass RN, Heckel R, Puddu M, Paunescu D, Stark WJ (2015) Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew Chem Int Ed Engl 54:2552–2555. https://doi.org/10.1002/anie.201411378

    Article  CAS  PubMed  Google Scholar 

  17. Chen W, Han M, Zhou J, Ge Q, Wang P, Zhang X, Zhu S, Song L, Yuan Y (2021) An artificial chromosome for data storage. Natl Sci Rev. https://doi.org/10.1093/nsr/nwab028

    Article  PubMed  PubMed Central  Google Scholar 

  18. Deng L, Wang YX, Noor-A-Rahim M, Guan YL, Shi ZP, Gunawan E, Poh CL (2019) Optimized code design for constrained DNA data storage with asymmetric errors. IEEE Access 7:84107–84121. https://doi.org/10.1109/ACCESS.2019.2924827

    Article  Google Scholar 

  19. Lu XZ, Jeong J, Kim JW, No JS, Park H, No A, Kim S (2020) Error rate-based log-likelihood ratio processing for low-density parity-check codes in DNA storage. Ieee Access 8:162892–162902. https://doi.org/10.1109/ACCESS.2020.3021700

    Article  Google Scholar 

  20. Hou HX, Shum KW, Chen MH, Li H (2016) BASIC codes: low-complexity regenerating codes for distributed storage systems. IEEE Trans Inf Theory 62:3053–3069. https://doi.org/10.1109/TIT.2016.2553670

    Article  Google Scholar 

  21. Organick L, Ang SD, Chen YJ, Lopez R, Yekhanin S, Makarychev K, Racz MZ, Kamath G, Gopalan P, Nguyen B, Takahashi CN, Newman S, Parker HY, Rashtchian C, Stewart K, Gupta G, Carlson R, Mulligan J, Carmean D, Seelig G, Ceze L, Strauss K (2018) Random access in large-scale DNA data storage. Nat Biotechnol 36:242–248. https://doi.org/10.1038/nbt.4079

    Article  CAS  PubMed  Google Scholar 

  22. Erlich Y, Zielinski D (2017) DNA fountain enables a robust and efficient storage architecture. Science 355:950–954. https://doi.org/10.1126/science.aaj2038

    Article  CAS  PubMed  Google Scholar 

  23. Anavy L, Vaknin I, Atar O, Amit R, Yakhini Z (2019) Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat Biotechnol 37:1229–1236. https://doi.org/10.1038/s41587-019-0240-x

    Article  CAS  PubMed  Google Scholar 

  24. Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB (2013) Characterizing and measuring bias in sequence data. Genome Biol 14:R51. https://doi.org/10.1186/gb-2013-14-5-r51

    Article  PubMed  PubMed Central  Google Scholar 

  25. Church GM, Gao Y, Kosuri S (2012) Next-generation digital information storage in DNA. Science 337:1628. https://doi.org/10.1126/science.1226355

    Article  CAS  PubMed  Google Scholar 

  26. Goldman N, Bertone P, Chen S, Dessimoz C, Leproust EM, Sipos B, Birney E (2013) Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494:77–80. https://doi.org/10.1038/nature11875

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Bornholt J, Lopez R, Carmean D, Ceze L, Seelig G, Strauss K (2017) A DNA-based archival storage system. ACM SIGPLAN Notices 51(4):637–649. https://doi.org/10.1145/2954679.2872397

    Article  Google Scholar 

  28. Wang Y, Noor-A-Rahim M, Zhang J, Gunawan E, Guan YL, Poh CL (2019) High capacity DNA data storage with variable-length Oligonucleotides using repeat accumulate code and hybrid mapping. J Biol Eng 13:89. https://doi.org/10.1186/s13036-019-0211-2

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Zhang SF, Peng K (2020) DNA information storage technology based on raptor code. Laser Optoelectron. https://doi.org/10.3788/LOP57.151701

    Article  Google Scholar 

  30. Xue TB, Lau FCM (2020) Construction of GC-balanced DNA with deletion/insertion/mutation error correction for DNA storage system. IEEE Access 8:140972–140980. https://doi.org/10.1109/ACCESS.2020.3012688

    Article  Google Scholar 

  31. Press WH, Hawkins JA, Jones SK, Schaub JM, Finkelstein IJ (2020) HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. Proc Natl Acad Sci USA 117:18489–18496. https://doi.org/10.1073/pnas.2004821117

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Song L, Geng F, Gong Z, Li B, Yuan Y (2020) Super-robust data storage in DNA by de Bruijn graph-based decoding. BioRxiv. https://doi.org/10.1101/2020.12.20.423642

    Article  PubMed  PubMed Central  Google Scholar 

  33. Zhong Y, Qi S, Sheng F et al (2018) A new digital information storing and reading system based on synthetic DNA. Sci China Life Sci 61:733–735. https://doi.org/10.1007/s11427-017-9131-7

    Article  PubMed  Google Scholar 

  34. Lee UJ, Hwang S, Kim KE, Kim M (2020) DNA data storage in perl. Biotechnol Bioprocess Eng 25:607–615. https://doi.org/10.1007/s12257-020-0022-9

    Article  CAS  Google Scholar 

  35. Jeong J, Park SJ, Kim JW, No JS, Jeon HH, Lee JW, No A, Kim S, Park H (2021) Cooperative sequence clustering and decoding for DNA storage system with fountain codes. Bioinformatics. https://doi.org/10.1093/bioinformatics/btab246

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Funding

This work was supported by the National Natural Science Foundation of China (Grant nos. 62072128, 61876047 and 62002079).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenbin Liu.

Ethics declarations

Conflict of Interest

There is no conflict of interest.

Ethical Approval

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zan, X., Yao, X., Xu, P. et al. A Hierarchical Error Correction Strategy for Text DNA Storage. Interdiscip Sci Comput Life Sci 14, 141–150 (2022). https://doi.org/10.1007/s12539-021-00476-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-021-00476-x

Keywords

Navigation