skip to main content

Interval Parsing Grammars for File Format Parsing

Published:06 June 2023Publication History
Skip Abstract Section

Abstract

File formats specify how data is encoded for persistent storage. They cannot be formalized as context-free grammars since their specifications include context-sensitive patterns such as the random access pattern and the type-length-value pattern. We propose a new grammar mechanism called Interval Parsing Grammars IPGs) for file format specifications. An IPG attaches to every nonterminal/terminal an interval, which specifies the range of input the nonterminal/terminal consumes. By connecting intervals and attributes, the context-sensitive patterns in file formats can be well handled. In this paper, we formalize IPGs' syntax as well as its semantics, and its semantics naturally leads to a parser generator that generates a recursive-descent parser from an IPG. In general, IPGs are declarative, modular, and enable termination checking. We have used IPGs to specify a number of file formats including ZIP, ELF, GIF, PE, and part of PDF; we have also evaluated the performance of the generated parsers.

Skip Supplemental Material Section

Supplemental Material

References

  1. 2008. Document managementPortable Document 493 FormatPart 1: PDF 1.7. Google ScholarGoogle Scholar
  2. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Godmar Back. 2002. Datascript – A specification and scripting language for binary data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2487, Springer Verlag, 66–77. isbn:3540442847 issn:16113349 https://doi.org/10.1007/3-540-45821-2_4 Google ScholarGoogle ScholarCross RefCross Ref
  4. Julian Bangert and Nickolai Zeldovich. 2014. Nail: A practical tool for parsing and generating data formats. In USENIX Symposium on Operating Systems Design and Implementation (OSDI). 615–628. Google ScholarGoogle Scholar
  5. Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. 337–340. Google ScholarGoogle ScholarCross RefCross Ref
  6. 1995. Executable and Linking Format (ELF) Specification. Version 1.2. Google ScholarGoogle Scholar
  7. Kathleen Fisher and Robert Gruber. 2005. PADS: a domain-specific language for processing ad hoc data. In ACM Conference on Programming Language Design and Implementation (PLDI). ACM Press, New York, NY, USA. 295–304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kathleen Fisher, Yitzhak Mandelbaum, and David Walker. 2006. The next 700 data description languages. In ACM Symposium on Principles of Programming Languages (POPL). ACM Press, New York, NY, USA. 2–15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bryan Ford. 2004. Parsing expression grammars: A recognition-based syntactic foundation. Conference Record of the Annual ACM Symposium on Principles of Programming Languages, 31 (2004), 111–122. isbn:158113729X issn:07308566 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bryan Ford. 2004. Parsing Expression Grammars: A Recognition-based Syntactic Foundation. In ACM Symposium on Principles of Programming Languages (POPL). 111–122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Alexei Hmelnov Hmelnov and Andrei Mikhailov. 2019. Generation of Code for Reading Data from the Declarative File Format Specifications Written in Language FlexT. Proceedings - 2018 Ivannikov Isp Ras Open Conference, ISPRAS 2018, 23–30. isbn:9781728112756 https://doi.org/10.1109/ISPRAS.2018.00011 Google ScholarGoogle ScholarCross RefCross Ref
  12. Suman Jana and Vitaly Shmatikov. 2012. Abusing file processing in malware detectors for fun and profit. In 2012 IEEE Symposium on Security and Privacy. 80–94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Trevor Jim, Yitzhak Mandelbaum, and David Walker. 2010. Semantics and Algorithms for Data-dependent Grammars. 417–430. Google ScholarGoogle Scholar
  14. Donald B. Johnson. 1975. Finding All the Elementary Circuits of a Directed Graph. SIAM J. Comput., 4, 1 (1975), 77–84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 2015. Kaitai Struct User Guide. https://doc.kaitai.io/user_guide.html Google ScholarGoogle Scholar
  16. Donald E. Knuth. 1968. Semantics of context-free languages. Mathematical Systems Theory, 2, 2 (1968), 127–145. Google ScholarGoogle ScholarCross RefCross Ref
  17. Ashish Kumar, Bill Harris, and Gang Tan. 2023. DISV: Domain Independent Semantic Validation of Data Files. In 9th Workshop on Language-Theoretic Security (LangSec). Google ScholarGoogle Scholar
  18. Zephyr S Lucas, Joanna Y Liu, Prashant Anantharaman, and Sean W Smith. 2021. Parsing PEGs with Length Fields in Software and Hardware. In 2021 IEEE Security and Privacy Workshops (SPW). 128–133. Google ScholarGoogle Scholar
  19. Prashanth Mundkur, Linda Briesemeister, Natarajan Shankar, Prashant Anantharaman, Sameed Ali, Zephyr Lucas, and Sean W. Smith. 2020. Research Report: The Parsley Data Format Definition Language. In 6th Workshop on Language-Theoretic Security (LangSec). 300–307. Google ScholarGoogle Scholar
  20. Terence Parr, Sam Harwell, and Kathleen Fisher. 2014. Adaptive LL(*) parsing: the power of dynamic analysis. In ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 579–598. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tahina Ramananandro, Antoine Delignat-Lavaud, Cédric Fournet, Nikhil Swamy, Tej Chajed, Nadim Kobeissi, and Jonathan Protzenko. 2019. EverParse: Verified Secure Zero-Copy Parsers for Authenticated Message Formats. In Usenix Security Symposium. 1465–1482. Google ScholarGoogle Scholar
  22. William Underwood. 2012. Grammar-Based Specification and Parsing of Binary File Formats. International Journal of Digital Curation, 7, 1 (2012), 95–106. issn:1746-8256 https://doi.org/10.2218/ijdc.v7i1.217 Google ScholarGoogle ScholarCross RefCross Ref
  23. W3C. 2018. Scalable Vector Graphics (SVG) 2. https://www.w3.org/TR/SVG2/ Google ScholarGoogle Scholar
  24. Jialun Zhang, Greg Morrisett, and Gang Tan. 2023. Interval Parsing Grammars for File Format Parsing. arxiv:2304.04859. Google ScholarGoogle Scholar
  25. Jialun Zhang, Greg Morrisett, and Gang Tan. 2023. Reproduction Package for article "Interval Parsing Grammars for File Format Parsing". https://doi.org/10.5281/zenodo.7811236 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Interval Parsing Grammars for File Format Parsing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Article Metrics

        • Downloads (Last 12 months)10,612
        • Downloads (Last 6 weeks)579

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader