Abstract
File formats specify how data is encoded for persistent storage. They cannot be formalized as context-free grammars since their specifications include context-sensitive patterns such as the random access pattern and the type-length-value pattern. We propose a new grammar mechanism called Interval Parsing Grammars IPGs) for file format specifications. An IPG attaches to every nonterminal/terminal an interval, which specifies the range of input the nonterminal/terminal consumes. By connecting intervals and attributes, the context-sensitive patterns in file formats can be well handled. In this paper, we formalize IPGs' syntax as well as its semantics, and its semantics naturally leads to a parser generator that generates a recursive-descent parser from an IPG. In general, IPGs are declarative, modular, and enable termination checking. We have used IPGs to specify a number of file formats including ZIP, ELF, GIF, PE, and part of PDF; we have also evaluated the performance of the generated parsers.
Supplemental Material
Available for Download
The appendix.
- 2008. Document managementPortable Document 493 FormatPart 1: PDF 1.7. Google Scholar
- Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA. Google ScholarDigital Library
- Godmar Back. 2002. Datascript – A specification and scripting language for binary data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2487, Springer Verlag, 66–77. isbn:3540442847 issn:16113349 https://doi.org/10.1007/3-540-45821-2_4 Google ScholarCross Ref
- Julian Bangert and Nickolai Zeldovich. 2014. Nail: A practical tool for parsing and generating data formats. In USENIX Symposium on Operating Systems Design and Implementation (OSDI). 615–628. Google Scholar
- Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. 337–340. Google ScholarCross Ref
- 1995. Executable and Linking Format (ELF) Specification. Version 1.2. Google Scholar
- Kathleen Fisher and Robert Gruber. 2005. PADS: a domain-specific language for processing ad hoc data. In ACM Conference on Programming Language Design and Implementation (PLDI). ACM Press, New York, NY, USA. 295–304. Google ScholarDigital Library
- Kathleen Fisher, Yitzhak Mandelbaum, and David Walker. 2006. The next 700 data description languages. In ACM Symposium on Principles of Programming Languages (POPL). ACM Press, New York, NY, USA. 2–15. Google ScholarDigital Library
- Bryan Ford. 2004. Parsing expression grammars: A recognition-based syntactic foundation. Conference Record of the Annual ACM Symposium on Principles of Programming Languages, 31 (2004), 111–122. isbn:158113729X issn:07308566 Google ScholarDigital Library
- Bryan Ford. 2004. Parsing Expression Grammars: A Recognition-based Syntactic Foundation. In ACM Symposium on Principles of Programming Languages (POPL). 111–122. Google ScholarDigital Library
- Alexei Hmelnov Hmelnov and Andrei Mikhailov. 2019. Generation of Code for Reading Data from the Declarative File Format Specifications Written in Language FlexT. Proceedings - 2018 Ivannikov Isp Ras Open Conference, ISPRAS 2018, 23–30. isbn:9781728112756 https://doi.org/10.1109/ISPRAS.2018.00011 Google ScholarCross Ref
- Suman Jana and Vitaly Shmatikov. 2012. Abusing file processing in malware detectors for fun and profit. In 2012 IEEE Symposium on Security and Privacy. 80–94. Google ScholarDigital Library
- Trevor Jim, Yitzhak Mandelbaum, and David Walker. 2010. Semantics and Algorithms for Data-dependent Grammars. 417–430. Google Scholar
- Donald B. Johnson. 1975. Finding All the Elementary Circuits of a Directed Graph. SIAM J. Comput., 4, 1 (1975), 77–84. Google ScholarDigital Library
- 2015. Kaitai Struct User Guide. https://doc.kaitai.io/user_guide.html Google Scholar
- Donald E. Knuth. 1968. Semantics of context-free languages. Mathematical Systems Theory, 2, 2 (1968), 127–145. Google ScholarCross Ref
- Ashish Kumar, Bill Harris, and Gang Tan. 2023. DISV: Domain Independent Semantic Validation of Data Files. In 9th Workshop on Language-Theoretic Security (LangSec). Google Scholar
- Zephyr S Lucas, Joanna Y Liu, Prashant Anantharaman, and Sean W Smith. 2021. Parsing PEGs with Length Fields in Software and Hardware. In 2021 IEEE Security and Privacy Workshops (SPW). 128–133. Google Scholar
- Prashanth Mundkur, Linda Briesemeister, Natarajan Shankar, Prashant Anantharaman, Sameed Ali, Zephyr Lucas, and Sean W. Smith. 2020. Research Report: The Parsley Data Format Definition Language. In 6th Workshop on Language-Theoretic Security (LangSec). 300–307. Google Scholar
- Terence Parr, Sam Harwell, and Kathleen Fisher. 2014. Adaptive LL(*) parsing: the power of dynamic analysis. In ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 579–598. Google ScholarDigital Library
- Tahina Ramananandro, Antoine Delignat-Lavaud, Cédric Fournet, Nikhil Swamy, Tej Chajed, Nadim Kobeissi, and Jonathan Protzenko. 2019. EverParse: Verified Secure Zero-Copy Parsers for Authenticated Message Formats. In Usenix Security Symposium. 1465–1482. Google Scholar
- William Underwood. 2012. Grammar-Based Specification and Parsing of Binary File Formats. International Journal of Digital Curation, 7, 1 (2012), 95–106. issn:1746-8256 https://doi.org/10.2218/ijdc.v7i1.217 Google ScholarCross Ref
- W3C. 2018. Scalable Vector Graphics (SVG) 2. https://www.w3.org/TR/SVG2/ Google Scholar
- Jialun Zhang, Greg Morrisett, and Gang Tan. 2023. Interval Parsing Grammars for File Format Parsing. arxiv:2304.04859. Google Scholar
- Jialun Zhang, Greg Morrisett, and Gang Tan. 2023. Reproduction Package for article "Interval Parsing Grammars for File Format Parsing". https://doi.org/10.5281/zenodo.7811236 Google ScholarDigital Library
Index Terms
- Interval Parsing Grammars for File Format Parsing
Recommendations
Parsing expression grammars: a recognition-based syntactic foundation
POPL '04: Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of programming languagesFor decades we have been using Chomsky's generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. The power of generative grammars to express ...
Parsing expression grammars: a recognition-based syntactic foundation
POPL '04For decades we have been using Chomsky's generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. The power of generative grammars to express ...
Left recursion in Parsing Expression Grammars
Parsing Expression Grammars (PEGs) are a formalism that can describe all deterministic context-free languages through a set of rules that specify a top-down parser for some language. PEGs are easy to use, and there are efficient implementations of PEG ...
Comments