Abstract
Regular expressions (shortened as regexp ) are widely used to parse data, detect recurrent patterns and information, and are a common choice for defining configurable rules for a variety of systems. In fact, many data-intensive applications rely on regexp matching as the first line of defense to perform on-line data filtering. Unfortunately, few solutions can keep up with the increasing data rate and complexity of sets containing hundreds of expressions. In this paper we present DotStar (.*), a complete algorithmic solution and a software tool-chain, that can compile large sets of regexp into an automaton that can take advantage of the vector/SIMD extensions available on many commodity multi-core processors. DotStar relies on several algorithmic innovations to transform the user-provided regexp set into a sequence of manageable intermediate representations. The resulting automaton is both space and time efficient, and can search in a single pass without backtracking. The experimental evaluation, performed on a family of state-of-the-art processors, shows that DotStar can efficiently handle both small sets of regexp, used in protocol parsing, and larger sets designed for Network Intrusion Detection Systems (NIDS), achieving a performance between 1 and 5 Gbit/sec per core.
Similar content being viewed by others
References
Caron P, Ziadi D (2000) Characterization of Glushkov automata. Theor Comput Sci 233(12):75–90
Cisco. Cisco IOS IPS Deployment Guide
Gluskov VM (1961) The abstract theory of automata. Russ Math Surv 16:1–53
Goyal N, Ormont J, Smith R, Sankaralingam K (2008) Signature matching in network processing using SIMD/GPU architectures
Kumar S, Dharmapurikar S, Yu F, Crowley P, Turner J (2006) Algorithms to accelerate multiple regular expressions matching for deep packet inspection. SIGCOMM Comput Commun Rev 36(4):339–350
Kumar S, Chandrasekaran B, Turner J, Varghese G (2007) Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia. In: ANCS ’07: proceedings of the 3rd ACM/IEEE symposium on architecture for networking and communications systems. ACM, New York, pp 155–164
Lawrence Berkeley National Laboratory Bro Intrusion Detection System
Lee J, Hwang SH, Park N, Lee S-W, Jun S, Kim YS (2007) A high performance nids using fpga-based regular expression matching. In: SAC ’07: proceedings of the 2007 ACM symposium on applied computing. ACM, New York, pp 1187–1191
Levandoski J, Sommer E, Strait M Application layer packet classifier for Linux
Lu W, Chiu K, Pan Y (2006) A parallel approach to xml parsing. In: The 7th IEEE/ACM international conference on grid computing (Grid2006), Barcelona, Spain, September 28–29, 2006
Martens W, Neven F, Schwentick T (2004) Complexity of decision problems for simple regular expressions. In: MFCS, pp 889–900
Navarro G, Raffinot M (2002) Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, New York
Petrini F, Agarwal V, Pasetto D (2009) SCAMPI: a scalable cam-based algorithm for multiple packet inspection. In: Proc intl conf for high performance computing, networking, storage and analysis (SuperComputing’09), Portland, OR, November 2009
Scarpazza DP, Russell GF (2009) High-performance regular expression scanning on the cell/b.e. processor. In: ICS ’09: proceedings of the 23rd international conference on supercomputing. ACM, New York, pp 14–25
Sourcefire Inc. SNORT network intrusion detection system
Sourdis I, Pnevmatikatos D (2003) Fast, large-scale string match for a 10 gbps fpga-based network intrusion. In: FPL 2003, pp 880–889
Suresh DC, Guo Z, Najjar WA (2006) Automatic compilation framework for bloom filter based intrusion detection. In: Workshop on applied reconfigurable computing
van Lunteren J, Rohrer J, Atasu K, Hagleitner C (2009) Regular expression acceleration at multiple tens of gb/s. In: Workshop on accelerators for high performance architectures, international conference on supercomputing
Yu F, Chen Z, Diao Y, Lakshman TV, Katz RH (2006) Fast and memory-efficient regular expression matching for deep packet inspection. In: ANCS ’06: proceedings of the 2006 ACM/IEEE symposium on architecture for networking and communications systems. ACM, New York, pp 93–102
Zhang W, van Engelen R (2006) A table-driven streaming xml parsing methodology for high-performance web services. In: ICWS ’06: proceedings of the IEEE international conference on web services. IEEE Comput Soc, Washington, pp 197–204
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pasetto, D., Petrini, F. & Agarwal, V. DotStar: breaking the scalability and performance barriers in parsing regular expressions. Comput Sci Res Dev 25, 93–104 (2010). https://doi.org/10.1007/s00450-010-0106-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-010-0106-4