research-article

Combining SIMD and Many/Multi-core Parallelism for Finite-state Machines with Enumerative Speculation

Authors:
Peng Jiang

The University of Iowa, IA

The University of Iowa, IA
View Profile

,
Yang Xia

The Ohio State University

The Ohio State University
View Profile

,
Gagan Agrawal

Augusta University, GA

Augusta University, GA
View Profile

Authors Info & Claims

ACM Transactions on Parallel Computing Volume 7 Issue 3Article No.: 15pp 1–26https://doi.org/10.1145/3399714

Published:21 June 2020Publication History

ACM Transactions on Parallel Computing

Abstract

Finite-state Machine (FSM) is the key kernel behind many popular applications, including regular expression matching, text tokenization, and Huffman decoding. Parallelizing FSMs is extremely difficult because of the strong dependencies and unpredictable memory accesses. Previous efforts have largely focused on multi-core parallelization and used different approaches, including speculative and enumerative execution, both of which have been effective but also have limitations. With increasing width and improving flexibility in SIMD instruction sets, this article focuses on combining SIMD and many/multi-core parallelism for FSMs. We have developed a novel strategy, called enumerative speculation. Instead of speculating on a single state as in speculative execution or enumerating all possible states as in enumerative execution, our strategy speculates transitions from several possible states, reducing the prediction overheads of speculation approach and the large amount of redundant work in the enumerative approach. A simple lookback approach produces a set of guessed states to achieve high speculation success rates in our enumerative speculation. In addition, to enable continued scalability of enumerative speculation with a large number of threads, we have developed a parallel merge method. We evaluate our method with four popular FSM applications: Huffman decoding, regular expression matching, HTML tokenization, and Div7. We obtain up to 2.5× speedup using SIMD on 1 core and up to 95× combining SIMD with 60 cores of an Intel Xeon Phi. On a single core, we outperform the best single-state speculative execution version by an average of 1.6×, and in combining SIMD and many-core parallelism, outperform enumerative execution by an average of 2×. Finally, when evaluate on a GPU, we show that our parallel merge implementations are 2.02--6.74× more efficient than corresponding sequential merge implementations and achieve better scalability on an Nvidia V100 GPU.

References

Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick. 2009. A view of the parallel computing landscape. Commun. ACM 52, 10 (Oct. 2009), 56--67. DOI:https://doi.org/10.1145/1562764.1562783Google ScholarDigital Library
Guy E. Blelloch. 1990. Prefix Sums and Their Applications. Technical Report CMU-CS-90-190. School of Computer Science, Carnegie Mellon University.Google Scholar
Linchuan Chen, Peng Jiang, and Gagan Agrawal. 2016. Exploiting recent SIMD architectural advances for irregular applications. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’16).Google ScholarDigital Library
F. Franchetti and M. Puschel. 2002. A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’02). IEEE, 20--26.Google ScholarDigital Library
C. Garca, R. Lario, M. Prieto, L. Piuel, and F. Tirado. 2003. Vectorization of Multigrid Codes Using SIMD ISA Extensions. Proceedings of the IEEE Inernational Parallel and Distributed Processing Symposium (IPDPS’03). 58a.Google Scholar
W. Daniel Hillis and Guy L. Steele, Jr. 1986. Data parallel algorithms. Commun. ACM 29, 12 (Dec. 1986).Google Scholar
Jan Holub and Stanislav Štekr. 2009. On parallel implementations of deterministic finite automata. In Proceedings of the 14th International Conference on Implementation and Application of Automata (CIAA’09).Google ScholarDigital Library
N. Ide, M. Hirano, Y. Endo, S. Yoshioka, H. Murakami, A. Kunimatsu, T. Sato, T. Kamei, T. Okada, and M. Suzuoki. 2000. 2.44-GFLOPS 300-MHz Floating-Point Vector-Processing Unit for High-performance 3D Graphics Computing. IEEE J. Solid-State Circ. 35, 7 (July 2000), 1025 --1033.Google ScholarCross Ref
Peng Jiang, Linchuan Chen, and Gagan Agrawal. 2016. Reusing data reorganization for efficient SIMD parallelization of adaptive irregular applications. In Proceedings of the International Conference on Supercomputing (ICS’16).Google ScholarDigital Library
Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N. Bhuyan. 2014. CuSha: Vertex-centric graph processing on GPUs. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC’14).Google Scholar
Seonggun Kim and Hwansoo Han. 2012. Efficient SIMD code generation for irregular kernels. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY, 55--64. DOI:https://doi.org/10.1145/2145816.2145824Google ScholarDigital Library
Shmuel Tomi Klein and Yair Wiseman. 2003. Parallel huffman decoding with applications to JPEG files. Comput. J. 46 (2003), 487--497.Google ScholarCross Ref
Richard E. Ladner and Michael J. Fischer. 1980. Parallel prefix computation. J. ACM 27, 4 (Oct. 1980).Google ScholarDigital Library
Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the International Conference on Supercomputing (ICS’13).Google ScholarDigital Library
D. Luchaup, R. Smith, C. Estan, and S. Jha. 2011. Speculative parallel pattern matching. IEEE Trans. Info. Forensics Secur. 6, 2 (June 2011), 438--451.Google Scholar
Duane Merrill, Michael Garland, and Andrew Grimshaw. 2012. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY, 117--128. DOI:https://doi.org/10.1145/2145816.2145832Google ScholarDigital Library
Todd Mytkowicz, Madanlal Musuvathi, and Wolfram Schulte. 2014. Data-parallel finite-state machines. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, 529--542.Google ScholarDigital Library
Yinfei Pan, Ying Zhang, and K. Chiu. 2008. Simultaneous transducers for data-parallel XML parsing. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). 1--12.Google Scholar
Simon J. Pennycook, Chris J. Hughes, M. Smelyanskiy, and S. A. Jarvis. 2013. Exploring SIMD for molecular dynamics, using intel xeon processors and intel xeon phi coprocessors. In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS’13).Google Scholar
Prakash Prabhu, G Ramalingam, and Kapil Vaswani. 2010. Safe programmable speculative parallelism. In Proceedings of Programming Language Design and Implementation (PLDI’10). Association for Computing Machinery, Inc.Google ScholarDigital Library
Junqiao Qiu, Zhijia Zhao, and Bin Ren. 2016. MicroSpec: Speculation-centric fine-grained parallelization for FSM computations. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT’16). ACM, New York, NY, 221--233. DOI:https://doi.org/10.1145/2967938.2967965Google ScholarDigital Library
T. Rognes and E. Seeberg. 2000. Six-Fold speed-up of SmithWaterman sequence database searches using parallel processing on common microprocessors. Bioinformatics 16, 8 (2000), 699--706.Google ScholarCross Ref
Erik Saule and Ümit V. Catalyurek. 2012. An early evaluation of the scalability of graph algorithms on the intel MIC architecture. In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium Workshops 8 PhD Forum (IPDPSW’12). IEEE, Washington, DC, 1629--1639. DOI:https://doi.org/10.1109/IPDPSW.2012.204Google Scholar
Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huynh, Xibai Li, and Rick Siow Mong Goh. 2015. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15).Google ScholarDigital Library
Zhijia Zhao and Xipeng Shen. 2015. On-the-fly principled speculation for FSM parallelization. SIGARCH Comput. Archit. News 43, 1 (Mar. 2015).Google ScholarDigital Library
Zhijia Zhao, Bo Wu, and Xipeng Shen. 2014. Challenging the “embarrassingly sequential”: Parallelizing finite state machine-based computations through principled speculation. ACM SIGARCH Computer Architecture News 42, 1 (2014), 543--558.Google ScholarDigital Library

Index Terms

Combining SIMD and Many/Multi-core Parallelism for Finite-state Machines with Enumerative Speculation
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Logic circuits
      1. Finite state machines

Recommendations

Combining SIMD and Many/Multi-core Parallelism for Finite State Machines with Enumerative Speculation
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Finite State Machine (FSM) is the key kernel behind many popular applications, including regular expression matching, text tokenization, and Huffman decoding. Parallelizing FSMs is extremely difficult because of the strong dependencies and unpredictable ...
Read More
Combining SIMD and Many/Multi-core Parallelism for Finite State Machines with Enumerative Speculation
PPoPP '17

Finite State Machine (FSM) is the key kernel behind many popular applications, including regular expression matching, text tokenization, and Huffman decoding. Parallelizing FSMs is extremely difficult because of the strong dependencies and unpredictable ...
Read More
Boundary element quadrature schemes for multi- and many-core architectures

In the paper we study the performance of the regularized boundary element quadrature routines implemented in the BEM4I library developed by the authors. Apart from the results obtained on the classical multi-core architecture represented by the Intel ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Parallel Computing Volume 7, Issue 3
Special Issue on PPoPP 2017 (Part 2) and Regular Papers
September 2020
182 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3407694
Editor:
David A. Bader
New Jersey Institute of Technology, USA
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 June 2020
- Revised: 1 April 2020
- Accepted: 1 April 2020
- Received: 1 July 2018
Published in topc Volume 7, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Finite-state machine
SIMD
break dependence
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 136
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format