Fused State Machines for Fault Tolerance in Distributed Systems

Balasubramanian, Bharath; Garg, Vijay K.

doi:10.1007/978-3-642-25873-2_19

Fused State Machines for Fault Tolerance in Distributed Systems

Bharath Balasubramanian¹⁹ &
Vijay K. Garg¹⁹

Conference paper

713 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7109))

Abstract

Replication is a standard technique for fault-tolerance in distributed systems modeled as deterministic finite state machines (DFSMs or machines). To correct f crash faults among n machines, replication requires nf additional backup machines. We present a fusion-based solution that requires just f additional backup machines (called fusions or fused backups). In this paper, we first propose a fundamental problem regarding DFSMs, independent of fault tolerance, that has not been explored in the literature so far: Given a machine M, with a set of states and a set of events, can we replace it with machines each containing fewer events than M? To formalize this we define a (k,e)-event decomposition of a given machine M, that is a set of k machines each with at least e events fewer than the event set of M, that acting in parallel, are equivalent to M. We present an algorithm to generate such machines with time complexity O(|X _M|³|Σ_M|^e), where X _M is the set of states and Σ_M the set of events of M. Second, we use our event decomposition algorithm to generate fused backups that can correct faults among a given set of machines. We show that these backups are minimal w.r.t the number of states they contain and the number of events in their event set. Third, we use the notion of locality sensitive hashing to present algorithms for the detection and correction of faults for the fusion-based solution. The algorithm for the detection of Byzantine faults has time complexity O(n f) on average, which is the same as that for replication. The algorithm for the correction of both crash and Byzantine faults has time complexity O(n ρf) with high probability (w.h.p), where ρ is the average state reduction achieved by fusion. We show that for small values of n (for most practical systems, n < 10) and ρ (average value of ρ < 2 in our experiments), this results in almost no overhead as compared to replication. Finally, we evaluate fusion on the widely used MCNC’91 benchmarks for DFSMs and results show that the average state space savings in fusion (over replication) is 38% (range 0-99%), while the average event-reduction is 4% (range 0-45%).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Article Google Scholar
Balasubramanian, B., Garg, V.K.: A fusion-based approach for handling multiple faults in data structures. Technical Report ECE-PDS-2009-001, Parallel and Distributed Systems Laboratory, ECE Dept. University of Texas at Austin (2009)
Google Scholar
Balasubramanian, B., Garg, V.K.: Fsm backup library (implemented in java 1.6). In: Parallel and Distributed Systems Laboratory (2011), http://maple.ece.utexas.edu
Balasubramanian, B., Garg, V.K.: A report on fused state machines for fault tolerance in distributed systems. Technical Report TR-PDS-2011-002 Parallel and Distributed Systems Laboratory, The University of Texas at Austin (2011), http://pdsl.ece.utexas.edu/TechReports/2011/TR-PDS-2011-002.pdf
Chen, P.M., Lee, E.K., Gibson, G.A., Katz, R.H., Patterson, D.A.: Raid: high-performance, reliable secondary storage. ACM Comput. Surv. 26(2), 145–185 (1994)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Article Google Scholar
Fischer, M.J., Lynch, N., Paterson, M.: Impossibility of distributed consensus with one faulty process. Journal of the ACM 32(2) (April 1985)
Google Scholar
Garg, V.K.: Implementing Fault-Tolerant Services Using State Machines: Beyond Replication. In: Lynch, N.A., Shvartsman, A.A. (eds.) DISC 2010. LNCS, vol. 6343, pp. 450–464. Springer, Heidelberg (2010)
Chapter Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB 1999: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 518–529. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Google Scholar
Hamming, R.: Error-detecting and error-correcting codes. Bell System Technical Journal 29(2), 147–160 (1950)
Article MathSciNet Google Scholar
Hartmanis, J., Stearns, R.E.: Algebraic structure theory of sequential machines. Prentice-Hall international series in applied mathematics. Prentice-Hall, Inc., Upper Saddle River (1966)
MATH Google Scholar
Hopcroft, J.E.: An n log n algorithm for minimizing states in a finite automaton. Technical report, Stanford, CA, USA (1971)
Google Scholar
Huffman, D.A.: The synthesis of sequential switching circuits. Technical report, Massachusetts, USA (1954)
Google Scholar
Lamport, L.: The implementation of reliable distributed multiprocess systems. Computer Networks 22, 95–114 (1978)
MathSciNet Google Scholar
Lamport, L., Shostak, R., Pease, M.: The byzantine generals problem. ACM Transactions on Programming Languages and Systems 4, 382–401 (1982)
Article MATH Google Scholar
Lee, D., Yannakakis, M.: Closed partition lattice and machine decomposition. IEEE Trans. Comput. 51(2), 216–228 (2002)
Article MathSciNet Google Scholar
Mishchenko, A., Chatterjee, S., Brayton, R.: Dag-aware aig rewriting: A fresh look at combinational logic synthesis. In: DAC 2006: Proceedings of the 43rd Annual Conference on Design Automation, pp. 532–536. ACM Press (2006)
Google Scholar
Ogale, V., Balasubramanian, B., Garg, V.K.: A fusion-based approach for tolerating faults in finite state machines. In: International Parallel and Distributed Processing Symposium, pp. 1–11 (2009)
Google Scholar
Patterson, D.A., Gibson, G., Katz, R.H.: A case for redundant arrays of inexpensive disks (raid). In: SIGMOD 1988: Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pp. 109–116. ACM Press, New York (1988)
Chapter Google Scholar
Pease, M., Lamport, L.: Reaching agreement in the presence of faults. Journal of the ACM 27, 228–234 (1980)
Article MathSciNet MATH Google Scholar
Schneider, F.B.: Byzantine generals in action: implementing fail-stop processors. ACM Trans. Comput. Syst. 2(2), 145–154 (1984)
Article Google Scholar
Schneider, F.B.: Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22(4), 299–319 (1990)
Article Google Scholar
Tenzakhti, F., Day, K., Ould-Khaoua, M.: Replication algorithms for the world-wide web. J. Syst. Archit. 50(10), 591–605 (2004)
Article Google Scholar
Yang, S.: Logic synthesis and optimization benchmarks user guide version 3.0 (1991)
Google Scholar
Youra, H., Inoue, T., Masuzawa, T., Fujiwara, H.: On the synthesis of synchronizable finite state machines with partial scan. Systems and Computers in Japan 29(1), 53–62 (1998)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Parallel and Distributed Systems Laboratory, Dept. of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA, 78712
Bharath Balasubramanian & Vijay K. Garg

Authors

Bharath Balasubramanian
View author publications
You can also search for this author in PubMed Google Scholar
Vijay K. Garg
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute IMDEA Networks, Avenida del Mar Mediterraneo, 22, 28918, Leganes, Madrid, Spain
Antonio Fernàndez Anta
CEIICP, RETIS Lab, Scuola Superiore Sant’Anna, Via Moruzzi 1, 56127, Pisa, Italy
Giuseppe Lipari
Dependability Group (TSF), LAAS-CNRS, 7 av du colonel Roche, 31077, Toulouse, France
Matthieu Roy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Balasubramanian, B., Garg, V.K. (2011). Fused State Machines for Fault Tolerance in Distributed Systems. In: Fernàndez Anta, A., Lipari, G., Roy, M. (eds) Principles of Distributed Systems. OPODIS 2011. Lecture Notes in Computer Science, vol 7109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25873-2_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-25873-2_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25872-5
Online ISBN: 978-3-642-25873-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics