Skip to main content

Fused State Machines for Fault Tolerance in Distributed Systems

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7109))

Abstract

Replication is a standard technique for fault-tolerance in distributed systems modeled as deterministic finite state machines (DFSMs or machines). To correct f crash faults among n machines, replication requires nf additional backup machines. We present a fusion-based solution that requires just f additional backup machines (called fusions or fused backups). In this paper, we first propose a fundamental problem regarding DFSMs, independent of fault tolerance, that has not been explored in the literature so far: Given a machine M, with a set of states and a set of events, can we replace it with machines each containing fewer events than M? To formalize this we define a (k,e)-event decomposition of a given machine M, that is a set of k machines each with at least e events fewer than the event set of M, that acting in parallel, are equivalent to M. We present an algorithm to generate such machines with time complexity O(|X M |3 M |e), where X M is the set of states and Σ M the set of events of M. Second, we use our event decomposition algorithm to generate fused backups that can correct faults among a given set of machines. We show that these backups are minimal w.r.t the number of states they contain and the number of events in their event set. Third, we use the notion of locality sensitive hashing to present algorithms for the detection and correction of faults for the fusion-based solution. The algorithm for the detection of Byzantine faults has time complexity O(n f) on average, which is the same as that for replication. The algorithm for the correction of both crash and Byzantine faults has time complexity O(n ρf) with high probability (w.h.p), where ρ is the average state reduction achieved by fusion. We show that for small values of n (for most practical systems, n < 10) and ρ (average value of ρ < 2 in our experiments), this results in almost no overhead as compared to replication. Finally, we evaluate fusion on the widely used MCNC’91 benchmarks for DFSMs and results show that the average state space savings in fusion (over replication) is 38% (range 0-99%), while the average event-reduction is 4% (range 0-45%).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)

    Article  Google Scholar 

  2. Balasubramanian, B., Garg, V.K.: A fusion-based approach for handling multiple faults in data structures. Technical Report ECE-PDS-2009-001, Parallel and Distributed Systems Laboratory, ECE Dept. University of Texas at Austin (2009)

    Google Scholar 

  3. Balasubramanian, B., Garg, V.K.: Fsm backup library (implemented in java 1.6). In: Parallel and Distributed Systems Laboratory (2011), http://maple.ece.utexas.edu

  4. Balasubramanian, B., Garg, V.K.: A report on fused state machines for fault tolerance in distributed systems. Technical Report TR-PDS-2011-002 Parallel and Distributed Systems Laboratory, The University of Texas at Austin (2011), http://pdsl.ece.utexas.edu/TechReports/2011/TR-PDS-2011-002.pdf

  5. Chen, P.M., Lee, E.K., Gibson, G.A., Katz, R.H., Patterson, D.A.: Raid: high-performance, reliable secondary storage. ACM Comput. Surv. 26(2), 145–185 (1994)

    Article  Google Scholar 

  6. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)

    Article  Google Scholar 

  7. Fischer, M.J., Lynch, N., Paterson, M.: Impossibility of distributed consensus with one faulty process. Journal of the ACM 32(2) (April 1985)

    Google Scholar 

  8. Garg, V.K.: Implementing Fault-Tolerant Services Using State Machines: Beyond Replication. In: Lynch, N.A., Shvartsman, A.A. (eds.) DISC 2010. LNCS, vol. 6343, pp. 450–464. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  9. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB 1999: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 518–529. Morgan Kaufmann Publishers Inc., San Francisco (1999)

    Google Scholar 

  10. Hamming, R.: Error-detecting and error-correcting codes. Bell System Technical Journal 29(2), 147–160 (1950)

    Article  MathSciNet  Google Scholar 

  11. Hartmanis, J., Stearns, R.E.: Algebraic structure theory of sequential machines. Prentice-Hall international series in applied mathematics. Prentice-Hall, Inc., Upper Saddle River (1966)

    MATH  Google Scholar 

  12. Hopcroft, J.E.: An n log n algorithm for minimizing states in a finite automaton. Technical report, Stanford, CA, USA (1971)

    Google Scholar 

  13. Huffman, D.A.: The synthesis of sequential switching circuits. Technical report, Massachusetts, USA (1954)

    Google Scholar 

  14. Lamport, L.: The implementation of reliable distributed multiprocess systems. Computer Networks 22, 95–114 (1978)

    MathSciNet  Google Scholar 

  15. Lamport, L., Shostak, R., Pease, M.: The byzantine generals problem. ACM Transactions on Programming Languages and Systems 4, 382–401 (1982)

    Article  MATH  Google Scholar 

  16. Lee, D., Yannakakis, M.: Closed partition lattice and machine decomposition. IEEE Trans. Comput. 51(2), 216–228 (2002)

    Article  MathSciNet  Google Scholar 

  17. Mishchenko, A., Chatterjee, S., Brayton, R.: Dag-aware aig rewriting: A fresh look at combinational logic synthesis. In: DAC 2006: Proceedings of the 43rd Annual Conference on Design Automation, pp. 532–536. ACM Press (2006)

    Google Scholar 

  18. Ogale, V., Balasubramanian, B., Garg, V.K.: A fusion-based approach for tolerating faults in finite state machines. In: International Parallel and Distributed Processing Symposium, pp. 1–11 (2009)

    Google Scholar 

  19. Patterson, D.A., Gibson, G., Katz, R.H.: A case for redundant arrays of inexpensive disks (raid). In: SIGMOD 1988: Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pp. 109–116. ACM Press, New York (1988)

    Chapter  Google Scholar 

  20. Pease, M., Lamport, L.: Reaching agreement in the presence of faults. Journal of the ACM 27, 228–234 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  21. Schneider, F.B.: Byzantine generals in action: implementing fail-stop processors. ACM Trans. Comput. Syst. 2(2), 145–154 (1984)

    Article  Google Scholar 

  22. Schneider, F.B.: Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22(4), 299–319 (1990)

    Article  Google Scholar 

  23. Tenzakhti, F., Day, K., Ould-Khaoua, M.: Replication algorithms for the world-wide web. J. Syst. Archit. 50(10), 591–605 (2004)

    Article  Google Scholar 

  24. Yang, S.: Logic synthesis and optimization benchmarks user guide version 3.0 (1991)

    Google Scholar 

  25. Youra, H., Inoue, T., Masuzawa, T., Fujiwara, H.: On the synthesis of synchronizable finite state machines with partial scan. Systems and Computers in Japan 29(1), 53–62 (1998)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Balasubramanian, B., Garg, V.K. (2011). Fused State Machines for Fault Tolerance in Distributed Systems. In: Fernàndez Anta, A., Lipari, G., Roy, M. (eds) Principles of Distributed Systems. OPODIS 2011. Lecture Notes in Computer Science, vol 7109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25873-2_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25873-2_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25872-5

  • Online ISBN: 978-3-642-25873-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics