Skip to main content
Log in

A decentralized and fault tolerant convergence detection algorithm for asynchronous iterative algorithms

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This article presents an algorithm that performs a decentralized detection of the global convergence of parallel asynchronous iterative applications. This algorithm is fault tolerant. It runs a decentralized saving procedure which enables this algorithm, after a node’s crash, to replace the dead node by a new one which will continue the computing task from the last check point. Combined with the advantages of the asynchronous iteration model, this method allows us to compute very large scale problems using highly volatile parallel architectures like Peer-to-Peer and distributed clusters architectures. We also present the implementation of this algorithm in the JaceP2P platform which is dedicated to designing and executing parallel asynchronous iterative applications in volatile environments. Numerous experiments show the robustness and the efficiency of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bahi J, Contassot-Vivier S, Couturier R (2002) Asynchronism for iterative algorithms in a global computing environment. In: The 16th annual int symp on high performance computing systems and applications (HPCS’2002), June 2002, Moncton, Canada, pp 90–97

  2. Bahi JM, Contassot-Vivier S, Couturier R (2006) Performance comparison of parallel programming environments for implementing AIAC algorithms. J Supercomput 35:227–244

    Article  Google Scholar 

  3. Bertsekas DP, Tsitsiklis JN (1989) Parallel and distributed computation: numerical methods. Prentice Hall, Englewood Cliffs

    MATH  Google Scholar 

  4. Vuillemin P (2006) Calcul itératif asynchrone sur infrastructure pair-à-pair : la plate-forme JaceP2P. Thèse, Université de Franche Comté

  5. Bertsekas DP, Tsitsiklis JN (1989) Convergence rate and termination of asynchronous iterative algorithms. In: 1989 Int Conf on Supercomputing, Crete, Greece. ACM SIGA RCH, 1989, pp 461–470

  6. Savari SA, Bertsekas DP (1996) Finite termination of asynchronous iterative algorithms. Parallel Comput 22:39–56

    Article  MATH  MathSciNet  Google Scholar 

  7. Bahi JM, Contassot-Vivier S, Couturier R (2007) Parallel iterative algorithms: from sequential to grid computing. Numerical analysis & scientific computing series. Chapman Hall/CRC, London

    Google Scholar 

  8. Bahi J, Contassot-Vivier S, Couturier R, Vernier F (2005) A decentralized convergence detection algorithm for asynchronous parallel iterative algorithms. IEEE Trans Parallel Distrib Syst 16(1):4–13

    Article  Google Scholar 

  9. El-Ruby M, Kenevan J, Carison R, Khalil K (1991) Leader election in distributed computing systems. In: Proceedings of computing in the 90’s, 1991. LNCS, vol 507. Springer, Berlin, pp 350–356.

    Chapter  Google Scholar 

  10. Antonoiu G, Srimani PK (1996) A self-stabilizing leader election algorithm for tree graphs. J Parallel Distrib Comput 34(2):227–232

    Article  Google Scholar 

  11. Bahi J, Couturier R, Vuillemin P (2006) JaceP2P: an environment for asynchronous computations on Peer-to-Peer networks. In: 2006 IEEE int conf on cluster computing (Cluster 2006), 2006. IEEE Computer Society Press, Los Alamitos

    Google Scholar 

  12. Dijkstra EW, Feijin WHJ, van Gasteren AJM (1983) Derivation of a termination detection algorithm for distributed computation. Inf Process Lett 16:217–219

    Article  Google Scholar 

  13. Francez N (1980) Distributed termination. ACM Trans Program Languages Syst 2:42–55

    Article  MATH  Google Scholar 

  14. Plank JS, Beck M, Kingsley G, Li K (1995) Libckpt: transparent checkpointing under UNIX. USENIX Winter, pp 213–224

  15. Cao G, Singhal M (1998) On coordinated checkpointing in distributed systems. IEEE Trans Parallel Distrib Syst 9:1213–1225

    Article  Google Scholar 

  16. Hursey J, Squyres JM, Mattox T, Lumsdaine A (2007) The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: IPDPS 2007—the 21st IEEE international parallel distributed processing symposium, Long Beach, California, USA, 26 March 2007

  17. Elnozahy EN, Zwaenepoel W (1992) Replicated distributed process in Manetho. In: The twenty-second international symposium on fault-tolerant computing, Boston, USA, 1992. IEEE Computer Society, Los Alamitos, pp 18–27

    Chapter  Google Scholar 

  18. Genaud S, Rattanapoka C (2005) A Peer-to-Peer framework for robust execution of message passing parallel programs on grids. In: Recent advances in parallel virtual machine and message passing interface, 12th European PVM/MPI users’ group meeting, Sorrento, Italy, September 18–21, 2005, pp 276–284

  19. Alvisi L, Marzullo K (1995) Message logging: pessimistic, optimistic, and causal. In: Proceedings of the 15th international conference on distributed computing systems, Vancouver, British Columbia, Canada, May 30–June 2, 1995. IEEE Computer Society Press, Los Alamitos

    Google Scholar 

  20. Elnozahy EN, Zwaenepoel W (1992) Manetho: transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans Comput 41:526–531

    Article  Google Scholar 

  21. Bouteiller A, Cappello F, Herault T, Krawezik G, Lemarinier P, Magniette F (2003) MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: SC2003: igniting innovation, Phoenix, AZ, November 15–21, 2003. ACM Press, IEEE Computer Society Press, New York, Los Alamitos

    Google Scholar 

  22. Byrne GD, Hindmarsh AC (1998) User documentation for PVODE an ODE solver for parallel computers. Tech rep UCRL-ID-130884. Lawrence Livermore National Laboratory, Livermore, CA

  23. Verwer JG, Blom JG, Hundsdorfer W (1996) An implicit-explicit approach for atmospheric transport-chemistry problems. Appl Numer Math 20:191–209

    Article  MATH  MathSciNet  Google Scholar 

  24. Bahi J, Miellou J-C, Rhofir K (1997) Asynchronous multisplitting methods for nonlinear fixed point problems. Numer Algorithms 15:315–345

    Article  MATH  MathSciNet  Google Scholar 

  25. www.grid5000.fr (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raphaël Couturier.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Charr, JC., Couturier, R. & Laiymani, D. A decentralized and fault tolerant convergence detection algorithm for asynchronous iterative algorithms. J Supercomput 53, 269–292 (2010). https://doi.org/10.1007/s11227-009-0293-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-009-0293-6

Keywords

Navigation