A decentralized and fault tolerant convergence detection algorithm for asynchronous iterative algorithms

Charr, Jean-Claude; Couturier, Raphaël; Laiymani, David

doi:10.1007/s11227-009-0293-6

A decentralized and fault tolerant convergence detection algorithm for asynchronous iterative algorithms

Published: 01 April 2009

Volume 53, pages 269–292, (2010)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jean-Claude Charr¹,
Raphaël Couturier¹ &
David Laiymani¹

91 Accesses
2 Citations
Explore all metrics

Abstract

This article presents an algorithm that performs a decentralized detection of the global convergence of parallel asynchronous iterative applications. This algorithm is fault tolerant. It runs a decentralized saving procedure which enables this algorithm, after a node’s crash, to replace the dead node by a new one which will continue the computing task from the last check point. Combined with the advantages of the asynchronous iteration model, this method allows us to compute very large scale problems using highly volatile parallel architectures like Peer-to-Peer and distributed clusters architectures. We also present the implementation of this algorithm in the JaceP2P platform which is dedicated to designing and executing parallel asynchronous iterative applications in volatile environments. Numerous experiments show the robustness and the efficiency of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel Byzantine Fault Tolerance

Legio: fault resiliency for embarrassingly parallel MPI applications

Article Open access 25 June 2021

An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

Article 09 March 2017

References

Bahi J, Contassot-Vivier S, Couturier R (2002) Asynchronism for iterative algorithms in a global computing environment. In: The 16th annual int symp on high performance computing systems and applications (HPCS’2002), June 2002, Moncton, Canada, pp 90–97
Bahi JM, Contassot-Vivier S, Couturier R (2006) Performance comparison of parallel programming environments for implementing AIAC algorithms. J Supercomput 35:227–244
Article Google Scholar
Bertsekas DP, Tsitsiklis JN (1989) Parallel and distributed computation: numerical methods. Prentice Hall, Englewood Cliffs
MATH Google Scholar
Vuillemin P (2006) Calcul itératif asynchrone sur infrastructure pair-à-pair : la plate-forme JaceP2P. Thèse, Université de Franche Comté
Bertsekas DP, Tsitsiklis JN (1989) Convergence rate and termination of asynchronous iterative algorithms. In: 1989 Int Conf on Supercomputing, Crete, Greece. ACM SIGA RCH, 1989, pp 461–470
Savari SA, Bertsekas DP (1996) Finite termination of asynchronous iterative algorithms. Parallel Comput 22:39–56
Article MATH MathSciNet Google Scholar
Bahi JM, Contassot-Vivier S, Couturier R (2007) Parallel iterative algorithms: from sequential to grid computing. Numerical analysis & scientific computing series. Chapman Hall/CRC, London
Google Scholar
Bahi J, Contassot-Vivier S, Couturier R, Vernier F (2005) A decentralized convergence detection algorithm for asynchronous parallel iterative algorithms. IEEE Trans Parallel Distrib Syst 16(1):4–13
Article Google Scholar
El-Ruby M, Kenevan J, Carison R, Khalil K (1991) Leader election in distributed computing systems. In: Proceedings of computing in the 90’s, 1991. LNCS, vol 507. Springer, Berlin, pp 350–356.
Chapter Google Scholar
Antonoiu G, Srimani PK (1996) A self-stabilizing leader election algorithm for tree graphs. J Parallel Distrib Comput 34(2):227–232
Article Google Scholar
Bahi J, Couturier R, Vuillemin P (2006) JaceP2P: an environment for asynchronous computations on Peer-to-Peer networks. In: 2006 IEEE int conf on cluster computing (Cluster 2006), 2006. IEEE Computer Society Press, Los Alamitos
Google Scholar
Dijkstra EW, Feijin WHJ, van Gasteren AJM (1983) Derivation of a termination detection algorithm for distributed computation. Inf Process Lett 16:217–219
Article Google Scholar
Francez N (1980) Distributed termination. ACM Trans Program Languages Syst 2:42–55
Article MATH Google Scholar
Plank JS, Beck M, Kingsley G, Li K (1995) Libckpt: transparent checkpointing under UNIX. USENIX Winter, pp 213–224
Cao G, Singhal M (1998) On coordinated checkpointing in distributed systems. IEEE Trans Parallel Distrib Syst 9:1213–1225
Article Google Scholar
Hursey J, Squyres JM, Mattox T, Lumsdaine A (2007) The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: IPDPS 2007—the 21st IEEE international parallel distributed processing symposium, Long Beach, California, USA, 26 March 2007
Elnozahy EN, Zwaenepoel W (1992) Replicated distributed process in Manetho. In: The twenty-second international symposium on fault-tolerant computing, Boston, USA, 1992. IEEE Computer Society, Los Alamitos, pp 18–27
Chapter Google Scholar
Genaud S, Rattanapoka C (2005) A Peer-to-Peer framework for robust execution of message passing parallel programs on grids. In: Recent advances in parallel virtual machine and message passing interface, 12th European PVM/MPI users’ group meeting, Sorrento, Italy, September 18–21, 2005, pp 276–284
Alvisi L, Marzullo K (1995) Message logging: pessimistic, optimistic, and causal. In: Proceedings of the 15th international conference on distributed computing systems, Vancouver, British Columbia, Canada, May 30–June 2, 1995. IEEE Computer Society Press, Los Alamitos
Google Scholar
Elnozahy EN, Zwaenepoel W (1992) Manetho: transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans Comput 41:526–531
Article Google Scholar
Bouteiller A, Cappello F, Herault T, Krawezik G, Lemarinier P, Magniette F (2003) MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: SC2003: igniting innovation, Phoenix, AZ, November 15–21, 2003. ACM Press, IEEE Computer Society Press, New York, Los Alamitos
Google Scholar
Byrne GD, Hindmarsh AC (1998) User documentation for PVODE an ODE solver for parallel computers. Tech rep UCRL-ID-130884. Lawrence Livermore National Laboratory, Livermore, CA
Verwer JG, Blom JG, Hundsdorfer W (1996) An implicit-explicit approach for atmospheric transport-chemistry problems. Appl Numer Math 20:191–209
Article MATH MathSciNet Google Scholar
Bahi J, Miellou J-C, Rhofir K (1997) Asynchronous multisplitting methods for nonlinear fixed point problems. Numer Algorithms 15:315–345
Article MATH MathSciNet Google Scholar
www.grid5000.fr (2009)

Download references

Author information

Authors and Affiliations

Laboratory of computer science of Franche Comte, University of Franche-Comte, IUT de Belfort-Montbéliard, Rue Engel Gros, BP 527, 90016, Belfort, France
Jean-Claude Charr, Raphaël Couturier & David Laiymani

Authors

Jean-Claude Charr
View author publications
You can also search for this author in PubMed Google Scholar
Raphaël Couturier
View author publications
You can also search for this author in PubMed Google Scholar
David Laiymani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raphaël Couturier.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Charr, JC., Couturier, R. & Laiymani, D. A decentralized and fault tolerant convergence detection algorithm for asynchronous iterative algorithms. J Supercomput 53, 269–292 (2010). https://doi.org/10.1007/s11227-009-0293-6

Download citation

Received: 03 October 2008
Accepted: 20 March 2009
Published: 01 April 2009
Issue Date: August 2010
DOI: https://doi.org/10.1007/s11227-009-0293-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A decentralized and fault tolerant convergence detection algorithm for asynchronous iterative algorithms

Abstract

Access this article

Similar content being viewed by others

Parallel Byzantine Fault Tolerance

Legio: fault resiliency for embarrassingly parallel MPI applications

An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A decentralized and fault tolerant convergence detection algorithm for asynchronous iterative algorithms

Abstract

Access this article

Similar content being viewed by others

Parallel Byzantine Fault Tolerance

Legio: fault resiliency for embarrassingly parallel MPI applications

An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation