Abstract
This article presents an algorithm that performs a decentralized detection of the global convergence of parallel asynchronous iterative applications. This algorithm is fault tolerant. It runs a decentralized saving procedure which enables this algorithm, after a node’s crash, to replace the dead node by a new one which will continue the computing task from the last check point. Combined with the advantages of the asynchronous iteration model, this method allows us to compute very large scale problems using highly volatile parallel architectures like Peer-to-Peer and distributed clusters architectures. We also present the implementation of this algorithm in the JaceP2P platform which is dedicated to designing and executing parallel asynchronous iterative applications in volatile environments. Numerous experiments show the robustness and the efficiency of our algorithm.
Similar content being viewed by others
References
Bahi J, Contassot-Vivier S, Couturier R (2002) Asynchronism for iterative algorithms in a global computing environment. In: The 16th annual int symp on high performance computing systems and applications (HPCS’2002), June 2002, Moncton, Canada, pp 90–97
Bahi JM, Contassot-Vivier S, Couturier R (2006) Performance comparison of parallel programming environments for implementing AIAC algorithms. J Supercomput 35:227–244
Bertsekas DP, Tsitsiklis JN (1989) Parallel and distributed computation: numerical methods. Prentice Hall, Englewood Cliffs
Vuillemin P (2006) Calcul itératif asynchrone sur infrastructure pair-à-pair : la plate-forme JaceP2P. Thèse, Université de Franche Comté
Bertsekas DP, Tsitsiklis JN (1989) Convergence rate and termination of asynchronous iterative algorithms. In: 1989 Int Conf on Supercomputing, Crete, Greece. ACM SIGA RCH, 1989, pp 461–470
Savari SA, Bertsekas DP (1996) Finite termination of asynchronous iterative algorithms. Parallel Comput 22:39–56
Bahi JM, Contassot-Vivier S, Couturier R (2007) Parallel iterative algorithms: from sequential to grid computing. Numerical analysis & scientific computing series. Chapman Hall/CRC, London
Bahi J, Contassot-Vivier S, Couturier R, Vernier F (2005) A decentralized convergence detection algorithm for asynchronous parallel iterative algorithms. IEEE Trans Parallel Distrib Syst 16(1):4–13
El-Ruby M, Kenevan J, Carison R, Khalil K (1991) Leader election in distributed computing systems. In: Proceedings of computing in the 90’s, 1991. LNCS, vol 507. Springer, Berlin, pp 350–356.
Antonoiu G, Srimani PK (1996) A self-stabilizing leader election algorithm for tree graphs. J Parallel Distrib Comput 34(2):227–232
Bahi J, Couturier R, Vuillemin P (2006) JaceP2P: an environment for asynchronous computations on Peer-to-Peer networks. In: 2006 IEEE int conf on cluster computing (Cluster 2006), 2006. IEEE Computer Society Press, Los Alamitos
Dijkstra EW, Feijin WHJ, van Gasteren AJM (1983) Derivation of a termination detection algorithm for distributed computation. Inf Process Lett 16:217–219
Francez N (1980) Distributed termination. ACM Trans Program Languages Syst 2:42–55
Plank JS, Beck M, Kingsley G, Li K (1995) Libckpt: transparent checkpointing under UNIX. USENIX Winter, pp 213–224
Cao G, Singhal M (1998) On coordinated checkpointing in distributed systems. IEEE Trans Parallel Distrib Syst 9:1213–1225
Hursey J, Squyres JM, Mattox T, Lumsdaine A (2007) The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: IPDPS 2007—the 21st IEEE international parallel distributed processing symposium, Long Beach, California, USA, 26 March 2007
Elnozahy EN, Zwaenepoel W (1992) Replicated distributed process in Manetho. In: The twenty-second international symposium on fault-tolerant computing, Boston, USA, 1992. IEEE Computer Society, Los Alamitos, pp 18–27
Genaud S, Rattanapoka C (2005) A Peer-to-Peer framework for robust execution of message passing parallel programs on grids. In: Recent advances in parallel virtual machine and message passing interface, 12th European PVM/MPI users’ group meeting, Sorrento, Italy, September 18–21, 2005, pp 276–284
Alvisi L, Marzullo K (1995) Message logging: pessimistic, optimistic, and causal. In: Proceedings of the 15th international conference on distributed computing systems, Vancouver, British Columbia, Canada, May 30–June 2, 1995. IEEE Computer Society Press, Los Alamitos
Elnozahy EN, Zwaenepoel W (1992) Manetho: transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans Comput 41:526–531
Bouteiller A, Cappello F, Herault T, Krawezik G, Lemarinier P, Magniette F (2003) MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: SC2003: igniting innovation, Phoenix, AZ, November 15–21, 2003. ACM Press, IEEE Computer Society Press, New York, Los Alamitos
Byrne GD, Hindmarsh AC (1998) User documentation for PVODE an ODE solver for parallel computers. Tech rep UCRL-ID-130884. Lawrence Livermore National Laboratory, Livermore, CA
Verwer JG, Blom JG, Hundsdorfer W (1996) An implicit-explicit approach for atmospheric transport-chemistry problems. Appl Numer Math 20:191–209
Bahi J, Miellou J-C, Rhofir K (1997) Asynchronous multisplitting methods for nonlinear fixed point problems. Numer Algorithms 15:315–345
www.grid5000.fr (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Charr, JC., Couturier, R. & Laiymani, D. A decentralized and fault tolerant convergence detection algorithm for asynchronous iterative algorithms. J Supercomput 53, 269–292 (2010). https://doi.org/10.1007/s11227-009-0293-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-009-0293-6