ABSTRACT
X10's Global Load Balancing framework GLB implements a user-level task pool for inter-place load balancing. It is based on work stealing and deploys the lifeline algorithm. A single worker per place alternates between processing tasks and answering steal requests. We have devised an efficient fault-tolerance scheme for this algorithm, improving on a simpler resilience scheme from our own previous work. Among the base ideas of the new scheme are incremental backups of ``stable'' tasks and an actor-like communication structure. The paper reports on our ongoing work to extend the GLB framework accordingly. While details of the scheme are left out, we discuss implementation issues and preliminary experimental results.
- N. Ali, S. Krishnamoorthy, M. Halappanavar, et al. Multi-fault tolerance for cartesian data distributions. Int. Journal of Parallel Programming, 41(3):469–493, 2013.Google ScholarCross Ref
- W. Bland, P. Du, A. Bouteiller, et al. A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI. In Proc. Euro-Par, pages 477–488. Springer LNCS 7484, 2012.Google Scholar
- R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In Proc. USENIX Annual Technical Symp., 1997. Google ScholarDigital Library
- M. Bungart, C. Fohry, and J. Posner. Fault-tolerant global load balancing in X10. In Proc. SYNASC Workshop on HPC Research Services, pages 471–478, 2014.Google ScholarCross Ref
- D. Cunningham, D. Grove, B. Herta, et al. Resilient X10: Efficient failure-aware programming. In Proc. ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 67–80, 2014. Google ScholarDigital Library
- Hadoop. 2015. URL http://hadoop.apache.org/.Google Scholar
- K. Kawachiya. Writing fault-tolerant applications using resilient X10. Technical Report RT0960, IBM Research Tokyo, Apr. 2014.Google Scholar
- M. C. Kurt, S. Krishnamoorthy, K. Agrawal, et al. Fault-tolerant dynamic task graph scheduling. In Proc. Int. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), pages 719–730, 2014. Google ScholarDigital Library
- W. Ma and S. Krishnamoorthy. Data-driven fault tolerance for work stealing computations. In Proc. ACM Int. Conf. on Supercomputing, pages 79–90, 2012. Google ScholarDigital Library
- S. Olivier, J. Huan, J. Liu, et al. UTS: An Unbalanced Tree Search benchmark. In Proc. Workshop on Languages and Compilers for High-Performance Computing, pages 235–250. Springer LNCS 4382, 2006.Google ScholarDigital Library
- Process Fault Tolerance. (Unofficial Draft for MPI-Standard), 2014. URL https://svn.mpi-forum.org/trac/mpi-forum-web/ attachment/ticket/323/ft.pdf.Google Scholar
- V. Saraswat, P. Kambadur, S. Kodali, et al. Lifeline-based global load balancing. In Proc. ACM Symp. on Principles and Practice of Parallel Programming, pages 201–212, 2011. Google ScholarDigital Library
- Y. Wang, W. Ji, F. Shi, and Q. Zuo. A work-stealing scheduling framework supporting fault tolerance. In Proc. Design, Automation and Test in Europe. EDA Consortium / ACM DL, 2013. Google ScholarDigital Library
- G. Wrzesinska, R. V. V. Nieuwpoort, J. Maassen, and H. E. Bal. Faulttolerance, malleability and migration for divide-and-conquer applications on the grid. In Proc. Int. Parallel and Distributed Processing Symp., 2005. Google ScholarDigital Library
- X10. 2015. URL x10-lang.org.Google Scholar
- XTENLANG-3391: make x10.glb safe for multi-threaded places, 2015. URL https://jira.codehaus.org/browse/XTENLANG-3391.Google Scholar
- W. Zhang, O. Tardieu, B. Herta, et al. GLB: Lifeline-based global load balancing library in X10. In Proc. ACM Workshop on Parallel Programming for Analytics Applications, pages 31–40, 2014. Google ScholarDigital Library
Index Terms
- Towards an efficient fault-tolerance scheme for GLB
Recommendations
Cooperation vs. coordination for lifeline-based global load balancing in APGAS
X10 2016: Proceedings of the 6th ACM SIGPLAN Workshop on X10Work stealing can be implemented in either a cooperative or a coordinated way. We compared the two approaches for lifeline-based global load balancing, which is the algorithm used by X10's Global Load Balancing framework GLB. We conducted our study ...
GLB: lifeline-based global load balancing library in x10
PPAA '14: Proceedings of the first workshop on Parallel programming for analytics applicationsWe present GLB, a programming model and an associated implementation that can handle a wide range of irregular parallel programming problems running over large-scale distributed systems. GLB is applicable both to problems that are easily load-balanced ...
A work-stealing scheduling framework supporting fault tolerance
DATE '13: Proceedings of the Conference on Design, Automation and Test in EuropeFault tolerance and load balancing are critical points for executing long-running parallel applications on multicore clusters. This paper addresses both fault tolerance and load balancing on multicore clusters by presenting a novel work-stealing task ...
Comments