skip to main content
10.1145/2771774.2771779acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
short-paper

Towards an efficient fault-tolerance scheme for GLB

Published:14 June 2015Publication History

ABSTRACT

X10's Global Load Balancing framework GLB implements a user-level task pool for inter-place load balancing. It is based on work stealing and deploys the lifeline algorithm. A single worker per place alternates between processing tasks and answering steal requests. We have devised an efficient fault-tolerance scheme for this algorithm, improving on a simpler resilience scheme from our own previous work. Among the base ideas of the new scheme are incremental backups of ``stable'' tasks and an actor-like communication structure. The paper reports on our ongoing work to extend the GLB framework accordingly. While details of the scheme are left out, we discuss implementation issues and preliminary experimental results.

References

  1. N. Ali, S. Krishnamoorthy, M. Halappanavar, et al. Multi-fault tolerance for cartesian data distributions. Int. Journal of Parallel Programming, 41(3):469–493, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  2. W. Bland, P. Du, A. Bouteiller, et al. A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI. In Proc. Euro-Par, pages 477–488. Springer LNCS 7484, 2012.Google ScholarGoogle Scholar
  3. R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In Proc. USENIX Annual Technical Symp., 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Bungart, C. Fohry, and J. Posner. Fault-tolerant global load balancing in X10. In Proc. SYNASC Workshop on HPC Research Services, pages 471–478, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  5. D. Cunningham, D. Grove, B. Herta, et al. Resilient X10: Efficient failure-aware programming. In Proc. ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 67–80, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Hadoop. 2015. URL http://hadoop.apache.org/.Google ScholarGoogle Scholar
  7. K. Kawachiya. Writing fault-tolerant applications using resilient X10. Technical Report RT0960, IBM Research Tokyo, Apr. 2014.Google ScholarGoogle Scholar
  8. M. C. Kurt, S. Krishnamoorthy, K. Agrawal, et al. Fault-tolerant dynamic task graph scheduling. In Proc. Int. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), pages 719–730, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Ma and S. Krishnamoorthy. Data-driven fault tolerance for work stealing computations. In Proc. ACM Int. Conf. on Supercomputing, pages 79–90, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Olivier, J. Huan, J. Liu, et al. UTS: An Unbalanced Tree Search benchmark. In Proc. Workshop on Languages and Compilers for High-Performance Computing, pages 235–250. Springer LNCS 4382, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Process Fault Tolerance. (Unofficial Draft for MPI-Standard), 2014. URL https://svn.mpi-forum.org/trac/mpi-forum-web/ attachment/ticket/323/ft.pdf.Google ScholarGoogle Scholar
  12. V. Saraswat, P. Kambadur, S. Kodali, et al. Lifeline-based global load balancing. In Proc. ACM Symp. on Principles and Practice of Parallel Programming, pages 201–212, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Wang, W. Ji, F. Shi, and Q. Zuo. A work-stealing scheduling framework supporting fault tolerance. In Proc. Design, Automation and Test in Europe. EDA Consortium / ACM DL, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Wrzesinska, R. V. V. Nieuwpoort, J. Maassen, and H. E. Bal. Faulttolerance, malleability and migration for divide-and-conquer applications on the grid. In Proc. Int. Parallel and Distributed Processing Symp., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X10. 2015. URL x10-lang.org.Google ScholarGoogle Scholar
  16. XTENLANG-3391: make x10.glb safe for multi-threaded places, 2015. URL https://jira.codehaus.org/browse/XTENLANG-3391.Google ScholarGoogle Scholar
  17. W. Zhang, O. Tardieu, B. Herta, et al. GLB: Lifeline-based global load balancing library in X10. In Proc. ACM Workshop on Parallel Programming for Analytics Applications, pages 31–40, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards an efficient fault-tolerance scheme for GLB

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              X10 2015: Proceedings of the ACM SIGPLAN Workshop on X10
              June 2015
              38 pages
              ISBN:9781450335867
              DOI:10.1145/2771774

              Copyright © 2015 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 14 June 2015

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • short-paper

              Acceptance Rates

              Overall Acceptance Rate5of5submissions,100%

              Upcoming Conference

              PLDI '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader