Skip to main content

SpotMPI: A Framework for Auction-Based HPC Computing Using Amazon Spot Instances

  • Conference paper
Algorithms and Architectures for Parallel Processing (ICA3PP 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7017))

Abstract

The economy of scale offers cloud computing virtually unlimited cost effective processing potentials. Theoretically, prices under fair market conditions should reflect the most reasonable costs of computations. The fairness is ensured by the mutual agreements between the sellers and the buyers. Resource use efficiency is automatically optimized in the process. While there is no lack of incentives for the cloud provider to offer auction-based computing platform, using these volatile platform for practical computing is a challenge for existing programming paradigms. This paper reports a methodology and a toolkit designed to tame the challenges for MPI applications.

Unlike existing MPI fault tolerance tools, we emphasize on dynamically adjusted optimal checkpoint-restart (CPR) intervals. We introduce a formal model, then a HPC application toolkit, named SpotMPI, to facilitate the practical execution of real MPI applications on volatile auction-based cloud platforms. Our models capture the intrinsic dependencies between critical time consuming elements by leveraging instrumented performance parameters and publicly available resource bidding histories. We study algorithms with different computing v.s. communication complexities. Our results show non-trivial insights into the optimal bidding and application scaling strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Starcluster (2010), http://web.mit.edu/stardev/cluster/

  2. Amazon hpc cluster instances (2011), http://aws.amazon.com/ec2/hpc-applications/

  3. Agbaria, A.M., Friedman, R.: Starfish: fault-tolerant dynamic mpi programs on clusters of workstations. In: Proceedings of the Eighth International Symposium on High Performance Distributed Computing, 1999, pp. 167–176 (1999)

    Google Scholar 

  4. Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the Spring Joint Computer Conference, April 18-20, pp. 483–485. ACM, New York (1967)

    Google Scholar 

  5. Andrzejak, A., Kondo, D., Yi, S.: Decision model for cloud computing under sla constraints. In: Proc. IEEE Int Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS) Symp., pp. 257–266 (2010)

    Google Scholar 

  6. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.: Xen and the art of virtualization. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 164–177. ACM, New York (2003)

    Chapter  Google Scholar 

  7. Blathras, K., Szyld, D.B., Shi, Y.: Timing models and local stopping criteria for asynchronous iterative algorithms. Journal of Parallel and Distributed Computing 58(3), 446–465 (1999)

    Article  Google Scholar 

  8. Borthakur, D.: The hadoop distributed file system: Architecture and design (2007), http://developer.yahoo.com/hadoop/tutorial/

  9. Chohan, N., Castillo, C., Spreitzer, M., Steinder, M., Tantawi, A., Krintz, C.: See spot run: using spot instances for mapreduce workflows. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, p. 7. USENIX Association (2010)

    Google Scholar 

  10. Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22(3), 303–312 (2006)

    Article  Google Scholar 

  11. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  12. Fagg, G., Dongarra, J.: Ft-mpi: Fault tolerant mpi, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  13. Graham, R.L., Choi, S.E., Daniel, D.J., Desai, N.N., Minnich, R.G., Rasmussen, C.E., Risinger, L.D., Sukalski, M.W.: A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming 31(4), 285–303 (2003)

    Article  MATH  Google Scholar 

  14. Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (blcr) for linux clusters. In: Journal of Physics: Conference Series, vol. 46, p. 494. IOP Publishing (2006)

    Google Scholar 

  15. Hursey, J.: Coordinated Checkpoint/Restart Process Fault Tolerance for MPI Applications on HPC Systems. PhD thesis, Indiana University, Bloomington, IN, USA (July 2010)

    Google Scholar 

  16. Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for open mpi. In: Proc. IEEE Int. Parallel and Distributed Processing Symp. IPDPS 2007, pp. 1–8 (2007)

    Google Scholar 

  17. Iosup, A., Ostermann, S., Yigitbasi, N., Prodan, R., Fahringer, T., Epema, D.: Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Transactions on Parallel and Distributed Systems 22(6), 931–945 (2011)

    Article  Google Scholar 

  18. Litzkow, M., Tannenbaum, T., Basney, J., Livny, M.: Checkpoint and migration of unix processes in the condor distributed processing system. Technical report, Technical Report (1997)

    Google Scholar 

  19. Lusk, E.: Fault tolerance in mpi programs. Special issue of the Journal High Performance Computing Applications, IJHPCA (2002)

    Google Scholar 

  20. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proc. Int. High Performance Computing, Networking, Storage and Analysis (SC) Conf. for, pp. 1–11 (2010)

    Google Scholar 

  21. Rao, S., Alvisi, L., Vin, H.M.: Egida: An extensible toolkit for low-overhead fault-tolerance. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers, pp. 48–55. IEEE, Los Alamitos (1999)

    Google Scholar 

  22. Shi, J.Y.: Program scalability analysis. In: International Conference on Distributed and Parallel Processing. Geogetown University, Washington D.C (1997)

    Google Scholar 

  23. Shi, J.Y., Taifi, M., Khreishah, A., Wu, J.: Sustainable gpu computing at scale. In: 14th IEEE International Conference in Computational Science and Engneering 2011 (2011)

    Google Scholar 

  24. Stellner, G.: Cocheck: Checkpointing and process migration for mpi. In: Proceedings of the 10th International Parallel Processing Symposium, IPPS 1996, pp. 526–531. IEEE Computer Society, Washington, DC, USA (1996)

    Google Scholar 

  25. Vecchiola, C., Pandey, S., Buyya, R.: High-performance cloud computing: A view of scientific applications. In: Proc. 10th Int. Pervasive Systems, Algorithms, and Networks (ISPAN) Symp., pp. 4–16 (2009)

    Google Scholar 

  26. Yi, S., Kondo, D., Andrzejak, A.: Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud. In: 2010 IEEE 3rd International Conference on Cloud Computing, pp. 236–243. IEEE, Los Alamitos (2010)

    Chapter  Google Scholar 

  27. Young, J.W.: A first order approximation to the optimum checkpoint interval. Communications of the ACM 17(9), 530–531 (1974)

    Article  MATH  Google Scholar 

  28. Youseff, L., Wolski, R., Gorda, B., Krintz, C.: Evaluating the performance impact of xen on mpi and process execution for hpc systems. In: Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed computing, p. 1. IEEE Computer Society, Los Alamitos (2006)

    Google Scholar 

  29. Zhang, Q., Grses, E., Boutaba, R., Xiao, J.: Dynamic resource allocation for spot markets in clouds. In: Proceedings of the 11th USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services (2011)

    Google Scholar 

  30. Zheng, G., Shi, L., Kalé, L.V.: Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi. In: 2004 IEEE International Conference on Cluster Computing, pp. 93–103. IEEE, Los Alamitos (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Taifi, M., Shi, J.Y., Khreishah, A. (2011). SpotMPI: A Framework for Auction-Based HPC Computing Using Amazon Spot Instances. In: Xiang, Y., Cuzzocrea, A., Hobbs, M., Zhou, W. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2011. Lecture Notes in Computer Science, vol 7017. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24669-2_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24669-2_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24668-5

  • Online ISBN: 978-3-642-24669-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics