Skip to main content
Log in

ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures

International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Exploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providing programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into scheduling hints related to thread-memory affinity issues. These hints enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. Several experiments show that mixed solutions (migrating both threads and data) outperform work-stealing based balancing strategies and next-touch-based data distribution policies. These techniques provide insights about additional optimizations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

  1. Antony, J., Janes, P.P., Rendell, A.P.: Exploring thread and memory placement on NUMA architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport. In: Proceedings of the International Conference on High Performance Computing (HiPC). Bangalore, India (2006)

  2. Ayguade, E., Gonzalez, M., Martorell, X., Jost, G.: Employing nested OpenMP for the parallelization of multi-Zone computational fluid dynamics applications. In: 18th International Parallel and Distributed Processing Symposium (IPDPS) (2004)

  3. Benkner, S., Brandes, T.: Efficient parallel programming on scalable shared memory systems with high performance fortran. In: Concurrency: Practice and Experience, vol. 14, pp. 789–803. John Wiley & Sons (2002)

  4. Brecht, T.: On the importance of parallel application placement in NUMA multiprocessors. In: Proceedings of the Fourth Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV). San Diego, CA (1993)

  5. Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: a generic framework for managing hardware affinities in HPC applications. In: Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2010). IEEE Computer Society Press, Pisa, Italia (2010)

  6. Broquedis, F., DiakhatT, F., Thibault, S., Aumage, O., Namyst, R., Wacrenier, P.A.: Scheduling Dynamic OpenMP Applications over Multicore Architectures. In: International Workshop on OpenMP (IWOMP). West Lafayette, IN (2008)

  7. Carlson, W., Draper, J., Culler, D., Yelick, K., Brooks, E., Warren, K.: Introduction to UPC and Language Specification. Tech. Rep. CCS-TR-99-157, George Mason University (1999)

  8. Chapman, B.M., Bregier, F., Patil, A., Prabhakar, A.: Achieving performance under OpenMP on ccNUMA and software distributed shared memory systems. In: Concurrency: Practice and Experience, vol. 14, pp. 713–739. John Wiley & Sons (2002)

  9. Chapman, B.M., Huang, L., Jin, H., Jost, G., de Supinski, B.R.: Extending openmp worksharing directives for multithreading. In: EuroPar’06 Parallel Processing (2006)

  10. Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A hybrid multi-core parallel programming environment (2007)

  11. Duran, A., Perez, J.M., Ayguade, E., Badia, R., Labarta, J.: Extending the openmp tasking model to allow dependant tasks. In: IWOMP Proceedings (2008)

  12. Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the cilk-5 multithreaded language. In: ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Montreal, Canada (1998)

  13. Goglin, B., Furmento, N.: Enabling High-Performance memory-migration in Linux for multithreaded applications. In: MTAAP’09: Workshop on Multithreaded Architectures and Applications, held in conjunction with IPDPS 2009. IEEE Computer Society Press, Rome, Italy (2009). doi:10.1109/IPDPS.2009.5161101

  14. hwloc: Portable hardware locality. http://runtime.bordeaux.inria.fr/hwloc/

  15. Koelbel, C., Loveman, D., Schreiber, R., Steele, G., Zosel, M.: The high performance Fortran handbook (1994)

  16. Löf, H., Holmgren, S.: affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system. In: 19th ACM International Conference on Supercomputing, pp. 387–392. Cambridge, MA, USA (2005)

  17. Mami: Marcel memory interface. http://runtime.bordeaux.inria.fr/MaMI/

  18. McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)

  19. Nikolopoulos, D.S., Papatheodorou, T.S., Polychronopoulos, C.D., Labarta, J., AyguadT, E.: User-level dynamic page migration for multiprogrammed shared-memory multiprocessors. In: ICPP, pp. 95–103. IEEE (2000)

  20. Nikolopoulos D.S., Polychronopoulos C.D., Papatheodorou T.S., Labarta J., Ayguad T.E.: Scheduler-activated dynamic page migration for multiprogrammed DSM multiprocessors. Parallel Distrib. Comput. 62, 1069–1103 (2002)

    Article  MATH  Google Scholar 

  21. NordTn, M., L.-f, H., Rantakokko, J., Holmgren, S.: Geographical locality and dynamic data migration for OpenMP implementations of adaptive PDE solvers. In: Second International Workshop on OpenMP (IWOMP 2006). Reims, France (2006)

  22. Song, F., Moore, S., Dongarra, J.: Feedback-directed thread scheduling with memory considerations. In: Proceedings of the 16th IEEE International Symposium on High-Performance Distributed Computing (HPDC07). Monterey Bay, CA (2007)

  23. Steckermeier, M., Bellosa, F.: Using locality information in userlevel scheduling. Tech. Rep. TR-95-14, University of Erlangen-Nnrnberg—Computer Science Department—Operating Systems—IMMD IV, Martensstrab́be 1, 91058 Erlangen, Germany (1995)

  24. Terboven, C., an Mey, D., Schmidl, D., Jin, H., Reichstein, T.: Data and thread affinity in openmp programs. In: MAW ’08: Proceedings of the 2008 workshop on Memory access on future processors, pp. 377–384. ACM, New York, NY, USA (2008). doi:10.1145/1366219.1366222

  25. The OpenMP API specification for parallel programming. http://www.openmp.org/

  26. Thibault, S., Namyst, R., Wacrenier, P.A.: Building portable thread schedulers for hierarchical multiprocessors: the BubbleSched Framework. In: Euro-Par. ACM, Rennes, France (2007)

  27. Thread Building Blocks. http://www.intel.com/software/products/tbb/

  28. Yang, R., Antony, J., Janes, P.P., Rendell, A.P.: Memory and thread placement effects as a function of cache usage: a study of the Gaussian chemistry code on the SunFire X4600 M2. In: Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008), pp. 31–36 (2008)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to François Broquedis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Broquedis, F., Furmento, N., Goglin, B. et al. ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures. Int J Parallel Prog 38, 418–439 (2010). https://doi.org/10.1007/s10766-010-0136-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-010-0136-3

Keywords

Navigation