ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
Journal of Parallel and Distributed Computing
Volume 64, Issue 1, January 2004, Pages 108-134
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Purchase PDF (676 K)

 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
View Record in Scopus
 
doi:10.1016/j.jpdc.2003.09.005    How to Cite or Link Using DOI (Opens New Window)
Copyright © 2003 Elsevier Inc. All rights reserved.

Improving effective bandwidth through compiler enhancement of global cache reuse*1

Chen DingCorresponding Author Contact Information, E-mail The Corresponding Author, a and Ken KennedyE-mail The Corresponding Author, b

a Department of Computer Science, University of Rochester, P.O. Box 270226, Rochester, NY 14627, USA b Center for High Performance Software Research (HiPerSoft), Rice University, Houston, TX, USA

Received 20 November 2002; 
revised 11 September 2003. 
Available online 2 December 2003.

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

Abstract

The performance of modern machines is increasingly limited by insufficient memory bandwidth. One way to alleviate this bandwidth limitation for a given program is to minimize the aggregate data volume the program transfers from memory. In this article we present compiler strategies for accomplishing this minimization. Following a discussion of the underlying causes of bandwidth limitations, we present a two-step strategy to exploit global cache reuse—the temporal reuse across the whole program and the spatial reuse across the entire data set used in that program. In the first step, we fuse computation on the same data using a technique called reuse-based loop fusion to integrate loops with different control structures. We prove that optimal fusion for bandwidth is NP-hard and we explore the limitations of computation fusion using perfect program information. In the second step, we group data used by the same computation through the technique of affinity-based data regrouping, which intermixes the storage assignments of program data elements at different granularities. We show that the method is compile-time optimal and can be used on array and structure data. We prove that two extensions—partial and dynamic data regrouping—are NP-hard problems. Finally, we describe our compiler implementation and experiments demonstrating that the new global strategy, on average, reduces memory traffic by over 40% and improves execution speed by over 60% on two high-end workstations.

Author Keywords: Reference affinity; Data locality; Program analysis; Loop fusion; Data transformation; Global cache reuse


 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.