doi:10.1016/j.cpc.2008.01.045
Copyright © 2008 Elsevier B.V. All rights reserved.
On ERI sorting for SIMD execution of large-scale Hartree–Fock SCF
aCentre for Telecommunications and Information Engineering, Monash University, Melbourne, Australia
bCentre for Distributed Systems and Software Engineering, Monash University, Melbourne, Australia
cOrganic Chemistry Institute, University of Zurich, Switzerland
Received 23 August 2007;
revised 18 January 2008;
accepted 22 January 2008.
Available online 1 February 2008.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
Given the resurgent attractiveness of single-instruction-multiple-data (SIMD) processing, it is important for high-performance computing applications to be SIMD-capable. The Hartree–Fock SCF (HF-SCF) application, in it's canonical form, cannot fully exploit SIMD processing. Prior attempts to implement Electron Repulsion Integral (ERI) sorting functionality to essentially “SIMD-ify” the HF-SCF application have met frustration because of the low throughput of the sorting functionality. With greater awareness of computer architecture, we discuss how the sorting functionality may be practically implemented to provide high-performance. Overall system performance analysis, including memory locality analysis, is also conducted, and further emphasises that a system with ERI sorting is capable of very high throughput. We discuss two alternative implementation options, with one immediately accessible software-based option discussed in detail. The impact of workload characteristics on expected performance is also discussed, and it is found that in general as basis set size increases the potential performance of the system also increases. Consideration is given to conventional CPUs, GPUs, FPGAs, and the Cell Broadband Engine architecture.
Keywords: Hartree–Fock self consistent field; Electron repulsion integrals; Single-instruction-multiple-data processing
PACS classification codes: 02.60.Pn; 07.05.Bx; 31.15.Ar; 89.20.Ff
Fig. 1. Thread matching with symmetry.
Fig. 2. Converting TLP to DLP. Each basic scheduling block (BSB) corresponds to a contiguous chunk of instructions. The BSB2 block has a cyclic arc with an associated variable integer—the arc corresponds to a loop identifier and the associated integer specifies the loop count; e.g., in thread 2 BSB2 is executed 3 times, whereas in thread 3 BSB2 is executed once. The difference in loop count is effectively a difference in control-flow. In this example the loop count of BSB2 is the only source of control-flow variation between different threads, therefore to identify threads with identical control-flow one merely has to check the loop count specified for BSB2. Matching threads can then be processed with exactly the same sequence of instructions, and may therefore be issued for SIMD processing.
Fig. 3. Architecture overview.
Fig. 4. Logical view of content-addressable memory structure.
Fig. 5. System block diagram. Values in “(…)” parentheses within blocks indicate storage capacity. Values in “[…]” parentheses on arrows indicate provisional bit-width allocation. Only major data busses are indicated; control busses have been omitted for clarity.
Fig. 6. ERI threads and SIMD utilisation. The ERI thread population contains 10 threads of three classes; 6 class A threads, 3 class B threads, and 1 class C thread. ERI threads of the same class may be computed in SIMD fashion. With a SIMD width of 6, a maximum of 6 ERIs may be computed simultaneously. If fewer than 6 ERIs are computed simultaneously, some processing resources are left idle/unutilised, and are basically wasted. In the example above, an aggregate efficiency of 55.56% is achieved.
Fig. 7. Impact of Wthreads on SIMD unit utilisation. All workloads in 6-31G** basis.
Fig. 8. SIMD unit utilisation for candidate workloads. All workloads in 6-31G** basis. V=64 and Wthreads=16k.
Fig. 9. Spatial locality for Zinc Chloride (6-31G** basis) workload.
Fig. 10. Spatial locality for Butane (6-31G** basis) workload.
Fig. 11. Temporal locality for candidate workloads. All workloads in 6-31G** basis.
Fig. 12. SIMD unit utilisation for candidate workloads with various treatments of contraction. All workloads in 6-31G** basis. Wthreads=16k.
Fig. 13. SIMD unit utilisation with contraction unrolling only. V=4, Wthreads=1 (i.e. ERI sorting effectively disabled). A “late-contraction” ERI routine (such as Rys quadrature) is assumed.
Fig. 15. ERI Sorting with conventional systems. Recall that the control tuple is a token comprising all the data that expresses the control-flow of the thread—in this application the control tuple contains the 3 angular momenta terms (ax,ay,az) for each of the 4 basis functions constituent in an ERI quartet, and optionally also the level of contraction K of each of the basis functions. A 64 bit control tuple is sufficiently large for this application.
Fig. 16. Software-based ERI Sorting time and SIMD unit utilisation. V=64, S=1k, CRC hash function, 256 hash keys. Contractions are matched (as opposed to unrolled or ignored—see Section 6 for explanation of these different matching schemes).
Table 1.
Basis set size and thread count for molecules in 6-31G** basis


Catenane is a “chain-like” molecular structure, which finds use in applications such as nanomachines
[9].
Table 2.
Vector utilisation with contraction unrolling only
