ABSTRACT
Researchers and software developers require effective performance evaluation. Researchers must evaluate optimizations or measure overhead. Software developers use automatic performance regression tests to discover when changes improve or degrade performance. The standard methodology is to compare execution times before and after applying changes.
Unfortunately, modern architectural features make this approach unsound. Statistically sound evaluation requires multiple samples to test whether one can or cannot (with high confidence) reject the null hypothesis that results are the same before and after. However, caches and branch predictors make performance dependent on machine-specific parameters and the exact layout of code, stack frames, and heap objects. A single binary constitutes just one sample from the space of program layouts, regardless of the number of runs. Since compiler optimizations and code changes also alter layout, it is currently impossible to distinguish the impact of an optimization from that of its layout effects.
This paper presents Stabilizer, a system that enables the use of the powerful statistical techniques required for sound performance evaluation on modern architectures. Stabilizer forces executions to sample the space of memory configurations by repeatedly re-randomizing layouts of code, stack, and heap objects at runtime. Stabilizer thus makes it possible to control for layout effects. Re-randomization also ensures that layout effects follow a Gaussian distribution, enabling the use of statistical tests like ANOVA. We demonstrate Stabilizer's efficiency (<7% median overhead) and its effectiveness by evaluating the impact of LLVM's optimizations on the SPEC CPU2006 benchmark suite. We find that, while -O2 has a significant impact relative to -O1, the performance impact of -O3 over -O2 optimizations is indistinguishable from random noise.
- A. Alameldeen and D. Wood. Variability in Architectural Simulations of Multi-threaded Workloads. In HPCA '03, pp. 7--18. IEEE Computer Society, 2003. Google ScholarDigital Library
- L. E. Bassham, III, A. L. Rukhin, J. Soto, J. R. Nechvatal, M. E. Smid, E. B. Barker, S. D. Leigh, M. Levenson, M. Vangel, D. L. Banks, N. A. Heckert, J. F. Dray, and S. Vo. SP 800--22 Rev. 1a. A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications. Tech. rep., National Institute of Standards & Technology, Gaithersburg, MD, United States, 2010. Google ScholarDigital Library
- E. D. Berger and B. G. Zorn. DieHard: Probabilistic Memory Safety for Unsafe Languages. In PLDI '06, pp. 158--168. ACM, 2006. Google ScholarDigital Library
- E. D. Berger, B. G. Zorn, and K. S. McKinley. Composing High-Performance Memory Allocators. In PLDI '01, pp. 114--124. ACM, 2001. Google ScholarDigital Library
- S. Bhatkar, D. C. DuVarney, and R. Sekar. Address Obfuscation: an Efficient Approach to Combat a Broad Range of Memory Error Exploits. In USENIX Security '03, pp. 8--8. USENIX Association, 2003. Google ScholarDigital Library
- S. Bhatkar, R. Sekar, and D. C. DuVarney. Efficient Techniques for Comprehensive Protection from Memory Error Exploits. In SSYM '05, pp. 271---286. USENIX Association, 2005. Google ScholarDigital Library
- S. M. Blackburn, A. Diwan, M. Hauswirth, A. M. Memon, and P. F. Sweeney. Workshop on Experimental Evaluation of Software and Systems in Computer Science (Evaluate 2010). In SPLASH '10, pp. 291--292. ACM, 2010. Google ScholarDigital Library
- S. M. Blackburn, A. Diwan, M. Hauswirth, P. F. Sweeney, et al. TR 1: Can You Trust Your Experimental Results? Tech. rep., Evaluate Collaboratory, 2012.Google Scholar
- A. Demers, M. Weiser, B. Hayes, H. Boehm, D. Bobrow, and S. Shenker. Combining Generational and Conservative Garbage Collection: Framework and Implementations. In POPL '90, pp. 261--269. ACM, 1990. Google ScholarDigital Library
- R. Durstenfeld. Algorithm 235: Random Permutation. Communications of the ACM, 7(7):420, 1964. Google ScholarDigital Library
- W. Feller. An Introduction to Probability Theory and Applications, volume 1. John Wiley & Sons Publishers, 3rd edition, 1968.Google Scholar
- A. Georges, D. Buytaert, and L. Eeckhout. Statistically Rigorous Java Performance Evaluation. In OOPSLA '07, pp. 57--76. ACM, 2007. Google ScholarDigital Library
- G. Hamerly, E. Perelman, J. Lau, B. Calder, and T. Sherwood. Using Machine Learning to Guide Architecture Simulation. Journal of Machine Learning Research, 7:343--378, Dec. 2006. Google ScholarDigital Library
- C. A. R. Hoare. Quicksort. The Computer Journal, 5(1):10--16, 1962.Google Scholar
- D. A. Jiménez. Code Placement for Improving Dynamic Branch Prediction Accuracy. In PLDI '05, pp. 107--116. ACM, 2005. Google ScholarDigital Library
- C. Kil, J. Jun, C. Bookholt, J. Xu, and P. Ning. Address Space Layout Permutation (ASLP): Towards Fine-Grained Randomization of Commodity Software. In ACSAC '06, pp. 339--348. IEEE Computer Society, 2006. Google ScholarDigital Library
- C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO '04, pp. 75--86. IEEE Computer Society, 2004. Google ScholarDigital Library
- G. Marsaglia. Random Number Generation. In Encyclopedia of Computer Science, 4th Edition, pp. 1499--1503. John Wiley and Sons Ltd., Chichester, UK, 2003.Google Scholar
- M. Masmano, I. Ripoll, A. Crespo, and J. Real. TLSF: A New Dynamic Memory Allocator for Real-Time Systems. In ECRTS '04, pp. 79--86. IEEE Computer Society, 2004. Google ScholarDigital Library
- I. Molnar. Exec-Shield. http://people.redhat.com/mingo/exec-shield/.Google Scholar
- D. A. Moon. Garbage Collection in a Large LISP System. In LFP '84, pp. 235--246. ACM, 1984. Google ScholarDigital Library
- T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing Wrong Data Without Doing Anything Obviously Wrong! In ASPLOS '09, pp. 265--276. ACM, 2009. Google ScholarDigital Library
- G. Novark and E. D. Berger. DieHarder: Securing the Heap. In CCS '10, pp. 573--584. ACM, 2010. Google ScholarDigital Library
- G. Novark, E. D. Berger, and B. G. Zorn. Exterminator: Automatically Correcting Memory Errors with High Probability. Communications of the ACM, 51(12):87--95, 2008. Google ScholarDigital Library
- The Chromium Project. Performance Dashboard. http://build.chromium.org/f/chromium/perf/dashboard/overview.html.Google Scholar
- The LLVM Team. Clang: a C Language Family Frontend for LLVM. http://clang.llvm.org, 2012.Google Scholar
- The LLVM Team. Dragonegg - Using LLVM as a GCC Backend. http://dragonegg.llvm.org, 2013.Google Scholar
- The Mozilla Foundation. Buildbot/Talos. https://wiki.mozilla.org/Buildbot/Talos.Google Scholar
- The PaX Team. The PaX Project. http://pax.grsecurity.net, 2001.Google Scholar
- D. Tsafrir and D. Feitelson. Instability in Parallel Job Scheduling Simulation: the Role of Workload Flurries. In IPDPS '06. IEEE Computer Society, 2006. Google ScholarDigital Library
- D. Tsafrir, K. Ouaknine, and D. G. Feitelson. Reducing Performance Evaluation Sensitivity and Variability by Input Shaking. In MASCOTS '07, pp. 231--237. IEEE Computer Society, 2007. Google ScholarDigital Library
- F. Wilcoxon. Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6):80--83, 1945.Google ScholarCross Ref
- P. R. Wilson, M. S. Johnstone, M. Neely, and D. Boles. Dynamic Storage Allocation: A Survey and Critical Review. Lecture Notes in Computer Science, 986, 1995. Google ScholarDigital Library
- H. Xu and S. J. Chapin. Improving Address Space Randomization with a Dynamic Offset Randomization Technique. In SAC '06, pp. 384--391. ACM, 2006. Google ScholarDigital Library
- J. Xu, Z. Kalbarczyk, and R. Iyer. Transparent Runtime Randomization for Security. In SRDS '03, pp. 260--269. IEEE Computer Society, 2003.Google Scholar
Index Terms
- STABILIZER: statistically sound performance evaluation
Recommendations
STABILIZER: statistically sound performance evaluation
ASPLOS '13Researchers and software developers require effective performance evaluation. Researchers must evaluate optimizations or measure overhead. Software developers use automatic performance regression tests to discover when changes improve or degrade ...
STABILIZER: statistically sound performance evaluation
ASPLOS '13Researchers and software developers require effective performance evaluation. Researchers must evaluate optimizations or measure overhead. Software developers use automatic performance regression tests to discover when changes improve or degrade ...
Statistical nonparametric mapping: Multivariate permutation tests for location, correlation, and regression problems in neuroimaging
Nonparametric statistical inference via permutation testing is on the rise in neuroimaging research. This rise in popularity is likely in response to recent studies that have demonstrated limitations of parametric inference in certain situations. ...
Permutation and sampling distributions for T statistic given n samples from a left skewed distribution. Red shading denotes 5% in each tail. image image
Comments