ABSTRACT
Full correlation matrix analysis (FCMA) is an unbiased approach for exhaustively studying interactions among brain regions in functional magnetic resonance imaging (fMRI) data from human participants. In order to answer neuroscientific questions efficiently, we are developing a closed-loop analysis system with FCMA on a cluster of nodes with Intel® Xeon Phi™ coprocessors. Here we propose several ideas for data-driven algorithmic modification to improve the performance on the coprocessor. Our experiments with real datasets show that the optimized single-node code runs 5x-16x faster than the baseline implementation using the well-known Intel® MKL and LibSVM libraries, and that the cluster implementation achieves near linear speedup on 5760 cores.
- H. M. Aktulga, A. Buluç, S. Williams, and C. Yang. Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '14, pages 1213--1222, May 2014. Google ScholarDigital Library
- M. Anderson, G. Ballard, J. Demmel, and K. Keutzer. Communication-avoiding qr decomposition for gpus. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium, IPDPS '11, pages 48--58, May 2011. Google ScholarDigital Library
- E. Aprà, M. Klemm, and K. Kowalski. Efficient implementation of many-body quantum chemical methods on the intel® xeon phi™ coprocessor. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 674--684, Nov 2014. Google ScholarDigital Library
- T. Auckenthaler, T. Huckle, and R. Wittmann. A blocked qr-decomposition for the parallel symmetric eigenvalue problem. Parallel Comput., 40(7):186--194, 2014. Google ScholarDigital Library
- B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th international conference on Machine learning, ICML '08, pages 104--111, Jul 2008. Google ScholarDigital Library
- C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3):1--27, 2011. Google ScholarDigital Library
- M. T. deBettencourt, J. D. Cohen, R. F. Lee, K. A. Norman, and N. B. Turk-Browne. Closed-loop training of attention with real-time brain imaging. Nature Neuroscience, 18(3):470--475, 2015.Google ScholarCross Ref
- J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger. Poster: Beating mkl and scalapack at rectangular matrix multiplication using the bfs/dfs approach. In Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC '12, pages 1370--1370, Nov 2012. Google ScholarDigital Library
- R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871--1874, 2008. Google ScholarDigital Library
- R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information for training support vector machines. J. Mach. Learn. Res., 6:1889--1918, 2005. Google ScholarDigital Library
- J. Fang, A. L. Varbanescu, H. J. Sips, L. Zhang, Y. Che, and C. Xu. An empirical study of intel xeon phi. arXiv preprint arXiv:1310.5842, 2013.Google Scholar
- P. Gepner, V. Gamayunov, D. L. Fraser, E. Houdard, L. Sauge, D. Declat, and M. Dubois. Evaluation of dgemm implementation on intel xeon phi coprocessor. Journal of Computers, 9(7):1566--1571, 2014.Google ScholarCross Ref
- K. Goto and R. A. Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS), 34(3):1--25, 2008. Google ScholarDigital Library
- A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov, G. Henry, A. G. Shet, G. Chrysos, and P. Dubey. Design and implementation of the linpack benchmark for single and multi-node systems based on intel® xeon phi™ coprocessor. In Proceedings of the 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS '13, pages 126--137, May 2013. Google ScholarDigital Library
- S. Heybrock, B. Joó, D. D. Kalamkar, M. Smelyanskiy, K. Vaidyanathan, T. Wettig, and P. Dubey. Lattice qcd with domain decomposition on intel® xeon phi™ co-processors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 69--80, Nov 2014. Google ScholarDigital Library
- J. Hutchinson, Y. Wang, and N. Turk-Browne. Decoding the locus of attention from the full correlation matrix of the human brain. In Society for Neuroscience, SfN '14, Nov 2014.Google Scholar
- S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. Improvements to platt's smo algorithm for svm classifier design. Neural Computation, 13(3):637--649, 2001. Google ScholarDigital Library
- M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. ACM SIGOPS Operating Systems Review, 25(Special Issue):63--74, 1991. Google ScholarDigital Library
- Y. Li, J. Dongarra, and S. Tomov. A note on auto-tuning gemm for gpus. In Computational Science - ICCS 2009, pages 884--892. Springer, 2009. Google ScholarDigital Library
- A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke, H.-J. Bungartz, and H. Lederer. The elpa library: scalable parallel eigenvalue solutions for electronic structure theory and computational science. Journal of Physics: Condensed Matter, 26(21):213201, 2014.Google ScholarCross Ref
- K. A. Norman, S. M. Polyn, G. J. Detre, and J. V. Haxby. Beyond mind-reading: multi-voxel pattern analysis of fmri data. Trends in cognitive sciences, 10(9):424--430, 2006.Google Scholar
- H. Pabst. Libxsmm. https://github.com/hfp/libxsmm.Google Scholar
- J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VII, pages 60--71, Oct 1996. Google ScholarDigital Library
- J. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Technical Report MSR-TR-98-14, Microsoft Research, Apr 1998.Google Scholar
- J. Sulzer, S. Haller, F. Scharnowski, N. Weiskopf, N. Birbaumer, M. L. Blefari, A. Bruehl, L. Cohen, R. Gassert, R. Goebel, et al. Real-time fmri neurofeedback: progress and challenges. NeuroImage, 76:386--399, 2013.Google ScholarCross Ref
- G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun. Fast implementation of dgemm on fermi gpu. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 35:1--35:11, Nov 2011. Google ScholarDigital Library
- N. B. Turk-Browne. Functional interactions as big data in the human brain. Science, 342(6158):580--584, 2013.Google ScholarCross Ref
- V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 31:1--31:11, Nov 2008. Google ScholarDigital Library
- Y. Wang, J. D. Cohen, K. Li, and N. B. Turk-Browne. Full correlation matrix analysis of fmri data. Technical report, Princeton Neuroscience Institute, 2014.Google Scholar
- Y. Wang, J. D. Cohen, K. Li, and N. B. Turk-Browne. Full correlation matrix analysis (fcma): An unbiased method for task-related functional connectivity. Journal of Neuroscience Methods, 251:108--119, 2015.Google ScholarCross Ref
- M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. ACM Sigplan Notices, 26(6):30--44, 1991. Google ScholarDigital Library
- K. J. Worsley, J.-I. Chen, J. Lerch, and A. C. Evans. Comparing functional connectivity via thresholding correlations and singular value decomposition. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1457):913--920, 2005.Google ScholarCross Ref
Index Terms
- Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessors
Recommendations
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumIntel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
Effective SIMD vectorization for intel Xeon Phi coprocessors
Special issue on Programming Models, Languages, and Compilers for Manycore and Heterogeneous ArchitecturesEfficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Comments