research-article

Free Access

Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessors

Authors:
Yida Wang

Princeton University

Princeton University
View Profile

,
Michael J. Anderson

Intel Corporation

Intel Corporation
View Profile

,
Jonathan D. Cohen

Princeton University

Princeton University
View Profile

,
Alexander Heinecke

Intel Corporation

Intel Corporation
View Profile

,
Kai Li

Princeton University

Princeton University
View Profile

,
Nadathur Satish

Intel Corporation

Intel Corporation
View Profile

,
Narayanan Sundaram

Intel Corporation

Intel Corporation
View Profile

,
Nicholas B. Turk-Browne

Princeton University

Princeton University
View Profile

,
Theodore L. Willke

Intel Corporation

Intel Corporation
View Profile

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2015Article No.: 23Pages 1–12https://doi.org/10.1145/2807591.2807631

Published:15 November 2015Publication History

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

Full correlation matrix analysis (FCMA) is an unbiased approach for exhaustively studying interactions among brain regions in functional magnetic resonance imaging (fMRI) data from human participants. In order to answer neuroscientific questions efficiently, we are developing a closed-loop analysis system with FCMA on a cluster of nodes with Intel® Xeon Phi™ coprocessors. Here we propose several ideas for data-driven algorithmic modification to improve the performance on the coprocessor. Our experiments with real datasets show that the optimized single-node code runs 5x-16x faster than the baseline implementation using the well-known Intel® MKL and LibSVM libraries, and that the cluster implementation achieves near linear speedup on 5760 cores.

References

H. M. Aktulga, A. Buluç, S. Williams, and C. Yang. Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '14, pages 1213--1222, May 2014. Google ScholarDigital Library
M. Anderson, G. Ballard, J. Demmel, and K. Keutzer. Communication-avoiding qr decomposition for gpus. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium, IPDPS '11, pages 48--58, May 2011. Google ScholarDigital Library
E. Aprà, M. Klemm, and K. Kowalski. Efficient implementation of many-body quantum chemical methods on the intel® xeon phi™ coprocessor. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 674--684, Nov 2014. Google ScholarDigital Library
T. Auckenthaler, T. Huckle, and R. Wittmann. A blocked qr-decomposition for the parallel symmetric eigenvalue problem. Parallel Comput., 40(7):186--194, 2014. Google ScholarDigital Library
B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th international conference on Machine learning, ICML '08, pages 104--111, Jul 2008. Google ScholarDigital Library
C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3):1--27, 2011. Google ScholarDigital Library
M. T. deBettencourt, J. D. Cohen, R. F. Lee, K. A. Norman, and N. B. Turk-Browne. Closed-loop training of attention with real-time brain imaging. Nature Neuroscience, 18(3):470--475, 2015.Google ScholarCross Ref
J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger. Poster: Beating mkl and scalapack at rectangular matrix multiplication using the bfs/dfs approach. In Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC '12, pages 1370--1370, Nov 2012. Google ScholarDigital Library
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871--1874, 2008. Google ScholarDigital Library
R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information for training support vector machines. J. Mach. Learn. Res., 6:1889--1918, 2005. Google ScholarDigital Library
J. Fang, A. L. Varbanescu, H. J. Sips, L. Zhang, Y. Che, and C. Xu. An empirical study of intel xeon phi. arXiv preprint arXiv:1310.5842, 2013.Google Scholar
P. Gepner, V. Gamayunov, D. L. Fraser, E. Houdard, L. Sauge, D. Declat, and M. Dubois. Evaluation of dgemm implementation on intel xeon phi coprocessor. Journal of Computers, 9(7):1566--1571, 2014.Google ScholarCross Ref
K. Goto and R. A. Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS), 34(3):1--25, 2008. Google ScholarDigital Library
A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov, G. Henry, A. G. Shet, G. Chrysos, and P. Dubey. Design and implementation of the linpack benchmark for single and multi-node systems based on intel® xeon phi™ coprocessor. In Proceedings of the 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS '13, pages 126--137, May 2013. Google ScholarDigital Library
S. Heybrock, B. Joó, D. D. Kalamkar, M. Smelyanskiy, K. Vaidyanathan, T. Wettig, and P. Dubey. Lattice qcd with domain decomposition on intel® xeon phi™ co-processors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 69--80, Nov 2014. Google ScholarDigital Library
J. Hutchinson, Y. Wang, and N. Turk-Browne. Decoding the locus of attention from the full correlation matrix of the human brain. In Society for Neuroscience, SfN '14, Nov 2014.Google Scholar
S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. Improvements to platt's smo algorithm for svm classifier design. Neural Computation, 13(3):637--649, 2001. Google ScholarDigital Library
M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. ACM SIGOPS Operating Systems Review, 25(Special Issue):63--74, 1991. Google ScholarDigital Library
Y. Li, J. Dongarra, and S. Tomov. A note on auto-tuning gemm for gpus. In Computational Science - ICCS 2009, pages 884--892. Springer, 2009. Google ScholarDigital Library
A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, A. Heinecke, H.-J. Bungartz, and H. Lederer. The elpa library: scalable parallel eigenvalue solutions for electronic structure theory and computational science. Journal of Physics: Condensed Matter, 26(21):213201, 2014.Google ScholarCross Ref
K. A. Norman, S. M. Polyn, G. J. Detre, and J. V. Haxby. Beyond mind-reading: multi-voxel pattern analysis of fmri data. Trends in cognitive sciences, 10(9):424--430, 2006.Google Scholar
H. Pabst. Libxsmm. https://github.com/hfp/libxsmm.Google Scholar
J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VII, pages 60--71, Oct 1996. Google ScholarDigital Library
J. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Technical Report MSR-TR-98-14, Microsoft Research, Apr 1998.Google Scholar
J. Sulzer, S. Haller, F. Scharnowski, N. Weiskopf, N. Birbaumer, M. L. Blefari, A. Bruehl, L. Cohen, R. Gassert, R. Goebel, et al. Real-time fmri neurofeedback: progress and challenges. NeuroImage, 76:386--399, 2013.Google ScholarCross Ref
G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun. Fast implementation of dgemm on fermi gpu. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 35:1--35:11, Nov 2011. Google ScholarDigital Library
N. B. Turk-Browne. Functional interactions as big data in the human brain. Science, 342(6158):580--584, 2013.Google ScholarCross Ref
V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 31:1--31:11, Nov 2008. Google ScholarDigital Library
Y. Wang, J. D. Cohen, K. Li, and N. B. Turk-Browne. Full correlation matrix analysis of fmri data. Technical report, Princeton Neuroscience Institute, 2014.Google Scholar
Y. Wang, J. D. Cohen, K. Li, and N. B. Turk-Browne. Full correlation matrix analysis (fcma): An unbiased method for task-related functional connectivity. Journal of Neuroscience Methods, 251:108--119, 2015.Google ScholarCross Ref
M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. ACM Sigplan Notices, 26(6):30--44, 1991. Google ScholarDigital Library
K. J. Worsley, J.-I. Chen, J. Lerch, and A. C. Evans. Comparing functional connectivity via thresholding correlations and singular value decomposition. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1457):913--920, 2005.Google ScholarCross Ref

Index Terms

Recommendations

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
Read More
Effective SIMD vectorization for intel Xeon Phi coprocessors
Special issue on Programming Models, Languages, and Compilers for Manycore and Heterogeneous Architectures

Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as ...
Read More
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 November 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Intel® Xeon Phi™ coprocessor
fMRI data
Qualifiers
- research-article
Conference

Acceptance Rates
SC '15 Paper Acceptance Rate79of358submissions,22%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 603
  Total Downloads
- Downloads (Last 12 months)53
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessors

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

Effective SIMD vectorization for intel Xeon Phi coprocessors

Evaluation of Rodinia Codes on Intel Xeon Phi

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessors

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

Effective SIMD vectorization for intel Xeon Phi coprocessors

Evaluation of Rodinia Codes on Intel Xeon Phi

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors