Optimizing the fMRI data-processing pipeline using prediction and reproducibility performance metrics: I. A preliminary group analysis
Introduction
Neuroimaging researchers typically focus on extracting “neuroscientifically relevant” results from their data sets. Almost always this is done without attempting to optimize and/or understand the relative influence of the pipeline processing choices that were made in analyzing the data. Moreover, the generation of a “plausible result” that can be linked to the neuroscientific literature is often taken as justification of the pipeline choices made, providing a systematic bias in the field towards prevailing neuroscientific expectations and away from unexpected, new results (Skudlarski et al., 1999, Strother et al., 1995a, Strother et al., 1995b, Strother et al., 2002). In addition, there is accumulating evidence in the literature that by applying a new processing pipeline to a raw data set, significantly modified spatial activation patterns may be obtained as a result of changing/optimizing preprocessing techniques (Della-Maggiore et al., 2002, Friston et al., 2000, LaConte et al., 2003a, Shaw et al., 2003a, Tanabe et al., 2002) and/or the data analysis approach (Beckmann and Smith, 2004, Friston et al., 1996, Kherif et al., 2002, Liou et al., 2003, Muley et al., 2001, Nandy and Cordes, 2003, Shaw et al., 2002, Strother et al., 1995a, Tegeler et al., 1999. These real-data results are supported by several simulation studies, which indicated that significant differences in signal detection performance should be expected for different preprocessing (Gavrilescu et al., 2002, Skudlarski et al., 1999) and data analysis (Beckmann and Smith, 2004, Lange et al., 1999, Lukic et al., 2002, Lukic et al., 2004, Tzikas et al., 2004) approaches. These published results demonstrate the likelihood that new insights into human brain function may be obscured by poor and/or limited choices in the image processing pipeline (McIntosh, Private communication).
Simulations in which the true activation signal is known allow different pipeline choices to be ranked using standard signal detection metrics based on receiver operating characteristic (ROC) curves (Swets, 1988). However, for fMRI, this is problematic because the vascular, blood oxygenation level dependent (BOLD) signal and noise structure are not well understood, and it is generally unknown if a particular set of simulation results are relevant for any given fMRI data set, a problem that is compounded if we are interested in the BOLD fMRI signal and noise structure as a function of age and/or disease (D'Esposito et al., 2003).
In an attempt to avoid the need for simulations, researchers have proposed data-driven techniques that estimate performance metrics from the available data. Le and Hu (1997) suggested estimating the true distribution based on highly averaged results. However, the large number of repeat scanning runs required makes this approach impractical, even if it is not biased by the requirement for the mean to tend towards the true signal.
Other researchers have focused on the reproducibility, or reliability, of activation patterns based on the recognition that smaller p values do not imply a stronger likelihood of getting the same result in another replication of the same experiment, and the historical importance of replication as a fundamental criterion for a result to be considered scientific (Carver, 1993, Genovese et al., 1997, Kiehl and Liddle, 2003, Liou et al., 2003, Maitra et al., 2002, Moeller et al., 1999, Strother et al., 1997, Strother et al., 1998, Tegeler et al., 1999). This is one reason why minimizing p values as a quantitative performance measure for pipeline optimization is a poor choice, although it has been used repeatedly in the literature (e.g., Hopfinger et al., 2000, Tanabe et al., 2002).
Provided at least three repeat runs are available, an empirical-ROC curve may be estimated from the data (Genovese et al., 1997), and by incorporating local spatial correlation into the same framework a minimum of two runs is sufficient (Maitra et al., 2002). An interesting application of this empirical-ROC generation framework together with a technique for selecting the optimal operating point on the resulting ROC curve has been recently published by Liou et al. (2003). An alternative procedure for generating empirical ROC curves that requires a “control state” run to estimate false-positive rates together with a standard experimental run has been proposed by Nandy and Cordes (2003).
Strother et al., 1997, Strother et al., 1998 proposed an alternative reproducibility metric based on a principal components analysis (PCA) of two or more independently replicated statistical parametric images (SPIs). This approach was further developed in Kjems et al. (2002), LaConte et al. (2003a), Shaw et al., 2002, Shaw et al., 2003a, Strother et al. (2002), and Tegeler et al. (1999). A correlation coefficient summarizes the reproducibility of two independent SPIs as reflected in their scatter plot. This reproducibility correlation coefficient also directly measures the overall signal-to-noise level of the single, reproducible, Z-scored, activation SPI that is extracted from the principal PCA axis of the scatter plot (Strother et al., 2002). However, this reproducibility metric is a biased measure because it inherits any data-analysis model biases that exist when measuring SPIs. It seems likely that the empirical-ROC metrics share this bias and that, like Strother's reproducibility metric, they should not be considered measures of true signal detection performance.
Simultaneously, Hansen and Strother, guided by the field of predictive learning in statistics (Hastie et al., 2001, Larsen and Hansen, 1997, Mjolsness and DeCoste, 2001), introduced the idea of using potentially unbiased cross-validation-based prediction metrics to measure data-analytic performance in functional neuroimaging (Hansen et al., 1999, Kjems et al., 2002, Kustra and Strother, 2001, Lautrup et al., 1995, Morch et al., 1997). Similar prediction metrics have recently been used by others (McKeown, 2000, Ngan et al., 2000). In addition, prediction metrics have been used to gain new insight into the debate over the spatially modular versus spatially distributed nature of human brain processing (Cox and Savoy, 2003, Haxby et al., 2001). We expect both prediction and reproducibility metrics to play an increasingly important role in the future optimization and interpretation of fMRI studies.
With this in mind, Strother et al. (2002) proposed the unique approach of simultaneously measuring and combining data-driven prediction and reproducibility metrics for pipeline and data analysis optimization using split-half resampling (a combination of two-fold cross-validation and delete-d jackknife resampling) to produce a ROC-like plot. They developed the NPAIRS (Nonparametric Prediction, Activation, Influence and Reproducibility reSampling) software package to implement and test this idea (Kjems et al., 2002, LaConte et al., 2003a, Shaw et al., 2003a; Web distribution and documentation at http://neurovia.umn.edu/incweb/npairs_info.html). In preliminary comparisons using simulations, Shaw et al. (2003b) have shown that prediction-reproducibility plots seem to perform at least as well as standard ROC curves.
This paper is concerned with the combined use of prediction and reproducibility metrics to test the relative importance of different processing pipeline choices in the detection of large-scale brain networks from the combined BOLD-fMRI scans of 16 subjects performing a block-design, parametric-static-force task. We have investigated the impact and interaction of interpolation, within-plane spatial smoothing, temporal detrending, between-subject alignment using affine and nonlinear polynomial registration (i.e., warps), and “tuning” the data analysis approach. The large-scale brain networks were detected for separate, uncorrelated OFF–ON and parametric force responses using canonical variates analysis (CVA), a flexible multivariate form of linear discriminant analysis that may be tuned to fit the data. The prediction and reproducibility metrics were measured using split-half resampling of the 16-subject group within the NPAIRS framework. We found that the metrics could easily detect the smoothing difference between sinc and trilinear-based interpolation, and that even a small amount of smoothing together with tuning the CVA model were by far the most important processing parameters. Detrending was found to be essential to remove low-frequency time trends, and to allow a reliable parametric force response to emerge despite pseudo-randomization of the force levels across two runs per session. In contrast, using an affine registration compared with 3rd to 7th order polynomial warps had only a minor impact on the performance metrics. However, our results make it clear that both prediction and reproducibility metrics are required for optimization as they individually select different optimal pipeline parameter settings that are associated with somewhat different activation patterns. In addition, the parameter settings of components in the pipeline interact so that the current practice of reporting the optimization of components tested in relative isolation is unlikely to lead to optimized processing pipelines.
Section snippets
Data acquisition
For a detailed description of data acquisition protocols, see LaConte et al., 2003a, LaConte et al., 2003b.
Software
The NPAIRS software used for this work is written in IDL™ (Research Systems Inc., Boulder, CO). The NPAIRS algorithm is part of the VAST software library from the VA Medical Center, Minneapolis, Minnesota, and the distributed NPAIRS module may now be run without an IDL license (see http://neurovia.umn.edu/incweb/npairs_info.html).
Preprocessing
After removal of the initial nonequilibrium scans per run, we (1) aligned each fMRI volume and resampled it into a Talairach reference space using either sinc or
Results
Fig. 3, Fig. 4 demonstrate the basic behavior of the NPAIRS prediction and reproducibility metrics for the 11-class CVA model as a function of polynomial warp order and within-slice smoothing. Fig. 3, Fig. 4 illustrate the median of 50 split-half prediction medians for an 11-class CVA model built on the first 100 principal components and detrended with 0 and 1.5 cycle cosine-basis-function cut-offs, respectively. In panels B and C, model performance is split into the underlying, uncorrelated
Discussion
Our choices for the pipeline components to manipulate in this study were based on some preliminary testing, computational expediency and standard practice in our laboratory. We acknowledge that we have not exhaustively optimized even the components tested, which would require further testing of the preprocessing components (interpolation, smoothing, detrending and warps) for 150 and 200 PCs passed to the CVA to cover the parameterization between optimal reproducibility and optimal prediction
Acknowledgments
Important contributions to this study were made by James Ashe, MD and Suraj Muley, MD of the University of Minnesota and VA Minneapolis Neurology Departments who, respectively, designed the behavioral static-force experiment and recruited and scanned the volunteer subjects. We also thank Xiaoping Hu, PhD and Essa Yacoub, PhD of the Center for Magnetic Resonance Research at the University of Minnesota for designing the fMRI/MRI acquisition strategies and running the MRI scanner, and the
References (59)
- et al.
Functional magnetic resonance imaging (fMRI) “brain reading”: detecting and classifying distributed patterns of fMRI activity in human visual cortex
NeuroImage
(2003) - et al.
An empirical comparison of SPM preprocessing parameters to the analysis of fMRI data
NeuroImage
(2002) - et al.
To smooth or not to smooth
NeuroImage
(2000) - et al.
Simulation of the effects of global normalisation procedures in functional MRI
NeuroImage
(2002) - et al.
Generalizable patterns in neuroimaging: how many principal components?
NeuroImage
(1999) - et al.
Consensus inference in neuroimaging
NeuroImage
(2001) - et al.
A study of analysis parameters that influence the sensitivity of event related fMRI analyses
NeuroImage
(2000) - et al.
Improved optimization for the robust and accurate linear registration and motion correction of brain images
NeuroImage
(2002) - et al.
Multivariate model specification for fMRI data
NeuroImage
(2002) - et al.
The quantitative evaluation of functional neuroimaging experiments: mutual information learning curves
NeuroImage
(2002)
Evaluating preprocessing choices in single-subject BOLD-fMRI studies using data-driven performance metrics
NeuroImage
Plurality and resemblance in fMRI data analysis
NeuroImage
An evaluation of methods for detecting brain activations from PET or fMRI images
Artif. Intell. Med.
Detection of consistently task-related activations in fMRI data with hybrid independent component analysis
NeuroImage
Effects of changes in experimental design on PET studies of isometric force
NeuroImage
Temporal filtering of event-related fMRI data using cross-validation
NeuroImage
The LONI pipeline processing environment
NeuroImage
Abnormal functional connectivity in post-traumatic stress disorder
NeuroImage
Evaluating subject specific preprocessing choices in multi-subject BOLD fMRI data sets using data driven performance metrics
NeuroImage
ROC analysis of statistical methods used in functional MRI: individual subjects
NeuroImage
The quantitative evaluation of functional neuroimaging experiments: the NPAIRS data analysis framework
NeuroImage
Comparison of detrending methods for optimal fMRI preprocessing
NeuroImage
Anatomic and functional variability: the effects of filter size in group fMRI data analysis
NeuroImage
Probabilistic independent component analysis for functional magnetic resonance imaging
IEEE Trans. Med. Imag.
The case against statistical significance testing, revisited
J. Exp. Educ.
Alterations in BOLD fMRI signal with ageing and disease: a challenge for neuroimaging
Nat. Rev., Neurosci.
Fiswidgets: a graphical computing environment for neuroimaging analysis
Neuroinformatics
A multivariate analysis of PET activation studies
Hum. Brain Mapp.
Estimating test–retest reliability in functional MR imaging. I. Statistical methodology
Magn. Reson. Med.
Cited by (108)
Pipeline Collector: Gathering performance data for distributed astronomical pipelines
2018, Astronomy and ComputingSelecting relevant features from the electronic health record for clinical code prediction
2017, Journal of Biomedical InformaticsEvaluation of multi-echo ICA denoising for task based fMRI studies: Block designs, rapid event-related designs, and cardiac-gated fMRI
2016, NeuroImageCitation Excerpt :Manual and automatic single-echo ICA-based denoising procedures (Pruim et al., 2015; Salimi-Khorshidi et al., 2014) can also help remove additional traces of noise. Prior research has shown that there can be substantial inter-subject differences in terms of optimal pre-processing pipelines (Strother et al., 2004). It is possible that comparison of ME-ICA against these other single-echo pre-processing pipelines, including subject-specific ones, would show relatively less improvements.
- 1
Now at Biomedical Engineering Department, Georgia Institute of Technology/Emory University.