Elsevier

NeuroImage

Volume 23, Supplement 1, 2004, Pages S196-S207
NeuroImage

Optimizing the fMRI data-processing pipeline using prediction and reproducibility performance metrics: I. A preliminary group analysis

https://doi.org/10.1016/j.neuroimage.2004.07.022Get rights and content

We argue that published results demonstrate that new insights into human brain function may be obscured by poor and/or limited choices in the data-processing pipeline, and review the work on performance metrics for optimizing pipelines: prediction, reproducibility, and related empirical Receiver Operating Characteristic (ROC) curve metrics. Using the NPAIRS split-half resampling framework for estimating prediction/reproducibility metrics (Strother et al., 2002), we illustrate its use by testing the relative importance of selected pipeline components (interpolation, in-plane spatial smoothing, temporal detrending, and between-subject alignment) in a group analysis of BOLD-fMRI scans from 16 subjects performing a block-design, parametric-static-force task. Large-scale brain networks were detected using a multivariate linear discriminant analysis (canonical variates analysis, CVA) that was tuned to fit the data. We found that tuning the CVA model and spatial smoothing were the most important processing parameters. Temporal detrending was essential to remove low-frequency, reproducing time trends; the number of cosine basis functions for detrending was optimized by assuming that separate epochs of baseline scans have constant, equal means, and this assumption was assessed with prediction metrics. Higher-order polynomial warps compared to affine alignment had only a minor impact on the performance metrics. We found that both prediction and reproducibility metrics were required for optimizing the pipeline and give somewhat different results. Moreover, the parameter settings of components in the pipeline interact so that the current practice of reporting the optimization of components tested in relative isolation is unlikely to lead to fully optimized processing pipelines.

Introduction

Neuroimaging researchers typically focus on extracting “neuroscientifically relevant” results from their data sets. Almost always this is done without attempting to optimize and/or understand the relative influence of the pipeline processing choices that were made in analyzing the data. Moreover, the generation of a “plausible result” that can be linked to the neuroscientific literature is often taken as justification of the pipeline choices made, providing a systematic bias in the field towards prevailing neuroscientific expectations and away from unexpected, new results (Skudlarski et al., 1999, Strother et al., 1995a, Strother et al., 1995b, Strother et al., 2002). In addition, there is accumulating evidence in the literature that by applying a new processing pipeline to a raw data set, significantly modified spatial activation patterns may be obtained as a result of changing/optimizing preprocessing techniques (Della-Maggiore et al., 2002, Friston et al., 2000, LaConte et al., 2003a, Shaw et al., 2003a, Tanabe et al., 2002) and/or the data analysis approach (Beckmann and Smith, 2004, Friston et al., 1996, Kherif et al., 2002, Liou et al., 2003, Muley et al., 2001, Nandy and Cordes, 2003, Shaw et al., 2002, Strother et al., 1995a, Tegeler et al., 1999. These real-data results are supported by several simulation studies, which indicated that significant differences in signal detection performance should be expected for different preprocessing (Gavrilescu et al., 2002, Skudlarski et al., 1999) and data analysis (Beckmann and Smith, 2004, Lange et al., 1999, Lukic et al., 2002, Lukic et al., 2004, Tzikas et al., 2004) approaches. These published results demonstrate the likelihood that new insights into human brain function may be obscured by poor and/or limited choices in the image processing pipeline (McIntosh, Private communication).

Simulations in which the true activation signal is known allow different pipeline choices to be ranked using standard signal detection metrics based on receiver operating characteristic (ROC) curves (Swets, 1988). However, for fMRI, this is problematic because the vascular, blood oxygenation level dependent (BOLD) signal and noise structure are not well understood, and it is generally unknown if a particular set of simulation results are relevant for any given fMRI data set, a problem that is compounded if we are interested in the BOLD fMRI signal and noise structure as a function of age and/or disease (D'Esposito et al., 2003).

In an attempt to avoid the need for simulations, researchers have proposed data-driven techniques that estimate performance metrics from the available data. Le and Hu (1997) suggested estimating the true distribution based on highly averaged results. However, the large number of repeat scanning runs required makes this approach impractical, even if it is not biased by the requirement for the mean to tend towards the true signal.

Other researchers have focused on the reproducibility, or reliability, of activation patterns based on the recognition that smaller p values do not imply a stronger likelihood of getting the same result in another replication of the same experiment, and the historical importance of replication as a fundamental criterion for a result to be considered scientific (Carver, 1993, Genovese et al., 1997, Kiehl and Liddle, 2003, Liou et al., 2003, Maitra et al., 2002, Moeller et al., 1999, Strother et al., 1997, Strother et al., 1998, Tegeler et al., 1999). This is one reason why minimizing p values as a quantitative performance measure for pipeline optimization is a poor choice, although it has been used repeatedly in the literature (e.g., Hopfinger et al., 2000, Tanabe et al., 2002).

Provided at least three repeat runs are available, an empirical-ROC curve may be estimated from the data (Genovese et al., 1997), and by incorporating local spatial correlation into the same framework a minimum of two runs is sufficient (Maitra et al., 2002). An interesting application of this empirical-ROC generation framework together with a technique for selecting the optimal operating point on the resulting ROC curve has been recently published by Liou et al. (2003). An alternative procedure for generating empirical ROC curves that requires a “control state” run to estimate false-positive rates together with a standard experimental run has been proposed by Nandy and Cordes (2003).

Strother et al., 1997, Strother et al., 1998 proposed an alternative reproducibility metric based on a principal components analysis (PCA) of two or more independently replicated statistical parametric images (SPIs). This approach was further developed in Kjems et al. (2002), LaConte et al. (2003a), Shaw et al., 2002, Shaw et al., 2003a, Strother et al. (2002), and Tegeler et al. (1999). A correlation coefficient summarizes the reproducibility of two independent SPIs as reflected in their scatter plot. This reproducibility correlation coefficient also directly measures the overall signal-to-noise level of the single, reproducible, Z-scored, activation SPI that is extracted from the principal PCA axis of the scatter plot (Strother et al., 2002). However, this reproducibility metric is a biased measure because it inherits any data-analysis model biases that exist when measuring SPIs. It seems likely that the empirical-ROC metrics share this bias and that, like Strother's reproducibility metric, they should not be considered measures of true signal detection performance.

Simultaneously, Hansen and Strother, guided by the field of predictive learning in statistics (Hastie et al., 2001, Larsen and Hansen, 1997, Mjolsness and DeCoste, 2001), introduced the idea of using potentially unbiased cross-validation-based prediction metrics to measure data-analytic performance in functional neuroimaging (Hansen et al., 1999, Kjems et al., 2002, Kustra and Strother, 2001, Lautrup et al., 1995, Morch et al., 1997). Similar prediction metrics have recently been used by others (McKeown, 2000, Ngan et al., 2000). In addition, prediction metrics have been used to gain new insight into the debate over the spatially modular versus spatially distributed nature of human brain processing (Cox and Savoy, 2003, Haxby et al., 2001). We expect both prediction and reproducibility metrics to play an increasingly important role in the future optimization and interpretation of fMRI studies.

With this in mind, Strother et al. (2002) proposed the unique approach of simultaneously measuring and combining data-driven prediction and reproducibility metrics for pipeline and data analysis optimization using split-half resampling (a combination of two-fold cross-validation and delete-d jackknife resampling) to produce a ROC-like plot. They developed the NPAIRS (Nonparametric Prediction, Activation, Influence and Reproducibility reSampling) software package to implement and test this idea (Kjems et al., 2002, LaConte et al., 2003a, Shaw et al., 2003a; Web distribution and documentation at http://neurovia.umn.edu/incweb/npairs_info.html). In preliminary comparisons using simulations, Shaw et al. (2003b) have shown that prediction-reproducibility plots seem to perform at least as well as standard ROC curves.

This paper is concerned with the combined use of prediction and reproducibility metrics to test the relative importance of different processing pipeline choices in the detection of large-scale brain networks from the combined BOLD-fMRI scans of 16 subjects performing a block-design, parametric-static-force task. We have investigated the impact and interaction of interpolation, within-plane spatial smoothing, temporal detrending, between-subject alignment using affine and nonlinear polynomial registration (i.e., warps), and “tuning” the data analysis approach. The large-scale brain networks were detected for separate, uncorrelated OFF–ON and parametric force responses using canonical variates analysis (CVA), a flexible multivariate form of linear discriminant analysis that may be tuned to fit the data. The prediction and reproducibility metrics were measured using split-half resampling of the 16-subject group within the NPAIRS framework. We found that the metrics could easily detect the smoothing difference between sinc and trilinear-based interpolation, and that even a small amount of smoothing together with tuning the CVA model were by far the most important processing parameters. Detrending was found to be essential to remove low-frequency time trends, and to allow a reliable parametric force response to emerge despite pseudo-randomization of the force levels across two runs per session. In contrast, using an affine registration compared with 3rd to 7th order polynomial warps had only a minor impact on the performance metrics. However, our results make it clear that both prediction and reproducibility metrics are required for optimization as they individually select different optimal pipeline parameter settings that are associated with somewhat different activation patterns. In addition, the parameter settings of components in the pipeline interact so that the current practice of reporting the optimization of components tested in relative isolation is unlikely to lead to optimized processing pipelines.

Section snippets

Data acquisition

For a detailed description of data acquisition protocols, see LaConte et al., 2003a, LaConte et al., 2003b.

Software

The NPAIRS software used for this work is written in IDL™ (Research Systems Inc., Boulder, CO). The NPAIRS algorithm is part of the VAST software library from the VA Medical Center, Minneapolis, Minnesota, and the distributed NPAIRS module may now be run without an IDL license (see http://neurovia.umn.edu/incweb/npairs_info.html).

Preprocessing

After removal of the initial nonequilibrium scans per run, we (1) aligned each fMRI volume and resampled it into a Talairach reference space using either sinc or

Results

Fig. 3, Fig. 4 demonstrate the basic behavior of the NPAIRS prediction and reproducibility metrics for the 11-class CVA model as a function of polynomial warp order and within-slice smoothing. Fig. 3, Fig. 4 illustrate the median of 50 split-half prediction medians for an 11-class CVA model built on the first 100 principal components and detrended with 0 and 1.5 cycle cosine-basis-function cut-offs, respectively. In panels B and C, model performance is split into the underlying, uncorrelated

Discussion

Our choices for the pipeline components to manipulate in this study were based on some preliminary testing, computational expediency and standard practice in our laboratory. We acknowledge that we have not exhaustively optimized even the components tested, which would require further testing of the preprocessing components (interpolation, smoothing, detrending and warps) for 150 and 200 PCs passed to the CVA to cover the parameterization between optimal reproducibility and optimal prediction

Acknowledgments

Important contributions to this study were made by James Ashe, MD and Suraj Muley, MD of the University of Minnesota and VA Minneapolis Neurology Departments who, respectively, designed the behavioral static-force experiment and recruited and scanned the volunteer subjects. We also thank Xiaoping Hu, PhD and Essa Yacoub, PhD of the Center for Magnetic Resonance Research at the University of Minnesota for designing the fMRI/MRI acquisition strategies and running the MRI scanner, and the

References (59)

  • S. LaConte et al.

    Evaluating preprocessing choices in single-subject BOLD-fMRI studies using data-driven performance metrics

    NeuroImage

    (2003)
  • N. Lange et al.

    Plurality and resemblance in fMRI data analysis

    NeuroImage

    (1999)
  • A.S. Lukic et al.

    An evaluation of methods for detecting brain activations from PET or fMRI images

    Artif. Intell. Med.

    (2002)
  • M.J. McKeown

    Detection of consistently task-related activations in fMRI data with hybrid independent component analysis

    NeuroImage

    (2000)
  • S.A. Muley et al.

    Effects of changes in experimental design on PET studies of isometric force

    NeuroImage

    (2001)
  • S.-C. Ngan et al.

    Temporal filtering of event-related fMRI data using cross-validation

    NeuroImage

    (2000)
  • D.E. Rex et al.

    The LONI pipeline processing environment

    NeuroImage

    (2003)
  • M.E. Shaw et al.

    Abnormal functional connectivity in post-traumatic stress disorder

    NeuroImage

    (2002)
  • M.E. Shaw et al.

    Evaluating subject specific preprocessing choices in multi-subject BOLD fMRI data sets using data driven performance metrics

    NeuroImage

    (2003)
  • P. Skudlarski et al.

    ROC analysis of statistical methods used in functional MRI: individual subjects

    NeuroImage

    (1999)
  • S.C. Strother et al.

    The quantitative evaluation of functional neuroimaging experiments: the NPAIRS data analysis framework

    NeuroImage

    (2002)
  • J. Tanabe et al.

    Comparison of detrending methods for optimal fMRI preprocessing

    NeuroImage

    (2002)
  • T. White et al.

    Anatomic and functional variability: the effects of filter size in group fMRI data analysis

    NeuroImage

    (2001)
  • C.F. Beckmann et al.

    Probabilistic independent component analysis for functional magnetic resonance imaging

    IEEE Trans. Med. Imag.

    (2004)
  • R.P. Carver

    The case against statistical significance testing, revisited

    J. Exp. Educ.

    (1993)
  • M. D'Esposito et al.

    Alterations in BOLD fMRI signal with ageing and disease: a challenge for neuroimaging

    Nat. Rev., Neurosci.

    (2003)
  • K. Fissell et al.

    Fiswidgets: a graphical computing environment for neuroimaging analysis

    Neuroinformatics

    (2003)
  • K.J. Friston et al.

    A multivariate analysis of PET activation studies

    Hum. Brain Mapp.

    (1996)
  • C.R. Genovese et al.

    Estimating test–retest reliability in functional MR imaging. I. Statistical methodology

    Magn. Reson. Med.

    (1997)
  • Cited by (108)

    • Evaluation of multi-echo ICA denoising for task based fMRI studies: Block designs, rapid event-related designs, and cardiac-gated fMRI

      2016, NeuroImage
      Citation Excerpt :

      Manual and automatic single-echo ICA-based denoising procedures (Pruim et al., 2015; Salimi-Khorshidi et al., 2014) can also help remove additional traces of noise. Prior research has shown that there can be substantial inter-subject differences in terms of optimal pre-processing pipelines (Strother et al., 2004). It is possible that comparison of ME-ICA against these other single-echo pre-processing pipelines, including subject-specific ones, would show relatively less improvements.

    View all citing articles on Scopus
    1

    Now at Biomedical Engineering Department, Georgia Institute of Technology/Emory University.

    View full text