Elsevier

NeuroImage

Volume 47, Issue 1, 1 August 2009, Pages 88-97
NeuroImage

Assessing certainty of activation or inactivation in test–retest fMRI studies

https://doi.org/10.1016/j.neuroimage.2009.03.073Get rights and content

Abstract

Functional Magnetic Resonance Imaging (fMRI) is widely used to study activation in the human brain. In most cases, data are commonly used to construct activation maps corresponding to a given paradigm. Results can be very variable, hence quantifying certainty of identified activation and inactivation over studies is important. This paper provides a model-based approach to certainty estimation from data acquired over several replicates of the same experimental paradigm. Specifically, the p-values derived from the statistical analysis of the data are explicitly modeled as a mixture of their underlying distributions; thus, unlike the methodology currently in use, there is no subjective thresholding required in the estimation process. The parameters governing the mixture model are easily obtained by the principle of maximum likelihood. Further, the estimates can also be used to optimally identify voxel-specific activation regions along with their corresponding certainty measures. The methodology is applied to a study involving a motor paradigm performed on a single subject several times over a period of two months. Simulation experiments used to calibrate performance of the method are promising. The methodology is also seen to be robust in determining areas of activation and their corresponding certainties.

Introduction

Functional Magnetic Resonance Imaging (fMRI) has become an extremely popular noninvasive imaging modality for understanding human cognitive and motor functions. The main goal of fMRI is to identify regions of the brain that are activated by a given stimulus or while performing some task, but high variability among replicated studies often leads to inconsistent results, causing concern among researchers (see, for instance, in Buchsbaum et al., 2005, Derrfuss et al., 2005, Ridderinkhof et al., 2004, Uttal, 2001). There are a number of factors that affect the identification of activated voxels. A typical fMRI paradigm consists of the application of a stimulus or performance of a cognitive or motor task over time. Any neural stimulus passes through the so-called hemodynamic filter (Maitra et al., 2002), resulting in a several-seconds delay before the blood-oxygen-level-dependent (BOLD) response occurs. Other factors also affect the acquired data (Genovese et al., 1997). For example, the cardiac and respiratory motion of a subject may result in physiological variation, giving rise to flow-artifacts which may need to be monitored or digitally filtered (Biswal et al., 1996). Subjects also often exhibit voluntary, involuntary and/or stimulus-correlated motion during scans (Hajnal et al., 1994). Another factor is scanner variability which is essentially controlled through effective quality control programs. Most signal differences between activated and control or resting states are small, typically on the order of 1–5% (Chen and Small, 2007), and sub-pixel motions can induce large apparent signal changes and result in the detection of false positives. Therefore, fMRI data are subjected to image registration algorithms which align the sequence of images to sub-pixel accuracy (Wood et al., 1998). The pre-processing of data improves the quality of acquired fMRI data, but identified regions of activation still vary from one replication to the other. This variability needs to be quantified in order to determine regions of activation with precision and accuracy (McGonigle et al., 2000, Noll et al., 1997, Wei et al., 2004).

Repeatability of results across multiple studies is one way of assessing variability and measures that calibrate repeatability are called reliability measures. Many authors working in the area of fMRI image variability used the term reliability to describe the extent to which activation was consistently identified in multiple fMRI images. However, there is another, perhaps more useful, quantity of interest to practitioners: quantitation of the true status of voxels identified as activated or inactivated. Measures that attempt to quantify the probability of the true status of a voxel given its identified state are more correctly termed measures of confidence or certainty even though these were also introduced, perhaps confusingly, as reliability measures by earlier authors that included me. In this paper, I will move towards adopting the nomenclature of certainty in these contexts in order to better distinguish it from simple reliability. But before proceeding further, I specify that I use the term “replication” to denote the repetition of the task or experimental condition to study variability. These replications are necessarily independent and, in the context of single-subject studies, occur on different scanning sessions, reasonably separated in time.

The issue of quantifying variability (whether reliability or certainty) of activation has interested researchers in two different frameworks. The first case involves the analysis of grouped fMRI data, which arise when multiple subjects are studied under multiple stimulus or task-performance levels, eg., fMRI data acquired while subjecting multiple volunteers to noxious painful stimuli at several graded levels. I will refer to these stimulus or task levels as experimental conditions. The second scenario, which is the focus of this paper, is the test–retest case, where replicated fMRI data are acquired on the same subject under the same experimental condition.

For grouped fMRI data, the goal is to determine where the effect of the stimulus is larger than subject-to-subject variation. Reliability of activation in response to stimulus has been quantified in terms of the intra-class correlation (ICC), which is calculated using voxels identified as activated in each subject after thresholding separately for each combination of experimental condition and subject (Aron et al., 2004, Fernández et al., 2003, Friedman et al., 2008, Manoach et al., 2001, Miezin et al., 2000, Raemekers et al., 2007, Sprecht et al., 2003). The ICC (Shrout and Fleiss, 1979, Koch, 1982, McGraw and Wong, 1996) provides a measure of correlation or conformity between regions identified as activated in multiple subjects under two or more experimental replications and/or conditions. Thus it is inapplicable to the test–retest framework on a single subject considered in this paper.

For test–retest, Rombouts et al., 1998, Machielsen et al., 2000 have proposed a global reliability measure of the percent overlap in voxels identified as activated between any two experimental replications. For any two replications (say, j and m), this measure is calculated as Rjm = 2Vjm / (Vj + Vm), where Vjm is the number of three-dimensional image voxels identified as activated in both the jth and mth replications, and Vj and Vm represent the number of voxels identified as activated, separately in the jth and mth replicated experiments, respectively. Rjm takes a value between 0 and 1, representing no to perfect overlap in identified activation at the two ends of the scale.

The percent overlap measure Rjm provides a measurement of the agreement in activation between any two replications, but it is sensitive to the method of identifying activation, unusable for voxel-level analysis, and awkward for more than two replicates. To illustrate Rjm sensitivity to method, consider a procedure that liberally identifies activation (eg., a naive testing approach with no correction for multiple testing), the denominator Vj + Vm would be large so that small disagreement in the voxels identified as activated would have very little impact on Rjm. In contrast, small differences in Vjm would severely affect Rjm when Vj + Vm is small, as expected under a conservative method (eg., testing with the Bonferroni correction for multiple testing). Another shortcoming is that Rjm is a global measure of agreement between replicated experiments giving no sense of voxel-level reliability of activation. One could compute separate Rjm for specific brain regions, but it will never be a high-resolution measure of activation reliability. A third concern is that Rjm is a reliability measure based only on the pair (j,m) of experimental replicates. When there are M replicates or M studies combined in a composite meta-analysis, there are (M2) overlap measures Rjm and there is no obvious way to combine them in a single measure of activation reliability. Thus, there is a need for a measure to quantify reliability or certainty of true activation at the voxel level across an arbitrary number of replicates. Ideally, such an assessment would be independent of the experimental condition and method used to identify activation.

Some more formal statistical approaches to assessing reliability in the test–retest fMRI framework have been proposed as well. Genovese et al., 1997, Noll et al., 1997 specified probabilities that voxels were correctly or incorrectly identified as activated at particular thresholds of the test statistic to determine significance of activation for a given experimental paradigm. Their approach modeled the total frequency (out of M replications) of a voxel identified as activated at given thresholds in terms of a mixture of binomial distributions. To combine data, they assumed independence over the thresholdings. All parameters, such as the mixing proportion of truly active voxels (denoted as λ in their work) or probability of voxels being correctly (πA) or incorrectly (πI) identified as active were assumed to be spatially independent and estimated using maximum likelihood (ML) methods. Maitra et al. (2002) extended their proposals by incorporating a more accurate model of mixtures of conditional binomial distributions, and by also generalizing λ to be voxel-specific. Specifically, they let λi be the probability that the ith voxel is truly active. Letting L be the number of activation threshold levels (assumed without loss of generality to be in increasing order), define Xi = (xi,1,xi,2,…,xi,L), where xi,l is the number of replications for which the ith voxel is identified as activated at the lth threshold. Let ηAl,l  1 (and ηIl,l  1) be the (global) probability of a truly active (correspondingly, truly inactive) voxel being identified as activated at the lth threshold, given that it has been so identified at the (l  1)th threshold level. Also, let πAl (or πIl) be the probability that a truly active (correspondingly inactive) voxel is identified as activated at the lth threshold. Then the likelihood function for the ith voxel is provided byλil=1L(xi,l1xi,l)ηIl,l1xi,l(1ηAl,l1)xi,l1xi,l+(1λi)l=1L(xi,l1xi,l)ηIl,l1xi,l(1ηIl,l1)xi,l1xi,lwhere xi,0  L, ηA1,0 = πAl and ηAl,0 = πIl,0. A further generalization incorporated spatial context by regularizing λ through a Markov Random Field component in the penalty term of the estimation process. Estimates were obtained by maximizing the penalized likelihood. Maitra et al. (2002) introduced a novel approach to quantifying certainty about the true status of voxels identified as activated/inactivated by defining a measure of reliability – the probability of a voxel identified as activated being truly active – and anti-reliability — the probability of a voxel incorrectly identified as inactivated being active. In naming these certainty measures (as mentioned earlier) they aligned them with the layman's notion of reliability: trustworthiness of identified activation.Maitra et al. (2002) also extended Genovese et al. (1997)'s approach to provide a voxel-specific method for choosing the optimal threshold for detecting activation by maximizing the “reliability efficient frontier” i.e., the probability of making a correct decision on the state of a voxel (whether activated or inactivated) at a given threshold. Their emphasis was on assessing certainty of activation and inactivation in a test–retest setting, but the method was also subsequently extended to grouped functional MR imaging data by Gullapalli et al. (2005).

The methodology of Genovese et al. (1997) and Maitra et al. (2002) is implemented by obtaining a test statistic and thresholding it (or more commonly, its p-value) at different levels. This is integral to obtaining the xi,ls used in Eq. (1). However, there is no clear guideline to choosing the thresholds which is left to the researcher. The choice of the number L and value of these thresholds is subjective and can greatly impact the reliability and certainty estimates. Too few threshold levels can result in severely biased estimates, while too many may be computationally burdensome besides having high variability in the estimates. An additional issue is the subjective choice of spacing between the thresholds, to which there is also no satisfactory answer. In this paper, we reformulate the problem in order to eliminate this requirement of threshold choice altogether. Specifically, we model the distribution of the voxel-wise p-value of the test statistic in terms of a mixture of two distributions. The first component of the mixture is the standard uniform density corresponding to the distribution of the p-value under the null hypothesis of no activation. The second is the distribution of the p-value when there is activation. While a mixture of beta distributions is sometimes used to approximate this latter distribution (Pounds and Morris, 2003, Allison et al., 2002), we note that it is possible to derive exact distributions in many standard scenarios, such as t-tests. Also, the mixing proportion of the mixture component representing the distribution of the p-value under activation is the same as the λ in Genovese et al. (1997) or Maitra et al. (2002). Estimation is done using ML. Once again, optimal cutoffs can be estimated by maximizing the reliability efficient frontier. To better reflect the fact that we are quantifying certainty in the true status of voxels identified as activated and inactivated, we rename the erstwhile reliability measure as the true activation certainty and the awkwardly-termed anti-reliability measure in terms of its complement from unity, calling the latter the true inactivation certainty. Estimates for both measures are also provided. The methodology is applied to an experiment involving a motor paradigm that was replicated on the same subject twelve times over the course of two months. The performance of the suggested method is also validated via simulation experiments over a range of replication sizes. Further, we randomly subdivide the dataset into two subsets of six replications, and study the robustness of the identified activation and the corresponding true activation and inactivation certainties. We conclude with some discussion.

Section snippets

Theory

The p-value of a test statistic To is the probability, under the null distribution, of obtaining a more extreme value (in the direction of the alternative) than To. For a one-sided t-test for the null hypothesis H0: β = 0 against the alternative Ha: β > 0 with β as the regression coefficient of a general linear model fit to the time series at a voxel, this is given by ℙr(tv  To), where ℙr abbreviates probability and tv denotes a t-distributed random variable with ν degrees of freedom and cumulative

Imaging

All MR images were acquired on a GE 1.5 Tesla Signa system equipped with echo-planar gradients and using v5.8 software. Structural T1-weighted images were obtained using a standard spin-echo sequence with TE/TR of 10/500 ms, and slice-positioning following the recommendations of Noll et al. (1997) to minimize intersession differences. For the fMRI sessions, twenty-four 6 mm-thick slices parallel to the AC–PC line and with no gap between them were acquired using a single-shot spiral sequence

Variability in activation

Fig. 2 represents the observed p-values of activation for slices 7 through 22 in the first and last replications of the experiment. All displays reported in this paper are in radiologic views and overlaid on top of the corresponding T1-weighted anatomical images. Note the large amount of variability in the observed p-values in between the two replications. In both cases, the region of the left primary motor cortex appears to be significantly activated in response to the task of right

Discussion

Genovese et al., 1997, Maitra et al., 2002 provided novel approaches to estimating the test–retest certainty of a voxel using ML and its penalized version to enforce spatial dependence between the estimated parameters. In both cases, the approach needs some processing by thresholding the acquired fMRI data before the models can be applied. The number of threshold levels and the thresholding values are subjective and depend entirely on the investigator. This paper removes the need for this step

Acknowledgments

I thank Rao P. Gullapalli of the University of Maryland School of Medicine for providing me with the data used in this study and for many other helpful discussions related to this paper. I also am very grateful to two reviewers whose detailed suggestions and insightful comments greatly improved the quality of this paper. This material is based, in part, upon work supported by the National Science Foundation (NSF) under its CAREER Grant No. DMS-0437555 and by the National Institutes of Health

References (40)

  • BuchsbaumB.R. et al.

    Meta-analysis of neuroimaging studies of the Wisconsin Card-sorting task and component processes

    Hum. Brain Mapp.

    (2005)
  • CoxR.W. et al.

    Software tools for analysis and visualization of fMRI data

    NMR Biomed.

    (1997)
  • DempsterA.P. et al.

    Likelihood from incomplete data via the EM algorithm

    J. Roy. Stat. Soc. Ser. B

    (1977)
  • DerrfussJ. et al.

    Involvement of inferior frontal junction in cognitive control: meta-analyses of switching and Stroop studies

    Hum. Brain Mapp.

    (2005)
  • FernándezG. et al.

    Intrasubject reproducibility of presurgical language lateralization and mapping using fMRI

    Neurology

    (2003)
  • FriedmanL. et al.

    Test–retest and between-site reliability in a multicenter fMRI study

    Hum. Brain Mapp.

    (2008)
  • GenoveseC.R. et al.

    Estimating test–retest reliability in functional MR imaging. I. Statistical methodology

    Magn. Reson. Med.

    (1997)
  • GullapalliR.P. et al.

    Reliability Estimation of grouped functional imaging data using penalized maximum likelihood

    Magn. Reson. Med.

    (2005)
  • HajnalJ.V. et al.

    Artifacts due to stimulus-correlated motion in functional imaging of the brain

    Magn. Reson. Med.

    (1994)
  • HanleyJ.A. et al.

    The meaning and use of the area under a receiver operating characteristic curve

    Radiology

    (1982)
  • Cited by (21)

    • Thresholds in fMRI studies: Reliable for single subjects?

      2013, Journal of Neuroscience Methods
      Citation Excerpt :

      One of the primary aims of our study was to investigate means of using reliability to determine optimal thresholds for individual fMRI studies. Several authors have independently concluded that reproducibility of fMRI results could provide meaningful insights for active voxel identification (Liou et al., 2009; Maitra, 2009; Afshin-Pour et al., 2012). From the outset, the potential of these methods for optimizing threshold selection has been identified (Genovese et al., 1997).

    • Functional magnetic resonance imaging: Comparison between activation maps and computation pipelines in a clinical context

      2013, Magnetic Resonance Imaging
      Citation Excerpt :

      In recent years, the trend is to study evaluation strategy and metrics that can be applied on non-threshold maps. Maitra [22] reformulated the pseudo-ROC approach to eliminate the thresholding requirement. The intraclass correlation coefficient (ICC) [23] is a widely used non-threshold metric [24].

    • Classification With the Matrix-Variate-t Distribution

      2020, Journal of Computational and Graphical Statistics
    View all citing articles on Scopus
    View full text