In individual-differences studies, a number of variables are measured for each individual. The goal is to decompose the covariation among these variables into a lower-dimensional, theoretically relevant structure (Bollen, 1989; Skrondal & Rabe-Hesketh, 2004). Critical in this endeavor is understanding the psychometric properties of the measurements. Broadly speaking, variables used in individual-difference studies come from the following three classes: The first is the class of rather natural and easy-to-measure variables such age, weight, and gender. The second is the class of instruments such as personality and psychopathology instruments. Instruments have a fixed battery of questions and a fixed scoring algorithm. Most instruments have been benchmarked, and their reliability has been well established. The final class of variables is performance on experimental tasks. These experimental tasks are often used to assess cognitive abilities in memory, attention, and perception.

On the face of it, individual-difference researchers should be confident about using scores from experimental tasks for the following reasons: First, many of these tasks are robust in that the effects are easy to obtain in a variety of circumstances. Take, for example, the Stroop task, which may be used as a measure of inhibition. The Stroop effect is so robust, it is considered universal (Haaf & Rouder, 2017; MacLeod, 1991). Second, many of these tasks have been designed to isolate specific cognitive processes. The Stroop task, for example, requires participants to inhibit the prepotent process of reading. Third, because these tasks are laboratory based and center on experimenter-controlled manipulations, they often have a high degree of internal validity. Fourth, because these tasks are used so often, there is usually a large literature about them to guide implementation and interpretation. It is no wonder task-based measures have become popular in the study of individual differences.

Yet, there has been a wrinkle in the setup. As it turns out, these task-based measures designed to measure the same latent concepts sometimes do not correlate heavily with one another. The best example of this wrinkle is perhaps in the individual-differences study of inhibition tasks (Friedman & Miyake 2004; Ito et al., 2015; Pettigrew & Martin, 2014; Rey-Mermet, Gade, & Oberauer, 2018; Stahl et al., 2014). In these large-scale individual-differences studies, researchers correlated scores in a variety of tasks that require inhibitory control. An example of two such tasks are the Stroop task (Stroop, 1935) and the flanker task (Eriksen & Eriksen, 1974). Correlations among inhibition measures are notoriously low; most do not exceed .2 in value. The Stroop and flanker measures, in particular, seemingly do not correlate at all (Hedge, Powell, & Sumner, 2018; Rey-Mermet, Gade, & Oberauer, 2018; Stahl et al., 2014). These low correlations are not limited to measures of inhibition. Ito et al., (2015) considered several implicit attitude tasks used for measuring implicit bias. Here again, there is surprisingly little correlation among bias measures that purportedly measure the same concept.

Why correlations are so low among these task-based measures remains a mystery (Hedge, Powell, & Sumner, 2018; Rey-Mermet et al., 2018). After all, the effects are robust, easy-to-obtain, and seemingly measure the same basic concepts. There are perhaps two leading explanations: One is that the problem is mostly substantive. The presence of low correlations in large-sampled studies may imply that the tasks are truly measuring different sources of variation. In the inhibition case, the ubiquity of low correlations in large-sample studies has led (Rey-Mermet et al., 2018) to make a substantive claim that the psychological concept of inhibition is not unified at all. The second explanation is that the lack of correlation is mostly statistical. High correlations may only be obtained if the tasks are highly reliable, and low correlations are not interpretable without high reliability. Hedge and colleagues document a range of test–retest reliabilities across tasks, with most tasks having reliabilities below .7 in value, and some tasks having reliabilities below .4 in value.

Researchers typically use classical test theory, at least implicitly, to inform methodology when studying individual differences. They make this choice when they tally up an individual’s data into a performance score. For example, in the Stroop task, we may score each person’s performance by the mean difference in response times between the incongruent and congruent trials. Because there is a single performance score per person per task, tasks are very much treated as instruments. When trials are aggregated into a single score, the instrument (or experiment) is a test and classical test theory serves as the analytic framework.

The unit of analysis in classical test theory is the test, or in our case, the experimental task. Theoretically important quantities, say effect size within a task and correlations across tasks, for example, are assigned to the task itself and not to a particular sample or a particular sample size. For instance, when a test developer states that an instrument has test-retest correlation of .80, that value holds as a population truth for all samples. We may see more noise if we re-estimate that value in smaller samples, but, on whole, the underlying population value of an instrument does not depend on the researcher’s sample size. We call this desirable property portability.

We will show here that one problem with using classical test theory for experimental tasks is that portability is violated. Concepts of effect sizes within a task and correlations across tasks are critical functions of design choices, in this case the number of trials per individual per task (Green et al., 2016). As a result, if researchers use different numbers of trials per task, which they invariably do, will be measuring different concepts.

In the next section, we show how dramatically portability is violated in practice, and how much these violations affect the interpretability of classical statistics. These failures serve as motivation for a call to use hierarchical linear models that model trial-by-trial variation as well as variation across individuals. We develop these models, and then apply them to address the low-correlation mystery between flanker and Stroop tasks.

The dramatic failure of portability

In classical test theory, an instrument, say a depression inventory, is a fixed unit. It has a fixed number of questions that are often given in a specific order. When we speak of portability, we speak of porting characteristics of this fixed unit to other settings, usually to other sample sizes in similar populations. In experimental tasks, we have several different sample sizes. One is the number of individuals in the sample; another is the number of trials each individual completed within the various conditions in the task. We wish to gain portability across both of these sample-size dimensions. For example, suppose one team runs 50 people each observing 50 trials in each condition and another team runs 100 people each observing 25 trials in each condition. If measures of reliability and effect size are to be considered a property of a task, then the expected values or underlying true values should be the same across both teams. Moreover, if performance correlated with other variables, portability implies that the true correlation is the same as well.

However, standard psychometric measures that are portable for instruments may fail dramatically on tasks because effect size and correlation measures are critically dependent on the number of trials per condition. To show this failure in a realistic setting, we reanalyze data from a few tasks from Hedge et al., (2018). Hedge et al. compiled an impressively large data set. They asked individuals to perform a large number of trials on several tasks. For example, in their Stroop task, participants completed a total of 1440 trials each. The usual number for individual-differences studies is on the order of 10s or 100s of trials each. Moreover, Hedge et al. explicitly set up their design to measure test-retest reliability. Individuals performed 720 trials in the first session, and then, three-weeks later, performed the remaining 720 trials in a second session.Footnote 1 Here, we use the Hedge et al. data set to assess how conventional measures of effect size and reliability are dependent on the number of trials per condition.Footnote 2

The following notation is helpful for specifying the conventional approach to effect size and reliability. Let Ȳijk be the mean response time for the ith individual in the jth session (either the first or second) in the kth condition. For the Stroop task, the conditions are congruent (k = 1) or incongruent (k = 2). Effects in a session are denoted dij. These are the differences between mean incongruent and congruent response times:

$$ d_{ij}=\bar{Y}_{ij2}-\bar{Y}_{ij1}. $$
(1)

Two quantities of interest can be calculated from these individual effects: The effect size and the test–retest reliability. To calculate the effect size, we first average the session-by-individual effects across sessions to obtain individuals’ effects denoted d̄i. The effect size is just the ratio of the grand mean of the individual effects to the standard deviation of these effects, i.e., es=mean(d̄i)/sd(d̄i). The test–retest reliability is the correlation of session-by-individual effect scores for the first session (j = 1) to the second session (j = 2).

Figure 1 shows a lack of portability in these two measures. We randomly selected different subsets of the Hedge et al.’s data and computed effect sizes and test–retest reliabilities. We varied the number of trials per condition per individual from 20 to 400, and for each level, we collected 100 different randomly chosen subsets. For each subset, we computed effect size and reliability, and plotted are the means across these 100 subsets. The line denoted “S” shows the case for the Stroop task, and both effect size and reliability measures increase appreciably with the number of trials per condition per individual. The line denoted “F” is from Hedge et al.’s flanker task, and the results are similar.

Fig. 1
figure 1

The effect of the number-of-trials-per-individual on sample effect sizes and sample reliability for Stroop and flanker data from Hedge et al., (2018). Each point is the mean over 100 different samples at the indicated sample size

These increases in effect size and reliability with the number of replicates is well known in classical test theory as an example of increasing test length to reduce measurement error (Kuder and Richardson, 1937). As the number of trials increase, cell means are known to greater precision. Classical measures are portable when applied to instruments because the numbers of items are fixed. They are importable when applied to task performance where the number of trials per individual will vary across studies.

The critical and outsized role of replicates is seemingly under-appreciated in practice. Many researchers are quick to highlight the numbers of individuals in studies. You may find this number repeatedly in tables, abstracts, method sections, and sometimes in general discussions. Researchers do not, however, highlight the number of replicates per condition per individual. These numbers rarely appear in abstracts or tables; in fact, it usually takes careful hunting to find them in the method section if they are there at all. And researchers are far less likely to discuss the numbers of replicates in interpreting their results. Yet, as shown here, this number of replicates is absolutely critical in understanding classical results.

A hierarchical model

Gaining portability in the limit of infinite trials

In this paper, we strive for portability—the meaning of correlation and effect size should be independent of the number of trials in component tasks. Portability is largely solved in classical test theory by using standardized instruments, say intelligence tests of fixed length. Yet, we think it is unreasonable to expect the same of experimentalists. Experimentalists will invariably use different number of trials depending on many factors including their access to subject pools, their overall goals, and the number of tasks and conditions in their experiments. We suspect recommending a standard number of trials per experiment will be unsuccessful.

We show the portability problem may be addressed by using rather conventional hierarchical statistical models (Raudenbush & Bryk, 2002; Snijders & Bosker, 1999) in analysis. In our case, where theoretical targets are sizes of effects within tasks and correlations of these effects across tasks, noise from finite trials serves as a nuisance. Researchers are most interested in measurement of individuals’ true scores, that is, the measurement of individuals’ true flanker-effect and Stroop-effect scores, and the true correlation between them. While individuals perform finite trials and have data confounded by trial noise, we may use hierarchical models to estimate what may reasonably happen in the large-trial limit. Estimates of variability in this limit are theoretically unaffected by trial noise. Estimates of effect sizes and correlations in the large-trial limit become portable estimates of effect sizes and correlations.

The application of hierarchical models to trial-by-trial performance data in experimental tasks is not new. Previous applications in individual-differences research include (Schmiedek, Oberauer, Wilhelm, Süß, & Wittmann, 2007; Voelkle, Brose, Schmiedek, & Lindenberger, 2014). Moreover, trial-by-trial hierarchical modeling is well known in cognitive psychology (Lee & Webb, 2005; Rouder & Lu, 2005) and linguistics (Baayen, Tweedie, & Schreuder, 2002). That said, to our knowledge, using hierarchical models to address the low-correlation mystery and to make classical test-theory concepts portable in experimental settings is novel.

Practitioners of classical test theory sometimes account for attenuation from measurement error using the Spearman correction (Spearman, 1904a). We find that our hierarchical model estimates often agree with these corrections which is not too surprising as the hierarchical models and the Spearman correction are based on similar specifications of noise. The models, however, are preferable because they offer a deeper understanding of the statistical and theoretical dynamics at play.

In the next section, we provide a fairly formal exposition of the model. We use an equation-based rather than a graphical presentation of the model because it is from the equations that the main results flow. Using the equations, we derive an expression for the sample effects and show how the values are attenuated by trial variability. In the following section, we do the same for sample correlation between two tasks, and again show how the values are attenuated by trial noise. We show how and why the hierarchical model provides for unattenuated estimates.

Model specification

Understanding what the model is and how it works relies on some basic notion. Let I be the number of people, J be the number of tasks, K be the number of conditions, and L be the number of trials per person per task per condition, or the number of replicates. Subscripts are used to denote individuals, tasks, conditions, and replicates. Let Yijk denote the th observation, = 1, … , L for the ith individual, i = 1, … , I in the jth task, j = 1, … , J and kth condition, k = 1, 2. Observations are usually performance variables, and in our case, and for concreteness, they are response times on trials. For now, we model response times in just one task, and in this case, the subscript j may be omitted. Consider a trial-level base model:

$$ Y_{ik\ell} \sim \text{Normal}(\mu_{ik},\sigma^{2}), $$

where μik is the true mean response time of the ith person in the kth condition and σ2 is the true trial-by-trial variability.

It is important to differentiate between true parameter values like μik and their sample estimates.

We develop this model for the Stroop task. In this task, the key contrast is between the congruent (k = 1) and incongruent (k = 2) conditions. This contrast is embedded in a model where each individual has an average speed effect, denoted αi, and a Stroop effect, denoted θi:

$$ Y_{ik\ell} \sim \text{Normal}(\alpha_{i}+x_{k}\theta_{i},\sigma^{2}), $$

where x1 = − 1/2 for the congruent condition and x2 = 1/2 for the incongruent condition.

The goal then is to study θi, the ith person’s Stroop effect. In modern mixed models, individual’s parameters αi and θi are considered latent traits for the ith person, and are modeled as random effects:

$$ \begin{array}{@{}rcl@{}} \alpha_{i} &\sim& \text{Normal}(\mu_{\alpha},\sigma^{2}_{\alpha}),\\ \theta_{i} &\sim& \text{Normal}(\mu_{\theta},\sigma^{2}_{\theta}), \end{array} $$

where μα and μθ are population means and σα2 and σθ2 are population variances.

Hierarchical regularization and the portability of effect size

Even before applying the model, a bit of analysis shows the flaws in conventional analysis and the expected improvements from hierarchical modeling. The conventional analysis of the Stroop task centers on sample means aggregated over trials. The critical quantity is di = Ȳi2Ȳi1, the observed Stroop effect for the ith individual. It is helpful to express the distribution of sample effects, di in model parameters:Footnote 3

$$ d_{i} \sim \text{Normal}(\mu_{\theta}, 2\sigma^{2}/L + \sigma^{2}_{\theta}), $$

where L is the number of trials per individual per condition. The term 2σ2/L corresponds to variability from trial-by-trial noise and σθ2 corresponds to variability of individuals’ effects. A classical effect size measure, mean(d)/sd(d), therefore estimates μθ/2σ2 /L + σθ 2. The problem is the inclusion of 2σ2/L, the nuisance trial-by-trial variation. This included nuisance trial variation results in individual effect estimates that are too variable and in effect size measures that are too small. More importantly, portability is violated as there is an explicit dependence on L, the number of trials. In the model-based approach, the critical parameter is θi, and its distribution is:

$$ \theta_{i} \sim \text{Normal}(\mu_{\theta},\sigma^{2}_{\theta}), $$

The effect size calculated from these individual effects is an estimate of μθ/σθ, a portable quantity.

The model specification makes it clear that the sample effects—those from aggregating over trials—are accurate estimators of the true effect only in the large-trial limit (as L). In this limit, the term 2σ2/L becomes vanishingly small. The hierarchical model provides an estimate of the same quantity, but it estimates the correct quantity even with finite trials! By using it, results are portable to designs with varying numbers of individuals and trials per individual. These results are estimates of underlying, theoretically useful properties of tasks.

How does di compare to model-based estimates of effects, θi? The hierarchical model is an example of a linear mixed model, and as such, may be implemented in a variety of packages including SAS PROC MIXED, lmer, lavaan, Mplus, stan, and JAGS. In the Appendix, we complete the specification of the model as a hierarchical Bayesian model and use the R-package BayesFactor for analysis. We have also made our analysis code available at https://github.com/PerceptionAndCognitionLab/ctx-reliability.

Figure 2a and b shows the comparison for a single block of Stroop data in Hedge et al., (2018). In this block, there are about L = 42 trials per individual per condition. There are two panels—one for sample effects (di), and one for model-based estimates of θi. Participants are ordered from the ones that show the smallest Stroop effect to the ones that show the largest. In the left panel, the solid line shows the sample effect for each participant. The shaded area is the extent of each individual’s 95% confidence interval. In the right panel, the solid line shows the point estimate of θi and the shaded area is the analogous 95% credible interval. The three horizontal lines are the mean (solid) and one standard deviation markers (dashed).

Fig. 2
figure 2

Analysis of a single task. a Sample effects from one block of data for all individuals with 95% CIs. b Model estimates from one block of data for all individuals with 95% credible intervals. c Hierarchical shrinkage in practice. The bottom row shows sample estimates from one block; the middle row shows the same from the hierarchical model; the top row shows sample estimates from all the blocks. The shrinkage from the hierarchical model attenuates the variability so that it matches that from much larger samples

Although the grand mean effect is about the same for model estimates and sample effects, individuals’ effects are different. The sample effects are spread out further from the grand mean than are the model estimates. This increased spread is expected and, as discussed above, comes about because the variability of di is the sum of true across-individual variation (σθ2) and trial-by-trial variation (2σ2/L). The spread of model-based estimates, in contrast, reflects true individual variation alone. This reduction of variability in hierarchical models is known as shrinkage or regularization (Efron and Morris, 1977). Regularization of estimates is a well-known property of hierarchical models, and regularized estimates are often more accurate than sample statistics (Lehmann & Casella, 1998).

Because the benefits of regularization remain opaque in many hierarchical applications in cognitive psychology, we explore them in a bit more depth here. The estimates in Fig. 2B were from a single block. Hedge et al. ran ten such blocks distributed across two sessions, and we may ask how well the estimates from one of the blocks predict the whole ten-block set. The ten-block set consists of over 400 trials per person and condition. Figure 2C, the style of which is borrowed from Efron and Morris (1977), shows the sample estimates from one block (bottom row), the regularized model estimates from the same block (middle row), and the sample estimates from all ten blocks (top row). The most variable estimators are the sample estimators for a single block (bottom row), and these are much more variable than the sample estimators for all ten blocks (top row) due to the increased imprecision from smaller numbers of trials. The shrinkage is evident in the comparison of the sample estimates and hierarchical estimates from one block (bottom row vs. middle row). The model-based estimates from one block are in line with the variability from all ten blocks. The benefits of hierarchical shrinkage fall off as the numbers of trials are increased only because the sample effects approach their true values.

The model-based one-block estimators are better predictors of the overall data than are the sample one-block estimators. This improvement may be formalized by a root-mean-squared error measure. The error is about 1.62 times larger for the sample means than for the hierarchical model estimates. This benefit is general—hierarchical models provide for more accurate estimates of individuals’ latent abilities (James & Stein, 1961).

A two-task model for reliability and correlation

Correlations among sessions

The hierarchical model may be expanded from accounting for only one task to account for two tasks or two sessions. For two sessions, the goal is to estimate a portable measure of test-retest reliability; for two tasks, the goal is to measure a portable estimate of correlation. Let’s take reliability first. The Hedge et al. data set, which we highlight here, has a novel feature. These researchers sought to measure the test–retest reliability of several cognitive tasks. They had individuals perform 720 trials of a task one day, and 3 weeks later the individuals returned and performed another 720 trials. We let the subscript j = 1, 2 index the session. The trial-level model is expanded to:

$$ Y_{ijk\ell} \sim \text{Normal}(\alpha_{ij}+x_{k}\theta_{ij},\sigma^{2}). $$

Here, we simply expand the parameters for individual-by-session combinations.

The parameters of interest are θij and there are several specifications that may be made here. We start with the most general of these, an additive-components decomposition into common and opposite components:

$$ \begin{array}{@{}rcl@{}} \theta_{i1} &=& \nu_{1} + \omega_{i} - \gamma_{i},\\ \theta_{i2} &=& \nu_{2} + \omega_{i} + \gamma_{i}. \end{array} $$

The parameter νj is the main effect of the jth session, and by having separate main effect parameters for each session, the model captures the possibility of a systematic effect of session on the Stroop effect. The parameter ωi is the common effect of the ith individual; individuals that have large Stroop effects on both sessions have high values of ω. The parameter, γi, is the oppositional component. It captures idiosyncratic deviations where one individual may have a large Stroop effect in one session and a smaller one in another. These individual common effects and oppositional effects are random effects, and we place a hierarchical constraint on them:

$$ \begin{array}{@{}rcl@{}} \omega_{i} &\sim& \text{Normal}(0,\sigma^{2}_{\omega}),\\ \gamma_{i} &\sim& \text{Normal}(0,\sigma^{2}_{\gamma}). \end{array} $$

To gain an expression for the correlation among two sessions, it is helpful to express the multivariate distribution of the θs. The easiest way to do this is to write out the multivariate distribution for two individuals’ effects across two sessions:

$$ \left[\begin{array}{c} \theta_{11} \\ \theta_{12} \\ \theta_{21} \\ \theta_{22} \end{array}\right] \!\!\sim \mathrm{N}_{4}\!\left( \! \left[\begin{array}{c} \nu_{1} \\ \nu_{2} \\ \nu_{1} \\ \nu_{2} \end{array}\right], \left[\begin{array}{cccc} \sigma^{2}_{\omega}+\sigma^{2}_{\gamma} & \sigma^{2}_{\omega}-\sigma^{2}_{\gamma}& 0 & 0\\ \sigma^{2}_{\omega}-\sigma^{2}_{\gamma} & \sigma^{2}_{\omega}+\sigma^{2}_{\gamma} & 0 & 0\\ 0 & 0 &\sigma^{2}_{\omega}+\sigma^{2}_{\gamma} & \sigma^{2}_{\omega}-\sigma^{2}_{\gamma} \\ 0 & 0 & \sigma^{2}_{\omega}-\sigma^{2}_{\gamma} & \sigma^{2}_{\omega}+\sigma^{2}_{\gamma} \end{array}\right] \right)_{.} $$

From these distributions, it follows that the test–retest reliability, the correlation across θi1 and θi2, is

$$ \rho = \frac{\sigma^{2}_{\omega}-\sigma^{2}_{\gamma}}{\sigma^{2}_{\omega}+\sigma^{2}_{\gamma}} $$

Importantly, this quantity is portable as it does not include trial-by-trial variability.

For comparison, we may also express in multivariate distribution for sample effects for two individuals in two sessions. Box 1 shows this distribution.

Box 1
figure a

Multivariate distribution of sample effects

From this distribution, it is clear that the sample correlation between sample effects is estimating (σω2 − σγ)/(σω2 + σγ2 + 2σ2/L). It is the added sample noise in the denominator, σ2/L, that renders the sample test–retest correlation importable and too small.

Figure 3 provides a real-data comparison of model-based and sample correlations. The top row is for the single block of Hedge et al.’s Stroop task; the bottom row is for all ten blocks of the same data set. The first column are scatter plots of the sample effects of the first session vs. the second session. There is almost no test–retest correlation using a single block of data (A); but when all data are considered there is a moderate correlation (B). This pattern is the same as in Fig. 1 where reliability was diminished for smaller numbers of trials.

Fig. 3
figure 3

Test–retest reliability for Hedge et al.’s Stroop-task data set. The top row shows the analysis for one block per session; the bottom row shows the analysis for all blocks. a, b Scatter plots of individuals” sample effects show almost no relation for one block and a moderate one for all blocks. c, d Scatter plots of model estimates for individuals shows a modest relationship for one block and a strong one for all blocks. e, d Posterior distribution of the model-based correlation parameter ρ. The correlation is modest and not well localized with one block; it is strong and well localized with all blocks

The model estimates of individual effects are shown in the middle column (C, D). Here, there is shrinkage, especially for the single block (C). This shrinkage, however, is to a line rather than to a point in the center. Why? Shrinkage in hierarchical models reflects the specific model specification of the latent parameters. For two tasks, we explicitly modeled the latent effects as being linearly related. This specification is then manifest in the pattern of shrinkage.

The last column shows the posterior distributions of the test–retest reliability coefficient. Here, for the single-block data there is much uncertainty (E). The uncertainty shrinks as the sample sizes grow, as indicated by the same plot for all the data (F). The posterior mean, a point-estimate for the test–retest reliability of the Stroop task, is 0.72. We also analyzed the flanker task, and the test–retest reliability of this task is 0.68.

Figure 3 highlights the dramatic difference between conventional and hierarchical-model analysis, especially for smaller numbers of trials per participant. If one uses the conventional correlation coefficient for one block of data, the resulting value is small, 0.10, and somewhat well localized (the 95% confidence interval is from-0.17 to 0.36). We emphasize this result is misguided. When all the data are used, the conventional correlation coefficient is 0.55, which is well outside this 95% confidence interval. Contrast this result to that from the hierarchical model. Here, not only is the correlation coefficient larger, 0.31, but the uncertainty is quite large as well (the 95% credible interval is from –0.32 to 0.82). Moreover, the value of the correlation from the hierarchical model with all of the data, 0.72, is within this credible interval. In summary, using the conventional analysis results in over confidence in a wrong answer. The hierarchical model, however, tempers this overconfidence by accounting for trial-by-trial uncertainty as well as uncertainty across individuals.

The above real-world demonstration that sample estimates of correlations are so badly attenuated should be alarming. The one-block case was based on a design with 48 trials per individual per condition. This is a typical number for individual-difference batteries with a large number of tasks. Yet, the attenuation is so severe that test-retest correlations are barely detectable. No wonder observed correlations across tasks are so low. The hierarchical model provides a more sensitive view of the structure in the data by allowing for shrinkage to a regression line.

The attenuation of correlation from measurement error is well known in classical test theory and one approach is to apply a correction formula. The basic idea underlies the Spearman–Brown prophecy formula and the Spearman formula for disattenuation of correlations (Spearman, 1904b). We used the latter to compute an adjusted test–retest correlations for both tasks and for both the one-block and full data sets. The results are in Table 1, and there is a high degree of concordance for the model correlations and the corrected correlation. This overall concordance is expected as the correction for attenuation and the hierarchical models share the same foundational assumption about noise.

Table 1 Test–retest correlations for Hedge et al., (2018) data sets

Correlations among tasks

The correlation among tasks may be analyzed with the same model that was used to analyze the correlation among sessions. Here we compare Stroop task performance to flanker task performance. Each of Hedge et al.’s participants participated in both tasks. To correlate tasks, we combined data across sessions and fit the model where j indexes task rather than session.

The results of the analysis are shown in Fig. 4. As can be seen, there appears to be no correlation. Inference about the lack of correlation will be made in the next section where model comparison is discussed.

Fig. 4
figure 4

The lack of correlation between flanker task and Stroop task performance. Left: Scatter plot of individuals’ model-based estimates. Right: Posterior distribution of the correlation coefficient

Model comparison

The above analyses were focused on parameter estimation. Model-based estimation provided here are portable analogs to sample-based measures of effect size, reliability, and correlation. The difference is that they account for variation at the trial level, and consequently, may be ported to designs with varying numbers of trials.

Researchers, however, are often interested in stating evidence for theoretically meaningful propositions. In the next section, we describe a set of theoretically meaningful propositions and their model implementation. Following this, we present a Bayes factor method for model comparison.

Theoretical positions and model implementation

When assessing the relationship between two tasks, the main target is the true latent correlation in the large-trial limit. There are two opposing theoretically important positions: 1. that there is no correlation; and 2. there is full correlation. A lack of true correlation indicates that the two tasks are measuring independent psychological processes or abilities. Likewise, if there is full correlation, then the two tasks are measuring the same psychological processes or abilities.

In the preceding section, we presented an estimation model, which we now call the general model. The critical specification is that of θij, the individual-by-task effect. We modeled these as:

$$ \begin{array}{@{}rcl@{}} \mathcal{M}_{g}: && \theta_{ij}=\nu_{j}+\omega_{i}+u_{j}\gamma_{i},\\ && \omega_{i} \sim \text{Normal}(0,\sigma^{2}_{\omega}),\\ && \gamma_{i} \sim \text{Normal}(0,\sigma^{2}_{\gamma}), \end{array} $$

where u = (− 1, 1) for the two tasks. In this model, the correlation among an individuals reflects the variability of ω and γ. All values of correlation on the open interval (-1,1) are possible. Full correlation is not possible, and there is no special credence given to no correlation. To represent the two theoretical positions, we develop alternative models on θij.

A no-correlation model is given by putting uncorrelated noise on θij:

$$ \mathcal{M}_{0}: \quad \theta_{ij} \sim \text{Normal}(\nu_{j},\sigma^{2}_{\theta}). $$

The no-correlation and the general models provide for different constraints. The general model has regularization to a regression line reflected by the balance of the variabilities of ω and γ. The no-correlation has regularization to the point (ν1,ν2).

A full correlation model is given by simply omitting the γ parameters in the general model.

$$ \begin{array}{@{}rcl@{}} \mathcal{M}_{1}: && \theta_{ij}=\nu_{j}+\omega_{i},\\ && \omega_{i} \sim \text{Normal}(0,\sigma^{2}_{\theta}) \end{array} $$

Here, there is a single random parameter, ωi, for both tasks per individual. In the full-correlation model, regularization is to a line with a slope of 1.0.

Bayes factor analysis

We use the Bayes factor (Edwards, Lindman, & Savage, 1963; Jeffreys, 1961) to measure the strength of evidence for the three models. The Bayes factor is the probability of the observed data under one model relative to the probability of the observed data under a competitor model.

Table 2 shows the Bayes factor results for the Stroop and flanker task data sets from Hedge et al., (2018). The top two rows are for the Stroop and flanker data, and the correlation being tested is the test-retest reliability. The posterior mean of the correlation coefficients are 0.72 and 0.68, for the Stroop and flanker tasks, respectively. The Bayes factors confirm that there is ample evidence that the correlation is neither null nor full. Hence, we may conclude that there is indeed some though not a lot of added variability between the first and second sessions in these tasks. The next row shows the correlation between the two tasks. Here, the posterior mean of the correlation coefficient is –0.06 and the Bayes factors confirm that the no-correlation model is preferred. The final row is a demonstration of the utility of the approach for finding dimension reductions. Here, we split the flanker task data in half by odd and even trials rather than by sessions. We then submitted these two sets to the model, and calculated the correlation. It was quite high of course, and the posterior mean of the correlation was 0.82. The Bayes factor analysis concurred, and the full-correlation model was favored by 31-to-1 over the general model, the nearest competitor.

Table 2 Bayes factor values for competing models of correlation

The Appendix provides the prior settings for the above analyses. It also provides a series of alternative settings for assessing how sensitive Bayes factors are to reasonable variation in priors. With these alternative settings, the Bayes factors attain different values. Table 3 shows the range of Bayes factors corresponding to these alternative settings. This table provides context for understanding the limits of the data and the diversity of opinion they support.

Table 3 Sensitivity of Bayes factor to prior settings

General discussion

In this paper, we examined classical test theory analysis of experimental tasks. The main difficulty in classical analysis occurs when researchers aggregate across trials to form individual-by-task scores. If aggregated scores are used as input, then conventional sample measures are contaminated by removable trial-by-trial variation. With this contamination, effect sizes, reliabilities, and correlations are too low, sometimes dramatically so. We take a standard statistical approach and advocate rather conventional mixed linear models (e.g., Raudenbush and Bryk, 2002). The key for this application is to apply the hierarchical models to the trial level data to remove trial-by-trial variation, and, to the best of our knowledge, this usage is novel in the inhibition context. Concepts such as effect size, reliability, and correlation are portable when defined in the asymptotic limit of unbounded trials per individual. In the current models, performance at these asymptotic limits are explicit model parameters.

With this development, it is possible to assess whether observed low correlations across tasks reflect low reliability or a true lack of association. We examine this problem for the Stroop and flanker task data reported in Hedge et al., (2018). We find that there is relatively high test-retest reliability for both tasks. This high reliability allows for the interpretation of the correlation between Stroop and flanker tasks. There is direct evidence for a null correlation.

Individual-difference researchers are intimately familiar with mixed linear models, and these are used regularly to decompose variability. Adding one additional level, the trial level, is conceptually trivial and computationally straightforward. Indeed, modeling at this level is common in high-stakes testing (IRT, Lord and Novick, 1968), cognition (Lee and Webb, 2005; Rouder & Lu, 2005), and linguistics (Baayen, Tweedie, & Schreuder, 2002). In the Appendix, we show how the nWayAOV function in the BayesFactor package may be used for implementation.

Models vs. corrections

One of the seemingly promising results here is that existing corrections for measurement noise in the Hedge data set yield similar results to hierarchical models. Perhaps this is not surprising as these corrections are based on the same assumptions about noise. Yet, we believe that researchers should invest in modeling even though correction formulas are much easier to implement in practice. Here is why: First, the Spearman correction yields a point estimate of the true latent value without a sense of uncertainty. Model-based point estimates, like the ones presented here, often come with a measure of uncertainty. Second, models may be compared to answer questions about whether the correlation is zero, one, or intermediary. Corrections offer no such inferences. Third, models may be extended to represent more complex structure in data. We know of no comparable correction for measurement noise in say factor analysis. Fourth, and most importantly, modeling provides for a deeper and more nuanced understanding of structure in data. Knowing how models represent structure and how data influences model-based conclusions allows researchers to add value in addressing theoretically important substantive issues in a way that correction procedures cannot.

More advanced task models

The field of individual differences has moved far beyond the consideration of two tasks or instruments. The field is dominated by multivariate, latent-variable models including factor models, structural equation models, state-space models, and growth models (e.g., Bollen, 1989). When task scores are used with these advanced models, the results are importable and, consequently, difficult to interpret unless trial-by-trial variation is modeled. A critical question is whether the hierarchical models specified here extend well beyond two tasks. Can, for example, a factor analysis of task covariation be performed while extending the model down to the trial level?

The good news is that these extensions, at least for estimation, are seemingly straightforward in both Bayesian and frequentist contexts. For frequentist modeling, the extensions still fall within mixed linear modeling classes and can be implemented in packages such as Mplus (Muthén & Muthén, 2005). Likewise, for Bayesian modeling, general-purpose packages such as stan (Carpenter et al., 2017) and JAGS (Plummer, 2003) are well suited for developing advanced latent variable models that account for trial-by-trial variation.

Although estimation in these models may be straightforward, model comparison strikes us as a thorny issue. There are many model comparison strategies. Of these, we find Bayes factors to be the most intellectually appealing (e.g., Rouder & Morey 2012; Rouder, Morey, & Wagenmakers, 2016). Developing Bayes factors for mixed models is certainly timely and topical, but the computational issues may be difficult. As a result, there is much work to be done if one wishes to state the evidence for theoretically motivated constraints on covariation across several tasks.

A caveat

In the beginning of this paper, we asked if low correlations reflect statistical considerations or substantive considerations. The answer here, with a large data set, is that there is a substantive claim. We state evidence for a lack of correlation across Stroop and flanker tasks. That said, we should not understate the statistical considerations.

We suspect there is less true individual variability in many tasks than has been realized, and apparent variability comes more from the trial level than from true individual differences. Consider a typical priming task. Trials in these tasks take about 500 ms to complete, and a large effect is about 50 ms. If the average is 50 ms, how much could individuals truly vary? The answer, we believe, is “not that much,” especially if we assume that no individual has true negative effects (Haaf & Rouder, 2017; Rouder & Haaf, 2018). Indeed, we have analyzed the observed and true (latent) variation across individuals in several tasks (Haaf & Rouder, 2017; in press). Observed variation is usually on the order of hundreds of milliseconds. Yet, once trial-by-trial variation is modeled, individuals’ true values vary from 10 to 40 ms. For such a narrow range, accurate understanding of individuals would require estimating each individual’s effect to within a few milliseconds. Even the hierarchical models we advocate cannot mitigate this fact; if there is too little resolution then the posterior reliabilities and correlations will not be well localized. Obtaining the requisite level of resolution to study individual differences in these tasks may requires several hundred trials per individual per condition, and perhaps this design choice may prove more critical than collecting from hundreds of individuals.