Theoretical background

The impact of the internet on modern daily life is immense. Thus, it is not surprising that it has also found its way into experimental psychology, with 11%–31% of the studies published in major cognitive science journals relying on internet-based data collection (Stewart, Chandler, & Paolacci, 2017). Over the last few years the systematic investigation of internet-based response time (RT) assessments has provided evidence against most of the preconceptions that initially hindered their broad application (Germine et al., 2012). Therefore, we know that mental chronometry using the internet is almost as precise as the traditional assessments conducted offline in laboratory settings.

To date, a considerable number of studies have investigated absolute RTs collected online. Keller, Gunasekharan, Mayo, and Corley (2009) evaluated the precision of internet-based recordings of known time intervals and found it to be remarkably high (up to a 22-ms offset in Windows operating systems). However, data based on human RTs usually indicate a small degree of overestimation in internet-assessed RT measures (i.e., 10–100 ms; Brand & Bradley, 2012) as compared to offline measures. Nonetheless, this systematic overestimation of internet-assessed RT data is usually negligible (e.g., Chetverikov & Upravitelev, 2016; Germine et al., 2012; Reimers & Stewart, 2007) and, therefore, hardly compromises the replicability of most cognitive paradigms. Crump, McDonnell, and Gureckis (2013) conducted an internet-based replication of eight widely used cognitive paradigms (e.g., Stroop, switching, flanker, Simon, and attentional blink). In this study, only the masked-priming effect could not be successfully replicated, due to a lack of precision with presentation times shorter than 65 ms. An extensive body of evidence exists to support the successful online replication of cognitive paradigms, such as the studies of Simcox and Fiez (2014; replication of a flanker effect and the lexical-decision paradigm), Barnhoorn, Haasnoot, Bocanegra, and van Steenbergen (2015; replication of the Stroop effect, attentional blink effect, and masked-priming effect), and Reimers and Maylor (2005; replication of the task-switching paradigm).

In light of these findings, internet-based mental chronometry appears to have a significant advantage: Cognitive paradigms can be presented to a wide range of participants while collecting data in an instant, with little cost (Reips, 2002). However, performance assessments via the internet seem to come with the drawback of some additional, unexplained variability (e.g., Neath, Earle, Hallett, & Surprenant, 2011; Schubert, Murteira, Collins, & Lopes, 2013). Brand and Bradley (2012) distinguished two putative main areas as sources of this variability—that is, technical variability and environmental variability. It is a well-known fact that the use of different testing hardware (e.g., CPU, keyboard, screen, or mouse) drives some variability in RT data (Neath et al., 2011; Plant & Turner, 2009). The same applies to diverse types of software, such as varying operating systems, driver versions (Plant & Quinlan, 2013), and web browsers (Reimers & Stewart, 2015; Semmelmann & Weigelt, 2017). Added to these technical aspects, the general lack of experimental control (to avoid effect confounding by person and environmental factors) and standardization (to minimize performance variance due to different assessment environments) probably influences internet-based mental chronometry. Thus, the difficulty of controlling for distraction (Brand & Bradley, 2012), as well as the missing guidance and control provided by an experimenter (Reips, 2002), can decrease data quality.

Research rationale

Although many studies in past years have aimed to prove the general precision of internet-based mental chronometry, certain information about the quality of such assessments is still missing. According to Germine et al. (2012), the quality of performance measures is reflected by three main aspects—that is, (1) their central tendency, (2) their variance, and (3) their reliability. These three quality indicators will help us to identify several properties of internet-based mental chronometry that have not been investigated so far.

With regard to the first aspect, most of the above-mentioned studies (e.g., Barnhoorn et al., 2015; Crump et al., 2013; Simcox & Fiez, 2014) focused on differences in the central tendencies of laboratory- versus internet-assessed performance measures. Given that such differences in the central tendencies were largely negligible, the respective paradigms were considered replicable in both settings. Notably, such replicability statements hinge on the type of performance measure. Commonly, the mean or median RT of each individual is considered the primary performance measure for mental chronometry. Recently, however, the joint analysis of RTs and errors (ERR) using the diffusion model (DM; Voss, Nagler, & Lerche, 2013) has gained popularity, because its measures provide additional information based on higher-order moments (e.g., the skewness) of the individual RT distributions (see also Wagenmakers, 2009). Because the central tendencies of such measures have not been compared between laboratory- and internet-based assessments, the present study went beyond the investigation of conventional performance measures by also fitting the DM and replicating its effects in different cognitive paradigms.

With regard to the second aspect, increased variability in internet-assessed performance measures has often been presumed (Neath et al., 2011; Reimers & Stewart, 2015). Nonetheless, we still lack estimates of the practical extent of the variance increase, which has important implications for the statistical power to successfully replicate cognitive paradigms. One attempt to provide such data was carried out by Brand and Bradley (2012), but their estimates were informed only by simulated data and considered the portion of technical (i.e., setup) variability in performance measures. By contrast, the portion of environmental variability was disregarded, although there has been a debate on the latter aspect for almost as long as RTs have been assessed online (Hecht, Oesker, Kaiser, Civelek, & Stecker, 1999). Another attempt to estimate the additional variance introduced by online data collection was carried out by de Leeuw and Motz (2016). However, those authors focused exclusively on software differences in offline versus online assessments, but disregarded actual setting variability. In consequence, in the present study we aimed to further investigate the impact of deviations from standardized laboratory settings on variability in internet-assessed performance measures (including those obtained by diffusion modeling). Detailed insights on this variability will help us quantify the putative loss of statistical power whenever cognitive performance is assessed in domestic environments.

In terms of the number of published studies, the reliabilities of performance measures (i.e., the third aspect) are seldom explicitly quantified in the field of cognitive psychology, although they convey important information about the maximum effect that can be attained in a given paradigm (e.g., Paap & Sawi, 2016). To our knowledge, only Germine et al. (2012) has provided reliability information about some internet-assessed performance measures. However, this information was restricted to the internal consistency of ERR data. Hence, neither the consistency nor the test–retest stability of internet-assessed RT and ERR measures has yet been reported. In the present study we therefore estimated the within- and between-session reliabilities of internet-based mental chronometry in different cognitive paradigms.

Method

Sample

A total of 137 students at Technische Universität Dresden took part in the study. However, only 127 of the participants (33 male, 94 female; between 18 and 40 years, Mage = 23.6 years, SD = 4.1) completed both sessions and, hence, provided a complete set of data. Informed consent was given by each participant prior to the procedure. Each participant received €18 in compensation or, in the case of psychology students, the equivalent in credit points. Approval was granted by the local ethics committee.

Apparatus and materials

Internet-based paradigm presentation was implemented in both the lab and home settings by Inquisit4Web (Millisecond Software, Seattle, USA), a software for Windows and OSX systems that operates client-sided.

In the lab setting (lab), participants used a Windows computer (Windows XP Professional, Version 202, Service Pack 3) in combination with a 19-in. TFT display, a QWERTZ keyboard, and an optical mouse. The online experiment was initiated using the Firefox browser (version 39.0). For the domestic setting (home), the only technical restriction given was to use a desktop or laptop PC in combination with a physical mouse and keyboard. No tablets or smartphones were permitted. Responses via a touch-sensitive surface were not allowed.

To allow application of our findings to a broad field of RT research, we aimed to investigate a diverse range of cognitive constructs. According to Miyake et al. (2000), the three interrelated constructs “shifting,” “updating,” and “inhibition” cover the majority of performance variance in the RT tasks commonly used to assess executive functioning. We chose three RT tasks to tap into each of these constructs: a number–letter task (Rogers & Monsell, 1995; Fig. 1A) to assess “shifting,” a go/no-go task (Wolff et al., 2016; Fig. 1B) to assess “inhibition,” and a spatial two-back task (Friedman et al., 2008; Fig. 1C) to assess “updating.” In each trial of the number–letter task, a character–digit pair was presented below or above a black bar. The participants were asked to classify either the character (a or b) or the number (1 or 2) when the stimulus was presented below or above the bar, respectively. The “y” key (in response to a and 1) and the “m” key (in response to 2 and b) on the keyboard served as response keys. Switch and repeat trials were defined by the compatibility of the target (i.e., number or letter) to that in the previous trial. The task consisted of 256 trials total (50% switch trials, 50% repeat trials). The go/no-go task simply required the classification of the alignment of two circles as vertical (go, predominant trial type) or horizontal (no go). Because DM analyses require two-choice data, the classical go/no-go task was slightly modified: Both go trials (“y” key) and no-go trials (“m” key) required a key response. The task consisted of 400 trials—that is, 87.5% of these were go trials, and 12.5% were no-go trials. Hence, the predominant response tendencies established by the go stimuli needed to be inhibited in the rarer no-go trials. The two-back task required continuously determining the matching of the currently presented stimulus array and the stimulus array that had been presented two trials before (target or nontarget), using the same keys that were used in the other paradigms. Hence, the task required successfully updating working memory in order to be correctly performed. The task consisted of 160 trials (70% nontargets, 30% targets).

Fig. 1
figure 1

Illustration of the task battery employed, which consisted of three executive-functioning tasks: (A) number–letter, (B) go/no-go, and (C) spatial two-back

Procedure

Participants were randomly assigned to the first setting—that is, “lab” or “home”. Prior to the first session, each participant had received an email that contained all information about the testing (i.e., the consent form, instructions for using the individualized participation code, and assessment dates), as well as a web link that led to the fully automatized online experiment. After starting the experiment, all participants were asked to enter their participation code and whether the current session was being performed in the lab or at home. Next, they were asked to report their sex and age. Thereafter, the tasks were presented in the following order: number–letter task, go/no-go task, and two-back task. Each task was preceded by task instructions and practice trials. Between tasks, the possibility for a short break was given. The whole procedure took approximately 1 h per session. The time between sessions was instructed to be seven days.

Only in the lab was an experimenter present, who welcomed the participants, guided them to the computer, and dismissed them after the procedure had been completed. Notably, the experimenter did not interact with the participants during the administration of the online task battery. For the home session, participants were told to conduct the experiment in a calm and nondistracting environment.

Performance measures

RT and ERR data enable investigation of a broad range of different performance measures. However, we focused on those measures that are commonly used as performance indicators in studies using each of the respective tasks. In the number–letter task, we focused on the difference in mean log(RT) and the difference in the relative error frequencies (ERR) between switch and repeat trials. By contrast, we focused on log(RT) and ERR in the no-go trials of the go/no-go task, as well as in the target trials of the spatial two-back task. Besides these six conventional performance measures, which were of primary interest to our analyses, further, secondary measures are listed in the Appendix.

Lately, the DM has become increasingly popular for joint analyses of RT and ERR data from cognitive two-choice tasks. To account for this development, we also performed DM analyses. The measures of the DM allow for a finer, theory-driven interpretation of the underlying cognitive processes, as compared to conventional performance measures based on aggregate data, such as mean RTs (Voss et al., 2013; Voss, Rothermund, & Voss, 2004). The predominantly estimated measures are upper threshold boundary separation (a), relative starting point (zr), drift rate (v), and response time constant (t0). The most prominent measure is v (typical range: – 5 < v < 5), which describes the mean speed of the information accumulation process toward the correct response option. The higher the boundary separation a is (0.5 < a < 2), the more cautious is the response style of the individual. A preference toward one of the two response options is expressed by zr (.3 < zr < .7), whereas zr = .5 indicates no preference. All residual processes (i.e., sensory encoding and motor execution of the response) are expressed by t0 (.1 < t0 < .5). For all trials of the respective task, we estimated the common a, zr, and t0 measures, whereas v was estimated separately for each trial type of the go/no-go task and the two-back task. In the case of the number–letter task, all DM measures were estimated separately for switch and repeat trials, before performance measures were calculated from their differences between the two conditions. An overview of all analyzed performance measures and their labels is presented in Table 1.

Table 1 Overview of the analyzed performance measures

Statistical analyses

All two-stage analyses reported in this article were performed using the R 3.3.1 statistical software (R Core Team, 2017). The raw data and the R script of our analyses can be downloaded from https://osf.io/64q2z/.

To obtain performance measures at the first analysis stage, thorough outlier removal is a common practice in internet-based mental chronometry, to control for unwanted environmental variability, such as transient distractions (Brand & Bradley, 2012; Keller et al., 2009). Because we were specifically interested in these effects, only little outlier control was applied to our data. Prior to the analyses, all timed-out responses were excluded (0.5% of all trials). Furthermore, trials with log(RT) lower or higher than 2.5 standard deviations (SD) from the conditional mean RT of each session, task, and trial type were excluded (1.9% of all trials). Moreover, three participants were excluded from the analyses of the two-back task due to technical problems during this paradigm in the home session.

For each task, the above-mentioned performance measures were calculated and then submitted to the second analysis stage. DM measures were estimated by minimizing the Kolmogorov–Smirnov statistic between the observed and the model-implied distribution of correct and error RTs using fast-dm (Voss & Voss, 2007).

At the second analysis stage, hierarchical regression analyses of each measure were conducted using generalized least squares estimation. The conditional mean of the respective performance measure (y) was modeled by the intercept (β0), the setting (β1; home = 0, lab = 1), the number of the session (β2; 1st session = 0, 2nd session = 1), and the setting of the initial session (β3; home = 0, lab = 1). Accordingly, the structural part of the regression model can be expressed by Formula 1:

$$ y={\beta}_0+{\beta}_1+{\beta}_2+{\beta}_{3.} $$
(1)

Formal hypothesis testing for setting differences (β1) in the 18 primary performance measures was performed on the basis of a tail-area false discovery rate of FDR = 5% (Benjamini & Hochberg, 1995).

The employed regression models also accounted for the difference between the residual standard deviation (SD) of y in the lab (σ) as compared to home (ω*σ). The resulting difference in the variability of performance measures between lab and home was therefore expressed as an SD ratio (ω), whereas the SD in the lab (σ) was set as the reference. Hence, the following formula (Eq. 2) applied:

$$ \omega ={SD}_{\mathrm{Home}}/{SD}_{\mathrm{Lab}}={SD}_{\mathrm{Home}}/\sigma . $$
(2)

Each regression model further provided estimates of the correlation between the two sessions (rTR). Given the design of the present study, the correlation coefficient r was interpretable as an estimate of the test–retest stability (time lag: 1 week) of the respective performance measures. To provide references for these stability estimates, the internal reliability of each performance measure was also quantified by splitting between odd- and even-numbered trials and correcting these estimates for attenuation by using the Spearman–Brown formula.

To obtain confidence intervals (CIs) for the estimated SD ratios (ω), test–retest stabilities (rTR), and internal reliabilities (rLab and rHome), a nonparametric bootstrap with n = 50,000 replicates was performed for each model.

Results

To provide a general description of the investigated performance measures, Table 2 lists their means and standard deviations in both settings (i.e., lab vs. home). The inferential analyses are based on multiple hierarchical regression modeling to compare the performance measures between the different settings. A comprehensive list of all estimates is provided in Table 3. On the basis of the Benjamini–Hochberg procedure (FDR = 5%), p ≤ .02 was considered the significance threshold for formal hypothesis testing.

Table 2 Mean performance measures and their mean standard deviations across participants
Table 3 Regression estimates with variance estimation according to settings and the reliabilities of the respective performance measures

Systematic setting differences

Regression modeling yielded no evidence of significant differences between the settings (i.e., lab and home), for both the conventional performance measures (– 0.01 ≤ β1 ≤ 0.01, .24 ≤ p ≤ .61) and the DM-based measures (– 0.04 ≤ β1 ≤ 0.20, .05 ≤ p ≤ .81).

The change in performance measures from the first to the second session was estimated by β2 (see Table 3). Some models—that is, C-GNRT, C-GNERR, C-2BRT, C-2BERR, DM-NLt0, DM-GNt0, DM-GNv, DM-2Ba, and DM-2Bt0—suggested improvements in task performance (p ≤ .01). This probably reflects the well-known practice effects that occur in repeatedly conducted cognitive paradigms (e.g., Davidson, Zacks, & Williams, 2003; Enge et al., 2014). All remaining estimates of β2 provided no considerable evidence of performance improvements (.08 ≤ p ≤ .87).

To control for potential asymmetries in the initial condition, β3 estimated differences in performance measures due to the initial setting. Models DM-NLzr, DM-NLt0, DM-NLv, and DM-2Bt0 revealed borderline associations (.01 ≤ p ≤ .04), whereas all other models suggested that differences due to initial setting were negligible (.21 ≤ p ≤ .99).

Variability in the two settings

Each regression model quantified the differences in residual variability of the 18 performance measures between the two settings. The estimated SD ratio ω quantified the relative change of SD in the home condition as compared to the lab condition (σ). The majority of models revealed a ω > 1 (1.01 ≤ ω ≤ 1.24), indicating higher variability of the performance measures at home (see Fig. 2). However, seven out of the 18 models—that is, C-NLERR, C-2BRT, DM-NLa, DM-NLt0, DM-GNa, DM-2Ba, and DM-2Bt0—showed a ω < 1 (0.86 ≤ ω ≤ 0.98). In some cases the variability of performance measures might therefore have been smaller in the domestic setting than in the standardized lab setting.

Fig. 2
figure 2

SD ratios ω (home/lab) of the respective performance measures. Error bars indicate the 95% confidence intervals based on bootstrapping (n = 50,000). SD ratios ω > 1 indicate more residual variance at home, whereas ω < 1 indicate more residual variance in the lab. CP = conventional measures; DM = measures from the diffusion model; RT = response time; ERR = error rate; a = boundary separation; zr = relative starting point; t0 = nondecision time; v = drift rate

Bootstrapping was used to estimate the sampling variability (i.e., the 95% confidence intervals [CIs]) of all the estimated ω values. According to these analyses, only two models showed an ω that differed considerably from 1. Model C-GNERR, representing the ERR of no-go trials in the no/no-go task, yielded ω = 1.19 [1.02, 1.39]. Model DM-GNzr, estimating zr in the go/no-go task, yielded ω = 1.20 [1.05, 1.36]. Both models provided evidence that the variability of some performance measures may have increased in the less standardized setting.

To increase the precision of the estimated ωs, we finally pooled the performance measures of each task through Bayesian meta-analyses (while accounting for the between-measure variability of ω using the Berger–Bernardo reference prior; see Bodnar, Link, Arendacká, Possolo, & Elster, 2017). In the go/no-go task, the mean SD ratio increased by 14.9% in the domestic setting (ω = 1.15, CI95% = 0.96–1.37). By contrast, the number–letter and two-back tasks yielded no considerable evidence for such an increase, with ω = 1.00 (CI95% = 0.83–1.20) and ω = 1.01 (CI95% = 0.84–1.23), respectively. Note that ω did not differ systematically between the conventional and DM-based performance measures. A precise record of all estimated ωs and their CIs is given in Table 3. Additional calculations using further performance measures confirmed these findings and are provided in the Appendix.

Reliability in both settings

The test–retest stability rTR (time lag: seven days) between the two sessions was also estimated by the regression models. For the conventional measures the test–retest stabilities fell within the range .33 ≤ rTR ≤ .73. The DM-based measures generally showed lower test–retest stabilities—that is, .04 ≤ rTR ≤ .56 (see Fig. 3). All test–retest stabilities and their CIs are listed in Table 3.

Fig. 3
figure 3

Test–retest stabilities (interval: seven days) and internal reliabilities of the respective performance measures in both settings. Confidence intervals below 0 are truncated

To provide benchmarks for rTR, the internal reliabilities (i.e., rLab and rHome) were subsequently estimated for each performance measure (see Table 3). Regarding the conventional measures, internal reliability had the ranges .33 ≤ rLab ≤ .96 in the lab and .35 ≤ rHome ≤ .94 at home. The internal reliabilities for DM measures fell within the ranges .01 ≤ rLab ≤ .91 in the lab and .38 ≤ rHome ≤ .95 at home, with the reliabilities for the number–letter task seeming to be generally lower than those for the other two tasks (see Fig. 3).

Replicability of cognitive effects

Multiple one-tailed Welch tests were conducted to assess the presence of the respective effects of the cognitive tasks. For each combination of task and setting, the mean RTs and mean ERRs were compared between trial types (see Fig. 4). All comparisons yielded ps ≤ .001 for the task-switching effect, the response inhibition effect, and working memory load, and thus exceeded the significance threshold. The effect sizes were scattered in the range 0.70 ≤ d ≤ 2.72 (see Fig. 4). These findings indicate that the prominent cognitive effects from the conventional measures were successfully replicable in both the laboratory and domestic settings.

Fig. 4
figure 4

Mean RTs (upper panels) and mean ERR (lower panels) according to setting (i.e., lab and home) for all tested tasks. Error bars indicate SDs. d = standardized mean change. *p < .001

To replicate previously published DM analyses of task switching, Welch tests were also used to investigate the differences between switch and repeat trials with regard to a, t0, and v in both settings. Neither the boundary separations a nor the drift rates v differed significantly between the two trial types in the lab (ps ≥ .07) or at home (ps ≥ .03). By contrast, the nondecision time t0 increased notably in switch trials across both settings (p < .001).

Discussion

The aim of this study was to get a better understanding of data quality when mental chronometry is performed in domestic environments via the internet. To answer this question, we focused on three aspects: setting-related differences in the central tendency of different performance measures (including conventional and DM-based measures), their variability, and their reliability.

Systematic differences in performance measures

With regard to conventional performance measures—that is, aggregated RT and ERR—our results are consistent with those from other studies replicating the cognitive effects of different internet-based cognitive tasks (e.g., Crump et al., 2013). For all three presented paradigms we were able to replicate the known effects in both investigated settings, in the lab and at home.

Regarding the DM measures, it is difficult to compare the reported results with previously published findings. To the best of our knowledge, no DM analyses of the spatial two-back task have been published so far. Conversely, Gomez, Ratcliff, and Perea (2007) have extensively discussed how to fit the go/no-go task using the diffusion model. However, their findings of a bias z toward the go response are (in our experience) not universally accepted, because the go/no-go task is commonly regarded as a one-choice task (see R. Miller, Scherbaum, Heck, Goschke, & Enge, 2017). For the task-switching paradigm, Schmitz and Voss (2012) reported a decrease of the drift rate v and an increase of the nondecision time t0 in switch trials. The boundary separation a, however, was insensitive to the trial type. Our analyses replicated these differences for t0 (but not after multiplicity adjustment for v) in both settings. Similarly, the boundary separation a did not differ between the trial types.

Notably, we did not find a meaningful setting difference in the conventional performance measures. This is especially of interest with respect to ERR, because online response logging is supposedly less sensitive to technical variance. Hence, increased ERR can be regarded as an indicator of environmental influences, such as distraction (Semmelmann & Weigelt, 2017). The loss of standardization and experimental control in domestic assessments often goes hand in hand with the preconception that distraction is increased and diligence is decreased (e.g., Chetverikov & Upravitelev, 2016; Gosling, Vazire, Srivastava, & John, 2004; Hilbig, 2016). Yet task completion in the domestic setting had no considerable influence on ERR. These results are consistent with the previous findings of Semmelmann and Weigelt (2017). However, apart from investigating these conventional performance measures, we performed diffusion modeling of the internet-assessed data. Overall, the DM measures showed a pattern similar to that of the conventional measures when lab and home performance was compared. Considering the large size of the investigated per-protocol analysis set (N = 127), we can therefore conclude that small effects due to setting differences are unlikely to occur (with a statistical power of 90% for any d > 0.32).

In sum, the general comparison of lab and home environments suggested that the relative loss of standardization and environmental control in the domestic setting did not systematically or substantially influence measures of cognitive performance.

Variability of the performance measures

One major aim of this study was to estimate changes in the variability of performance measures that might be caused by domestic data assessment. In general, we observed a slight increase in variability for most of the performance measures at home (5% larger variance, which was primarily attributable to the go/no-go task). However, several aspects need to be pointed out.

Although it seems obvious to expect a smaller variability in the laboratory settings, some of our investigated measures also suggested numerically lower variance in the less standardized and controlled domestic setting. This pattern was found in seven out of 18 reported measures across all three paradigms, and pertained to both the conventional measures (i.e., aggregated RT/ERR data) and the DM measures. The fact that in all of these cases the CIs of the reported SD ratios included ω = 1 suggests that the lower SDs at home did not reach a practically considerable extent and, therefore, most likely only occurred due to chance. Nevertheless, these results highlight that the variability differences due to a loss of standardization and experimental control might be smaller than is usually expected. Similar patterns of decreased performance variability in the domestic setting were also found in the analyses of additional performance measures (see the Appendix).

Only two out of 18 measures displayed a likely increased setting-related variability. This was the case for ERR and zr in the go/no-go task. However, in both cases the lower limit of the 95% CI was very close to ω = 1. Furthermore, both measures only displayed (to our mind) relatively negligible differences (d ≤ .13).

Brand and Bradley’s (2012) simulation study illustrates that the increased variability in unstandardized settings can be compensated for by larger sample sizes. This idea directly results from the logic underlying the estimation of statistical power, which requires the smallest detectable effect of interest to be scaled on the basis of its variability within the investigated sample. Thus, larger sample sizes are needed to achieve the same power level whenever variability increases. For example, a 5% increase in variance would increase the total sample size by 20.5% in order to detect the same performance difference between two participant groups of equal size with the same probability. Thus, informed researchers should carefully evaluate whether their study would actually benefit from the convenience of recruiting large samples online.

Data reliability

First and foremost, no major differences in internal reliability were observed between the two settings. This suggests that the setting was not influential on internal reliability. In general, the internal reliabilities of performance measures for the go/no-go and two-back tasks were in a satisfactory range (i.e., moderate to good). By contrast, the number–letter task seemed to display generally lower (i.e., poor) reliabilities for most of the investigated measures. These findings align well with the internal reliabilities of performance measures that have been derived from similar tasks using laboratory assessment. Wolff et al. (2016), for example, reported internal reliabilities of inverse efficiency scores around r = .80 (go/no-go task), r = .89 (two-back task), and r = .59 (number–letter task). Interestingly, the internal reliabilities also did not seem to differ much between the conventional and DM measures.

Second, and less surprising, the test–retest stabilities (time lag: seven days) were considerably lower than the internal reliabilities. Moreover, we observed a small tendency among the conventional performance measures to reach slightly higher test–retest stabilities than the DM measures (cf. Lerche & Voss, 2017). The large variability of the test–retest stabilities we reported conforms to results from Willoughby and Blair (2011), who compiled similar findings for different executive-functioning tasks. They stated that conventional measures (from laboratory assessments) display reliabilities in the range .4 ≤ r ≤ .7. In the present study, only the ERR in the number–letter task showed a slightly lower reliability. Lerche and Voss found the test–retest stabilities for different DM parameters to be roughly .0 < r < .8 (based on optimization of the Kolmogorov–Smirnov statistic), which covers the range of the herein-reported DM measures. Similar to the internal reliabilities, the test–retest stabilities were lower in the number–letter task than in the other two tasks, which can probably be attributed to the way the performance measures from the number–letter task were calculated as compared to those from the other tasks: It has been repeatedly shown and discussed that difference scores are prone to lower reliabilities in many cases (J. Miller & Ulrich, 2013; Paap & Sawi, 2016).

Limitations

When generalizing our findings, two aspects need to be pointed out. First, data collection was achieved by using the Millisecond Inquisit4Web. This Java-based software prioritizes task presentation and thereby minimizes any distraction by background applications. Thus, the reported findings will not necessarily generalize to less invasive presentation software, such as Adobe Flash or JavaScript (e.g., de Leeuw & Motz, 2016; Reimers & Stewart, 2015). Second, our study only assessed undergraduate student participants. Hence, the modulatory impact of a broader education and age range will need to be clarified through further investigations.

Conclusion

The findings of this study provide new insights into the data quality of internet-based mental chronometry in different settings. We were able to show that conventional performance measures as well as DM-based measures are little influenced by unobserved setting variables. Similarly, our findings show that setting-related differences in the variability of performance measures are probably of small size. Although there is an overall increase of variance of approximately 5% in self-chosen experimental environments, internet-based assessments can nevertheless be used without a corresponding loss of statistical power because they facilitate the recruitment of larger sample sizes. Finally, our data indicate that the internal as well as the test–retest stability of internet-assessed cognitive performance is in a satisfactory range and is comparable to reliabilities in the laboratory. This supports the utility and precision of internet-based assessment of cognitive performance in domestic settings.

Author note

This work was supported by the German Research Foundation (DFG, Grant No. SFB 940/2).