Confidence in masked orientation judgments is informed by both evidence and visibility

Rausch, Manuel; Hellmann, Sebastian; Zehetleitner, Michael

doi:10.3758/s13414-017-1431-5

Confidence in masked orientation judgments is informed by both evidence and visibility

Published: 17 October 2017

Volume 80, pages 134–154, (2018)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Confidence in masked orientation judgments is informed by both evidence and visibility

Download PDF

Manuel Rausch ORCID: orcid.org/0000-0002-5805-5544^1,2,
Sebastian Hellmann¹ &
Michael Zehetleitner^1,2

2753 Accesses
29 Citations
2 Altmetric
Explore all metrics

Abstract

How do human observers determine their degree of belief that they are correct in a decision about a visual stimulus—that is, their confidence? According to prominent theories of confidence, the quality of stimulation should be positively related to confidence in correct decisions, and negatively to confidence in incorrect decisions. However, in a backward-masked orientation task with a varying stimulus onset asynchrony (SOA), we observed that confidence in incorrect decisions also increased with stimulus quality. Model fitting to our decision and confidence data revealed that the best explanation for the present data was the new weighted evidence-and-visibility model, according to which confidence is determined by evidence about the orientation as well as by the general visibility of the stimulus. Signal detection models, postdecisional accumulation models, two-channel models, and decision-time-based models were all unable to explain the pattern of confidence as a function of SOA and decision correctness. We suggest that the metacognitive system combines several cues related to the correctness of a decision about a visual stimulus in order to calculate decision confidence.

Modelling visibility judgments using models of decision confidence

Article Open access 04 June 2021

Global visual confidence

Article Open access 25 March 2021

Priors and payoffs in confidence judgments

Article 07 May 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Humans are usually faced with the need to make decisions about stimuli on the basis of distorted or ambiguous sensory signals. To deal with that uncertainty, various strategies exist, such as looking the stimulus for a longer time, or asking other persons about their opinions. To know when these strategies are appropriate, observers need to have a sense for the probability that their decisions will be correct. The resulting subjective belief that a decision is correct is what we here refer to as confidence (Pouget, Drugowitsch, & Kepecs, 2016). To assess confidence experimentally, the standard procedure is to present observers in several trials with always one out of a set of two possible stimuli. Observers are asked to make a decision which of the two stimuli occurred and to indicate how confident they are about the accuracy of that decision (Baranski & Petrusic, 1994; Fleming & Dolan, 2012; Kepecs & Mainen, 2012). How do observers determine their degree of belief that a decision is correct? Although numerous different theories have been proposed in recent years (Fleming & Daw, 2017; Jang, Wallsten, & Huber, 2012; Kiani, Corthell, & Shadlen, 2014; Maniscalco & Lau, 2016; Moran, Teodorescu, & Usher, 2015; Paz, Insabato, Zylberberg, Deco, & Sigman, 2016; Pleskac & Busemeyer, 2010; Sanders, Hangya, & Kepecs, 2016), there is still no consensus as to the explanatory mechanism. In the present study, we suggest a novel alternative model for confidence in decisions about visual stimuli—namely, the weighted-evidence-and-visibility model (WEV model).

Signal detection theory (SDT) and related models

The most prominent class of models of confidence relies on signal detection theory (SDT; Green & Swets, 1966; Macmillan & Creelman, 2005; Wickens, 2002). SDT models assume the same mechanism underlying the decision, but they extend SDT in different ways to account for confidence.

The mechanism proposed by SDT for making a decision is the following (Green & Swets, 1966; Macmillan & Creelman, 2005; Wickens, 2002): First, the stimulus generates sensory data within the visual system of the observer. Not all sensory data caused by the stimulus are relevant for the decision. Those aspects of the representation that are relevant for the task are transformed into sensory evidence, a continuous internal variable that differentiates between the two stimulus alternatives. Because there is noise in the system, the sensory evidence varies across presentations of the stimuli, which is why it is described as a random sample out of one of two probability distributions, one for each stimulus. If the observer is unable to differentiate between the stimuli, the two distributions created by the two stimuli are identical. The better the observer is able to differentiate between the two possible stimuli, the greater is the distance between the two distributions. It is assumed that observers, to select a response, compare their sample of evidence against a set of criteria. They respond “A” if the sample is greater than the criterion, and “B” otherwise.

How can SDT be extended to account for confidence, as well? The various proposals can be sorted into three categories (see Fig. 1):

(i)
Decision and confidence are based on identical samples of the sensory evidence (SDT rating model).
(ii)
Confidence is based on the same sample of sensory evidence as the decision, but the sample available for confidence is distorted or is overlaid with noise (noisy SDT models).
(iii)
Confidence is based on a second sample of the sensory evidence (two-channel models).

According to Proposal (i), the SDT rating model, decision options and levels of confidence are considered to form an ordered set of responses, such as “I am sure it is A,” “I am guessing A,” “I am guessing B,” and “I am sure it is B.” Each adjacent pair of response options is delineated by one criterion. Both the decision and confidence are selected by the comparison of a single sample of evidence with the set of criteria. This means that observers respond, “I am guessing B” if the sample falls between that criterion separating “I am guessing B” from “I am guessing A” and that criterion separating “I am guessing B” from “I am sure it is B.”

Proposal (ii) is shared by several models that we here collectively refer to as noisy SDT models. These models assume that confidence is informed by the sample of evidence used to select the decision, but that the sample is distorted, incomplete, or overlaid with noise. The degree of confidence is then determined by comparing the distorted sample against a set of confidence criteria. The different theories categorized as noisy SDT models imply different mechanisms why the evidence used for confidence is less informative then the evidence for the decision: First, the sample could be overlaid by unsystematic noise (Hebart, Schriever, Donner, & Haynes, 2016; Sanders et al., 2016). Second, the evidence might be reduced (Maniscalco & Lau, 2016). Finally, confidence could be based exclusively on the evidence in favor of the selected option, but not on evidence against it (Zylberberg, Barttfeld, & Sigman, 2012).

Proposal (iii) is a tenet of models that we here refer to as two-channel models. These models were motivated by the observation that observers occasionally realize that they have made an error, which is not possible within the SDT rating model or the noisy SDT model (Fleming & Daw, 2017; Yeung & Summerfield, 2012). Two-channel models are characterized by the assumption that confidence is not based on the sample of evidence used to make the decision. Instead, it is based on a second sample of evidence generated in parallel to first. Although the term two-channel model seems to suggest that the parallel samples are generated by two separate processes (Maniscalco & Lau, 2016), it is also possible that the same process generates first the sample for the decision and afterward the sample for confidence (Moran et al., 2015). Observers are then confident to the degree that the second sample confirms the decision. Although it is sometimes implied that the evidence used for confidence is stochastically independent from that used in the decision (Rausch & Zehetleitner, 2017), other models allow for a correlation between the two samples (Fleming & Daw, 2017; Jang et al., 2012).

The interaction between the quality of the stimulus and the correctness of the decision

How can SDT models of confidence be tested? A benchmark test may be found in the interaction between of the quality of stimulation and the correctness of the decision (Moran et al., 2015). Perceptual tasks can be made harder or easier by adjusting the physical features of the stimulus—for example, contrast, luminance, or presentation time. Those experimental manipulations intended to facilitate or complicate the task are what we collectively refer to here as the quality of stimulation. SDT and many models that assume that confidence is based on evidence predict the same qualitative pattern (Kellen & Klauer, 2015; Kepecs, Uchida, Zariwala, & Mainen, 2008; Sanders et al., 2016; Urai, Braun, & Donner, 2017): When the decision is correct, confidence should be positively associated with the stimulus quality. In contrast, when the decision is incorrect, the correlation between confidence and the stimulus quality should be negative. The reason for the interaction can be seen in Fig. 2: To inspire confidence, the sample of evidence must be more extreme than the criteria separating decisions without confidence from decisions with confidence. When the stimulus quality is low, and thus the two stimuli are very hard to distinguish, the two distributions of evidence overlap almost entirely (left panel). When the stimulus quality is better, the two distributions are shifted away from each other (right panel). Therefore, greater portions of the distributions extend beyond the criteria for confidence in the correct decision, implying that confidence in the correctness of a decision becomes more and more likely. However, when the distance between the two distributions is large, the portions of the distributions that exceed the confidence criteria for incorrect decisions, which are located at the other sides of the distributions, become smaller and smaller (highlighted in black). At a consequence, when the quality of stimulation increases, confidence in incorrect decisions will become less likely.

Is there empirical support for such an interaction pattern? Previous studies have concurrently observed the predicted positive correlation between stimulus quality and confidence in correct decisions across a variety of tasks (Kiani et al., 2014; Moran et al., 2015; Rausch & Zehetleitner, 2016; Sanders et al., 2016). However, these studies were inconsistent regarding the predicted negative correlation between stimulus quality and confidence in incorrect decisions: The correlation was negative, as predicted by SDT, in an auditory discrimination task and a general-knowledge task (Sanders et al., 2016), as well as in a discrimination task concerning the relative amounts of white and black areas in a visual texture (Moran et al., 2015). However, in contrast to the SDT prediction, coherence of motion was positively, not negatively, associated with confidence in incorrect trials in a random-dot motion discrimination task (Kiani et al., 2014). Furthermore, in a low-contrast orientation discrimination task, the average correlation between stimulus contrast and confidence in incorrect responses was close to zero with a narrow confidence interval (Rausch & Zehetleitner, 2016). Overall, the observations of both positive and negative correlations between stimulus quality and confidence in incorrect decisions pose a challenge to models of confidence derived from SDT.

The weighted evidence and visibility model

The present study proposes a new model of confidence in visual decisions, the weighted-evidence-and-visibility model (WEV). The core idea is that confidence judgments are influenced by two internal variables: the evidence as well as the visibility of the stimulus (see Fig. 3 and the Appendix for details).

The reason why visibility is informative for confidence is because stimuli always comprise several features: for instance, size, form, color, duration, and so forth. In many tasks, each stimulus falls into one of two categories, and observers need to decide which category the current stimulus belongs to. Only one or two features of the stimulus usually determine the category of the stimulus. Nevertheless, the visual system does not only represent the task-relevant, but also the task-irrelevant, features (Marshall & Bays, 2013; Xu, 2010). The strengths of the representations of task-relevant and task-irrelevant features vary from trial to trial and may to some degree be independent of each other, as a consequence of the parallel processing of features in the visual system (Kyllingsbæk & Bundesen, 2007). To make a decision about the category of the stimulus, the representation of task-relevant features is transformed into evidence about the two stimulus categories. The representation of task-irrelevant features is not informative about the stimulus category, and therefore cannot contribute to the evidence. However, the representation of task-irrelevant features is not entirely useless, because the strength of this representation is informative about the quality of stimulation. This is particularly true for experiments in which the category of the stimulus and the quality of stimulation are both experimentally varied across trials. When many features of the stimulus are highly visible, it is reasonable to assume that the representation of the task-relevant feature will be accurate, and thus a high degree of confidence would be appropriate. Likewise, when the task-irrelevant features of the stimulus cannot be perceived, it is likely that the representation of the task-relevant feature is also poor, and thus the evidence could be misleading.

Therefore, the WEV model assumes that confidence is based not only on evidence. The second source of input is an estimate of the physical quality of the stimulus, which we refer to here as the visibility of the stimulus. For computational simplicity, the WEV model assumes that evidence and visibility both depend on the quality of stimulation but are stochastically independent when the stimulus quality is held constant. To determine the degree of confidence, the evidence and visibility of the stimulus are weighted and combined into one decision variable. The weights between evidence and visibility are expected to depend on the characteristics of the stimulation and the task: Some stimulus materials may allow observers to estimate the quality of stimulation with some precision. In this case, a strong weight on visibility would be expected. Since the visibility of the stimulus is determined by the quality of stimulation, but not by the category of the stimulus, the consequence is a positive correlation between stimulus quality and confidence in incorrect decisions. In contrast, other stimulus materials may leave observers without any cues to estimate the quality of stimulation, resulting in a strong weight on evidence. A strong weight on evidence implies a negative correlation between stimulus quality and confidence during incorrect decisions. Overall, the weighting of evidence and visibility allows the WEV model to be consistent with both positive and negative correlation patterns.

Confidence and decision time

An alternative explanation for positive correlations between stimulus quality and confidence in incorrect decisions is provided by two sequential-sampling models of decision making: the diffusion model with internal deadlines (Ratcliff, 1978) and the bounded-accumulation model (Kiani et al., 2014). These models share the assumption that confidence is informed by the elapsed time before a decision is reached. According to the internal deadlines model, observers set themselves variable deadlines for making the decision: When the decision is made before any of the deadlines has passed, the observer feels maximally confident that the decision is correct. The more deadlines are missed, the less confident is the observer that the decision is correct (Ratcliff, 1978). Can internal deadlines explain positive correlations between stimulus quality and confidence in incorrect decisions? Because the model assumes that confidence is informed only by decision time, when increasing the stimulus quality speeds up the decision time for incorrect decisions, increasing the stimulus quality should also make observers more confident in incorrect decisions. In contrast, when increasing the stimulus quality slows down the decision time for incorrect decisions, increasing the stimulus quality should decrease confidence in incorrect decisions. As a consequence, the internal-deadlines model can be tested by comparing the correlation between stimulus quality and confidence with the correlation between reaction time and confidence. If the model was correct, the correlations should be of different signs.

The bounded-accumulation model implies a more complex relationship between decision times and confidence. There, it is argued that sensory evidence is accumulated within two processes. The first process to hit a decision boundary determines which of the two stimulus categories will be selected. Over time, observers learn to associate decision times and states of the losing accumulator with the probability of being correct. When observers make a confidence judgment, they compare the decision time and the state of the losing accumulator with the distributions of decision times and accumulator states of correct and incorrect decisions they have learned over time. When the current decision time and accumulator state are more likely to stem from the distributions known from correct trials than from those of incorrect trials, observers are confident that the decision is correct (Kiani et al., 2014; van den Berg, Anandalingam, et al., 2016; van den Berg, Zylberberg, Kiani, Shadlen, & Wolpert, 2016).

Can the bounded-accumulation model explain positive correlations between stimulus quality and confidence in incorrect decisions? The state of the losing accumulator alone cannot account for these positive correlations because, as it accumulates evidence, it implies a negative correlation between stimulus quality and confidence in incorrect decisions, just as SDT does. However, the positive correlations between stimulus quality and confidence in incorrect decisions can be explained by observers learning the distribution of decision times. There are two possibilities. First, observers might learn that correct decisions are faster than incorrect decisions. In this case, if increasing the stimulus quality speeds up decision times for incorrect decisions, confidence in these incorrect decisions should be increased, as well. Second, observers could learn that correct decisions are slower than incorrect decisions. In this case, confidence in incorrect decisions should increase with stimulus quality only if increasing the stimulus quality slows down the decision times for incorrect decisions.

To summarize, models of confidence based on decision times are in principle consistent with positive correlations between stimulus quality and confidence in incorrect decisions. However, positive correlations between stimulus quality and confidence imply specific patterns of decision times, which can be tested.

Rationale of the present study

The present study was designed to test whether the WEV model provides a better account of confidence in visual decisions than previously established models of confidence. For this purpose, we presented participants with horizontal and vertical gratings. In Experiment 1, the participants reported both the orientation of the stimuli and their confidence in being correct with just one single response. In Experiment 2, confidence judgments were made after the orientation judgment. To manipulate the quality of stimulation, we varied the stimulus onset asynchrony (SOA) between the grating and a backward mask. The backward mask was intended to interfere with both the representation of the task-relevant stimulus feature—that is, the orientation—and the representation of task-irrelevant features, for example, the shape.

In both experiments, the associations between SOA and confidence and between SOA and response times were assessed separately for correct and incorrect decisions. Moreover, we formally assessed the goodness of fit of seven models fitted separately to the orientation judgments and confidence reports of each single participant. The seven models included the WEV model, the SDT model, the noisy SDT model, the postdecisional accumulation model, and the two-channel model. In addition, two variants of the WEV model and the SDT model were used: one model in which the variability of evidence increased as a function of stimulus quality, and one in which the variability remained constant. These two models with SOA-dependent variance of the evidence were included because pilot studies had indicated that allowing variance to increase with SOA improved the model fit. Although previous studies had determined the parameters of the postdecisional accumulation models based on judgments, confidence, and response times (Moran et al., 2015; Pleskac & Busemeyer, 2010), in the present study we did not model response times, because the WEV and SDT models do not make predictions about response times, and model comparisons need to be made on the same data.

We hypothesized that if the WEV model provides a better account of confidence in masked orientation decisions than any of the other five competing models, the two variants of the WEV model should result in better goodness-of-fit measures than any of the competing models. If confidence was informed only by evidence, as is argued by SDT and related models, the correlation between SOA and confidence in incorrect trials should be negative. A positive correlation of SOA and confidence in incorrect trials would be consistent with the WEV model. Finally, if the correlations between SOA and confidence in incorrect trials can be explained by confidence being informed by decision times, the correlations between SOA and confidence should be consistent to the correlation between SOA and response time.

Experiment 1

The full experimental program, the analysis code, and the full data are openly available at the Open Science Framework website (https://osf.io/ty4h8), to facilitate reproduction of the present study and its results (Ince, Hatton, & Graham-Cumming, 2012; Morin et al., 2012). In addition, the hypotheses and analyses, including the participant exclusion criteria, were recorded at the same website (Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012). At this point in time we had already collected the data, but they had not yet been analyzed.

Material and methods

Participants

A total of 20 participants (three male, 17 female) took part in the experiment. The age of the participants ranged between 18 and 40 years (M = 23.3). All participants reported normal or corrected-to-normal vision, no history of neuropsychological or psychiatric disorders, and not being on psychoactive medications. All participants gave written informed consent and received either course credits or €8 per hour for participation.

Apparatus and stimuli

The experiment was performed in a darkened room. The stimuli were presented on a Display++ LCD monitor (Cambridge Research Systems, UK) with a screen diagonal of 81.3 cm, set at a resolution of 1,920 × 1,080 pixels and a refresh rate of 120 Hz. The viewing distance, not enforced by restraints, was approximately 60 cm. The experiment was conducted using PsychoPy version 1.83.04 (Peirce, 2007, 2008) on a Fujitsu ESPRIMO P756/E90+ desktop computer with Windows 8.1. The target stimulus was a square (size 3°×3°), textured with a sinusoidal grating with one cycle per degree of visual angle (maximal luminance: 64 cd/m²; minimal luminance: 21 cd/m²). The mask consisted of a square (4°×4°) with a black- (0 cd/m²) and-white (88 cd/m²) checkered pattern consisting of five columns and rows. All stimuli were presented at fixation against a gray (44 cd/m²) background. The orientation of the grating varied randomly between horizontal or vertical. The participants simultaneously reported the orientation of the target and their confidence in being correct using a Cyborg V1 joystick (Cyborg Gaming, UK). Confidence was recorded using a continuous scale because continuous scales provide a maximum amount of information per single trial (Rausch & Zehetleitner, 2014).

Trial structure

The time course of one trial is shown in Fig. 4. Each trial began with the presentation of a fixation cross for 1,000 ms. Then the target stimulus was shown for a short period of time until it was replaced by the mask. There were five possible SOAs, or time periods between target and mask onset: 8.3, 16.7, 33.3, 66.7, and 133.3 ms. The mask was always presented for 500 ms. When it disappeared, two visual analog scales were displayed on the screen. The upper scale represented the response that the orientation of the target was vertical, and the bottom one represented the response that the orientation of the target was horizontal. Observers reported the orientation by moving the joystick forward or back, which moved an index from the lower to the upper scale or vice versa. The ends of both scales were labeled unsure and sure. Observers reported their confidence by moving the joystick toward the left or the right, thus moving the index on the scales. To avoid any bias by the initial position of the index, the index appeared only when observers had started moving the joystick. The orientation response and degree of confidence were recorded when observers pulled the trigger of the joystick. Finally, if the orientation response was wrong, the trial ended by presentation of the word error for 1,000 ms.

Design and procedure

Participants were instructed to report the orientation of the grating and their confidence in being correct on the orientation judgment as accurately as possible without time pressure. The experiment consisted of one training block and nine experimental blocks of 60 trials each. Each SOA featured 12 times in each block, in random order. The orientation of the target stimulus varied randomly across trials. After each block, the percentage of errors was displayed in order to provide participants with feedback about their accuracy.

Analysis

All analyses were conducted using the free software R (R Development Core Team, 2014).

Correlation analysis

To assess the relationships of confidence with both SOA and response times, we calculated the correlations between confidence and the other variables separately for each participant and for correct and incorrect decisions. Correlations were measured by Goodman and Kruskal’s gamma coefficient (Γ), because Γ does not make any scaling assumptions beyond the ordinal level and can attain its maximum value irrespective of ties (Nelson, 1984).

Modeling analysis

We fitted seven different models derived from SDT to the joint distributions of orientation responses and confidence data:

(i)
the SDT rating model
(ii)
the noisy SDT model
(iii)
the SDT model with SOA-dependent variance of evidence
(iv)
the WEV model
(v)
the WEV model with SOA-dependent variance of evidence
(vi)
the two-channel model
(vii)
the postdecisional accumulation model

In all seven models, we assumed that each presentation of the stimulus created a sample of evidence x drawn from a Gaussian distribution. The location of the distribution was determined by the stimulus category S ∈ {0, 1}, as well as the sensitivity parameter d specific to each SOA. Thus, each model involved five different sensitivity parameters d ₁ –d ₅, one for each SOA. The mean of the distribution was calculated as $ \left(S-\frac{1}{2}\right)\times {d}_i $: Thus, when S = 1, the distribution is centered at d/2, and when S = 0, the center is at −d/2. In models (iii) and (v), the variance s ² of the distributions of x increased as a function of d: s ² = 1 + k × d _i ², where k is a free parameter quantifying the slope of the increase of variance with d. These two models were included in the analysis because an analysis of pilot experiments revealed that allowing the variance of evidence to increase as a function SOA improved the fits of the SDT and WEV model. In all other models, the variance s ² was set to 1. In all models, participants’ responses R were assumed to be 0 when x was lower than the primary task criterion θ, and 1 otherwise.

The degree of confidence was determined by comparing the decision variable y with a set of confidence criteria c. Each confidence criterion delineated between two adjacent confidence categories—for example, participants would select Confidence Category 2 if y fell between c ₁ (which separated Categories 1 and 2) and c ₂ (which separated Categories 2 and 3). To be consistent with the standard SDT rating model (Green & Swets, 1966), two separate sets of confidence criteria were assumed, one for each response option. The different models were characterized by the way that y was determined:

According to the SDT rating model and the SDT model with an SOA-dependent variance of evidence, y was identical to x.
According to the noisy SDT model, y was sampled from a Gaussian distribution, with a mean of x and with the standard deviation σ, which is an additional free parameter.
According to the WEV model and the WEV model with SOA-dependent variance of evidence, y was again is sampled from a Gaussian distribution with the standard deviation σ. The mean of the distribution from which y is drawn was calculated as (1 − w) × x + w × (2R − 1) × (d _i − mean(d)). The parameter w described the degree to which participants relied on evidence or on visibility when they reported their degree of confidence. When w = 0, then the formula reduced to y = x; meaning the model was then identical to the noisy SDT model. The closer w was to 1, the more y depended on the term (2R − 1) × (d _i − mean(d)). The term d _i − mean(d) ensured that y depended on the quality of stimulation, as indexed by d _i, independent of x. The term 2R − 1 ensured that highly visible stimuli tended to shift y in such a way that high-confidence responses were more likely, and likewise, low-visibility stimuli shifted y in such a way that the probability of low-confidence responses increased.
According to the two-channel model, y was stochastically independently from x. The mean of the distribution of y was given by a × d _i, where a is a free parameter that expresses the fraction of signal available to the second channel as compared to the signal available to the first channel. The variance of y was set at 1.
According to the postdecisional accumulation model, y was again sampled from a Gaussian distribution. The mean of the distribution is given by x + (2S − 1) × d _i × b, where b indicated the amount of postdecisional accumulation, and the term 2S − 1 ensured that postdecisional accumulation tended to decrease y when S = 0, and to increase y when S = 1. The variance of the distribution of y was b ².

Model fitting was performed separately for each single participant. The fitting procedure involved the following computational steps. First, the continuous confidence ratings were discretized by dividing the continuous scale into five equal partitions. Second, the frequency of each confidence category was calculated for both orientations of the stimulus and the orientation response. Third, for each model, the set of parameters was determined that minimized the negative log-likelihood of the data (Dorfman & Alf, 1969). The formulae for calculating the probability of an orientation response in conjunction with a specific degree of confidence given the stimulus and the set of parameters are described in the Appendix. Minimization of the negative log-likelihood was performed using a general SIMPLEX minimization routine (Nelder & Mead, 1965).

The relative goodness of fit of the seven candidate models was assessed using the Bayes information criterion (Schwarz, 1978) and the AIC_c (Burnham & Anderson, 2002), a variant of the Akaike information criterion (Akaike, 1974).

Group-level statistics

All statistical analysis at the group level was based on Bayesian statistics (Dienes, 2011; Rouder, Speckman, Sun, Morey, & Iverson, 2009; Wetzels et al., 2011) using the R library BayesFactor (Morey & Rouder, 2014). To test whether the mean Γ correlation coefficients were different from zero and to compare the BICs and AIC_cs between two models, we used the Bayesian equivalents of t tests. Recommended as a standard prior in psychology, a Cauchy distribution with a scale parameter of 1 was assumed as the prior distribution for standardized effect sizes (Rouder et al., 2009). In addition, Markov chain Monte Carlo resampling was used to determine the 95% credible intervals of the posterior distribution of the mean Γ as well as the mean differences for BIC and AIC_c. In addition, the Bayesian equivalent of an analysis of variance was performed with confidence as the dependent variable and with SOA and trial correctness as factors, testing all models that can be created by removing or leaving in a main effect or interaction term from the full model. As priors, we used a default g prior with a scale parameter of $ \sqrt{2}/2 $to maintain consistency with the prior of the Bayesian t tests (see Rouder & Morey, 2012, for the model details).

Results

Two participants were excluded from the analysis because their error rates were not below chance level, BF₁₀s ≤ .20. For the remaining 18 participants, the error rates were at chance at the SOA of 8.3 ms (M = 50.3%, SD = 6.5) and dropped to M = 2.6% (SD = 6.7) at the maximum SOA of 133.3 ms (see Fig. 5, left panel). Confidence judgments averaged 11.5% (SD = 11.7) of the width of the visual analog scale at an SOA of 8.3 ms and increased to a mean of 90.2% (SD = 14.0) at an SOA of 133.3 ms.

Correlation between SOA and confidence

As can be seen from the central panel of Fig. 5, confidence increased as a function of SOA in correct as well as in incorrect trials, although to a lesser extent in the latter case. The Bayesian equivalent of an analysis of variance revealed effects of SOA, BF₁₀ = 2.1 · 10⁴⁵, and trial correctness on confidence, BF₁₀ = 2.3 · 10³, as well as an interaction between the two, BF₁₀ = 31.4. The mean correlation coefficient between confidence and SOA was large for correct trials, M = .80, SD = .08, and medium for incorrect trials, M = .38, SD = .19. A Bayesian analysis indicated that the mean correlation coefficients were different from zero for correct trials, posterior of the mean 95% CI = [.76, .84], BF₁₀ = 3.57 · 10¹⁵, as well as for incorrect trials, posterior of the mean 95% CI = [.27, .46], BF₁₀ = 8.77 · 10⁴.

Correlation between SOA and response times

As is shown in the right panel of Fig. 5, correct responses were faster than incorrect responses, posterior of the mean RT difference 95% CI = [54.9, 262.2], BF₁₀ = 11.3. Correct responses also appeared to become faster with increasing SOA. However, the mean correlation was only weakly negative, M = – .11, SD = .26, and there was insufficient evidence to ascertain whether the mean correlation was different from 0, posterior of the mean 95% CI = [– .22, .02], BF₁₀ = .74. Concerning incorrect responses, the right panel of Fig. 5 suggests that response times remained constant for shorter SOAs between 8.3 and 33.0 ms, and sharply increased for the longer SOAs. However, the mean correlation was negligibly small, M = .05, SD = .14. The reason is that the number of errors at longer SOAs was low, which is why errors at shorter SOAs contributed more strongly to the correlation coefficients. Consequently, there was insufficient evidence to ascertain whether the mean correlation was zero, posterior of the mean 95% CI = [– .02, .11], BF₁₀ = .49.

Modeling results

As is suggested by the upper left and upper central panels of Fig. 6, the only two models that could reproduce the increase of confidence with SOA in incorrect trials were the WEV model and the WEV model with SOA-dependent variance of the evidence. All the other models predicted a decrease of confidence in incorrect trials with SOA, but no such a decrease was observed.

Figure 7 shows that the two flavors of the WEV model could reproduce the correlation between SOA and confidence at the level of each single participant. The other models systematically failed to account for confidence in incorrect decisions.

The two WEV models also fitted the data best in terms of BIC and AIC_c: The best model overall was the WEV model with SOA-dependent variance of the evidence, BIC: M = 1,386, AIC_c: M = 1,314, followed by the WEV model with constant variance, BIC: M = 1,388, AIC_c: M = 1,320. The third-best model was the SDT model with SOA-dependent variance of evidence, but its fit to the data was worse according to both metrics (BIC: M = 1,425; AIC_c: M = 1,362). All the other models performed even worse (all BICs: M ≥ 1,506; all AIC_cs: M ≥ 1,447).

A series of Bayes factors was used to compare the AIC_c and BIC of the WEV model with SOA-dependent variance of evidence against those of all six of the other models. The Bayes factors revealed some evidence against a difference in model fits between the WEV model with SOA-dependent variance and the WEV model with constant variance:

BIC: posterior of the mean ΔBIC: 95% CI = [– 14.1, 11.4], BF₁₀ = 0.18
AIC_c: posterior of the mean ΔAIC_c: 95% CI = [– 18.1, 7.6], BF₁₀ = 0.25.

However, the Bayes factors indicated strongly that the WEV model with SOA-dependent variance fitted the data better than each of the other five models:

SDT model with SOA-dependent variance of evidence:
- BIC: posterior of the mean ΔBIC: 95% CI = [– 55.1, – 18.9], BF₁₀ = 109.2
- AIC_c: posterior of the mean ΔAIC_c: 95% CI = [– 63.6, – 27.1], BF₁₀ = 695.8
SDT rating model:
- BIC: posterior of the mean ΔBIC: 95% CI = [– 162.9, – 67.6], BF₁₀ = 546.9
- AIC_c: posterior of the mean ΔAIC_c: 95% CI = [– 175.3, – 79.3], BF₁₀ = 1.5 · 10³
postdecisional accumulation model:
- BIC: posterior of the mean ΔBIC: 95% CI = [– 167.9, – 71.8], BF₁₀ = 801.4
- AIC_c: posterior of the mean ΔAIC_c: 95% CI = [– 176.8, – 79.9], BF₁₀ = 1.6 · 10³
noisy SDT model:
- BIC: posterior of the mean ΔBIC: 95% CI = [– 168.8, – 73.0], BF₁₀ = 897.6
- AIC_c: posterior of the mean ΔAIC_c: 95% CI = [– 177.5, – 81.2], BF₁₀ = 1.8 · 10³
two-channel model:
- BIC: posterior of the mean ΔBIC: 95% CI = [– 177.4, – 80.3], BF₁₀ = 1.5 · 10³
- AIC_c: posterior of the mean ΔAIC_c: 95% CI = [– 184.7, – 88.8], BF₁₀ = 3.0 · 10³.

In view of these results, we performed an additional simulation to assess whether the present results can be explained by a failure of model identification. For this purpose, we simulated data based on the SDT rating model and the SDT model with SOA-dependent variance of evidence (i.e., the two models that received the most empirical support without assuming an influence of visibility weighting). We first created 1,000 bootstrap samples from the parameter sets obtained by fitting the empirical data. Then, for each sampled parameter set, we randomly created a data set with the same number of trials as the empirical data. Using each simulated data set, we fitted both the WEV model with SOA-dependent variance and the correct generative model—that is, the SDT model or the SDT model with SOA-dependent variance, respectively. Comparing model fits between the WEV model with SOA-dependent variance and the SDT model based on data conforming to the SDT model revealed not a single BIC difference erroneously in favor of the WEV model with SOA-dependent variance, f = 0.0%, and only a few misleading AIC_c differences, f = 3.6%. Likewise, comparing the WEV model with SOA-dependent variance and the SDT model with SOA-dependent variance using data in line with the latter model suggested hardy any BIC differences incorrectly favoring the WEV model with SOA-dependent variance, f = 0.3%, and a moderate number of misleading AIC_c differences, f = 7.7%.

Discussion

The present experiment suggested that the WEV model with SOA-dependent variance of evidence provides a better account of confidence in masked orientation decisions than the SDT model, the noisy SDT model, the postdecisional accumulation model, and the two-channel model. Moreover, longer SOAs increased confidence in both correct and incorrect responses, an observation consistent with the WEV model but not in accordance with the SDT rating model and the many other confidence models. Concerning response times, there was insufficient evidence to determine whether SOA and response time were correlated in correct or incorrect trials.

Can the positive correlation between SOA and confidence in incorrect decisions be explained by the WEV model alone, or are models based on decision times able to provide an alternative explanation? Both the internal-deadline model and the bounded accumulation model could only account for the confidence data if there were a negative correlation between SOA and response times. The reason is that according to the internal-deadline model, when observers are more confident, this means that the decision process was faster, so fewer internal deadlines were missed. According to the bounded accumulation model, because correct responses were faster than incorrect responses, observers would associate fast responses with a greater probability of a correct response, meaning that confidence in incorrect decisions would increase with SOA only if response times decreased with SOA. Unfortunately, the data were inconclusive as to whether the correlation of RT and SOA was absent, although the credible interval indicated that if there were a negative correlation, its size would be close to zero. Nevertheless, since there was still the possibility of a correlation between SOA and response times, the deadline diffusion model and the bounded accumulation model could not be ruled out as explanations of the positive correlation between SOA and confidence in incorrect decisions in this experiment. There were several possibilities why the correlation between SOA and response times might have remained undetected. First, it would be expected that longer SOAs would speed up correct responses. However, once again there was insufficient evidence for such a correlation. This might indicate that the assessment of response times is not very precise. An obstacle to assessing response times with precision was that observers had to make two decisions at the same time, one about the orientation of the stimulus and one about their degree of confidence. As a consequence, the response times might reflect not only the time needed to make a decision about the orientation, but also the time required to select a degree of confidence. The decision time related to confidence might have obscured the correlation between SOA and decision time related to the decision about the stimulus (Kiani et al., 2014). For this reason, we conducted a second experiment, instructing observers to report their degree of confidence only after the orientation response.

Is it possible that the simultaneous measurement of confidence and orientation responses interfered with the reported degree of confidence as well, not only with its timing? In line with this hypothesis, the other study to observe a positive correlation between confidence in incorrect decisions and stimulus quality had assessed confidence and task response with one single response, as well (Kiani et al., 2014). Those studies to find a negative correlation had all assessed confidence only after the task response (Moran et al., 2015; Sanders et al., 2016). These previous studies had offered two explanations why measuring confidence simultaneously with or subsequent to the decision might result in different correlations: One possibility is that assessing confidence only after the response might allow observers to accumulate additional sensory evidence (Kiani et al., 2014). Because sensory evidence implies negative correlations between stimulus quality and confidence in incorrect decisions, assessing confidence after the decision might miss a positive correlation between confidence and task accuracy at the time of the decision. A second possibility is that asking observers to report the orientation and their confidence at the same time might induce a decision strategy based no longer on evidence, but on heuristics (Aitchison, Bang, Bahrami, & Latham, 2015). Again, a second experiment seemed necessary to investigate whether the results of Experiment 1 would generalize to an experiment in which observers reported their degree of confidence only after the orientation response.

Experiment 2

The materials, methods, analysis, preregistration, and availability of all materials of Experiment 2 were the same as for Experiment 1, except for the differences outlined below.