Elsevier

Cognition

Volume 131, Issue 1, April 2014, Pages 75-91
Cognition

Traditional difference-score analyses of reasoning are flawed

https://doi.org/10.1016/j.cognition.2013.12.003Get rights and content

Highlights

  • Simulations showed that traditional analyses of reasoning may be incorrect.

  • Two experiments studied how instructions affect the belief bias effect.

  • Signal detection analyses better matched the data than traditional analyses.

  • Signal detection analyses showed that belief bias is mainly a response bias.

  • Past work on conditional reasoning with traditional analyses is also at risk.

Abstract

Studies of the belief bias effect in syllogistic reasoning have relied on three traditional difference score measures: the logic index, belief index, and interaction index. Dube, Rotello, and Heit (2010, 2011) argued that the interaction index incorrectly assumes a linear receiver operating characteristic (ROC). Here, all three measures are addressed. Simulations indicated that traditional analyses of reasoning experiments are likely to lead to incorrect conclusions. Two new experiments examined the role of instructional manipulations on the belief bias effect. The form of the ROCs violated assumptions of traditional measures. In comparison, signal detection theory (SDT) model-based analyses were a better match for the form of the ROCs, and implied that belief bias and instructional manipulations are predominantly response bias effects. Finally, reanalyses of previous studies of conditional reasoning also showed non-linear ROCs, violating assumptions of traditional analyses. Overall, reasoning research using traditional measures is at risk of drawing incorrect conclusions.

Introduction

One of the central research issues in cognition is how prior beliefs are put together with new observations. For example, this issue arises in perception (e.g., Schyns & Oliva, 1999), memory (e.g., Bartlett, 1932), comprehension (e.g., Bransford & Johnson, 1972), categorization (e.g., Heit & Bott, 2000), social cognition (e.g., Sherman et al., 2008) and contingency judgment by humans as well as animals (e.g., Alloy & Tabachnik, 1984). Here our focus is reasoning. Broadly speaking, when reasoning is uncertain, it is normative to take account of prior beliefs, indeed any knowledge, in an effort to improve inferences (Skyrms, 2000; see also Heit, Hahn, & Feeney, 2005). However, when the task is to reason according to standard rules of logic, it is normative to focus on the form of an argument only, and not how it connects with other knowledge. For example, in typical studies of syllogistic reasoning, participants are explicitly instructed to focus on whether the conclusion logically follows from the premises. Researchers can then measure how prior beliefs, despite instructions, influence reasoning (e.g., Evans et al., 1983, Oakhill and Johnson-Laird, 1985).

One result of this research strategy is the belief bias effect, which is the tendency for conclusions of syllogisms to be accepted when they are consistent with prior beliefs, regardless of their validity. For example, Evans et al. (1983) found that syllogisms with invalid, but believable conclusions, likeNoaddictivethingsareinexpensive.Somecigarettesareinexpensive.Therefore,someaddictivethingsarenotcigarettes.were judged to be “valid” 71% of the time. In contrast, structurally identical invalid problems with unbelievable conclusions, such asNocigarettesareinexpensive.Someaddictivethingsareinexpensive.Therefore,somecigarettesarenotaddictive.were accepted only 10% of the time. Evans et al. also observed a smaller discrepancy in the acceptance rates for logically valid problems with believable and unbelievable conclusions (89% and 56%, respectively). The different sizes of the belief effect for valid and invalid problems resulted in a statistically reliable interaction between the validity of the conclusion and its believability. The three basic effects—higher acceptance rates for valid than invalid conclusions, higher acceptance rates for believable than unbelievable conclusions, and an interaction between validity and believability—have been studied extensively. Evans et al. referred to these three key measures as the logic index, the belief index, and the interaction index.

Most researchers have measured belief bias effects using a 2 × 2 ANOVA on the raw scores. A convincing belief bias effect is observed whenever both main effects and the interaction are statistically significant. As reviewed by Dube, Rotello, and Heit (2010), this work has served as the empirical basis for three decades of research on reasoning. Historically, the interaction effect has been taken to support important theories of reasoning (e.g., dual-process theory, Evans & Curtis-Holmes, 2005; mental models theory, Oakhill & Johnson-Laird, 1985; see also Evans et al., 1983, Klauer et al., 2000, Newstead et al., 1992, Polk and Newell, 1995, Quayle and Ball, 2000). Assessment of the interaction index continues to be a key point in recent studies of belief bias on syllogistic reasoning in a variety of arenas (e.g., neuroscience, Stollstorff, Bean, Anderson, Devaney, & Vaidya, 2013; emotion and cognition, Blanchette and Campbell, 2012, Eliades et al., 2012, Goel and Vartanian, 2011; individual differences, Stupple, Ball, Evans, & Kamal-Smith, 2011; informal argumentation, Thompson & Evans, 2012).

The logic effect is also a matter of extensive interest in reasoning research, beyond syllogistic reasoning tasks. For example, Pollard and Evans (1987) proposed that the logic index based on raw difference scores (logically correct answers minus logically incorrect answers) should be used to analyze performance on the Wason (1968) selection task. This proposal has been influential (e.g., Griggs, 1989, Platt and Griggs, 1993, Stanovich and West, 2008). The logic index has also been used extensively in studies of conditional reasoning (e.g., Evans et al., 1999, Sellen et al., 2005). Therefore, the critiques in this paper of the logic index apply not only to syllogistic reasoning but to the selection task and conditional reasoning as well. Similarly, the belief index has also been used to study belief bias effects in conditional reasoning (e.g., Evans et al., 2009, Handley et al., 2004).

What the aforementioned studies have in common is that they rely on analyses of simple difference scores and interactions. A few experiments have used these measures to investigate the important topic of whether the belief bias effect can be reduced or eliminated intentionally (Evans et al., 1994, Newstead et al., 1992). In other words, can an experimenter’s instructions lead a participant to avoid using prior beliefs when evaluating logical validity? This is an important theoretical question because it addresses a core issue in dual-process accounts of reasoning, namely whether automatic processes can be inhibited or substituted with more controlled processes (referred to as an intervention by Evans, 2008, and an override by Stanovich, 2009). In an experiment with syllogisms, Newstead et al. found that highly detailed instructions eliminated both the belief effect and the interaction effect. In contrast, two of the three experiments on syllogisms reported by Evans et al. (1994) found no reduction in the belief effect or the interaction effect. Their favored explanation for the inconsistent results focused on stimulus and instruction effects. But another possibility is that the traditional measures they considered have a tendency to lead to distorted or unreliable conclusions.

Dube et al. (2010) raised a related concern. We showed that the theoretical receiver operating characteristic (ROC) curves—which plot correct response rates (hits, H) against error response rates (false alarms, F) as a function of changing response bias but constant accuracy level—implied by traditional measures are linear (see also Macmillan and Creelman, 2005, Swets, 1986). In contrast, the empirical ROCs obtained in reasoning tasks, including both syllogistic belief bias and inductive reasoning, are curved and therefore inconsistent with the assumptions of the raw score approach (Dube et al., 2011, Dube et al., 2010, Heit and Rotello, 2005, Heit and Rotello, 2008, Heit and Rotello, 2010, Heit and Rotello, 2012, Heit et al., 2012, Rotello and Heit, 2009, Trippas et al., 2013). Note that in Dube et al., 2010, Dube et al., 2011 we focused on the interaction index and did not consider potential problems with the logic index or the belief index that are addressed here for the first time.

Applying a measurement statistic, like a difference between acceptance rates, has been shown to result in a high probability of the data being misinterpreted if the assumptions of that measure are not met. This point has been made often in memory research (e.g., Evans et al., 2009, Masson and Rotello, 2009, Verde and Rotello, 2003, Wixted and Mickes, 2012). For example, what are actually response bias differences between two experimental conditions may be falsely interpreted as accuracy differences. That negative consequence of violated assumptions cannot be overcome by collecting larger sample sizes, which often, insidiously, worsen the problem (Rotello, Masson, & Verde, 2008).

Dube et al. (2010) applied a signal detection (SDT) model of belief bias to our curved ROC data, and concluded that the belief effect and the interaction effect could be fully accounted for by a simple response bias shift for believable and unbelievable problems: Reasoning accuracy did not differ with believability, though subjects’ willingness to say “valid” did. In contrast, traditional analyses had indicated that reasoning accuracy was greater for unbelievable arguments than for believable arguments. Accuracy differences are often used to justify theoretical claims of differential or extra processing for some argument types; for example Evans et al. (1983) concluded that when reasoners are faced with an unbelievable conclusion, they undertake additional processing to scrutinize an argument’s premises. If belief bias is simply a tendency to respond more positively to some arguments than others, these extra processes are unnecessary.

In Dube et al. (2010), difference score analyses led to the usual conclusion that there is an interaction between logic and belief, but SDT analyses concluded that there was no interaction. Indeed, in one experiment we eliminated the belief content of the syllogisms, replacing the content words with letters and imposing a between-subjects manipulation of response bias. The resulting response rates were analyzed using a standard 2 × 2 ANOVA and revealed significant main effects of logic and bias, and a significant interaction. In other words, despite being presented with identical problems, participants in the conservative condition appeared to reason more consistently with the rules of logic than those in the liberal condition. In contrast, SDT model fits led to the conclusion that only the response bias parameters need to be free to vary to account well for the data; accuracy parameters do not differ.

To see how these vastly different interpretations of the data are possible, consider Fig. 1, which shows hypothetical data that might be observed in an experiment on belief bias. A typical result is represented by points B and D, where point B reflects more conservative responding to unbelievable problems, and point D reflects more liberal responding to believable problems. Notice that point B falls on a linear ROC implied by a higher value of the traditional difference score measure, HF, relative to point D. In a traditional analysis, these points would be interpreted as showing an interaction between validity and believability, in which reasoning accuracy is higher for unbelievable conclusions. Points A and C would also be interpreted as showing an interaction, although for that pair of points higher reasoning accuracy would be inferred for the believable problems (point C). If Points B and C were observed empirically, the experiment might be deemed a failure: They fall on the same HF ROC and therefore difference score analysis would conclude that the interaction index was zero. The signal detection interpretation of these points, shown as the smooth curve, is that they all reflect the same reasoning accuracy, differing only in response bias.

We emphasize that linear ROCs are a necessary assumption of traditional difference-score analyses of reasoning. The difference score approach subtracts the positive response rate to one type of stimulus (say, invalid problems) from that to another stimulus type (valid problems). This assumes that the difference, HF, measures accuracy; Snodgrass and Corwin (1988) called this measure Pr. If only bias changes across conditions, then response rates to both types of stimuli will increase or decrease, but Pr should be constant. Indeed, because ROCs are isosensitivity curves, connecting the data points from conditions that differ in response bias but not accuracy defines the theoretical ROC for that accuracy measure. In the case of Pr, we note that Pr = H  F, or, equivalently, that H = Pr + F. Because Pr is necessarily constant in an ROC, this simple equation shows that the hit rate is a linear function of the false alarm rate, with intercept equal to Pr and a slope of 1. All points that have equal Pr must fall on the same line (Swets, 1986, p. 111). For example, consider one experimental condition that results in a correct response rate of 0.61 and an error rate of 0.11, yielding Pr = 0.50 (this is point B in Fig. 1). If another condition produces a liberal response bias shift, the false alarm rate might increase, say to 0.39. If accuracy is unchanged, then the hit rate must be 0.89, because H = Pr + F = 0.50 + 0.39 (this is point C in Fig. 1). Similarly, if responding is very conservative, so that the false alarm rate is 0, then the hit rate must be 0.50 if accuracy is constant; these data would appear at the y-intercept of the H  F = 0.50 line in Fig. 1. Data points that do not fall on that line reflect different values of Pr and therefore different accuracy levels.

The current state of the field leaves researchers in a muddle. The belief bias effect is a central phenomenon in reasoning research, but there are differing views on the nature of the belief bias effect, how to measure it, and how to explain it. Traditional raw score measures suggest a dramatically different interpretation of the data than the newly-proposed signal detection measures and model. Traditional measures have implied that believability influences reasoning itself, whereas signal detection measures based on ROCs have generally suggested that believability influences only participants’ willingness to say “valid.” In this paper, we further contrast traditional analyses of reasoning tasks and SDT model-based analyses. We first show via simulation that traditional approaches can easily lead to different conclusions than SDT model-based analyses. Next, in two experiments, we investigate the important issue of how instructions affect the three belief bias effects; these data show that the assumptions of the traditional analyses (but not SDT analyses) are violated. Finally, we present ROC data from other deductive reasoning tasks, such as conditional reasoning, that are also consistent with the assumptions of SDT and inconsistent with traditional analytic approaches.

We first present new simulations that demonstrate that all three traditional difference-score measures for reasoning experiments are at risk. We take as a starting point the observation that all extant empirical ROC curves for reasoning experiments are curved (Dube et al., 2010, Dube et al., 2011, Heit and Rotello, 2005, Heit and Rotello, 2008, Heit and Rotello, 2010, Heit and Rotello, 2012, Heit et al., 2012, Rotello and Heit, 2009, Trippas et al., 2013), and are consistent with arguments that vary in strength according to Gaussian distributions. We generate simulated data for hypothetical experiments containing an instructional manipulation that affects response bias and not reasoning accuracy. We show that traditional raw score analyses will tend to incorrectly imply that there is a difference in the size of the logic effect across instructional conditions when there is none, and will likewise tend to incorrectly imply that the interaction between logic and belief differs from one instructional condition to the other. Then we address the case in which instructions affect reasoning accuracy rather than response bias. Although traditional measures will correctly pick up the difference in the logic effect, they may incorrectly imply that the belief effect differs across conditions when it does not.

To generate the simulated data, we sampled evidence values from Gaussian distributions like those shown in Fig. 2a. Evidence values sampled from the valid distribution had a higher mean than those sampled from the invalid distribution, and the magnitude of the difference in mean strength was varied over several levels (see Table 1 for details). In this set of simulations, we assumed that the believability of an argument’s conclusion influenced only response bias (i.e., criterion location), with believable problems yielding a more liberal bias (see Fig. 2a). Although there was a logic effect, it was the same size for both believable and unbelievable problems, and thus there was no interaction between belief and logic. The magnitude of the bias shift was also varied over several levels (Table 1). Layered on top of the bias effect attributable to believability, we assumed in this simulation that the instructional manipulation itself affected response bias. For simplicity, the two bias effects were assumed to be additive. For each simulated trial, the sampled evidence value was compared to the appropriate decision criterion; values above the criterion led to “valid” responses, and those below it led to “invalid” decisions. The number of simulated trials per subject was varied parametrically, as was the number of simulated subjects per condition in each experiment.

For each combination of parameter values, we simulated 1000 experiments, computing the logic, belief, and interaction indices in both instructional conditions. These values were then compared using two-sample t-tests. Because the distance between the valid and invalid distributions did not vary with instructions or conclusion believability, significant t-tests for the logic effect or the interaction effect each represent Type I errors in which the two instructional conditions are erroneously concluded to yield different accuracy or an interaction. Although there was a belief effect within each condition as a result of the response bias shift for believable problems, the magnitudes of the effect were identical, so significant t-tests for the belief effect also represent Type I errors in this simulation.

The results for the logic effect are shown in the upper row of Fig. 3, for a representative set of parameters (overall d = 1, zROC slope = 0.8, the most liberal criterion placed at the mean of the invalid distribution, and instruction effect of 0.4 standard deviations). The left panel shows the probability that the two instruction conditions are falsely declared to yield a different validity effect when there are 20 simulated subjects per condition, and the right panel shows the results for 60 simulated subjects. This simulation shows that there is a substantial risk of erroneously inferring that there is a different logic effect across instruction conditions. Moreover, increasing the number of subjects (left vs. right panel) or the number of sampled trials (x-axis values within a panel) increases the probability of drawing the wrong conclusion. This probability increases as the magnitude of the belief bias shift increases (functions within a panel). The probability of this error also increases as the bias difference between conditions increases (not shown). In sum, the circumstances that would usually lead to a more powerful experiment—more subjects, more data per subject, larger effect size—actually lead to more incorrect conclusions for the traditional difference-score measure of the logic effect.

The middle row of Fig. 3 presents the results for the interaction index. The same general patterns are revealed, indicating that the traditional measures of the interaction index often leads to the incorrect conclusion that there is an interaction present. Finally, the bottom row of Fig. 3 shows that the belief effect had a Type I error rate near 0. The result obtains for the simple reason that the bias shifts were always the same size in both conditions.

In a second set of simulations, analogous to our first set, we assumed that the effect of instructional condition was to influence overall reasoning accuracy and not response bias. In brief, traditional measures on the simulated data showed belief and interaction effects that were not really there, and were more likely to do so when sample sizes were larger. Finally, note that both sets of simulations are independent of the nature of the reasoning task, e.g., they would apply equally well to syllogistic reasoning and conditional reasoning.

Having shown the potential for traditional analyses to draw incorrect conclusions from simulated data, we next turn to two new experiments. Dube et al. (2010) argued that the traditional interaction measure is inappropriate and can lead to incorrect inferences. Here, we extend our arguments to the logic index and the belief index as well. In our simulations, we have shown that in experiments designed to measure the influence of instructions on belief bias, or, indeed, with any experimental manipulation that could influence either response bias or reasoning accuracy, all three traditional measures will often lead to incorrect conclusions. In our new experiments, we revisit the issue of how instructions affect belief bias in syllogistic reasoning (Evans et al., 1994, Newstead et al., 1992).

Evans et al. (1994) varied the level of detail provided in instructions to participants. Some conditions used elaborate instructions that emphasized the nature of logical necessity; other instruction sets were long, complex, and included the logical definition of the quantifier “some.” These complex instructions were found to reduce the size of the belief effect and to render the interaction non-significant. Evans et al. interpreted those data as indicating that the believability of a conclusion strongly influences the accuracy of a participant’s reasoning processes. In contrast to previous work, we compared the three traditional measures of reasoning performance to measures derived from SDT analyses. If the resulting ROCs are curved and consistent with SDT, then the SDT analysis will more accurately identify the contributing effects.

Section snippets

Experiment 1

Two sets of instructions were used, standard and augmented. The standard instructions were based on Evans et al. (1994, Experiment 1) as well as Newstead et al. (1992, Experiment 5). The augmented instructions, also based on those experiments, included additional reminders to “only endorse a conclusion if it definitely follows from the information given.” We focused on whether the augmented instructions led participants to adhere more closely to the rules of logic (thus increasing the logic

Experiment 2

Having shown in Experiment 1 that traditional analyses and SDT model-based lead to different conclusions regarding belief bias and the effects of instructions, we next investigated an alternative instructional manipulation, in which participants were told to be more conservative, namely that when they are guessing, they should respond “not valid” rather than “valid.” In a recognition memory paradigm, Rotello, Macmillan, Hicks, and Hautus (2006) showed that an analogous instructional

General discussion

Our initial simulations suggested that when argument strength has an underlying distribution that is Gaussian in form, there is great potential for traditional analyses such as the logic index, belief index, and interaction index to draw incorrect conclusions. For example, when conclusion believability and instructions are known to affect only response bias, our simulations showed that there is still a good chance that the traditional logic effect and interaction effect measures will

Conclusion

Our argument can be summarized as follows: Experimental analyses are model-dependent; different models may lead to different conclusions; and analyses based on signal detection theory are justified based on the form of the ROC data whereas traditional analyses are not. Therefore, traditional difference-score analyses of reasoning performance in general, and the belief bias effect in syllogistic reasoning in particular, are flawed and likely to lead to erroneous conclusions.

The belief bias

Acknowledgements

We thank Marios Eliades and Isabelle Blanchette for providing the raw data from Eliades et al. (2012), Henry Markovits for providing raw data from Markovits and Handley (2005) and Markovits et al. (2010), Lance Rips for providing raw data from Rips (2001), Wendy Contreras, Graham Ellis, and Aljanee Whitaker for their assistance in running experiments, Nicolas Raboy for programming assistance, John Dunn and Dries Trippas for feedback on a previous version of this manuscript, and Chad Dubé for

References (75)

  • A.M. Cleary

    ROCs in recognition with and without identification

    Memory

    (2005)
  • J. Cohen

    Statistical power for the behavioral sciences

    (1988)
  • T.D. Cook et al.

    Quasi-experimentation: Design and analysis issues for field settings

    (1979)
  • C. Dube et al.

    Binary ROCs in perception and recognition memory are curved

    Journal of Experimental Psychology: Learning, Memory, and Cognition

    (2012)
  • C. Dube et al.

    Assessing the belief bias effect with ROCs: It’s a response bias effect

    Psychological Review

    (2010)
  • C. Dube et al.

    The belief bias effect is aptly named: A reply to Klauer and Kellen (2011)

    Psychological Review

    (2011)
  • M. Eliades et al.

    An investigation of belief-bias and logicality in reasoning with emotional contents

    Thinking & Reasoning

    (2012)
  • J.St.B.T. Evans

    The heuristic-analytic theory of reasoning: Extension and evaluation

    Psychonomic Bulletin & Review

    (2006)
  • J.St.B.T. Evans

    Hypothetical thinking: Dual processes in reasoning and judgement

    (2007)
  • J.St.B.T. Evans

    Dual-processing accounts of reasoning, judgement and social cognition

    Annual Review of Psychology

    (2008)
  • J.St.B.T. Evans et al.

    On the conflict between logic and belief in syllogistic reasoning

    Memory & Cognition

    (1983)
  • J.St.B.T. Evans et al.

    Rapid responding increases belief bias: Evidence for the dual-process theory of reasoning

    Thinking & Reasoning

    (2005)
  • J.S.B.T. Evans et al.

    Reasoning under time pressure

    Experimental Psychology (formerly Zeitschrift für Experimentelle Psychologie)

    (2009)
  • J.St.B.T. Evans et al.

    The influence of linguistic form on reasoning: The case of matching bias

    The Quarterly Journal of Experimental Psychology: Section A

    (1999)
  • J.St.B.T. Evans et al.

    Debiasing by instruction: The case of belief bias

    European Journal of Cognitive Psychology

    (1994)
  • K. Evans et al.

    Scene perception and memory revealed by eye movements and ROC analyses: Does a cultural difference truly exist?

    Quarterly Journal of Experimental Psychology

    (2009)
  • V. Goel et al.

    Negative emotions can attenuate the influence of beliefs on logical reasoning

    Cognition & Emotion

    (2011)
  • R.A. Griggs

    To “see” or not to “see”: That is the selection task

    The Quarterly Journal of Experimental Psychology

    (1989)
  • W. Guido et al.

    Receiver operating characteristic (ROC) analysis of neurons in the cat’s lateral geniculate nucleus during tonic and burst response model

    Visual Neuroscience

    (1995)
  • S.J. Handley et al.

    Working memory, inhibitory control and the development of children’s reasoning

    Thinking & Reasoning

    (2004)
  • E. Heit et al.

    Defending diversity

  • E. Heit et al.

    Are there two kinds of reasoning?

  • E. Heit et al.

    Modeling two kinds of reasoning

  • E. Heit et al.

    Relations between inductive reasoning and deductive reasoning

    Journal of Experimental Psychology: Learning, Memory, and Cognition

    (2010)
  • E. Heit et al.

    The pervasive effects of argument length on inductive reasoning

    Thinking & Reasoning

    (2012)
  • R.A. Kinchla

    Comments on Batchelder and Riefer’s multinomial model for source monitoring

    Psychological Review

    (1994)
  • K.C. Klauer et al.

    Assessing the belief bias effect with ROCs: Reply to Dube, Rotello, and Heit (2010)

    Psychological Review

    (2011)
  • Cited by (0)

    View full text