Why does guessing incorrectly enhance, rather than impair, retention?

Yan, Veronica X.; Yu, Yue; Garcia, Michael A.; Bjork, Robert A.

doi:10.3758/s13421-014-0454-6

Why does guessing incorrectly enhance, rather than impair, retention?

Published: 14 August 2014

Volume 42, pages 1373–1383, (2014)
Cite this article

Download PDF

Memory & Cognition Aims and scope Submit manuscript

Why does guessing incorrectly enhance, rather than impair, retention?

Download PDF

Veronica X. Yan¹,
Yue Yu¹,
Michael A. Garcia¹ &
…
Robert A. Bjork¹

1506 Accesses
30 Citations
28 Altmetric
4 Mentions
Explore all metrics

Abstract

The finding that trying, and failing, to predict the upcoming to-be-remembered response to a given cue can enhance later recall of that response, relative to studying the intact cue–response pair, is surprising, especially given that the standard paradigm (e.g., Kornell, Hays, & Bjork, 2009) involves allocating what would otherwise be study time to generating an error. In three experiments, we sought to eliminate two potential heuristics that participants might use to aid recall of correct responses on the final test and to explore the effects of interference both at an immediate and at a delayed test. In Experiment 1, by intermixing strongly associated to-be-remembered pairs with weakly associated pairs, we eliminated a potential heuristic participants can use on the final test in the standard version of the paradigm—namely, that really strong associates are incorrect responses. In Experiment 2, by rigging half of the participants’ responses to be correct, we eliminated another potential heuristic—namely, that one’s initial guesses are virtually always wrong. In Experiment 3, we examined whether participants’ ability to remember—and discriminate between—their incorrect guesses and correct responses would be lost after a 48-h delay, when source memory should be reduced. Across all experiments, we continued to find a robust benefit of trying to guess to-be-learned responses, even when incorrect, versus studying intact cue–response pairs. The benefits of making incorrect guesses are not an artifact of the paradigm, nor are they limited to short retention intervals.

The influence of making judgments of learning on memory performance: Positive, negative, or both?

Article 02 April 2018

Jessica L. Janes, Michelle L. Rivers & John Dunlosky

The effect of delayed judgments of learning on retention

Article 27 February 2021

Eylul Tekin & Henry L. Roediger III

Updating metacognitive control in response to expected retention intervals

Article 21 October 2016

Joshua L. Fiechter & Aaron S. Benjamin

An abundance of research on testing and generation effects has shown that the act of retrieval is a learning event—and often a powerful learning event—in the sense that the retrieved information becomes more retrievable in the future than it would have been otherwise (see, e.g., Roediger & Karpicke, 2006). The retrieval processes triggered by testing are, therefore, opportunities for learning—a basic fact about human learning that is often not appreciated or, at least, is underappreciated, by students (see, e.g., Karpicke, Butler, & Roediger, 2009; Kornell & Bjork, 2007).

Testing effects and generation effects, however, typically refer to the consequences of successful retrieval or generation. One justifiable concern about testing or generation is that what is retrieved, whether correct or incorrect, will be learned: That is, by virtue of the very power of retrieval as a learning event, it seems likely that any errors that are produced will persist. One influential school of thought, for example, inspired by Skinnerian principles of learning, has emphasized “errorless learning” procedures (Skinner, 1958; Terrace, 1963), and a number of studies have, in fact, shown that initially incorrect responses often persist on subsequent tests (e.g., Cunningham & Anderson, 1968; Elley, 1966; Kaess & Zeaman, 1960; Marsh, Roediger, Bjork, & Bjork, 2007). Additionally, generating errors before being given feedback mirrors a classic A-B/A-D interference paradigm (e.g., Briggs, 1954), in which researchers have found that participants do, indeed, become more likely to output the initial “B” response as the retention interval increases.

The picture, though, is not so clear. Other studies investigating the effects of errors on multiple-choice tests (e.g., Butler, Marsh, Goode, & Roediger, 2006), for example, have shown no effect of generating errors, and other recent—and not so recent—findings suggest that there might, in fact, be benefits of trying to generate a correct response, even when the effort fails.

That even failed efforts to generate a to-be-remembered response might have benefits is suggested by the results of early research by Slamecka and Fevreiski (1983). Participants were presented with a list of related cue–target word pairs and were asked to say the target word aloud. In a study-only condition, the participants were shown the intact pair (e.g., pursue–avoid); in a generate condition, participants were shown the full cue word together with a fragment of the target word (e.g., pursue–av--d). If they failed to generate the target word within a 4-s interval, they were provided with corrective feedback immediately for 3 s. On a subsequent free recall test of the targets, there was a benefit of generate over study-only, even when only those items for which participants failed to generate the correct response were examined. The authors argued that failed generations were, in fact, incomplete generations, where semantic features, but not surface features, were processed.

In Slamecka and Fevreiski’s (1983) study, however, 93 % of the errors were errors of omission, not errors of commission, so their findings leave open the possibility that generating overt errors has negative, not positive, effects. Recently, though, Kornell, Hays, and Bjork (2009), using a procedure in which participants’ guesses of to-be-learned responses are wrong with high probability (thus, eliminating differences between items in the errorful and errorless conditions—a confound in some previous studies), extended the finding to cases where participants do not simply omit responses, but produce errors. Their results suggest that producing errors, at least under some circumstances, enhances subsequent learning.

Kornell et al.’s (2009) findings have stirred considerable interest, not only because producing incorrect guesses does not seem, intuitively, to be a good learning technique, but also because their specific procedure involved taking what would otherwise be study time to predict (erroneously) an upcoming to-be-learned response. In the guess-first condition of their Experiment 4, for example, participants were shown cues such as Whale: ______ for 8 s and were asked to predict the upcoming to-be-learned associate of that cue. Immediately after, they were then shown the cue together with the to-be-learned response (Whale: Mammal) for 5 s (97 % of the guesses were incorrect, and the trials on which guesses matched the target were removed from analyses). In their study-only condition, on the other hand, pairs such as Whale: Mammal were shown for the full 13 s. The guess-first condition produced better later recall of the correct target than did the study-only condition, despite the shorter study time and the reasonable expectation that generating a competing associate would create proactive interference. Kornell et al.’s basic finding has now been replicated by a number of other investigators (Grimaldi & Karpicke, 2012; Hays, Kornell, & Bjork, 2013; Huelser & Metcalfe, 2012; Knight, Ball, Brewer, DeWitt, & Marsh, 2012; Vaughn & Rawson, 2012), as well as with foreign language learning (Potts & Shanks, 2014) and more semantically rich text passages (Richland, Kornell, & Kao, 2009) and trivia facts (Kornell, 2014).

Questions and issues motivating the present research

Why do we not find interference in these experimental paradigms? In the present series of experiments, we seek to address two issues: (1) that the experimental paradigm design allows participants to distinguish between their guess and the correct answer at the time of the final test, and (2) whether the guess-first benefit will be maintained or whether the generated guesses will interfere with target recall at a longer retention interval.

One explanation of the benefits of guessing incorrectly is that a participant’s incorrect guess acts as a mediator between the cue and the correct response. An assumption that underlies this explanation is that learners have a means of knowing, at the time of the final test, which response—the one they generated or the one they then studied—is the correct response.

In Experiment 1, we set out to examine whether a feature intrinsic to Kornell et al.’s (2009) paradigm might play a key role in learners being able to make that judgment. Because Kornell et al. wanted to examine whether making incorrect guesses would help or hinder learning, they chose weak associates of the cue word as to-be-learned response targets—that is, words that were unlikely to come to mind and be guessed in advance by participants. In Experiment 1, we explored whether participants in prior experiments may have been able to use the fact that generated errors tended to be strong associates of the cue words, whereas target responses were always weak associates of a given cue. Could participants have mitigated interference at the final test between competing responses, generated errors and targets, by learning that targets are weak associates? We nullified that possible heuristic in Experiment 1 by designing the materials so that the correct answer for half the pairs was a strong associate of the cue word.

In Experiment 2, we sought to nullify another possible heuristic that participants could be using in this paradigm: that their guesses are always wrong. The errorful generation paradigm—as used by Kornell et al. and in subsequent follow-up studies (Grimaldi & Karpicke, 2012; Huelser & Metcalfe, 2012; Knight et al., 2012; Potts & Shanks, 2014; Vaughn & Rawson, 2012)—ensures that the guess is almost always wrong, leaving open the possibility that when presented with the cue at final test, participants are able to simply select whatever response they did not generate for themselves. Therefore, we rigged Experiment 2 so that, in one condition, half of participants’ guesses were always deemed to be correct, and compared the benefit of making errors in this condition with the original condition where just about all the guesses were incorrect.

Variations of the original paradigm have been investigated to test different theories as to why there is a benefit of generating incorrect guesses, and these theories are further discussed in the General Discussion section. Despite variations on this original design, however, whether participants could use a heuristic remains an open question. One variant (e.g., Grimaldi & Karpicke, 2012; Hays, Kornell, & Bjork, 2013) found that delaying feedback of the correct answer eliminates the benefit of making incorrect guesses. While one explanation is that delaying feedback means that the correct target is not encoded into an activated semantic network, it could also be that having first generated guesses to all the guess-first word pairs before receiving the correct answer makes it more difficult for learners to recognize that all the correct responses are less obvious associates of the cue or even that all their initial guesses are wrong. Another variant on the original design showed that the benefit of generating responses was eliminated when participants’ guesses were constrained to a particular word (through the provision of a two-letter stem—e.g., tide–wa____; Grimaldi & Karpicke, 2012). By constraining the guess to one obvious target response, the experimenters created a very different task than is experienced by participants making unconstrained anticipatory “guesses.” Instead of interpreting the constrained generations as “wrong answers,” participants may simply interpret them as other correct answers that are simply not required on the later test.

Experiment 3 was designed to examine whether the ability of participants to discriminate at the time of test between the response they guessed and the actual correct responses depends on the retention interval to the final testing being relatively short. Prior studies have used very short retention intervals in which participants are readily able to retrieve their initial guesses and to distinguish their guesses from the correct targets (e.g., Knight et al., 2012; Vaughn & Rawson, 2012). A question that remains, however, is whether incorrect guesses might become interfering at a long delay. At a delay, we expect that participants will display weaker episodic discrimination and a relatively stronger memory trace for generated guesses, as compared with studied targets. The combination of these two factors could create a case where generated guesses proactively interfere with access to the correct targets. If there is indeed no benefit (or even a detriment) to making guesses at long delays, this finding would have implications for applications of generating errors in education. In Experiment 3, therefore, we investigate whether making erroneous guesses starts to interfere after a longer retention interval (48 h).

Experiment 1

In Experiment 1, we replicated Kornell et al.’s (2009) Experiment 4, but with two changes. First, as was mentioned above, to nullify participants being able to use relative associative strength as a discriminative cue at the time of the final test, we made half of the to-be-learned responses strong, rather than weak, associates of the cue words. If the advantage of guessing before study is due to the use of a “the-answer-is-always-weakly-associated” heuristic at the final test and mixing high associates with the low associates prevents the usage of this strategy, the benefit of guessing-first over only studying on the final test should be eliminated. Second, in addition to asking participants to recall the correct targets on the final cued-recall test, we also asked them to recall their initial guesses. We reasoned that if guesses competed and interfered with the ability to retrieve the targets, we should see better recall of targets when participants are unable to recall their incorrect guesses.

Method

Participants and design

Thirty-four undergraduates from the University of California, Los Angeles (UCLA) participated in Experiment 1. The participants received partial course credit as compensation. We manipulated study condition (guess-first vs. study-only) and word-pair association strength (strong vs. weak) within subjects. In the cued-recall test phase, participants were asked—in response to each cue word—to recall the correct target and then the target they had guessed during the study phase prior to seeing the correct response.

Materials and apparatus

Sixty paired associates were used. Half were weakly associated word pairs with forward association strength between 0.05 and 0.054 (e.g., Olive: Branch); half were strongly associated word pairs with a forward association strength between 0.3 and 0.4 (e.g., Table: Chair). The weak associates were a randomly selected set of 30 pairs, taken from the materials of Kornell et al. (2009). All the words were, at minimum, four letters long. Half of the word associates were randomly assigned to the guess-first condition, which comprised 15 strong associates and 15 weak associates, and the remaining 30 were assigned to the study-only condition. Assignment of these two sets of 30 word pairs was counterbalanced across participants. The order in which the four within-subjects conditions (strong vs. weak; guess-first vs. study-only) appeared was block randomized; the list was divided into 15 blocks of four trials, where each block consisted of one pair from each within-subjects condition (therefore controlling for serial position effects between the conditions). From the participants’ point of view, however, they saw only one long list of 60 word pairs. Finally, the order of the word pairs was fully randomized.

The experiment was created using Collector (https://github.com/gikeymarcia/Collector), an open-source PHP-based program designed to run psychology experiments and conducted via an Internet browser. Participants came into the laboratory and were administered the study on 21.5-in. Apple iMac desktop computers. The web browser was opened full-screen, and instructions and word pairs were all presented in the center of the screen.

Procedure

The study was composed of two phases: a study phase and a final cued-recall test phase. For the study phase, participants were told that they would study pairs of related words. Sometimes they would see complete pairs, whereas other times the second word would be missing. When pairs were shown incomplete, participants were told that they should try to guess the upcoming to-be-learned response, after which they would be shown the correct answer. Participants were shown the 60 word pairs one at a time. In the guess-first condition, they were presented with a cue and a blank (e.g., Olive: _____) and were given 8 s to make a guess (e.g., they might guess “Martini”). Participants were instructed to always make a guess, rather than to leave the space blank. The full cue–target pair (e.g. Olive: Branch) was then shown for 5 s immediately after making their guess. In the study-only condition, participants were presented with the full cue–target pair twice consecutively, for 8 and 5 s, respectively.

After a 5-min retention interval, participants were then given a final cued-recall test on all 60 word pairs. During the final cued-recall test, participants were shown a given cue twice followed by a blank line each time. Participants were informed that every cue word would be presented twice consecutively and were instructed to fill in the first blank with the correct target. For example, if they were presented with the cue: “Olive:_____,” they should type “Branch” (the correct target) the first time they see “Olive: _____” in the final test. They were instructed that for the second blank, they should type in their original guesses, if the pair was in the guess-first condition. In the example given then, that means that they should type in “Martini” for the immediately subsequent, second presentation of “Olive: _____.” If they had not been asked to make a guess for the cue word in the study phase (i.e., the pair had been in the study-only condition), participants were told to type in “Read,” instead of an initial guess. It was not indicated during the final test whether the pair was in the guess-first or the study-only condition, and the second blank appeared regardless of whether participants were able to fill in the first blank (i.e., recall the correct target). Participants were not given any explicit instruction about whether they should always fill in the blank, and many left the space blank if they could not recall the answers. The pairs were presented in a randomized order, and the test was self-paced.

Results and discussion

Although comparison of the weak and strong associates is not of primary concern—the strong associates were included simply to reduce the possible use of a “the answer-is-always-weakly-associated” heuristic—we analyzed the strong and weak pairs separately. Successful guess rates were 4 % for the weak associates and 9 % for the strong associates for the guess-first pairs during the study phase. The rates for the weak associates were about as expected, but the rates for the strong associates were lower than expected. This lower rate may reflect that the pairs were intermixed, meaning that participants could learn that the most obvious associates were only infrequently the correct responses, leading to a reduced success rate for the strong associates. All analyses reported in this article are restricted to the items where the guess was incorrect. Additionally, responses were counted as correct only if they were typed into the appropriate spaces; in other words, recalled targets were counted as correct if entered into the first blank, but not if entered into the second.

Recall of correct targets

As is shown in Fig. 1, we replicated the basic finding—namely that the guess-first condition produced better later recall of the target response than did the study-only condition, despite the presence of strongly associated to-be-learned pairs.

Furthermore, the benefit of making incorrect guesses was present for both strongly associated to-be-learned pairs and weakly associated pairs. A two-way (study condition × association strength) within-subjects ANOVA showed that there was a main effect of study condition, F(1, 33) = 45.06, MSE = .03, p < .1, η ² _p = .58: Pairs in the guess-first condition (M = .79, SD = .11) were recalled significantly better than the pairs in the study-only condition (M = .59, SD = .18), t(33) = 6.73, p < .001, Cohen’s d = 1.15 . There was no significant effect of association strength, F(1, 33) =0 .04, MSE = .01, p > .05, η ² _p = .001.

The study condition × association strength interaction was marginally significant, F(1, 33) = 4.04, MSE = .02, p = .053, η ² _p = .11. The benefit of making an incorrect guess appears to have been marginally larger for the strong associates (.81 vs. .57 for guess-first and study-only conditions, respectively) than for the weak associates (.76 vs. .62), although the benefit was significant for both weak and strong associates.

Whatever the reason for the strongly associated pairs showing at least as large a benefit of error generation, the key point is that the benefits of the guess-first condition found by Kornell et al. (2009) and subsequent research findings appear not to be a consequence of participants being able to adopt a heuristic at the time of the final test—namely, that the correct response is the weaker of the two remembered associates to a given cue word.

Participants’ ability to recall their initial guesses

Participants ability to recall their initial guesses (M = .79, SD = .13) and the correct answers in the guess-first condition (M = .79, SD = .11) did not differ, t(33) = 0.15, p > .05, Cohen’s d = 0.026; neither was there a difference in their recall of their guesses to the strong-associate cues (M = .81, SD = .14) and to the weak-associate cues (M = .77, SD = .17), t(33) = 1.30, p > .05, Cohen’s d = 0.22. Intrusion rates of guesses into the blank space provided for targets and vice versa were very low: Guesses intruded into recall of targets only 1.4 % of the time (SD = 3.1 %), and targets intruded into recall of guesses only 2.9 % of the time (SD = 4.8 %). Thus, there was no evidence that initial guesses were suppressed, replicating the findings of Vaughn and Rawson (2012) and Knight et al. (2012).

For the study-only trials, participants correctly typed “Read” in the second blank provided for each given cue 78.2 % (SD = 30.4 %) of the time. The large standard deviation simply represents the 6 participants who may have misunderstood the instructions (3 of whom mostly left the space blank, and 3 of whom either provided the target a second tim or entered in completely new cue-related words).

Target recall conditional on guess recall

When we examine target recall, conditional upon ability to recall initial guesses, we see interesting patterns. A 2 (strength: weak, strong) vs. 3 (study condition: guess recalled, guess unrecalled, study only) within-subjects ANOVA revealed a main effect of study condition, F(2, 58) = 31.6, MSE = .04, p < .001, η ² _p = .52, but no main effect of strength and no interaction, Fs < 1. Data from 4 participants were not included in this ANOVA analysis because they had perfectly recalled their initial guesses to either all the weak associate or strong associate pairs. In Fig. 2, we collapse across strength and compare correct target recall performance of all of the study-only items with that of the guess-first items for which guesses were also recalled and with the guess-first items for which the guesses were not recalled. As is shown in Fig. 2, there is a benefit of generating guesses, but only when those initially incorrect guesses are later recallable.

Post hoc t-tests showed that while there was a benefit of making incorrect guesses (M = .85, SD = .11) over pure study (M = .59, SD = .19) when guesses were retrieved, t(33) = 8.38, p < .001, Cohen’s d = 1.43, there was no difference between recall of targets of guess-first items when participants could not recall their guesses (M = .59, SD = .22) and recall of targets of study-only items, t(33) = 0.01 , p > .05, Cohen’s d < 0.01. Additionally, there was a significant difference between recall of the guess-first targets when guesses were recalled and when they were not, t(33) = 7.29, p < .001, Cohen’s d = 1.25. These analyses suggest that participants’ accessibility to their guesses also allows for greater accessibility of the targets and replicate the patterns found in prior studies (Butler, Fazio, & Marsh, 2011; Knight et al., 2012; Vaughn & Rawson, 2012).

Experiment 2

In Experiment 1, despite our expectations, successful guess rates between high and low associate words pairs were not dramatically different. Prior research by Koriat, Fiedler, and Bjork (2006) also suggests that hindsight bias can make it difficult for participants to accurately judge the likelihood of generating a target given a cue, particularly when the cue–target pair is related: When shown a list of word pairs of zero, low, or high association, participants grossly overestimated the percentage of people who would generate the target given the cue and showed a remarkable underappreciation of the difference between high- and low-associate pairs. If, in hindsight, participants are unable to judge which word is a stronger associate to the cue word, then Experiment 1 might not have worked to fully address the heuristic that the correct target is always a weak associate.

In an attempt to address these concerns, we conducted an experiment similar to Koriat et al.’s (2006) study. We presented 33 participants with the word pairs and asked them, first, to judge the number of people out of 100 who would generate the target given the cue and then to categorize half of the pairs as “strong” and the other half as “weak” associates. As with Koriat et al., participants greatly overestimated the likelihood of generating the target given the cue for both strong (M = 60 %, SD = 12 %) and weak (M = 47 %, SD = 13 %) associate pairs, and they miscategorized 39 % (SD = 5 %) of the pairs as strong or weak. These findings suggest, therefore, that the subjective experience of participants in Experiment 1 did not differ as markedly as we expected for strongly and weakly associated pairs.

Another heuristic that would be easy for participants to use at the time of test in the original error generation paradigm is that almost every response they generate is incorrect. That is, if it is easy to distinguish between their generated response and the correct response at the time of test, then the one should not interfere with the other. We attempted to eliminate the use of this heuristic in Experiment 2 by rigging half of participants’ responses to be correct. If the benefit of guessing first is a result of participants using a “my-guess-is-always-wrong” heuristic, then mixing correct guesses with the incorrect guesses should eliminate this strategy and eliminate the benefit or, at least, reduce the size of the benefit, as compared with when guesses are always incorrect.