Traditionally, errors during learning have been considered detrimental for subsequent memory, stemming from behaviorist claims that effective training should minimize error production (Skinner, 1958). Consistent with this idea, Marsh, Roediger, Bjork, and Bjork (2007) reported that errors on multiple-choice tests often perseverate on subsequent tests, highlighting potential negative consequences of failed tests. Other studies investigating the errors produced on multiple-choice tests have reported contrary evidence, suggesting minimal negative consequences (Butler, Marsh, Goode, & Roediger, 2006). However, the latter studies were not optimally designed for establishing the mnemonic consequences of errors, as items were not assigned to errorful versus nonerrorful conditions; rather, analyses were conducted post-hoc on the basis of the subset of items producing errors versus another, nonmanipulated subset of items. Thus, any difference in subsequent performance may have reflected item-selection effects (e.g., due to item difficulty). Ideally, an investigation of the mnemonic consequences of incorrect responses would minimize the differences between items in the errorful versus nonerrorful conditions.

Recently, Kornell, Hays, and Bjork (2009) utilized a guessing procedure to minimize item-selection effects, in order to more clearly compare the effects of errors (in this case, incorrect guesses) versus study.Footnote 1 To illustrate, we will describe their Experiment 3 (the one most similar methodologically to the present experiments). Their materials included weakly associated English word pairs (e.g., factory–plant). Every item received two trials, and the second trial was always a study trial. The key manipulation concerned the type of first trial (i.e., the “pretrial”). In the guess condition, participants were shown only the cue (factory–?) and were prompted to guess a possible target. Without prior study, the guesses were almost always incorrect (96 % incorrect), minimizing item-selection concerns.Footnote 2 In the prestudy condition, both the cue and the target were presented for study. In both conditions, the pretrial was always followed immediately by the study trial for that item. The final cued-recall performance was significantly greater for guess than for prestudy items, suggesting that incorrect guesses are not inherently detrimental to performance, and can actually enhance performance to a greater degree than can additional study.

This initial demonstration that incorrect guessing can boost memory performance more than studying is intriguing, but what factors moderate the advantage of incorrect guesses over study? Kornell et al. (2009) suggested that one factor may be the extent to which meaningful associations can be formed between the cue and target information. For example, Kornell et al. consistently found the benefit of incorrect guessing over study with weakly related word pairs but failed to find the effect with trivia facts (e.g., “Who invented snow golf? Rudyard Kipling”), many of which do not easily afford a semantic connection between question and answer. Corroborating the idea that the benefit of incorrect guessing over study may be limited to materials that afford meaningful associations, Izawa, Hayden, and Franklin (2005) found that failed retrieval attempts were worse than additional study on an interim criterion test with arbitrarily related consonant–vowel–consonant (CVC)–two-digit pairs.

Other than the suggestion that the benefit of incorrect guessing over study may depend on the kind of materials, no other moderators of this effect have been proposed or systematically explored. Accordingly, we conducted three experiments to examine the factors that may moderate the effectiveness of incorrect guessing over study. By comparison, the testing-effect literature has shown that several factors moderate the effectiveness of successful tests over study, including the amount of test practice and the timing of subsequent study (Roediger & Karpicke, 2006). Do these factors also moderate the effectiveness of incorrect guessing over study?

To examine whether the number of incorrect guesses and the timing of subsequent study moderate the effectiveness of incorrect guessing over study, we adopted the guessing procedure used by Kornell et al. (2009). Each experiment included pretrials (either guess or prestudy), followed by a study opportunity. In all three experiments, the key outcome was performance on an immediate final cued-recall test. In Experiment 1, we investigated the number of incorrect guess trials (one vs. three), whereas in Experiment 2 we manipulated the timing of subsequent study following an incorrect guess (delayed vs. immediate; Fig. 1 illustrates the sequence of trials in each condition). To foreshadow, the results indicated that the effectiveness of incorrect guessing over study was moderated by the timing of study but not by the number of incorrect guesses. Experiment 3 evaluated one account of why the timing of study may moderate the effect.

Fig. 1
figure 1

A diagram illustrating the sequence of trials as a function of type of pretrial and timing of study

Experiment 1

In Experiment 1, we manipulated the number of incorrect guess trials (one vs. three). Previous research has shown that increasing the number of successful tests during practice increases subsequent memory (e.g., Vaughn & Rawson, 2011). Does the same hold true for incorrect guesses? In Kornell et al. (2009), the advantage of incorrect guessing over study was observed with only one pretrial. If the number of incorrect guesses moderates the effect, one might expect an even greater advantage with three pretrials.

Method

A group of 52 undergraduates participated for course credit. The number of pretrials (one vs. three) was manipulated between participants, and the type of pretrial (guess vs. prestudy) was manipulated within participants.

The materials included 48 weakly associated English–English word pairs (D. L. Nelson, McEvoy, & Schreiber, 2004). The items were selected on the basis of low forward associative strength (.050 to .054), the criterion used by Kornell et al. (2009). For each participant, half of the items were randomly assigned to each pretrial condition.

During guess trials, only the cue was presented, and participants had 6 s to guess the corresponding target. The instructions stressed that participants must guess on each trial, but no instructions were given regarding how participants should generate a guess (e.g., the participants were not explicitly told to generate a weak associate). Participants overtly typed their guesses into a response field on the screen and were not told whether their guess was right or wrong. The computer recorded all responses so that we could eliminate correctly guessed items from the subsequent analyses. During prestudy trials, the cue and target were presented together for 6 s, and the participants were instructed to do their best to learn the pair for an upcoming memory test. In the one-trial group, each item received one pretrial (either guess or prestudy). Items in the three-trial group received one pretrial in each of three blocks. The presentation order of items was randomized anew for each participant in each pretrial block.

After all 48 items had been either guessed or prestudied one or three times, the study phase began (see the top panel of Fig. 1). During the study phase, all 48 items were presented one at a time for one 6-s study trial apiece, with presentation order randomized anew for each participant. After the study phase, participants immediately completed a self-paced final cued-recall test. The cue words were presented one at a time, and participants were instructed to recall the corresponding target. The instructions emphasized that participants should type in the correct target that they had learned during the study trial and not their initial guess response.

Results

Items correctly guessed on the guess trials were excluded from the subsequent analyses (5.2 % and 6.4 % in the one- and three-trial groups, respectively).

Mean performance on the final cued-recall test is reported in Fig. 2. A 2 (number of pretrials) × 2 (kind of pretrial) mixed-factor ANOVA revealed a significant main effect of number of pretrials, with significantly greater performance following three versus one pretrials, F(1, 48) = 7.40, MSE = 445.41, p = .009. Although it is not surprising that performance was greater following three rather than one prestudy trial, the finding of greater performance following three rather than one incorrect guess is potentially surprising and intriguing (we will revisit this outcome later).

Fig. 2
figure 2

Mean performance on the final cued-recall test in Experiment 1 as a function of type of pretrial and number of pretrials during practice. Error bars report standard errors of the means

Most importantly for the present purposes, the main effect of pretrial type was significant [F(1, 48) = 11.71, MSE = 88.53, p = .001], but the interaction was not [F(1, 48) = 0.424, MSE = 88.53, p = .518]: Prestudied items outperformed incorrectly guessed items in both the one-trial and three-trial groups.

Experiment 2

In contrast to Kornell et al.’s (2009) finding that incorrectly guessing outperformed prestudy, we observed greater performance following prestudy versus incorrect guessing, regardless of the number of pretrials. Why was performance greater for prestudied than for guessed items? A key methodological difference between Experiment 1 and those of Kornell et al. involved the timing of study: Whereas Kornell et al. administered the study trial immediately after the pretrial, study was delayed in our Experiment 1. Experiment 2 was designed to evaluate whether the timing of study moderates the effectiveness of incorrect guessing over prestudy by manipulating the timing of study (delayed vs. immediate). If the timing of study moderates the effect, we would expect an interaction in which prestudy outperforms guessing when study is delayed (as in Exp. 1), and the opposite pattern when study is immediate (replicating Kornell et al., 2009).

Method

A group of 66 undergraduates participated for course credit. The type of pretrial (guess vs. prestudy) was manipulated within participants, and the timing of study (delayed vs. immediate) was manipulated between participants.

The materials and procedure were identical to those for the one-trial group in Experiment 1, with an exception concerning the timing of study. The timing of study for the delayed-study group was the same as in Experiment 1, but for the immediate-study group, each pretrial was immediately followed by the study trial for that item (see the bottom panel of Fig. 1). After all pretrials and study trials were completed, the participants immediately completed a final cued-recall test.

Results

Items correctly guessed on guess trials were excluded from the subsequent analyses (6.4 % and 6.9 % in the delayed- and immediate-study groups, respectively).

Mean final cued-recall performance is reported in Fig. 3. Neither main effect was significant (Fs < 1.45); however, the interaction was significant [F(1, 64) = 36.31, MSE = 96.25, p < .001]. In the delayed-study group, performance was greater for prestudied than for guessed items, t(33) = 3.81, p < .001. In the immediate-study group, performance was greater for guessed than for prestudied items, t(31) = 4.63, p < .001. Thus, the timing of study moderated the effectiveness of incorrect guessing over studying.

Fig. 3
figure 3

Mean performance on the final cued-recall test in Experiment 2 as a function of type of pretrial and timing of study during practice. Error bars report standard errors of the means

Experiment 3

Why is guessing better than prestudy when subsequent study is immediate but not when it is delayed? Note that whereas the prestudied items showed a classic spacing effect (performance increased 15 % with delayed vs. immediate study), guess items did not show such an effect (performance actually decreased 6 % with delayed vs. immediate study).

Why did guess items not show a spacing effect? To facilitate our discussion of possible explanations, Table 1 illustrates plausible assumptions about what is encoded during practice in each condition. The assumptions concern two dimensions of encoding: the number of traces stored and the information contained in those traces. In the simplest case, consider the assumption about what is encoded when prestudy is followed by immediate study (for brevity, referred to as prestudy/immediate-study hereafter). Given research demonstrating that massed restudy of paired associates typically yields performance similar to that from a single study episode (e.g., Carpenter & DeLosh, 2005; T. O. Nelson & Leonesio, 1988), we assume that the prestudy/immediate-study condition functionally yields one trace containing the cue–target pair. In contrast, when guesses are followed by immediate study (guess/immediate-study), the guess is still active in working memory during study, and thus may result in the encoding of an elaborated trace in which the guess functions as a mediator between the cue and the target. Prior evidence that mediators enhance subsequent memory (e.g., Carpenter, 2011; Carpenter, Sachs, Martin, Schmidt, & Looft, 2012; Dunlosky et al., 2005; Pyc & Rawson, 2010) supports this potential explanation for why performance is greater in the guess/immediate-study condition than in the prestudy/immediate-study condition (as in Exp. 2).

Table 1 Assumptions regarding the number and content of traces encoded during practice, as a function of the timing of study and the type of pretrial

When prestudy is followed by delayed study (prestudy/delayed-study), we assume that two compatible traces of the cue–target pair are encoded, one at each presentation. When guessing is followed by delayed study (guess/delayed-study), we also assume that two traces are encoded, but that the content of the traces differ, with the trace from the guess trial containing the cue and the guess and the trace from the study trial containing the cue and the target. These conflicting traces may produce a source-monitoring problem during the final test if both are retrieved but participants are unable to remember which trace represents their guess. This explanation rests on the assumption that participants have poorer memory for their guesses in the guess/delayed-study than in the guess/immediate-study conditions. Experiment 3 was designed to test this assumption.

Method

A group of 59 undergraduates participated for course credit. The design, materials, and procedure were identical to those of Experiment 2, with one modification to the final test. During each final test trial, participants were shown a cue word along with three response options: (1) “I was not asked to guess for this item during practice,” (2) “I did make a guess for this item during practice, but now I can’t remember what I guessed,” (3) “I did make a guess for this item during practice, and I remember what my guess was.” Option 3 was accompanied by a field prompting participants to type their initial guess. After making their selection (Option 1, 2, or 3), participants were shown the cue word on a new screen and were instructed to type in the correct target word for that cue.

Results and discussion

Items correctly guessed on guess trials were excluded from the subsequent analyses (7.1 % and 7.8 % in the delayed- and immediate-study groups, respectively).

Mean final cued-recall performance is reported in Fig. 4. Replicating Experiment 2, neither main effect was significant (Fs < 3.1), whereas the interaction was significant F(1, 57) = 29.87, MSE = 100.76, p < .001. In the delayed-study group, performance was greater for prestudied than for guessed items, t(29) = 4.61, p < .001. In the immediate-study group, performance was greater for guessed than for prestudied items, t(28) = 3.42, p = .002.

Fig. 4
figure 4

Mean performance on the final cued-recall test in Experiment 3 as a function of type of pretrial and timing of study during practice. Error bars report standard errors of the means

Was this pattern due to differences in memory for guesses? Several outcomes reported in Table 2 suggest not. First, participants in the two groups were similarly accurate in identifying the items for which they had generated a guess: Participants in both groups chose Option 1 (“I was not asked to guess for this item during practice”) for a similarly sizeable majority of the prestudy items (F < 1) and for a similarly minimal number of guess items (F < 1).

Table 2 Mean number of items for which each response option was selected, as a function of timing of study and type of pretrial in Experiment 3

Given that participants in the two groups accurately remembered when they had guessed, were they similarly likely to remember their original guess response? Of greatest interest are the conditional outcomes reported in the last row of Table 2. On those trials in which participants actually guessed and remembered guessing, the likelihoods of correctly remembering their original guess responses were equally high across both conditions (F < 1). Overall, these results cast doubt on a source-monitoring account of our findings.

General discussion

Kornell et al. (2009) demonstrated an advantage of incorrect guessing over study for subsequent memory. Three experiments investigated potential moderators of this effect. Experiment 1 showed that the number of incorrect guesses did not moderate the effect of incorrect guessing versus study and revealed a surprising reversal of the pattern observed by Kornell et al. (2009). Experiment 2 reconciled the apparent inconsistency by showing that the timing of the subsequent study trial moderated the benefit of incorrect guessing over study, largely because prestudy items showed a classic spacing effect, whereas guess items did not. Experiment 3 provided evidence against a source-monitoring account of this pattern, revealing that participants in both groups were similarly likely to correctly remember when and what they had originally guessed.

Given the evidence against the source-monitoring account, what other accounts might explain our findings? For the source-monitoring account, we proposed the possibility that the different traces in the guess/delayed-study condition (see Table 1) might both be retrieved and cause subsequent source-monitoring errors during the final test. A related possibility arises from classic AB–AD interference research, in which an originally learned A–B pair interferes with memory for a subsequently presented A–D pair (e.g., Postman, 1962; Postman & Gray, 1977). By analogy, in the present research, the cue–guess trace encoded during guess trials may have proactively interfered with retrieval of the cue–target trace at time of final test.

Another possible account concerns the differential involvement of mediators. Earlier, we proposed a mediator-based account regarding the advantage of the guess/immediate-study condition over the prestudy/immediate-study condition (see Table 1). To revisit, we assumed that with immediate study, the guess is still active in working memory and can be encoded as a mediator of the cue–target pair. Likewise, individuals in the guess/delayed-study condition may have covertly retrieved their original guess during study and also encoded it along with the cue–target pair. Assuming that this occurred on some proportion of trials would explain the numerical advantage of the guess/delayed-study condition over the prestudy/immediate-study condition. However, encoding the guess as a mediator would presumably occur less frequently in the guess/delayed-study condition than in the guess/immediate-study condition, due to the additional difficulty entailed in covertly retrieving the original guess response after a delay, potentially explaining the numerical disadvantage of the guess/delayed-study versus guess/immediate-study condition. Furthermore, one could assume that retrieving the guess during delayed study was more likely after guessing three times than after guessing once. If so, enhanced mediator encoding may also provide an explanation for why final test performance in Experiment 1 was greater after three than after one guess trial. Finally, a mediator-based account would explain why the effectiveness of guessing over prestudying may be limited to semantically related materials (as suggested by Kornell et al., 2009).

Although we did not directly test this mediator-based account, the outcomes from exploratory post-hoc analyses are suggestive. In brief, we compared the final cued recall for trials in which participants said that they had guessed and correctly remembered their guess response versus trials in which they said that they had guessed but did not actually remember their guess. Final cued-recall performance was greater when participants correctly remembered versus failed to remember their original guess (78 % vs. 62 %), t(47) = 3.09, p = .003. Future research could test the mediator-based account by cuing individuals with their guess responses to retrieve the corresponding targets at final test (cf. Mäntylä, 1986; Pyc & Rawson, 2010). Presumably, guesses would be more effective cues in the guess/immediate-study than in the guess/delayed-study condition. Furthermore, this account predicts that the effectiveness of guess cues would vary as a function of the number of previous incorrect guesses.

In the method used by Kornell et al. (2009) and in the present work, incorrect guesses were induced by initiating guess trials without prior study. By comparison, encouraging students to guess without a prior study opportunity is a common device in educational practice (e.g., questions at the beginning of textbook chapters). Would our results apply to scenarios in which a prior study opportunity had occurred, but with low levels of initial performance? To revisit a result mentioned earlier, Izawa et al. (2005) had participants first study a list of CVC–two-digit pairs before initiating either repeated studying or repeated testing. The repeated study and test trials were administered in blocks, functionally delaying the timing of subsequent study before an interim criterion test for both sets of items. As in the results from Kornell et al. and the present experiments, initial test performance for the repeated test items was very low (failure rate of approximately 95 % across test trials). Despite the initial study phase, performance on a subsequent interim test was greater for the studied than for the tested items, mirroring our results.Footnote 3

Our conclusion that prestudying is more effective than guessing when the timing of study is delayed might seem inconsistent with results reported by Richland, Kornell, and Kao (2009). In their study, participants in the guess group were given 2 min to answer five prequestions and then 8 min to study the corresponding text, whereas participants in the study-only group were given 10 min to study. Final test performance was greater in the guess group than in the study-only group with delayed study (i.e., with other prequestions or text material intervening between generating an answer to a given prequestion and encountering the answer in the text). However, the timing of encounters with the target information differed in the study-only group, and the number of times that learners read the target sentences within the allotted study time was not recorded (nor was the amount of time spent on target sentences or the spacing of rereading, if any occurred). Thus, these conditions do not map cleanly onto those used here and in Kornell et al. (2009).

In sum, the timing of study moderates the effectiveness of guessing over prestudy. Incorrect guesses are not inherently detrimental to performance, and can even enhance performance, but they do so much less when subsequent study is delayed. Thus, we have identified one of relatively few conditions in which studying outperforms testing.