Introduction

Intentional forgetting plays an important role in cognition by preventing unwanted thoughts from monopolizing our limited-capacity cognitive resources (e.g., Fawcett & Hulbert, 2020). However, recent evidence suggests that some memories are less easily forgotten than others (e.g., Hauswald et al., 2010). Emotional experiences in particular appear to be more resilient than neutral experiences (Bailey & Chapman, 2012; Hauswald et al., 2010), although this finding has been inconsistent (e.g., Yang et al., 2012). The purpose of the current meta-analysis was to synthesize research exploring the intentional forgetting of emotional material within the context of item-method directed forgetting to determine (a) the typical magnitude and interstudy variability of item-method directed forgetting, (b) whether emotional items are less affected by instructions to forget, and, (c) whether the magnitude of any such difference is moderated by specific study characteristics.

Although several paradigms have been used to study intentional forgetting (for a review, see Anderson & Hanslmayr, 2014), we intend to focus exclusively on item-method directed forgetting. In this paradigm, participants are presented with a series of items, each followed by an instruction to Remember (R) or Forget (F). After the presentation of all items, memory performance is tested using recall or recognition. Typically, performance is greater for R than F items; this is known as the directed forgetting effect and has often been explained by the selective encoding of R items at study (MacLeod, 1998). Specifically, participants engage in maintenance rehearsal until a memory instruction is received. Following an R instruction, they elaboratively rehearse the item, whereas following an F instruction they cease rehearsal, eliminating the item from the rehearsal set (Basden et al., 1993; Basden & Basden, 1998; Jing et al., 2019).

The manner in which F items are eliminated from the rehearsal set has become a topic of debate, with some arguing for the involvement of one or more effortful processes (e.g., Fawcett & Taylor, 2008; Zacks et al., 1996). Fawcett and Taylor (2008) demonstrated that intentional forgetting requires more effort than remembering in the period immediately following the memory instruction, as evidenced by slower secondary probe responses (see also Fawcett et al., 2016; but see Lee, 2018). Later work demonstrated processing resources to be re-allocated away from the representation of the F item in working memory during this same period (e.g., Fawcett & Taylor, 2012; Taylor, 2005; Taylor & Fawcett, 2011). Supporting this idea, neuroimaging studies have revealed frontal brain regions distinct from those involved in incidental forgetting are involved in intentional forgetting (for a review, see Anderson & Hanslmayr, 2014; see also Rizio & Dennis, 2013; Gallant et al., 2018). Notably, others have argued that intentional forgetting is a natural consequence of selectively rehearsing the R items while passively disregarding the F items (Basden & Basden, 1996).

Although directed forgetting (DF) is robust using neutral stimuli such as common nouns (e.g., Bjork, 1970; MacLeod, 1975; MacLeod, 1989; Woodward & Bjork, 1971), objects (e.g., Quinlan et al., 2010), videos (e.g., Fawcett et al., 2013), or visual scenes (e.g., Hauswald & Kissler, 2008), there are a growing number of studies using emotionally valenced materials (e.g., Bailey & Chapman, 2012; Berger et al., 2018; Yang et al., 2016). This line of research provides an important link to real-world applications, as keeping unwanted memories from coming to mind is paramount to our cognitive well-being (e.g., Fawcett & Hulbert, 2020). For this reason, the present study evaluates whether the same process that helps us intentionally forget neutral experiences, such as an outdated phone number, may also be applied to emotional experiences.

Past experiments exploring this question in the context of item-method directed forgetting have proven inconsistent. Some studies have found no difference in the magnitude of DF for emotional and neutral words (e.g., Berger et al., 2018; Gallant & Yang, 2014), whereas others have found a smaller or non-significant effect for emotional words (e.g., Bailey & Chapman, 2012; Yang et al., 2016) or sentences (Lee & Hsu, 2013). Studies using emotionally valenced images have produced similarly mixed results, with some showing no difference in the magnitude of DF for emotional and neutral images (e.g., Quinlan & Taylor, 2014; Taylor et al., 2018; Yang et al., 2012), and others showing a smaller or non-significant effect for emotional images (Hauswald et al., 2010; Nowicka et al., 2011; Zwissler et al., 2011).

Arousal has been proposed as one explanation for these discrepancies (e.g., Gallant & Dyson, 2016; Hauswald et al., 2010). Matching valence conditions for arousal can equate DF for emotional and neutral items (e.g., Gallant et al., 2018; Yang et al., 2012), whereas easing these constraints results in reduced DF for the arousing stimuli (e.g., Hauswald et al., 2010). This fits with research demonstrating that arousal influences memory to a greater extent than valence (Dolcos et al., 2004; Szőllősi & Racsmány, 2020). However, the impact of arousal on DF has not always been replicated, with some studies observing no effect of emotion even when the emotional stimuli were more arousing than the neutral stimuli (e.g., Taylor et al., 2018).

Although there is no consensus as to whether DF is reduced for emotional items, there are several reasons we would predict such a reduction. Emotional items tend to be processed faster (Kissler & Herbert, 2013), capture attention more easily (Hindi Attar & Müller, 2012), and lead to enhanced memory (Adelman & Estes, 2013). Increased processing of items – preceding (Hourihan & Taylor, 2006) or following (Lee et al., 2007) the memory instruction – reduces item-method DF (e.g., Hauswald et al., 2010; Hourihan & MacLeod, 2008). Consistent with this finding, neuroimaging studies using negative pictures have identified a negative correlation between neural markers of enhanced encoding, and the magnitude of the directed forgetting effect (Hauswald et al., 2010).

In summary, the purpose of the current meta-analysis was to (a) determine whether emotional memories are truly more resistant to intentional forgetting than neutral memories, and, (b) investigate whether the magnitude of any such difference is moderated by study characteristics, including whether the emotional items are more arousing than the neutral items.

Method

Literature search

We conducted a search of the online resources Google Scholar (full-text searched), PsycINFO, PsycARTICLES, and PubMED (title, keywords, and abstract searched) using the following Boolean search phrase: ("item-method" OR "item method") AND ("directed forgetting" OR "intentional forgetting") AND (“emotion” OR "emotional" OR "valence" OR "negative" OR "positive"). The search was conducted until November 2018, restricted to English-language articles, and supplemented by reference review and expert consultation. Authors of included studies were also contacted for raw data and access to missed or unpublished studies.

Study inclusion criteria

Articles reporting at least one estimate of item-method directed forgetting as measured by recall or recognition, within a non-clinical population, using emotional images, words, faces, or events were considered for inclusion. Articles were excluded if they (a) used only clinical samples; (b) reported no experimental data; (c) used a different task (e.g., list-method directed forgetting); (d) reported only samples with a mean age < 17 years or > 40 years; (e) did not include an emotional condition; (f) were in a language other than English; (g) were inaccessible and the corresponding author failed to provide a copy; (h) reported only an animal model; (i) reported data from an already included source; and/or, (j) provided insufficient information to calculate effect sizes. Studies including multiple comparisons, only some of which were eligible for inclusion (e.g., an article with both young adult and elderly samples) were still included, although only the eligible comparisons were incorporated into our analyses. All studies intermixed the emotional and neutral items, rather than presenting them in blocked or pure lists.

Data extraction

The first author coded each article in consultation with the remaining authors; all coding decisions were documented and discussed until a consensus emerged. In this manner we also coded methodological features for use as moderators, including the stimulus type (words, other), memory task (recall, recognition), average valence and arousal of the stimulus set, whether participants engaged in a recall task prior to the recognition task (if included), list size (the total number of items included in the study phase), whether buffer items were included preceding and/or following the study items (buffers, no buffers), and whether or not a secondary measure (e.g., EEG) was gathered concurrent to the study portion of the task. Because the scale used for valence and arousal varied across studies, these values were standardized by dividing each by the maximum possible value for that study, producing a value ranging from 0 to 1. Moderator analyses applied to differences in the magnitude between neutral and emotional conditions used instead the difference in the arousal and valence ratings for those conditions (reflecting the degree to which emotional items were more arousing or valenced than the neutral items).

Effect size calculation

Effect sizes were calculated as raw mean differences using the equations appropriate for within-subject designs provided by Borenstein et al. (2010, Chapter 4).Footnote 1 We first calculated differences estimating the magnitude of DF (R – F) separately for the neutral, negative, and positive conditions within each experiment. A difference was then calculated for the combination of the negative and positive conditions, which we refer to as the emotional condition.Footnote 2 We next calculated the difference in the magnitude of DF for each condition. We did this by subtracting performance for the R and F items for each and then subtracting the resulting values. For comparisons aggregating DF within each valence condition, positive values indicate greater performance for R items compared to F items; for comparisons between the magnitude of DF across valence conditions (e.g., Neutral – Emotional), positive values indicate greater DF in the neutral condition, except for the comparison between the positive and negative conditions (i.e., Positive – Negative) wherein positive values indicate greater DF in the positive condition. Our focus was on the combined (positive + negative) emotional condition, as it made best use of the available data, and because each of the comparisons between the neutral and remaining conditions produced qualitatively similar results.

For recall, the mean percentage of items recalled in each condition was used; for recognition, we instead used the percentage of “hits” for old items. Our preference was to analyze recall accuracy and “hits” because they share a common scale and could therefore be aggregated using raw difference scores. Nonetheless, supplementary analyses were also undertaken to verify whether findings observed in the main analyses were also present using a measure of sensitivity (d’).Footnote 3 We used values of d’ reported in-text or derived from raw data where possible, with the remaining values calculated using aggregate hits and false alarms.

Throughout, standard deviations were required to estimate the standard error of the difference scores. In cases where standard deviations were unavailable from the text or study’s authors, we imputed the relevant value as the average of the standard deviations available for that dependent measure. Only a small number of standard deviations were imputed in this manner for our analysis of recall accuracy and recognition “hits”; however, few studies reported d’ directly, requiring us to impute most of the standard deviations used in the calculation of those models. For that reason, additional sensitivity analyses were undertaken for our d’ models weighting studies by sample size. Sample-size weighted models produced results qualitatively similar to the models using standard errors.

Following calculation of the effect sizes, separate Bayesian random- and mixed-effects models were generated using the brms 2.9.0 (Bürkner, 2017, 2018) package in R 3.6.1 (R Core Team, 2018). Each model incorporated random intercepts accounting for variability across samples and dependencies across measured effects. Because several studies reported both recall and recognition, we included a random slope in our analysis of task effects. Models were fit and evaluated for convergence using standard practices (e.g., R-hat < 1.01; Gelman & Hill, 2007). We direct interested readers to past work from our laboratory for more information on our modelling approach (e.g., Fawcett et al., 2016; Fawcett & Ozubko, 2016).

The priors for our primary analyses are summarized below, but more detailed information is available from the senior author. When comparing the magnitude of directed forgetting within a given valence, our prior expectations relating to the intercept of each model assumed that the average effect in a typical sample should range between -30% and 30%. We further assumed the standard deviation pertaining to random effects should range between 0% and 30%; this broadly permits the “true” effect within any given sample to vary anywhere from -90% to 90%. Our prior for slopes within the moderator models were represented by a normal distribution centred at 0 with a standard deviation of 30.

For the comparison of the magnitude of directed forgetting across valences, our prior expectations relating to the intercept of each model assumed that the average effect in a typical sample should range somewhere between -20% and 20%. We further assumed that the standard deviations pertaining to random effects should range between 0% and 20%; this broadly permits the “true” effect within any given sample to vary anywhere from -60% to 60%. Our prior for slopes within the moderator models were reflected by a normal distribution centred at 0 with a standard deviation of 20.

Due to inconsistent reporting across studies, we fit a separate model for each moderator. Further, heterogeneity was quantified within each of our non-moderator models using prediction intervals (IntHout et al., 2016), which reflect the range of probable “true” effects that would be expected should a new study be conducted like those included in the analysis. For each moderator, Bayesian p-values were calculated reflecting our confidence in the direction of the observed effect (e.g., a value of .95 pertaining to a positive effect would indicate that we are 95% confident that the effect is positive).

Results

Description of studies

Of the 607 studies identified, 31 were included in the final sample (see Table 1), providing 36 effect sizes (see Fig. 1). References contributing to our analyses are marked by an asterisk (*).

Table 1 Study characteristics
Fig. 1
figure 1

Meta-analysis inclusion flowchart

Directed forgetting for neutral and emotional conditions

As depicted in Fig. 2, DF was numerically largest in the neutral condition, whereas positive and negative effects were of similar magnitude. These comparisons are explored further below.

Fig. 2
figure 2

Mean directed forgetting effect (%) as a function of valence (neutral, emotional, negative, positive) and task type (recall, recognition). Note. Yellow Circles: Recall, Blue Triangles: Recognition. Symbols and error bars represent posterior estimates and their corresponding 95% confidence intervals. Xs represent the empirical values reported in the relevant article. Symbol size is scaled to reflect relative sample size. Estimates provided in the bottom panel represent aggregate effects; in this panel, thick lines reflect 95% confidence intervals and thin lines reflect 95% prediction intervals. Data are sorted in descending order according to the effect for the neutral condition

Despite clear evidence of an effect in a “typical” study, prediction intervals revealed heterogeneity across studies. Of these prediction intervals, all but the positive valence condition excluded negative values, indicating that the “true” effect for most studies with methods like those in the present analysis should indicate at least some degree of DF, although some of those effects may be close to 0%.

As summarized in Table 2, the effect of each moderator was consistent across the neutral and emotional conditions. In particular, the magnitude of DF was greatest for comparisons (a) using words than other stimuli; (b) using recall than recognition (this was less convincing for the neutral condition); (c) using recognition for which recognition was preceded by recall; (d) for shorter lists; and (e) for which the study phase included buffers. Minimal evidence was observed supporting the remaining moderators. These effects persisted when measured using d’, although the effect of list size was no longer credible for either the neutral or emotional model (confidence dropped to .88 and .83, respectively).

Table 2 Moderators influencing the magnitude of the directed forgetting effect (%) for neutral and emotional items

Comparing the magnitude of directed forgetting across conditions

As depicted in Fig. 3, DF tended to be smaller for emotional than neutral items but did not differ between negative and positive items. Prediction intervals again revealed heterogeneity between studies. Of particular interest, prediction intervals for the neutral-emotional comparison revealed the probable “true” effects for studies with methods like those in the present sample would be expected to range from as low as -5.3% to as high as 13.7%. That is to say that whereas we would expect most (75%) studies to support the claim that emotional stimuli are less likely than neutral stimuli to be forgotten intentionally, this may not always be the case: For the remaining studies, similar or even slightly superior DF would be expected for emotional items.

Fig. 3
figure 3

Mean difference in the magnitude of the directed forgetting effect (%) as a function of valence contrast (neutral-emotional, neutral-negative, neutral-positive, positive-negative) and task type (recall, recognition). Note. Yellow Circles: Recall, Blue Triangles: Recognition. Symbols and error bars represent posterior estimates and their corresponding 95% confidence intervals. Xs represent the empirical values reported in the relevant article. Symbol size is scaled to reflect relative sample size. Estimates provided in the bottom panel represent aggregate effects; in this panel, thick lines reflect 95% confidence intervals and thin lines reflect 95% prediction intervals. Data are sorted in descending order according to the effect for the neutral – emotional comparison

Table 3 provides insight into when we might expect emotional material to be less likely to be forgotten intentionally. Differences between emotional and neutral items were greatest (a) when the emotional items were more arousing than the neutral items, and (b) when the study phase included buffers. As depicted in Fig. 4, the difference in DF for neutral and emotional items was predicted to be numerically equivalent when arousal was matched perfectly, M = 0.0%, CI95% [-5.5, 5.6].

Table 3 Moderators influencing differences in the magnitude of the directed forgetting effect (%) between neutral and emotional conditions
Fig. 4
figure 4

Difference in the magnitude of the directed forgetting effect (%) between the neutral and emotional conditions as a function of the difference in arousal (neutral – negative) and task type (recall, recognition). Note. Yellow Circles: Recall, Blue Triangles: Recognition. X-axis indicates the mean difference in (scaled) arousal between conditions. The Y-axis indicates the difference in the magnitude of directed forgetting between conditions (%). The left panel represents the effect fitted as a linear function; the right panel represents the effect fitted as a non-linear function using thin-plate regression splines. Symbol size is scaled to reflect relative sample size

However, inspection of Fig. 4 also reveals two influential effects (Gallant et al., 2018; Gallant & Yang, 2014) with arousal ratings matched more closely than the remaining data. These points were confirmed as multivariate outliers using the minimum covariance determinant (Fauconnier & Haesbroeck, 2009). Their influence was addressed in two ways. First, the model was refit excluding those studies; this produced a similar, albeit stronger relation (confidence in the effect became .99). Second, a non-linear model was fit to the data using thin-plate regression splines (Wood, 2003), permitting the possibility that performance would asymptote or even invert as arousal was matched between conditions. As also depicted in Fig. 4, the non-linear model predicted an initial, rapid decline, leveling off close to 0 (no difference).

Although our analyses of d’ produced a comparable pattern of differences in the overall magnitude of DF between the emotional and neutral conditions, the same was not true for the moderator analyses. Despite the directionality of the effects matching those described in the preceding paragraph, none of the moderators were credible. This could be caused by the fact that the analysis of d’ was based on far fewer studies relative to the overall models. Alternately, it is possible that the moderators were driven by differences in response bias. We view this possibility as relatively less likely. Because DF was most often measured as the difference in “hits” for R and F items using a shared false alarm rate, there was little opportunity for response bias to contribute.

Tests of publication bias

Although our analyses provide clear evidence both of a DF effect within each valence condition and a reduction in the magnitude of DF for emotional (relative to neutral) material, it remains possible that the magnitude of these effects might be driven in part by publication bias, whereby non-supportive findings (particularly with small sample sizes) are preferentially not published. To evaluate this possibility, a series of regression tests were undertaken using the (scaled) standard error or sample size as a moderator, conducted separately for each dependent measure and valence condition (mirroring our moderator analyses, we have limited our models to the neutral and emotional conditions). This approach is analogous to the regression tests included in the regtest function of the metafor package (Viechtbauer, 2010; see also Sterne & Egger, 2005).

Three of the possible 12 tests demonstrated evidence of a relation between either standard error or sample size and the aggregate effect. Standard error, B = 7.8, CI 95% [5.4, 10.3], and sample size, B = -2.3, CI 95% [-4.5, 0.0], predicted the magnitude of DF in the emotional condition, although standard error, B = 10.1, CI 95% [7.2, 13.3], but not sample size, B = -1.2, CI 95% [-5.4, 2.1], credibly predicted the magnitude of DF in the neutral condition. Inspection of the funnel plots pertaining to each of these analyses demonstrated an apparent tendency for small, imprecise, supportive studies to be published at a rate greater than small, imprecise, non-supportive studies. Importantly, whereas this might suggest some degree of publication bias in our estimate of the magnitude of DF, the neutral and emotional conditions appeared to be influenced to a similar degree. Supporting this assertion, neither standard error, B = 1.5, CI 95% [-1.1, 4.1], nor sample size, B = -0.9, CI 95% [-5.1, 3.5], credibly predicted the comparison between emotional and neutral conditions. None of the conducted tests revealed evidence of small sample effects in our analyses of d’ (all remaining Bayesian p-values indicated < 75% confidence that the slope tended in a direction suggestive of publication bias). Funnels corresponding to each of the regression tests are provided in the Online Supplementary Material.

Discussion

The current analysis addressed whether emotional memories are more resilient to intentional forgetting than neutral memories. Initial models demonstrated significant DF for each condition, although these effects were moderated by methodological features. We also demonstrated reduced DF on average for emotional than neutral items, with minimal difference between positive and negative items. Importantly, our data suggest that reduced DF for emotional memories may be driven in part by differences in arousal. Owing to their greater theoretical implications, we begin our discussion with the between-valence comparisons.

Comparing the magnitude of directed forgetting between neutral and emotional items

The present meta-analysis revealed DF to be of lesser magnitude for emotional than neutral items, resolving a previous debate exhibiting equivocal evidence for (Yang et al., 2016) or against (e.g., Taylor et al., 2018) such an effect. In terms of the mechanisms driving this difference, there are several possibilities. First, emotion leads to less efficient performance on tasks requiring cognitive control (e.g., the Stroop or stop-signal inhibition tasks; Rebetez et al., 2015; Song et al., 2017), so it is possible that active control mechanisms evoked during F trials might have been similarly disadvantaged (e.g., Fawcett & Taylor, 2008). Emotional stimuli are also encoded in a manner that is both preferential and more automatic than neutral stimuli (Kissler & Herbert, 2013; Knott et al., 2018; Minor & Herzmann, 2019), potentially converting the maintenance rehearsal thought to occur between item and instruction onset into something more elaborate, frustrating disengagement upon receipt of an F instruction. Related to this point, emotional items also tend to be more semantically interconnected and distinctive than neutral items (Talmi et al., 2007; Talmi & Moscovitch, 2004), either of which would be expected to undermine forgetting (e.g., Golding et al., 1994; Hourihan & Macleod, 2008). Regardless of the mechanism, present findings suggest that not only are emotional memories typically easier to remember, they are also more difficult to forget intentionally.Footnote 4

Despite demonstrating reduced DF relative to neutral items, present findings suggest no difference between positive and negative items. This was unexpected as past research has suggested that negative items are harder to forget intentionally than positive items (Gallant & Dyson, 2016; Otani et al., 2012) and enjoy a stronger enhancement in memory (e.g., Inaba et al., 2005; Szőllősi & Racsmány, 2020). Migita et al. (2011) found such enhancements occur mostly due to pre-attentive processes at item presentation that might undermine DF, as discussed previously. However, Migita et al. (2011) found no enhancement for positive compared to neutral items. The current analysis conflicts with this finding, indicating instead that either valence produces DF of similar magnitude.

Moderators influencing the magnitude of directed forgetting between valence conditions

The present findings offer preliminary support for arousal as one possible explanation for the inconsistent effect of emotion across studies (e.g., Hauswald et al., 2010; Taylor et al., 2018). For example, Bailey and Chapman (2012) compared neutral, negative, and positive items at high and low arousal levels and found reduced DF for high-arousal emotional items in their initial study. Gallant et al. (2018) used arousal-matched items and found no effect of valence on DF. This is not surprising as emotionally arousing information tends to be moved into memory relatively automatically (Kensinger & Corkin, 2004) and has been shown to contribute to greater memory enhancement than emotional valence (Szőllősi & Racsmány, 2020). Emotionally arousing stimuli attract more attention at encoding than do neutral stimuli (e.g., Simola et al., 2013), and this additional attention may be one of the primary factors in predicting an immediate memory enhancement (Talmi, 2013; Talmi & McGarry, 2012). Given that item-method DF may involve withdrawal of attention from F items (e.g., Fawcett & Taylor, 2010; Taylor, 2005), this withdrawal may therefore be less successful for high-arousal items. Together, these results and the current meta-analysis suggest that reduced DF for emotional items may be partly due to unconstrained differences in arousal rather than differences in valence, although this relationship requires further investigation.

The only other moderator to demonstrate a credible impact on the difference in DF between emotional and neutral conditions was the inclusion of buffer items to mitigate the influence of primacy and recency effects (e.g., Wiswede et al., 2007). However, classic theoretical accounts cannot explain this difference. Primacy effects are historically thought to reflect enhanced rehearsal of items presented early in the list (Glenberg et al., 1980), but F items should not receive enhanced rehearsal regardless of their serial position. Recency effects are historically attributed to the final items being maintained in working memory until test (Craik, 1970); however, following an F instruction, participants work to push the item from mind (e.g., Fawcett & Taylor, 2008). Other theoretical accounts instead suggest primacy and recency effects occur due to the temporal distinctiveness of the items presented at the beginning and end of the learning phase (Bireta et al., 2018; Neath, 1993a, 1993b), and distinctiveness has been shown to reduce DF (Hourihan & Macleod, 2008). In the context of this comparison, these effects might lead to enhanced memory for both neutral and emotional items presented at the beginning or the end of the study phase. Eliminating these effects would lead to a larger difference between conditions as the emotional items are already processed preferentially relative to the neutral items.

Directed forgetting for neutral and emotional conditions

In addition to differences in DF across valences, each valence condition also demonstrated better memory for R compared to F items. This is consistent with previous research observing DF for emotional items (Gallant et al., 2018; Quinlan & Taylor, 2014; Taylor et al., 2018). Although past work would have predicted greater DF for recall than recognition (e.g., Titz & Verhaeghen, 2010), the current analysis demonstrated weak support for this claim in the neutral condition. Nonetheless, there does appear to be a tendency for larger DF as measured using recall. The contrast between words and more elaborative stimuli (e.g., Quinlan et al., 2010; Titz & Verhaeghen, 2010) was more convincing, and could be attributed to the fact that images are often processed in a distinctive manner (Ensor et al., 2019), granting them an advantage relative to words (Nelson et al., 1976). This result supports earlier claims that distinctive encoding attenuates DF (Hourihan & Macleod, 2008).

The present study also found greater DF for shorter lists. This might speculatively be attributed to fatigue known to occur during long tasks requiring repeated bouts of cognitive control (i.e., Stroop; Rauch & Schmitt, 2009). In the context of DF, this could mean that as the task progresses participants rehearse the R items less effectively while also becoming less efficient at pushing the F items from mind. This could lead to both a reduction in memory for R items and a (paradoxical) improvement in memory for F items, reducing the overall magnitude of DF. Further research is needed to evaluate this possibility.

Studies including buffers demonstrated greater DF than those that did not. As discussed previously, items presented at the beginning and end of the study list may be temporally distinct, leading to less DF for those items. Using buffers would work to eliminate this enhancement for F items presented at the beginning and end of the list, producing greater DF as a result.

Among studies using recognition, those that also had a recall test preceding it demonstrated greater DF than those that did not. This could be attributed to testing effects (Roediger & Butler, 2011), as R items have been shown to be recalled prior to F items (e.g., Lee, 2013). This could lead to enhanced memory for R items and therefore greater DF on the recognition test. The recall of the R items might also operate like the retrieval practice phase of a retrieval-induced forgetting paradigm (Anderson et al., 1994), with F items competing for attentional focus, resulting in their down-regulation and failure to be recognized during the following recognition task.

Conclusion

The present meta-analysis demonstrates that emotional memories are typically harder to forget intentionally than neutral memories, at least in the context of an item-method directed forgetting paradigm. However, the magnitude of this difference varies broadly from one study to the next, driven by factors such as variation in arousal. Future research should further isolate the role played by these – and other – moderators in determining the impact of emotion on our ability to exert control over unwanted experiences.