Psycholinguists apply speed and accuracy measures from a large number of behavioral tasks to the goal of understanding how readers and listeners comprehend language. Those who use event-related potentials have similarly applied amplitude and latency variation of a large number of event-related potential (ERP) components to understand different aspects of language processing, from the identification of individual letters and phonemes, to building and revising syntactic structure, to the appreciation of irony (Friederici, Hahne, & Mecklinger, 1996; Massol, Grainger, Midgley, & Holcomb, 2012; Näätänen et al., 1997; Regel, Coulson, & Gunter, 2010). Among this diversity of dependent measures, perhaps the most commonly used behavioral measure is lexical decision (LD) time—the reaction time (RT) in the task of deciding whether a letter string is actually a word—and the most commonly used ERP measure is amplitude of the N400 component. Both measures are sensitive to a range of variables that include: (1) sublexical characteristics of words, such as their orthographic similarity to other words in the language (Holcomb, Grainger, & O’Rourke, 2002); (2) lexical characteristics, such as a word’s frequency of usage, and whether it refers to a concrete or abstract concept (Gullick, Priya, & Coch, 2013; Kroll & Merves, 1986; Smith & Halgren, 1987; Van Petten & Kutas, 1991; West & Holcomb, 2000), and (3) semantic relationships among words (Kutas, Van Petten, & Kluender, 2006; McNamara, 2005; Neely, 1991; Van Petten & Luka, 2006, for reviews). The degree of overlap between the factors that influence lexical decisions and those that influence N400 amplitudes is high, which both facilitates integration of behavioral and ERP data on the same topic and has led to at least one computational model that is designed to simulate LD and N400 results within the same formal architecture (Laszlo & Plaut, 2012). Other, more pragmatic benefits of the overlapping sensitivities of the two measures are the ability to validate a difficult-to-create stimulus set with LD times before proceeding to an ERP experiment (e.g., Rommers, Dijkstra, & Bastiaansen, 2013; Van Petten & Rheinfelder, 1995) and the ability to collect two dependent measures at the same time by asking the subjects in an experiment to perform an LD task while electroencephalography (EEG) is being recorded.

When LD times and N400 amplitudes have been collected in the same paradigm, the results from the two measures are frequently parallel—that experimental manipulation X has a statistically significant impact on both measures, whereas experimental manipulation Y has a null impact on both measures (e.g., Batterink & Neville, 2011; Grossi, 2006; Macizo, Van Petten, & O’Rourke, 2012). In other cases, however, a pair of conditions have produced equivalent LD times but different N400 amplitudes (see, e.g., Borovsky, Kutas, & Elman, 2013; Heil, Rolke, & Pecchinenda, 2004; Justus et al., 2011; Kielar & Joanisse, 2011; Küper & Heil, 2009). The parallel results encourage the idea that LD times and N400 amplitudes provide windows onto much the same sets of cognitive processes, whereas the dissociations have usually been interpreted as an indication that the N400 amplitude is “more sensitive,” but authors have been reluctant to conclude that the two measures are qualitatively distinct.

Do graded independent variables produce graded LD times and/or N400 amplitudes?

Three variables that clearly influence both LD time and N400 amplitude—word frequency, number of orthographically similar words, concreteness—are all continuous in nature. The exact shape of the distribution of word frequencies in English usage is debatable, but to a first approximation, the logarithm of word frequency is normally distributed (Baayen, 1992; Carroll, 1967). Concreteness is defined by subjects’ ratings on a scale from 1 to 7; two large studies show that distributions of these ratings have a bimodal character with peaks around 3.5 and 5.5, but that some 20 %–25 % of the ratings occur in the “trough” between the two major clusters (Nelson & Schreiber, 1992; Wiemer-Hastings, Krug, & Xu, 2001). The exact distribution of orthographic neighborhood size has not been described (to our knowledge), perhaps because the number of neighbors that a word possesses decreases rapidly with increasing word length (Yarkoni, Balota, & Yap, 2008), but words of a fixed length clearly show a range of neighborhood sizes.Footnote 1

To the extent that these variables influence basic processes in word recognition or comprehension, and that our dependent measures offer a window onto these processes, one would like to see graded levels of the independent variable reflected in gradations of the dependent variable (whether that consists of a linear relationship or some weaker but at least monotonic relationship). Does this occur for either LD times or N400 amplitudes? Both behavioral and ERP psycholinguistic research have been dominated by experiments with two extreme conditions instead of multiple conditions with graded levels of some independent variable. However, two experiments with three or four levels of word frequency have shown a corresponding gradient of LD times (Allen & Emerson, 1991; Johnson, Allen, & Strand, 1989). Similarly, two experiments with multiple levels of word frequency have shown graded N400 amplitudes (Dambacher, Kliegl, Hofmann, & Jacobs, 2006; Van Petten & Kutas, 1990). For the orthographic similarity between a target word and other words in the language, we have found only one LD experiment with more than two levels of orthographic neighborhood size; this experiment showed graded RTs (Sears, Hino, & Lupker, 1995), whereas other studies have shown significant linear regression coefficients for the relationship between neighborhood size and LD time (Macizo & Van Petten, 2007; Yarkoni et al., 2008). For the N400, Laszlo and Federmeier (2011) showed a continuous range of amplitudes across a range of neighborhood sizes from 0 to 23. For the third variable—the concreteness of a word’s meaning—Balota and colleagues reported significant linear regression coefficients for LD times (Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004), but we were unable to find an ERP experiment that included some intermediate level of concreteness or imageability. Overall, to the extent that there are any data, it appears that lexical and sublexical variables have graded influences on both LD time and N400 amplitudes, as one would hope.

What of semantic context? For sentence contexts, the standard measure of contextual strength is cloze probability, the percentage of subjects who offer a particular word to complete a sentence fragment when asked to generate “the best completion” or “the first word that comes to mind” (Taylor, 1953). A sentence provides strong context for a given word if a large proportion of readers offer that word (e.g., “George keeps his dog on a LEASH”), and weaker context if a small number offer that word (e.g., “George keeps his dog on a DIET”). For word pairs, the strength of relationship is derived from a parallel generative procedure: the percentage of subjects who produce Word B in response to cue Word A in a free association task. This is referred to as association strength. The distribution of cloze probabilities in natural language use is difficult or impossible to estimate—the set of possible sentences is infinite, and the set of actual sentences expands every second as people speak and write. Sentences created by experimenters have appeared to show a continuous gradient of cloze probabilities (e.g., Bloom & Fischler, 1980). Characterizing the natural distribution of relationship strength between pairs of words is a somewhat more tractable problem, because the number of words in a language is finite, and two large-scale sets of association norms are available for English (Kiss, Armstrong, Milroy, & Piper, 1973; Nelson, McEvoy, & Schreiber, 1998). In Fig. 1, we plot the distribution of associative strengths for the responses to the 8,211 words used as cues in the Edinburgh Associative Thesaurus (EAT; Kiss et al., 1973), for the most popular response to each cue. Although the distribution shows substantial skew from normal, this exercise shows that the strength of the word-pair relationship is certainly a continuous variable. We can thus ask whether gradations in relationship strength are reflected in gradations of LD times and N400 amplitudes.

Fig. 1
figure 1

Association strength is the percentage of subjects who offered a given word in response to a cue word (e.g., 57 % responded “dog” to a cue of “cat,” and 29 % responded “fright” to a cue of “scare”). Plotted are association strengths for the most popular response (primary associate) to the 8,211 cue words in the Edinburgh Associative Thesaurus (Kiss et al., 1973). Up arrow heads indicate the mean associative strengths for the strong and weak conditions in Experiment 1 (both conditions included primary associates only). Down arrow heads indicate the mean associative strengths for the strong primary and weak primary conditions in Experiment 3

Four studies have examined the ERPs elicited by the second words of pairs that were strongly associated, more weakly associated, or semantically unrelated. These have uniformly shown graded N400 amplitudes: largest amplitudes for unrelated words, intermediate for weak associates, and smallest for strong associates (Frishkoff, 2007; Kutas & Hillyard, 1989; Kandhadai & Federmeier, 2010; Ortu, Allan, & Donaldson, 2013). The graded response to word-pair relationships parallels the monotonic relationship between the strength of a sentence context (cloze probability) and the N400 amplitude observed in other experiments (DeLong, Groppe, Urbach, & Kutas, 2012; DeLong, Urbach, & Kutas, 2005; Kutas & Hillyard, 1984; Kutas, Lindamood, & Hillyard, 1984; Thornhill & Van Petten, 2012; Wlotko & Federmeier, 2013; see Van Petten & Luka, 2012, for review). A larger number of studies have searched for a graded influence of word-pair association in LD times, but largely in vain. Although a handful of experiments have obtained graded RTs—strong associates faster than weak associates, which are faster than unrelated words (Cañas, 1990; Coney, 2002; De Groot, Thomassen, & Hudson, 1982)—the majority have shown no RT gradient across association strengths (Anaki & Henik, 2003; Bonnotte & Casalis, 2009; Fischler, 1977; Fischler & Goodman, 1978; Hodgson, 1991; Koriat, 1981; Kroll & Potter, 1984; Nation & Snowling, 1999; Sánchez-Casas, Ferré, Demestre, García-Chico, & García-Albea, 2012). Some studies have suggested that the presence or absence of a strength effect depends on the temporal delay (the stimulus onset asynchrony, or SOA) between the members of a pair, but these have not been consistent (Hutchison, Balota, Cortese, & Watson, 2008, reported an effect of forward association strength at a 250-ms, but not at a 1,250-ms, SOA; Stolz & Neely, 1995, reported an effect of association strength at an SOA of 800 ms, but not at 200 ms; Perea & Rosa, 2002, reported a strength effect at an SOA of 66 ms, but not at 83, 100, 116, or 166 ms). The null results have come in two flavors: either strongly and weakly associated targets elicited equivalent LD times (both faster than unrelated), or RTs to weakly associated targets were equivalent to unrelated RTs. In both versions, the RT measure displayed a dichotomous response to a graded manipulation.

A similar dissociation between LD times and N400 amplitudes seems to be present in a paradigm in which word associations are more constrained. Instead of freely associating, subjects in a normative group can be asked to generate an exemplar in response to a category name (e.g., PANCAKE for “a breakfast food”), and their responses can be ranked so as to create category typicality scores. When a different group of subjects are presented with the category names followed by highly typical, less typical, and out-of-category items, N400 amplitudes show a three-way gradient: largest for out-of-category, intermediate for less typical exemplars, and smallest for very typical exemplars (Federmeier, Kutas, & Schul, 2010; Heinze, Muente, & Kutas, 1998; Kiang, Kutas, Light, & Braff, 2007). Much like experiments using free association norms, LD times instead show only a dichotomy between in-category versus out-of-category exemplars (Becker, 1980; Young, Newcombe, & Hellawell, 1989).

The discrepancy between LD times and N400 amplitudes in response to strength-of-context manipulations may suggest that these measures are less similar than has been thought, and that entirely different processes are responsible for the overall influence of semantic context in the two measures. It remains possible, however, that the discrepancy is spurious, particularly given that a minority of lexical decision experiments have shown strength effects. A priori, a strength effect can only be observed if there is sufficient convergence between the semantic intuitions of some normative group of subjects who provide the free association data and an experimental group who provide the lexical decision or ERP data. Purely behavioral experiments tend to use much smaller stimulus sets than do ERP experiments (although often a larger number of subjects), and it is possible that larger stimulus samples will be more likely to yield graded effects. A second technical issue is that all of the ERP and behavioral experiments cited above (with the exception of Becker, 1980) used different target words in the strongly and weakly associated conditions, with varying degrees of attention to matching these targets for a variety of lexical characteristics that were extraneous to the strong/weak manipulation. Here, too, random variation between conditions would have a better chance of washing out with the larger stimulus samples used in ERP experiments. A related possibility is that poor control over extraneous variables (some of which may not even be known) is especially problematic for LD times and frequently obscures an underlying strength effect, whereas the N400 is less sensitive to those variables—thus allowing a strength effect to emerge.

To date, only a single study has reported a direct comparison of the impacts of word-pair association strength on N400 amplitudes and LD times. Frishkoff (2007) found graded N400 amplitudes across unrelated, weakly associated, and strongly associated target words in two experiments. LD times from the same subjects showed a statistically ambiguous pattern: In one experiment, strong associates yielded faster RTs than weak ones, but weak associates differed from unrelated words at p = .07; significantly graded RT effects were then observed in a second experiment with the same stimuli.

The present study

The present study includes three experiments with three different stimulus sets, in order to probe the reliability of strength effects on N400 amplitudes and LD times. For Experiment 1, we used stimuli that were like those from prior work, in that strongly related and weakly related pairs had different target words. In Experiments 2 and 3, we examined responses to the same targets contingent on strengths of association, so that any extraneous variation in the lexical characteristics of the targets would be eliminated.

In Experiments 1 and 3, lexical decision and ERP data were collected concurrently from the same subjects. In Experiment 2, one group of subjects provided lexical decision data, whereas a different group was assigned a letter-probe task during EEG recording instead of lexical decision. In the letter-probe task, subjects view a pair of words and then decide whether a subsequently presented letter occurred in either word. Because the upcoming letter is not known in advance, semantic context effects can be observed prior to the time that subjects make a yes/no decision. This task is designed to disentangle decision-related ERP components from semantic context effects per se. Experiment 2 thus isolates the N400’s sensitivity to association strength by removing the possible contributions from decision confidence that are typically evident in a temporally overlapping P300 component.

Across experiments, the stimuli were also constructed to examine a specific proposal about the surprising absence of strength-of-association effects in LD times. Anaki and Henik (2003) suggested that LD times reflect competition across a set of associates activated by the presentation of a cue word, so that the strongest associate in a set is facilitated, regardless of its absolute strength. For instance, the most popular response to QUACK in the Edinburgh Associative Thesaurus is DUCK, offered by 42 % of the subjects, and the most popular response to POINT is SHARP, offered by 9 % of the subjects. Although these differ dramatically in association strength, both DUCK and SHARP are the primary associates of their cues. Anaki and Henik found that strong and weak primary associates elicited equally fast LD times (i.e., no strength effect). Critically, the weak primary associates elicited faster responses than did equally weak nonprimary associates—such as SUN preceded by BEACH, a pair that also has 9 % association strength but that suffers from internal competition with BEACH–SAND and BEACH–SEA—according to Anaki and Henik’s account. In Experiment 1, we compared weak to strong primary associates, for which Anaki and Henik’s proposal predicts no strength effect. In Experiment 3, we compared strong primary, weak primary, and weak nonprimary associates, for which their proposal also predicts no strength effect, but a distinction between primary and nonprimary associates.

Experiment 1

Method

Subjects

The subjects in all experiments were native speakers of English with no history of neurological disorder, psychiatric disorder, or learning disability by self-report, nor any medications known to affect the central nervous system. A group of 32 young adults were paid for their participation in Exp. 1 (17 men, 15 women). Their mean age was 23.8 years (SD = 5.4). All had some college education (mean years of formal education = 15.8, SD = 1.8, using a formula that assigns 12 years for a high school degree or 16 for a Bachelor’s degree, and adds years up to a maximum of 5 years for any postgraduate education). The data from three additional subjects were not analyzed: One offered no behavioral response on roughly a third (32 %) of the trials; RTs for a second person were more than two standard deviations slower than the mean of the retained subjects; and for a third person, more than 80 % of the trials included non-EEG electrical artifacts.

Stimuli

Sets of 160 related and 160 unrelated word pairs were initially constructed as control items for a sentence-processing study (Coulson, Federmeier, Van Petten, & Kutas, 2005). The related and unrelated pairs shared context (or cue) words (e.g., SPARE TIRE vs. SPARE PENCIL; see Table 1 for other examples). Of the related pairs, 130 were selected for analysis here, because the target items were primary associates—the most popular response to their cue words—in the Edinburgh Associative Thesaurus (EAT; Kiss et al., 1973). These primary associates were divided into equal-size sets of strong and weak target words and were contrasted with semantically unrelated targets preceded by the same context words. Strong targets were offered as responses to their context words by an average of 47.5 of the 100 subjects in the EAT norms (ranging from 34 to 85), and weak targets were offered as responses by an average of 23.1 subjects (ranging from 10 to 33). Figure 1 shows the locations of the mean association strengths along the distribution of association strengths in the EAT. The strong, weak, and unrelated targets did not differ in frequency of usage or word length, as is shown in Table 2. The numbers of orthographic neighbors (other words that can be formed by changing one letter) also did not differ between strong and weak targets [t(128) = 1.54], but strong targets had slightly more neighbors than unrelated targets [t(128) = 2.33, p = .02]. Prior results showed that, all else being equal, words with more orthographic neighbors elicit larger N400s (Holcomb et al., 2002; Laszlo & Federmeier, 2011), so that the small imbalance here would tend to act against the predicted influence of word-pair association strength.

Table 1 Sample stimuli from Experiment 1
Table 2 Target characteristics in Experiment 1 (mean and SE)

Procedures

Each subject viewed 32 or 33 strongly associated pairs (65 strong pairs divided across subjects), 32 or 33 weakly associated pairs, 65 unrelated pairs, and 80 pairs comprising words and pronounceable nonwords. Each subject also viewed an additional 15 associated and 15 unrelated pairs not analyzed here, because the associated targets were not primary associates of their cue words. Each item of a pair was presented for 200 ms in the center of a video monitor, with a 750-ms interstimulus interval and a 4.7-s interval between trials. The subjects made speeded lexical decisions on the second item of each pair, signaled by buttonpresses with the right and left thumbs. The mapping between the right and left hands and word versus nonword decisions was counterbalanced across subjects.

Electrophysiological methods

The EEG was recorded with tin electrodes mounted in a commercially available elastic cap. Midline frontal (Fz), central (Cz), and parietal (Pz) recording sites were used, along with lateral pairs of electrodes over the posterior temporal (T5, T6) and occipital (O1, O2) scalp, as defined by the 10–20 system (Jasper, 1958). Three additional lateral pairs were used: a fronto-temporal pair placed midway between F7/F8 and T3/T4, a midtemporal pair placed 33 % lateral to Cz (left and right midtemporal), and a posterior temporal pair placed 30 % of the interaural distance lateral to and 12.5 % of the inion–nasion distance posterior to Cz (left and right posterior temporal). Each scalp site was referred to the left mastoid during recording and was re-referenced to an average of the left and right mastoids prior to data analyses. Vertical eye movements and blinks were monitored via an electrode placed below the right eye, referred to the left mastoid. Horizontal eye movements were monitored via a right-to-left bipolar montage at the external canthi. The EEG was amplified by a Grass Model 12 polygraph with half-amplitude cutoffs of 0.01 and 100 Hz, digitized at a sampling rate of 250 Hz. Trials with eye movement, muscle, or amplifier blocking artifacts were rejected prior to averaging.Footnote 2 After artifact rejection and exclusion of trials with incorrect lexical decisions, the ERPs for individual subjects included means of 30 trials in the strong condition, 31 in the weak condition, 58 in the unrelated condition, and 69 in the nonword condition (minimum 24 trials).

Statistical methods

To reduce the influence of exceptionally long RTs in the lexical decision task, we calculated medians for each condition in each subject, and also trimmed the means after excluding trials with RTs longer than two standard deviations above a subject’s mean in that condition. Both methods were used to ensure that the results did not hinge on how outlying RTs were handled, given that a diversity of methods occur in the lexical decision literature. Medians and trimmed means were analyzed via paired t tests to contrast each of the associated conditions in an experiment to the unrelated condition, and to compare the associated conditions to each other. N400 amplitudes were measured as the mean amplitudes from 250 to 450 ms after the onset of target items at all scalp sites, relative to a 200-ms prestimulus baseline, and analyzed via analyses of variance (ANOVAs) using condition and scalp site as repeated measures. Because the goal of the present article is to compare ERP and lexical decision results, main effects of condition will be emphasized, and interactions between condition and scalp site are not noted unless they qualify a nonsignificant main effect of condition. For F ratios with more than one degree of freedom in the numerator, the Huyhn–Feldt correction for nonsphericity of variances is applied; for significant results, we report the original dfs, the corrected probability level, and the ε correction factor. In a final analysis, the sensitivities of the RT and ERP measures to the linear effect of association strength are compared via an ANOVA with a polynomial contrast.

Results

Lexical decision times

Table 3 shows the error rates and RTs for trials with correct responses (see also Fig. 3b below). The unrelated target words elicited substantially faster responses than did the nonwords [t(31) = 5.38 for medians, t(31) = 6.55 for trimmed means; both ps < .0001]. Both the median and trimmed mean RTs showed robust semantic context effects for the strongly and weakly associated targets, as compared to the unrelated targets [all four paired ts(31) > 4.56, all ps < .0001]. However, the strongly and weakly associated conditions did not differ from each other [t(31) = 0.14 for median RTs, t(31) = 1.05 for trimmed mean RTs, both ps  > .30].

Table 3 Lexical decision results in Experiment 1 (mean and SE)

Event-related potentials

The left side of Fig. 2 shows that nonword targets elicited both a larger N400 and a larger P300 than did the unrelated word targets. N400 amplitudes were measured in the latency range from 250 to 450 ms after target onset, in order to minimize overlap with the P300, and were analyzed with a repeated measures ANOVA including Scalp Site as a factor (13 levels); this confirmed the larger N400 for nonwords than for unrelated words [F(1, 31) = 5.73, p < .05]. Figure 3a shows the mean amplitudes across all 13 scalp sites in the 250- to 450-ms latency range, and Fig. 3c shows the ERPs elicited by strongly associated, weakly associated, and unrelated target words at selected scalp sites. Both strongly and weakly associated targets elicited smaller N400s than did unrelated targets [Fs(1, 31) = 35.8 and 17.3, respectively; ps < .0002]. In contrast to the null effect of association strength on LD times, strongly associated targets elicited smaller N400s than did weakly associated targets [F(1, 31) = 7.15, p < .02].

Fig. 2
figure 2

Left column: Grand average event-related potentials (ERPs) from 32 subjects in Experiment 1. Right column: Grand average ERPs from 24 subjects in Experiment 3. The frontal midline site is Fz, central midline is Cz, and parietal midline is Pz. The locations of the midtemporal and posterior-temporal sites in Experiment 1 are described in the text. For Experiment 3, these are the (nearby) standardized locations of T3/T4 and T5/T6

Fig. 3
figure 3

a Mean amplitudes of event-related potentials (ERPs) in the 250- to 450-ms latency range, measured across all scalp sites in Experiment 1. Brackets between adjacent bars show the effect sizes of the difference between them, calculated as unbiased Cohen’s d. (b) Median lexical decision times from the same 32 subjects. Brackets between adjacent bars show the effect sizes of the difference between them, as unbiased Cohen’s d. Negative numbers indicate effects in the unpredicted direction (i.e., slower RTs for strong than for weak associates). (c) Grand average ERPs from midline frontal, central, and parietal scalp sites, along with left and right midtemporal sites (LT, RT) and a pair of posterior-temporal sites (LpT and RpT). The association strengths for the related pairs are provided as percentages of subjects in an independent normative group (Kiss et al., 1973) who offered the critical word as a response to the context word that preceded it

Linear effect of association strength: Direct comparison between N400 amplitude and LD time

Median LD times and N400 amplitudes were jointly analyzed in an ANOVA with repeated measures of measure (RT vs. N400 amplitude collapsed across scalp sites) and strength (unrelated, weak, or strong). The three levels of strength were defined via their numerical association strengths of 0, 23, and 47 and subjected to a polynomial contrast to examine whether the linear effect of association strength was greater for N400 amplitudes than for LD times. In this analysis, confirmation of the more detailed comparisons above would consist of a significant interaction between the linear effect of the strength variable and the nature of the dependent measure. The expected interaction of strengthlin with measure was observed [F(1, 31) = 21.8, p < .0001, η 2p = .41], in addition to an overall effect of strengthlin [F(1, 31) = 18.5, p < .0002, η 2p = .37]. The analysis also yielded an interaction between the quadratic component of strength and the type of measure [F(1, 31) = 7.19, p < .02, η 2p = .19], along with an overall effect of strengthquad [F(1, 31) = 6.70, p < .02, η 2p = .18]. The quadratic component reflects the shape of the function relating LD time to association strength, with a single-step decrease from the unrelated condition to the weakly associated condition, but no further decrease between the weakly and strongly associated conditions. Follow-up tests showed that the quadratic component of strength was significant for LD times [F(1, 31) = 6.93, p < .02, η 2p = .18] but not for N400 amplitudes [F(1, 31) = 0.71, η 2p = .02]. Given that both dependent measures yielded large differences between the extreme conditions of unrelated and strongly associated pairs, both follow-up tests yielded significant linear effects of strength [RT, F(1, 31) = 20.2, p < .0001, η 2p = .40; N400, F(1, 31) = 35.6, p < .0001, η 2p = .54].

Discussion

Strength of association influenced N400 amplitudes, as in the small number of prior ERP experiments that have examined word pairs varying in association strength (Frishkoff, 2007; Kandhadai & Federmeier 2010; Kutas & Hillyard, 1989; Ortu et al., 2013). The subjects performed a lexical decision task as their EEG was recorded, but as in a number of prior experiments with RT data only, association strength had no impact on LD times. The RT data instead showed a simple division between related pairs (independent of strength of relationship) and unrelated pairs.

One potential complication for interpreting the dissociation between N400 amplitude and LD time was created by the use of different words in the strong and weak conditions. Both N400 amplitudes and LD times are sensitive to variables other than semantic context, including, at least, frequency of usage, orthographic neighborhood density, and the concrete versus abstract nature of a word’s meaning. Some of these variables act in the same direction for the two dependent measures, such that uncommon words elicit both larger N400s and longer LD times than do more commonly used words (Allen & Emerson, 1991; Balota & Chumbley, 1984; Dambacher et al., 2006; Van Petten, 1995). Other variables act in opposing directions, such that concrete words elicit larger N400s but shorter LD times than abstract words (Gullick et al., 2013; Kroll & Merves, 1986; Smith & Halgren, 1987; Van Petten & Kutas, 1991; West & Holcomb, 2000). We equated the strong, weak, and unrelated words on some but not all of these variables, and other, less well-characterized lexical variables may also have an influence. One could thus worry that some poorly understood combination of extraneous variables acted to inflate the apparent impact of association strength in the ERPs and deflate the impact on LD times. For Experiments 2 and 3, we thus used the same target words in conditions that varied in association strength. Both experiments also examined a lower range of association strengths than in Exp. 1. The “strong” condition in Exp. 2 was close in association strength to the “weak” condition of Exp. 1, and the weaker conditions had yet lower association strengths.

A second potential concern about the results of Experiment 1 was that the strength effect observed in the ERPs arose not from the N400, but from a P300 triggered by the need to make a word/nonword judgment. P300 amplitude often varies with decision confidence (Hillyard, Squires, Bauer, & Lindsay, 1971; Paul & Sutton, 1972; Squires, Squires, & Hillyard, 1975). The strongly related pairs may have elicited more confident “yes” decisions, such that the ERPs included task-specific contributions (increased positivity from the P300) in addition to the more general semantic context effect indexed by the N400. Although the measurement epoch of 250–450 ms was designed to exclude the P300, one might worry that this strategy was not entirely successful. In the ERP portion of Experiment 2, we thus used a task that excluded decision-related ERP components until more than 1,500 ms after target word presentation.

Experiment 2

Two groups of subjects participated in Experiment 2: One viewed only pairs composed of real words and judged whether a probe letter presented after each pair had been present in one of the words or in neither. A different group viewed the same word pairs intermixed with pairs containing a pronounceable nonword and made lexical decisions without EEG recording. Both items in a pair were thus relevant to the assigned task for both the lexical decision and ERP subjects. The task assignments were motivated by our desire to ensure continued attention to all items across the very large number of pairs delivered. Mandating attention to both members of a word pair should increase the sensitivity to their semantic relationship in our dependent measures, and thus afford a strong test of the impact of relationship strength. The data from the ERP version of the experiment (only) have been presented elsewhere (combined with a different experiment in which the two members of each pair were presented simultaneously; Luka & Van Petten, 2013).

Method

Subjects

A group of 14 men and 16 women participated in the ERP version (mean age = 23.4 years, SD = 3.4; mean years of formal education = 16.2, SD = 1.9). The data from four additional subjects were not analyzed: Three had high numbers of trials contaminated by non-EEG artifacts (primarily blinks), and one showed very low accuracy in the letter-probe task (59.7 % vs. a mean of 95.6 % for the retained subjects). Eight men and 16 women (college students between the ages of 18 and 22) participated in the lexical decision version. In the lexical decision group, the data from one additional subject were not analyzed because his or her RTs were more than two standard deviations slower (960 ms for words, 1,138 ms for nonwords) than the mean of the retained subjects.

Stimuli

A total of 240 nouns were paired with three cue words each to form triplets of semantically related pairs, as in ORE–METAL, WELD–METAL, and SCRAP–METAL (see Table 4 for other examples). The mean associative strengths in the EAT were 23.6 % for strong pairs, 11.7 % for moderate pairs, and 5.9 % for weak pairs. The mean association ranks for the strong, moderate, and weak pairs were 1.8, 3.0, and 5.0, respectively. Table 5 shows other characteristics of the critical target nouns. Each subject received all 240 critical nouns: one quarter with a strong cue, one quarter with a moderate cue, one quarter with a weak cue, and one quarter with an unrelated cue. The unrelated pairs were formed by recombining the cue words and critical nouns. Cue words were rotated across subjects so that each critical noun appeared equally often in a strong, moderate, weak, and unrelated pair, although an individual subject viewed each critical noun only once. To equate the proportions of semantically related and unrelated pairs, 120 semantically unrelated word pairs (unanalyzed fillers) were added to each stimulus list, so that subjects viewed 180 related and 180 unrelated word pairs in total. The four pair types (strong, moderate, weak, and unrelated) and the unrelated filler pairs were randomly intermixed. The ERP subjects viewed only the word pairs; the lexical decision subjects viewed the same word pairs mixed with 360 pairs containing a pronounceable nonword in the first or second position (180 of each). Nonwords were presented in both positions in order to mandate attention to both items of a pair, as in the letter-probe task.

Table 4 Sample stimuli from Experiment 2
Table 5 Target characteristics in Experiment 2 (mean and SE)

Procedures

The screen continuously displayed a central frame in which all of the text stimuli appeared. On each trial, the first member of a pair was displayed for 200 ms, followed 500 ms later by the second member of a pair for 200 ms (700-ms SOA). The lexical decision subjects used the index fingers of each hand to indicate whether both items were words or whether one was a nonword. For ERP subjects, the second item of a pair was followed 1,500 ms later by a single letter of the alphabet with a question mark. The index fingers of each hand were used to indicate whether the probe letter occurred in either of the preceding words or in neither word (the mapping between hand and response was counterbalanced across subjects). For both related and unrelated pairs, half of the correct answers were “present” and half were “absent.” “Present” letters were equally likely to occur in the first or second word of a pair. For lexical decision subjects, the next trial began 4 s after the second item of a pair; for ERP subjects, the next trial began 4 s after the probe letter.

Electrophysiological methods

The scalp sites in Exp. 2 were Fpz, Fz, Fcz, Cz, Cpz, Pz, Oz, Fp1, Fp2, F3, F4, Fc3, Fc4, C3, C4, Cp3, Cp4, P3, P4, O1, O2, Ft7, Ft8, T3, T4, Tp7, Tp8, T5, and T6. Other methods were like those of Experiment 1. After artifact rejection, the ERPs for each subject comprised a mean of 54 or 55 trials each for the conditions of strong, moderate, weak, and unrelated (minimum of 32 trials).

Statistical methods

Repeated measures ANOVAs were used to compare RTs for the three associated conditions. For negative lexical decisions, we consider only the trials with nonwords in the second position of a pair, since responses to pairs with nonwords in the initial position could be produced before the second member of a pair was presented. Other methods were like those of Experiment 1.

Results

Lexical decision times

Table 6 shows the error rates and RTs for trials with correct responses (see also Fig. 4b). The unrelated target words received substantially faster RTs than the nonwords [t(23) = 9.45 for medians, t(23) = 9.28 for trimmed means, both ps < .0001]. Both the median and trimmed mean RTs showed robust semantic context effects for the strongly, moderately, and weakly associated targets as compared to the unrelated targets [all six paired ts(23) > 2.77, all ps = .01 or less]. However, no significant impact of association strength was evident when the strong, moderate, and weak RTs were compared to one another [medians, F(2, 46) = 1.34; trimmed means, F(2, 46) = 0.49].

Table 6 Lexical decision results in Experiment 2 (mean and SE)
Fig. 4
figure 4

a Mean amplitudes of event-related potentials (ERPs) in the 250- to 450-ms latency range, measured across all scalp sites in Experiment 2 from 30 subjects who performed a letter-probe task. Brackets between adjacent bars show the effect sizes of the difference between them, calculated as unbiased Cohen’s d. (b) Median lexical decision times from 24 subjects. Brackets between adjacent bars show the effect sizes of the difference between them, as unbiased Cohen’s d. Negative numbers indicate effects in the unpredicted direction (e.g., slower RTs for strong than for moderate associates). (c) Grand average ERPs from midline prefrontal, frontal, frontocentral, central, centroparietal, parietal, and occipital scalp sites. The association strengths for the related pairs are provided as percentages of subjects in an independent normative group (Kiss et al., 1973) who offered the critical word as a response to the context word that preceded it

Event-related potentials

Figure 4a shows the mean amplitudes across all 29 scalp sites in the 250- to 450-ms latency range, and Fig. 4c shows the ERPs elicited by unrelated, weakly, moderately, and strongly associated words at midline scalp sites. Each of the associated conditions elicited smaller N400s than did the unrelated words [weak, F(1, 29) = 8.03, p < .01; moderate, F(1, 29) = 6.53, p < .02; strong, F(1, 29) = 16.7, p < .0005]. An ANOVA comparing the three associated conditions yielded a main effect of association strength [F(2, 58) = 6.54, p < .005, ε = 1.0]. Pairwise comparisons showed that the weak and moderate conditions elicited indistinguishable N400s [F(1, 29) = 0.08], but that both elicited a larger N400 than did the strongly associated words [weak vs. strong, F(1, 29) = 7.99, p < .01; moderate vs. strong, F(1, 29) = 9.50, p < .005].Footnote 3

Linear effect of association strength: Direct comparison between N400 amplitude and LD time

Median LD times and N400 amplitudes were jointly analyzed in an ANOVA with a repeated measure of strength (unrelated, weak, moderate, or strong) and a between-subjects variable of dependent measure (RT vs. N400 amplitude collapsed across scalp sites). The four levels of strength were defined via their numerical association strengths of 0, 6, 12, and 24 and were subjected to a polynomial contrast to examine whether the linear effect of association strength was greater for N400 amplitudes than for LD times. In this analysis, confirmation of the more detailed comparisons above would consist of a significant interaction between the linear effect of the strength variable and the nature of the dependent measure. The expected interaction of strengthlin and measure was observed [F(1, 52) = 10.3, p < .002, η 2p = .17], in addition to an overall effect of strengthlin [F(1, 52) = 10.8, p < .002, η 2p = .17]. The analysis also yielded an interaction between the quadratic component of strength and the type of measure [F(1, 52) = 17.5, p < .0005, η 2p = .25], along with an overall effect of strengthquad [F(1, 52) = 19.6, p < .0001, η 2p = .27]. The quadratic component reflects the shape of the function relating LD time to association strength, with a single-step decrease from the unrelated condition to the weakly associated condition, but no further decrease with higher association strengths. Follow-up tests showed that the quadratic component of strength was significant for LD time [F(1, 23) = 14.8, p < .001, η 2p = .39], but not for N400 amplitude [F(1, 29) = 1.19, η 2p = .04]. Given that both dependent measures yielded large differences between the extreme conditions of unrelated and strongly associated pairs, both follow-up tests yielded significant linear effects of strength [RT, F(1, 23) = 8.42, p < .01, η 2p = .27; N400, F(1, 29) = 17.4, p < .0005, η 2p = .38].

Discussion

The pattern of results in Experiment 2 was much like that in Experiment 1. The lexical decision RTs showed a binary division between related words (of all strengths) and unrelated words. Although the ERPs did not distinguish two closely spaced levels of association strength (weak vs. moderate, 6 % vs. 12 %), N400 amplitudes showed a gradation across levels, in that strongly associated words elicited smaller N400s than did more weakly associated words, which in turn elicited smaller N400s than did completely unrelated words. Because subjects were unable to make a task-related decision during the epoch of interest (the target letter was not displayed until 1,500 ms after the second word of a pair), the association strength in the ERPs can be attributed to semantic processing per se, and not to decision confidence.

However, before concluding that lexical decision RTs are insensitive to any variation in semantic association, we should more carefully consider Anaki and Henik’s (2003) claim that RTs reflect association rank but not association strength. These investigators found equivalent LD times for first-rank associates—the most popular response—to a cue word, regardless of whether those primary associates were offered by 42 % of the subjects in their normative group or 10 %. Anaki and Henik’s subjects appeared to treat nonprimary associates like unrelated words; these conditions had equivalent RTs. The LD results of Experiment 1 are not incompatible with Anaki and Henik’s results—we also compared strong and weak primary associates and found equivalent RTs. The lexical decision results of Experiment 2 are at least partially inconsistent with their results, given that our weak and moderate conditions contained very few primary associates (mean association ranks of 5.0 and 3.0, respectively) but elicited faster RTs than did unrelated words. However, those stimuli were selected on the basis of the association strength between members of a pair, not association rank per se. Experiment 3 more closely paralleled Anaki and Henik’s design of three related conditions: strong primary associates of a cue word, weak primary associates of a cue word, and weak nonprimary associates of a cue word. In contrast to Anaki and Henik’s stimulus set, we used the same critical (target) words in each of these conditions (as well as in an unrelated condition), alleviating any concern that the results might reflect accidental differences in the lexical characteristics of the target words. Experiment 3 also returned to a within-subjects design in which ERPs and LD times were collected on the same trials in the same subjects.

Experiment 3

Method

Subjects

Nine men and 15 women participated, with a mean age of 27.7 years (SD = 6.3) and 17.4 (SD = 3.0) mean years of formal education.

Stimuli

A total of 200 words were paired with three cue words each, to form triplets of related pairs, as in HARD–SOFT, FLUFFY–SOFT, and PILLOW–SOFT (see Table 7 for other examples). For strong primary pairs, the target word was the most popular response to its cue and was offered by a substantial number of the EAT subjects (mean associative rank 1.0, mean associative strength 39.5 %). For weak primary pairs, the target word was also the most popular response to its cue (associative rank 1.0) but was offered by a smaller number of EAT subjects (associative strength 9.2 %) because the cue elicited a greater diversity of responses. Figure 1 shows the locations of these two primary association strengths along the distribution of association strengths in the EAT. Weak nonprimary pairs were selected to have association strengths very close to that in the weak primary condition—7.3 %—but were drawn from fourth- or higher-ranked associates. Unrelated pairs were formed by recombining cues and targets. Each subject received the same 200 target words, evenly divided between the strong primary, weak primary, weak nonprimary, and unrelated conditions. Materials were rotated across subjects so that each target occurred in each of the four conditions, but neither targets nor cues were repeated within subjects. The general characteristics of the critical words are shown in Table 8. The remainder of the trials consisted of 100 unrelated word pairs (to make related and unrelated pairs equally probable) and 150 pairs with a pronounceable nonword in the first or second position (75 each).

Table 7 Sample stimuli from Experiment 3
Table 8 Target word characteristics in Experiment 3 (mean and SE)

Procedures and statistical methods

The screen continuously displayed a central frame in which all text stimuli appeared. On each trial, the first member of a pair was displayed for 200 ms, followed 500 ms later by the second member of a pair for 200 ms (700-ms SOA). Lexical decisions were signaled with the left and right index fingers (the mapping between word/nonword and hands was counterbalanced across subjects). The statistical methods were like those of Experiment 2.

Electrophysiological methods

The scalp sites in Experiment 3 were Fpz, Fz, Fcz, Cz, Cpz, Oz, C3, C4, Cp3, Cp4, P3, P4, T3, T4, T5, and T6. All other methods were like those of Experiments 1 and 2. After artifact rejection and exclusion of trials with incorrect lexical decisions, the ERPs for individual subjects included a mean of 42 or 43 trials in the strong primary, weak primary, weak nonprimary, and unrelated conditions, and a mean of 64 trials in the nonword condition (minimum 19).

Results

Lexical decision times

Table 9 shows the error rates and RTs for trials with correct responses (see also Fig. 5b). The unrelated target words received substantially faster RTs than did the nonwords [t(23) = 8.06 for medians, t(23) = 8.73 for trimmed means, both ps < .0001]. Only the strong primary associates elicited faster RTs than did unrelated words [t(23) = 3.01, p < .01, for medians, t(23) = 3.76, p < .001, for the trimmed means]. The LD times for neither weak primary nor weak nonprimary associates differed from those in the unrelated condition [medians and trimmed means, all four ts(23) < 1.51, ps > .14]. ANOVAs comparing the three associated conditions to each other thus yielded main effects of association strength [medians, F(2, 46) = 3.96, p < .05, ε = .90; trimmed means, F(2, 46) = 5.60, p < .01, ε = 1.0]. Pairwise comparisons showed that strong primary RTs were faster than both weak primary and weak nonprimary RTs [medians and trimmed means, all four ts(23) > 2.42, all ps < .05], whereas the weak primary and weak nonprimary conditions did not differ from each other [means and medians, both ts(23) < 0.1].

Table 9 Lexical decision results in Experiment 3 (mean and SE)
Fig. 5
figure 5

a Mean amplitudes of event-related potentials (ERPs) in the 250- to 450-ms latency range, measured across all scalp sites in Experiment 3 from 24 subjects. Brackets between adjacent bars show the effect sizes of the difference between them, calculated as unbiased Cohen’s d. (b) Median lexical decision times from the same 24 subjects. Brackets between adjacent bars show the effect sizes of the difference between them, as unbiased Cohen’s d. (c) Grand average ERPs from midline prefrontal, frontal, frontocentral, central, centroparietal, parietal, and occipital scalp sites. The association strengths for the related pairs are provided as percentages of subjects in an independent normative group (Kiss et al., 1973) who offered the critical word as a response to the context word that preceded it

Event-related potentials

The right side of Fig. 2 shows that nonword targets elicited a larger N400 and a larger P300 than did the unrelated word targets. N400 amplitudes were measured in a latency range of 250 to 450 ms after target onset, to minimize overlap with the P300, and were analyzed with a repeated measures ANOVA including Scalp Site as a factor (19 levels). This yielded an interaction between word/nonword status and scalp site [F(18,414) = 4.61, p < .001, ε = .30] without a main effect of word/nonword (F < 1). This outcome suggests that the relatively early latency window of 250–450 ms did not fully succeed in separating the N400 from the subsequent P300; the interaction between word/nonword and scalp site reflects the dominance of the P300 at midline scalp sites seen in Fig. 2. Analysis of a more restricted set of eight lateral posterior scalp sites (Cp3, Cp4, P3, P4, T3, T4, T5, and T6) showed significantly more negative ERPs (larger N400s) for the nonwords than for the unrelated words [F(1, 23) = 5.22, p < .05]. In the comparisons of ERPs across the word conditions reported below, we continued to analyze all scalp sites.

Figure 5a shows the mean amplitudes across all 19 scalp sites in the 250- to 450-ms latency range, and Fig. 5c shows the ERPs elicited by strong primary, weak primary, weak nonprimary, and unrelated words at midline scalp sites. Each of the associated conditions elicited smaller N400s than did unrelated words [weak nonprimary, F(1, 23) = 4.47, p < .05; weak primary, F(1, 23) = 4.49, p < .05; strong primary, F(1, 22) = 22.2, p < .0001]. An ANOVA comparing the three associated conditions yielded a main effect of association strength [F(2, 46) = 3.70, p < .05, ε = 0.99]. Pairwise comparisons showed that the weak primary and weak nonprimary conditions elicited indistinguishable N400s [F(1, 23) = 0.12], but that both elicited a larger N400 than did the strongly associated words [weak primary vs. strong primary, F(1, 23) = 9.03, p < .001; weak nonprimary vs. strong primary, F(1, 23) = 4.10, p < .05].

Linear effect of association strength: Direct comparison between N400 amplitude and LD time

Median LD times and N400 amplitudes were jointly analyzed in an ANOVA with repeated measures of measure (RT vs. N400 amplitude collapsed across scalp sites) and strength (unrelated, weak nonprimary, weak primary, or strong). The four levels of strength were defined via their numerical association strengths of 0, 7, 9, and 40 and were subjected to a polynomial contrast to examine whether the linear effect of association strength was greater for N400 amplitude than for LD time. The expected interaction of strengthlin with measure was observed [F(1, 23) = 12.3, p < .002, η 2p = .35], in addition to an overall effect of strengthlin [F(1, 23) = 9.98, p < .005, η 2p = .30]. The quadratic component and its interaction with measure were nonsignificant. Given that both dependent measures yielded large differences between the extreme conditions of unrelated and strongly associated pairs, follow-up tests on both measures yielded significant linear effects of strength [RT, F(1, 23) = 16.1, p < .005, η 2p = .33; N400, F(1, 23) = 21.8, p = .0001, η 2p = .49].

Summary

The lexical decision results in Experiment 3 were like those from the first two experiments, in that RTs showed only a dichotomous split among conditions. In this case, the split was between strongly associated words and all others (weak primary, weak nonprimary, and unrelated). The placement of the threshold that divided fast from slow differed from those found in Experiments 1 and 2, in that RTs for weakly related target words were no faster than the responses to unrelated words.

The ERP results of Experiment 3 were very similar to those of Experiment 2: Even very weak associations between members of a word pair led to smaller N400s than did no relationship at all, but weak associations were less effective than strong ones for reducing N400 amplitude.

Both LD times and N400 amplitudes were insensitive to association rank, in that the weak primary and weak nonprimary conditions elicited indistinguishable responses, so that the results did not replicate Anaki and Henik’s (2003) findings. This may reflect the closer control over the characteristics of the target words afforded by using the same items across conditions rather than different targets in the three associated conditions, as in Anaki and Henik’s stimulus set.

General discussion

Across three experiments with different stimuli and subjects, target words preceded by strong semantic associates elicited faster LD times and smaller N400s, replicating many prior results for each measure. However, the inclusion of more weakly related word pairs here revealed a clear difference between the two measures. In each experiment, the amplitude of the N400 elicited by weak associates fell somewhere between those for unrelated words and strong associates (although closely spaced levels of weak association could not be distinguished—i.e., 6 % from 12 % in Exp. 2, and 7 % from 9 % in Exp. 3). LD times instead showed dichotomies rather than gradations: Weak associates were like strong associates in Experiments 1 and 2, and like unrelated words in Experiment 3. As we reviewed in the introduction, a number of previous studies provided hints about the graded versus dichotomous sensitivity to semantic context displayed by the two measures. Those hints were confirmed here by examining the two measures with the same stimuli in the same subjects (in Exps. 1 and 3), and by using the same target words across different association strengths (in Exps. 2 and 3); these precautions ruled out explanations of the N400/LD time dissociation that might have been grounded in individual differences among readers or among words.

Before concluding that the divergent results for lexical decisions and N400 amplitudes indicate a qualitative dissociation between the two measures, one quantitative explanation needs to be ruled out. Namely, if Measure A is more sensitive to some process than Measure B, it will produce a larger difference between two extreme conditions. With a larger separation between extremes, it would be easier—that is, require less statistical power—to shoehorn some intermediate condition (like the weak associates) into the space between the extremes. We thus examined the Cohen’s d effect sizes for the RT and N400 context effects in the extreme comparison between strongly associated and unrelated words in each experiment, as well as combined across experiments. Table 10 shows that all of the effect sizes were moderate to large, ranging from 0.50 to 1.03. The N400 effect sizes were somewhat larger than the LD time effect sizes, but the 95 % confidence limits around the N400 and LD time effect sizes show considerable overlap. Thus, we found little support for the idea that the RT measures were generically less sensitive than the ERP measures for the detection of any semantic relationship. Instead, N400 amplitudes showed stronger linear gradations from stronger to weaker relationships than did LD times, and no sharp discontinuities between levels of association strength (see the analyses of linear effects in each Results section). Overall, the results show not differential sensitivity of the two measures, but that the graded semantic activity visible in the ERPs is transformed into an all-or-none semantic-priming effect in LD times.

Table 10 Cohen’s d effect sizes (95 % confidence limits)

Some prior investigations of the impact of semantic context on ERPs and lexical decision have shown a more dramatic dissociation than the one observed here. If subjects are asked to decide whether a context word contains a simultaneously presented letter, the LD time for a subsequent word no longer shows the standard advantage for related over unrelated words (Smith, Theodor, & Franklin, 1983). This result has been replicated numerous times, and it was initially taken to indicate that the letter search task blocked semantic processing of the context word, so that the relationship between context and target words was never noticed (Stolz & Besner, 1998, 1999). However, multiple studies have shown that N400 semantic context effects are still present in this paradigm, indicating that semantic activity can be unlinked from lexical decision (Heil et al., 2004; Küper & Heil, 2009; see Van Petten, 2013, for a review). The most parsimonious interpretation of this dissociation is that making one decision about letters encourages subjects to also base their lexical decisions on orthography alone, although semantic processing persists. The dissociation between N400 amplitudes and LD times in the letter search paradigm provides a clear indication that semantic information can be used or discarded when making lexical decisions, but also that the absence of a semantic effect on LD times does not indicate the absence of semantic processing.

The present results suggest that even when semantic information does contribute to lexical decisions, its influence is often thresholded rather than continuous. If we take the lexical decision task at face value, as an attempt to optimally discriminate words from nonwords, then detection of even a minimal relationship between a target letter string and the preceding context would be sufficient to signal that the letter string was indeed a word, and additional information about the strength or nature of that relationship would be largely irrelevant.

Much less clear is how strong a semantic relationship must be to count as being positive evidence for a “word” response, or what factors determine the placement of the threshold. Recall that in Experiments 1 and 2, the RTs to weakly associated targets were indistinguishable from those to strongly associated targets, but in Experiment 3 they were indistinguishable from those to unrelated targets. One possibility is that the placement of a threshold is influenced by the distribution of relationship strengths across the entire set of stimuli presented. In Experiment 3, the weakly associated pairs (7 % and 9 % strengths) were much closer to the unrelated pairs (0 %) than to the strongly associated pairs (40 %) in association strength, whereas in Experiments 1 and 2 we used more equally spaced levels of association strength (0 %, 23 %, and 47 %, or 0 %, 6 %, 12 %, and 24 %, respectively). The wide gap between weakly and strongly related pairs in Experiment 3 may have isolated the strong pairs and encouraged the weak and unrelated pairs to be clustered together. Alternatively, threshold placement in the lexical decision task may depend on factors that are outside experimental control, such as prior experience or familiarity with specific semantic relationships on the part of individual subjects.

In the reams that have been written about LD times, it has frequently been noted that differences between experimental conditions may arise from the processes that lead to identification of a word and its meaning, and/or from the need to make a binary word/nonword decision (Balota & Chumbley, 1984; Neely, Keefe, & Ross, 1989; Norris, 2006; Plaut & Booth, 2000; Ratcliff, Gomez, & McKoon, 2004; Yap, Balota, & Tan, 2013). Different theorists have espoused a fairly direct mapping from semantic activity in lexical units to LD times (e.g., Plaut & Booth, 2000), multistage models with an explicit decision stage (e.g., Balota & Chumbley, 1984), and models that emphasize a flexible mapping between input and behavior, depending on the assigned task (e.g., Norris, 2006). The relative contributions of lexical–semantic processes versus task-specific decision processes have been debated by examining the interactions between word frequency, stimulus degradation, proportions of words to nonwords, proportions of related to unrelated words, and orthographic similarity between word and nonword targets presented for lexical decision. Our understanding of the relationship between the lexical decision task and the neural activity reflected in ERPs is not sufficiently advanced to favor one model or another, but it does stress the importance of the binary nature of the decision options in shaping lexical decision RTs. We suggest that these theoretical accounts of the lexical decision task are likely to benefit from an independent measure of semantic processing that is distinct from the measure being modeled.