Semantic features are the focus of a large area of research which tries to delineate the semantic representation of a concept. These features are key to models of semantic memory (i.e., memory for facts; Collins & Quillian, 1969; Collins & Loftus, 1975), and they have been used to create both feature-based (Cree & McRae, 2003; Smith, Shoben, & Rips, 1974; Vigliocco, Vinson, Lewis, & Garrett, 2004) and distributional-based models (Griffiths, Steyvers, & Tenenbaum, 2007; Jones & Mewhort, 2007; Riordan & Jones, 2011). Semantic representation is built in a distributional model by examining the co-occurrence of words in a large text with the idea that similar contexts for concepts indicate similarity in meaning. Feature-based models simply indicate that similarity between concepts is defined by their overlapping features. To create feature-based similarity, participants were often asked to create lists of properties for categories of words. This property listing was a seminal task with corresponding norms that have been prevalent in the literature (Ashcraft, 1978; Rosch & Mervis, 1975; Toglia, 2009; Toglia & Battig, 1978). Feature production norms are created by soliciting participants to list properties or features of a target concept without focusing on category. These features are then compiled into feature sets that are thought to represent the memory representation of a particular concept (Collins & Loftus, 1975, Collins & Quillian, 1969; Jones, Willits, & Dennis, 2015; McRae & Jones, 2013).

For example, when queried on what features define a cat, participants may list tail, animal, and pet. These features capture the most common types of descriptions: “is a” and “has a”. Additionally, feature descriptions may include uses, locations, behavior, and gender (i.e., actor denotes both a person and gender). The goal of these norms is often to create a set of high-probability features, as there can and will be many idiosyncratic features listed in this task, to explore the nature of concept structure. In the classic view of category structure, concepts have defining features or properties, while the probabilistic view suggests that categories are fuzzy with features that are typical of a concept (Medin, 1989). These norms have now been published in Italian (Montefinese, Ambrosini, Fairfield, & Mammarella, 2013; Reverberi, Capitani, & Laiacona, 2004), German (and Italian, Kremer & Baroni, 2011), Portuguese (Stein & de Azevedo Gomes, 2009), Spanish (Vivas, Vivas, Comesaña, Coni, & Vorano, 2017), and Dutch (Ruts et al.,, 2004), as well as for the blind (Lenci, Baroni, Cazzolli, & Marotta, 2013).

Previous work on semantic feature production norms in English includes databases by McRae et al., (2005), Vinson and Vigliocco (2008), Buchanan et al., (2013), and Devereux et al., (2014). McRae et al., (2005)’s feature production norms focused on 541 nouns, specifically living and nonliving objects. Vinson and Vigliocco (2008) expanded the stimuli set by contributing norms for 456 concepts that included both nouns and verbs. Buchanan et al., (2013) broadened to concepts other than nouns and verbs with 1808 concepts normed. The Devereux et al., (2014) norms included a replication of McRae et al., (2005)’s concepts with the addition of several hundred more concrete concepts. The current paper represents nearly 2000 new concepts added to these previous projects and a reanalysis of the original data.

Creation of norms is vital to provide investigators with concepts that can be used in future research. The concepts presented in the feature production norming task are usually called cues, and the responses to the cue are called features. The concept paired with a cue (first word) is denoted as a target (second word) in semantic priming tasks. In a lexical decision task, participants are shown cue words before a related or unrelated target word. Their task is to decide if the target word is a word or nonword as quickly as possible. A similar task, naming, involves reading the target word aloud after viewing a related or unrelated cue word. Semantic priming occurs when the target word is recognized (responded to or read aloud) faster after the related cue word in comparison to the unrelated cue word (Moss, Ostrin, Tyler, Marlsen-Wilson, Tyler, & Marslen-Wilson, 1995). The feature list data created from the production task can be used to determine the strength of the relation between cue and target word, often by calculating the feature overlap, or number of shared features between concepts (McRae, Cree, Seidenberg, & McNorgan, 2005). Both the cue-feature lists and the cue-cue combinations (i.e., the relation between two cues in a feature production dataset, which becomes a cue-target combination in the priming task) are useful and important data for researchers in exploring various semantic-based phenomena.

These feature lists can provide insight into the probabilistic nature of language and conceptual structure. Some features are considered more typical (e.g., probable) and are listed more often than others. Further, processing time is speeded for concepts with more listed features, which is referred to as the number of features effect (Cree & McRae, 2003; McRae, Sa, & Seidenberg, 1997; Moss, Tyler, & Devlin, 2002; Pexman, Holyk, & Monfils, 2003). The feature production norms can be used as the underlying conceptual data to create models of semantic priming and cognition focusing on cue-target relation (Cree, McRae, & McNorgan, 1999; Rogers & McClelland, 2004; Vigliocco, Vinson, Lewis, & Garrett, 2004). By selecting stimuli from these norms, others have studied semantic word-picture interference (i.e., slower naming times when distractor words are related category concepts in a picture naming task, Vieth, McMahon, & de Zubicaray, 2014), recognition memory (Montefinese, Zannino, & Ambrosini, 2015), meaning-syntactic differences (i.e., differences in naming times based on semantic or syntactic similarity, Vigliocco, Vinson, Damian, & Levelt, 2002; Vigliocco, Vinson, & Siri, 2005), and semantic richness, which is a measure of shared defining features (Grondin, Lupker, & McRae, 2009; Kounios, Green, Payne, Fleck, Grondin, & McRae, 2009; Yap, Lim, & Pexman, 2015; Yap & Pexman, 2016). Last, neuropsychological research has benefited from feature production norms, as Vinson and Vigliocco (2002) and Vinson et al., (2003) have used these norms to explore aphasia (i.e., the loss of understanding speech).

However, it would be unwise to consider these norms as an exact representation of a concept in memory (McRae, Cree, Seidenberg, & McNorgan, 2005). These norms represent salient features that participants can recall, likely because saliency is considered special to our understanding of concepts (Cree & McRae, 2003). Additionally, Barsalou (2003) suggested that participants are likely creating a mental model of the concept based on experience and using that model to create a feature property list. This model may represent a specific instance of a category (i.e., their pet dog), and feature lists will represent that particular memory. One potential solution to overcome saliency effects would be to solicit applicability ratings for features across multiple exemplars of a category, as De Deyne et al., (2008) have shown that this procedure provides reliable ratings across exemplars and provides more connections than the sparse representations that can occur when producing features.

Computational modeling of memory requires sufficiently large datasets to accurately portray semantic memory, therefore, the advantage of big data in psycholinguistics cannot be understated. There are many large corpora that could be used for exploring the structure of language and memory through frequency (see the SUBTLEX projects Brysbaert & New, 2009; New, Brysbaert, Veronis, & Pallier, 2007). Additionally, there are large lexicon projects that explore how the basic features of words affect semantic priming, such as orthographic neighborhood (words that are one letter different from the cue), length, and part of speech (Balota, Yap, Hutchison, Cortese, Kessler, Loftis, & Treiman, 2007; Keuleers, Lacey, Rastle, & Brysbaert, 2012). In contrast to these basic linguistic features of words, other norming efforts have involved subjective ratings of concepts. Large databases of age of acquisition (i.e., rated age of learning the concept; Kuperman, Stadthagen-Gonzalez, & Brysbaert, 2012), concreteness (i.e., rating of how perceptible a concept is; Brysbaert, Warriner, & Kuperman, 2014), and valence (i.e., rating of emotion in a concept; Warriner, Kuperman, & Brysbaert, 2013) provide further avenues for understanding the impact these rated properties have on semantic memory. For example, age of acquisition and concreteness ratings have been shown to predict performance on recall tasks (Brysbaert, Warriner, & Kuperman, 2014; Dewhurst, Hitch, & Barry, 1998), while valence ratings are useful for gauging the effects of emotion on meaning (Warriner, Kuperman, & Brysbaert, 2013). These projects represent a small subset of the larger normed stimuli available (Buchanan, Valentine, & Maxwell, 2018), however, research is still limited by the overlap between these datasets. If a researcher wishes to control for lexical characteristics and subjective rating variables, the inclusion of each new variable to the study will further restrict the item pool for study. Large, overlapping datasets are crucial for exploring the entire range of an effect ensuring that the stimuli set is not the only contributing factor to the results of a study.

Therefore, the purpose of this study was to expand the number of cue and feature word stimuli available, which additionally increases the possible cue-target pairings for studies using word-pair stimuli (like semantic priming tasks). To accomplish these goals, we have expanded our original semantic feature production norms (Buchanan, Holmes, Teasley, & Hutchison, 2013) to include all cues and targets from The Semantic Priming Project (Hutchison, Balota, Neely, Cortese, Cohen-Shikora, Tse, & Buchanan, 2013). The existing norms were reprocessed along with these new norms to provide new feature coding and affixes (i.e., word addition that modifies meaning, such as pre or ing) to explore the impact of word form. Previously, Buchanan et al., (2013) illustrated convergent validity with McRae et al., (2005) and Vinson and Vigliocco (2008) even with a different approach to processing feature production data. In McRae et al., (2005) and Vinson and Vigliocco (2008), features were coded with complexity, matching the “is a” and “has a” format that was first found in Collins and Quillian (1969) and Collins and Loftus (1975). Buchanan et al., (2013) took a count-based approach, wherein each feature is treated as a separate concept (i.e., four legs would be treated as two features, rather than one complex feature). Both approaches allow for the computation of similarity by comparing feature lists for cue words, however, the count-based approach matches popular computational models, such as Latent Semantic Analysis (Landauer & Dumais, 1997) and Hyperspace Analogue to Language (Lund & Burgess, 1996). These models treat each word in a document or text as a cue word and similarity is computed by assessing a matrix of frequency counts between concepts and texts, which is similar to comparing overlapping feature lists.

In contrast, hybrid models include both a compositional view (i.e., words are first broken down into their components cat and s, Jarvella & Meijers, 1983; Mackay, 1978) and a full-listing view (i.e., each word form is represented completely separately, cat and cats Bradley, 1980; Butterworth, 1983), and processing occurs as a race between each type of representation. Given these various models, we created a coding system to capture the feature word meaning, in addition to morphology, to provide different levels of information about each cue-feature combination. In the previous study by Buchanan et al., (2013), each feature was converted to a common form if they denoted the same concept (i.e., most features were translated to their root form). To reduce the sparsity of the matrix, features such as beauty or beautiful are grouped together to help capture the essential features. However, we previously included a few exceptions to this coding system, such as act and actor when the differences in features denoted a change of action (noun/verb) or gender or cue sets did not overlap (i.e., features like will and willing did not have overlapping associated cues). These exceptions were designed to capture how changes in morphology might be important cues to word meaning, as hybrid models of word identification have outlined that morpheme processing can be complex (Caramazza, Laudanna, & Romani, 1988; Marslen-Wilson, Tyler, Waksler, & Older, 1994). In this study, we reduced words to their root form, but additionally coded the affixes to ensure a reduction in sparsity and morphological information was included.

The entire dataset is available at http://wordnorms.com/ which allows the use of detailed queries to search for specific stimuli. The data collection, (re)processing, website, and finalized dataset are detailed below. The basic properties of the cue-feature data will be detailed, such as the average number of features each cue elicited across parts of speech and datasets. The cue-feature data will be explored for divergent validity from the free association norms to show evidence that the new feature production norms provide additional information not found in the Nelson et al., (2004) dataset. We then provide details on how to calculate semantic similarity and then use these values to portray convergent validity by correlating multiple measures of meaning. Additionally, the similarity measures are compared to the priming times from the Semantic Priming Project (Hutchison, Balota, Neely, Cortese, Cohen-Shikora, Tse, & Buchanan, 2013) to demonstrate the relation between semantic similarity and priming.

Method

Participants

A total of 198 new participants were recruited from Amazon’s Mechanical Turk, which is a large, diverse participant pool wherein users can complete surveys for small sums of money (Buhrmester, Kwang, & Gosling, 2011). Participants signed up for the HITS through Amazon’s Mechanical Turk website and completed the study within the Mechanical Turk framework. These data were combined with previously collected datasets, for which we list the location of testing, sample size, and number of concepts in Table 1. Participant answers were screened for errors, and incorrect or incomplete surveys were rejected or discarded without payment. These surveys were usually rejected if they included copied definitions from Wikipedia, “I don’t know”, or the participant wrote a paragraph about the concept. Each participant was paid five cents for a survey, and they could complete multiple Human Intelligence Tasks or HITS. Participants were required to be located in the United States with a HIT approval rate of at least 80%, and no other special qualifications were required. HITS would remain active until n = 30 valid survey answers were obtained.

Table 1 Sample size and concept norming size for each data collection location/time point

Materials

The 1914 new concepts provided in this study expands upon the 1808 concepts previously published in Buchanan et al., (2013) and provides complete coverage of the Semantic Priming Project (Hutchison, Balota, Neely, Cortese, Cohen-Shikora, Tse, & Buchanan, 2013). The concept set from Buchanan et al., (2013) was selected primarily from the Nelson et al., (2004) database, with small overlaps in the McRae et al., (2005) and Vinson and Vigliocco (2008) database sets for convergent validity. To create the final database of 4436 concepts, the Buchanan et al., (2013), McRae et al., (2005), and Vinson and Vigliocco (2008) feature lists were all combined into one larger dataset. Concepts were labeled by their most frequent part of speech using the English Lexicon Project (Balota, Yap, Hutchison, Cortese, Kessler, Loftis, & Treiman, 2007) and Google’s define search. The complete dataset of 4436 concepts includes: 70.4% of concepts were nouns, 14.9% adjectives, 12.4% verbs, and 2.3% were other forms of speech, such as adverbs and conjunctions. The new concepts from this norming set only constituted: 72.0% nouns, 14.9% adjectives, 12.4% verbs, and 2.3% other parts of speech.

Procedure

The survey instructions were copied from McRae et al., (2005)’s Appendix B, which were also used in the previous publication of these norms. Because the McRae et al., (2005) data were collected on paper, we modified these instructions slightly. The original lines to write in responses were changed to an online text box response window. The detailed instructions additionally no longer contained information about how a participant should only consider the noun of the target concept, as the words in our study included multiple forms of speech and senses. Participants were encouraged to list the properties or features of each concept in the following areas: physical (looks, sounds, and feels), functional (uses), and categorical (belongings). The exact instructions were as follows:

We want to know how people read words for meaning. Please fill in features of the word that you can think of. Examples of different types of features would be: how it looks, sounds, smells, feels, or tastes; what it is made of; what it is used for; and where it comes from. Here is an example:

duck: is a bird, is an animal, waddles, flies, migrates, lays eggs, quacks, swims, has wings, has a beak, has webbed feet, has feathers, lives in ponds, lives in water, hunted by people, is edible

Complete this questionnaire reasonably quickly, but try to list at least a few properties for each word. Thank you very much for completing this questionnaire.

Data processing

The entire dataset, at each processing stage described here, can be found at: https://osf.io/cjyzw/.Footnote 1 First, each concept’s answers were separated into an individual text file that is included as the “raw” data online. Each of these files was then spell checked and corrected if it was clear that the participant answer was a typo. As noted earlier, participants often cut and paste Wikipedia or other online dictionary sources into. These entries were easily spotted because the formatting of the webpage was included in their answer, and we processed this data by opening the raw text files that were compiled for each cue, looking for these large blocks of formatted text, and deleting that information. Approximately 113 HITS were rejected because of poor data, and 4524 HITS were paid. Therefore, we estimate approximately 2% of the HITS included Wikipedia articles or other ineligible entries.

Next, each concept was processed for feature frequency. In this stage, the raw frequency counts of each cue-feature combination were calculated and put together into one large file. Cue-cue combinations were discarded, as they were often participants writing the definition of a concept in a sentence. English stop words such as the, an, of were then discarded, as well as terms that were often used as part of a definition (like, means, describes). Figure 1 portrays the cue-feature dataset provided online. The first column in the dataset (“where”) indicates the norming of the cue: b = Buchanan et al., (2013) or this expansion, m = McRae et al., (2005), and v = Vinson and Vigliocco (2008). The next column is the “cue” or concept word, followed by the “feature” or raw, unprocessed feature listed with the cue.

Fig. 1
figure 1

Example of the cue-feature dataset created from the feature listing task

We then created a “translated” column for each feature listed by using a Snowball stemmer (Porter, 2001) and hand coding. This column indicates the root word for each feature. The “frequency_feature” column portrays the frequency of the “feature” column (raw word), while the “frequency_translated” includes the frequency of the “translated” column. As you can see in Fig. 1, leave, leaving, and left were combined into leave for the “translated” column and the frequency of each of the raw words in the “frequency_feature” column was then totaled for the “frequency_translated” column. The affixes were added in the columns “a1”, “a2”, and “a3” (not pictured). For example, the original feature cats would be translated to cat and s, wherein cat would be the translated feature and the s would be the affix code.

The “n” column denotes the sample size for that cue word, as the sample sizes varied across experiment time, as shown in Table 1. The “normalized_feature” and “normalized_translated” columns are the two frequency columns divided by sample size times 100 (i.e., the percent of participants who used each raw and translated feature for that cue word). At this stage, the data were reduced to cue-feature combinations that were listed by at least 16% of participants (matching McRae et al., 2005’s procedure) or were in the top five features listed for that cue. This calculation was performed on the feature percent for the root word (the “normalized_translated” column). Table 2 indicates the average number of cue-feature pairs found for each data collection site/time point and part of speech for the cue word. The data from McRae et al., (2005) and Vinson and Vigliocco (2008) were added by including all the cue-feature combinations listed in their supplemental files with their original feature in the “feature” column. If features could be translated into root words with affixes, the same procedure as described above was applied. The cue-feature file includes 69284 cue-raw feature combinations, where 48,925 are from our dataset, and 24,449 of which are unique cue-translated feature combinations.

Table 2 Average (SD) cue-feature pairs by location/time point

The parts of speech for the cue (“pos_cue”), raw feature (“pos_feature”), and translated feature (“pos_translated”) are the next columns in this file. Table 3 depicts the pattern of feature responses for cue-feature part of speech combinations. Statistics in Table 3 only include information from the reprocessed Buchanan et al., (2013) norms and the new cues collected for this project. The overall percent of part of speech combinations are presented in the “% Raw” and “% Root” columns in Table 3, indicating, for example, the percent of time that both the cue and feature were both adjectives (38.09%). The mean frequency columns portray the average of the “normalized_feature” (raw) and “normalized_translated” (root) columns from Fig. 1 for each cue-feature part of speech combination.

Table 3 Percent and average percent of frequency for cue-feature part of speech combinations

The final data processing step was to code affixes found on the original features. Multiple affix codes were often needed for features, as beautifully would have been translated to beauty, ful, and ly (the “feature”, “a1”, and “a2” columns). A coding schema was created from online searches of affixes (provided in the supplemental materials). Table 4 displays the list of affix types, common examples for each type of affix, and the percent of affixes that fell into each category. Generally, affixes were tagged in a one-to-one match, however, special care was taken with numbers (cats) and verb tenses (walks).

Table 4 Example of affix coding and percent of affixes found

To create similarity measures, we used cosine calculated in three different ways: by the “feature” + “normalized_feature” percentages, the “translated” + “normalized_translated” percentages, and affixes + “normalized_feature” percentages (as the frequency of affixes is tied to the original raw word). Cosine values were calculated for each of these feature sets by using the following formula:

$$ \frac{{\sum}_{i=1}^{n} A_{i} \times B_{i}} {\sqrt{{\sum}_{i=1}^{n} {{A}_{i}^{2}}} \times \sqrt{{\sum}_{i=1}^{n} {{B}_{i}^{2}}}} $$

This formula is similar to a dot-product correlation, where Ai and Bi indicate the overlapping frequency percent between cue A and cue B. The i subscript denotes the current feature, and when features match, the frequencies are multiplied together and summed across all matches (Σ). For the denominator, the feature frequency is first squared and summed from i to n features for cue A and B. The square root of these summation values is then multiplied together. In essence, the numerator calculates the overlap of feature frequency for matching features, while the denominator accounts for the entire feature frequency set for each cue. Cosine values range from 0 (no overlapping features) to 1 (complete overlapping features). With over 4000 cue words from all data sources (i.e., the current paper plus, Buchanan, Holmes, Teasley, & Hutchison, 2013; McRae, Cree, Seidenberg, & McNorgan, 2005; Vinson & Vigliocco, 2008), just under 20 million cue-cue cosine combinations can be calculated.

Website

In addition to our OSF page, we present a revamped website for this data at http://www.wordnorms.com/. The single word norms page includes information about each of the cue words including cue set size, concreteness, word frequency from multiple sources, length, full part of speech, orthographic/phonographic neighborhood, and number of phonemes, syllables, and morphemes. These values were taken from Nelson et al., (2004), Balota et al., (2007), and Brysbaert and New (2009). A definition of each of these variables is provided along with the minimum, maximum, mean, and standard deviation of numeric values.Footnote 2 On the word pair norms page, all information about cue-feature and cue-cue statistics can be found. The cue-feature data includes the cue, features, and their processed information, as described above. The cue-cue data includes the cue and target words from this project (cue-cue combinations), the root, raw, and affix cosines described above, as well as the original Buchanan et al., (2013) cosines. Additional semantic information includes Latent Semantic Analysis (LSA, Landauer & Dumais, 1997) and JCN (JCN stands for Jiang-Conrath, see explanation below, Jiang & Conrath, 1997) values provided in the Maki et al., (2004) norms, along with forward strength and backward strength (FSG,BSG) from the Nelson et al., (2004) norms for association. Users can search and save filtered output in a csv or Excel file. The complete data is also provided for download.

We have provided the data on the website to calculate a broad range of linguistic information or simply use the provided values. From our OSF page (also linked to GitHub: https://github.com/doomlab/Word-Norms-2), you can find the data at each stage of processing and final data from this manuscript. Interested researchers could use our raw feature files to create their own coding schemes (or ones similar to McRae et al., (2005)), use the processed files to calculate set sizes for each cue or feature, and use these files plus the cosine files to create their own experimental stimuli. These data could also be used to calculate other measures of interest, such as pointwise positive mutual information, entropy, and random walk statistics (De Deyne, Navarro, Perfors, & Storms, 2016).

Results

Research questions

In this section, we will detail the results of the new data collection and reprocessing of previous data.

  1. 1)

    Descriptive statistics: First, we provide descriptive statistics on the cue-feature lists to compare the newly collected concepts (n = 1914) to the Buchanan et al., (2013) data (n = 1808). The data were then examined for general trends in parts of speech for cue-feature pairs for both raw and root translated words. The affixes were a new and important component to this study, and their descriptive statistics are detailed.

  2. 2)

    Divergent validity: When collecting semantic feature production norms, there can be a concern that the information produced will simply mimic the free association norms, and thus, be a more of representation of association (context) rather than meaning. Association and meaning do overlap, however, the variables used to represent these concepts have been shown to tap different underlying constructs (Maki & Buchanan, 2008). Therefore, it is important to show that, while some overlap is expected, the semantic feature production norms provide useful, separate information from the free association norms. To ensure divergent validity, we examined the percent overlap and correlations between the cue-feature data and the free association norms (Nelson, McEvoy, & Schreiber, 2004).

  3. 3)

    Convergent validity: The new data and Buchanan et al., (2013) were then compared to the McRae et al., (2005) and Vinson and Vigliocco (2008) to portray convergent validity. We calculated the cosine values between matching cue sets, and correlated the cosine scores between overlapping cue-cue pairs in these datasets. For a second form of convergent validity, the correlation between other semantic similarity measures (LSA, JCN) and cosine values are provided.

  4. 4)

    Relation to semantic priming: Last, we examined the correlation between semantic similarity values and semantic priming using the data in the Semantic Priming Project (Hutchison et al., 2013). This project was designed to provide complete coverage of the Semantic Priming Project, we wished to explore the relation between similarity measures and the priming scores provided, as a potential use for the new norms.

Descriptive data

An examination of the results of the cue-feature lists indicated that the new data collected was similar to the previous semantic feature production norms. As shown in Table 2, the new Mechanical Turk data showed roughly the same number of listed features for each cue concept, usually between five and seven features. These numbers represent, for each cue and part of speech, the average number of distinct cue-feature pairs provided by participants after processing. Table 3 portrayed that adjective cues generally included other adjectives or nouns as features, while noun cues were predominately described by other nouns. Verb cues included a large feature list of nouns and other verbs, followed by adjectives and other word forms. Lastly, the other cue types generally elicited nouns and verbs. Frequency percentages were generally between 7 and 20% when examining the raw words. These words included multiple forms, as the percent increased to around 30% when features were translated into their root words. Indeed, nearly half of the 48,925 cue-feature pairs were repeated, as 24,449 cue-feature pairs were unique when examining translated features. Generally, because of the translation process, word forms shifted towards nouns and verbs and away from adjectives because adjectives are often formed by adding an affix to a noun or verb.

Table 4 shows the distribution of these affix values. A total of 36,030 affix values were found across 4407 of the 4436 cue concepts. The total number of affixes was broken into: first n = 33,052, second n = 2832, and third n = 146. The most affixes were found in the numbers and characteristic categories, indicating that participants were indicating quantity and type (i.e., to/from a noun). Verb tenses comprised another large set of affixes portraying the action of the cue word. Persons and objects affixes were used about 7% of the time on features to explain cues, while actions and processes were added to the feature about 8% of the time.

Divergent validity

Table 5 portrays the overlap with the Nelson et al., (2004) norms. The percent of time a cue-feature combination was present in the free association norms was calculated, along with the average forward strength for those overlapping pairs. First, these values were calculated on the complete dataset with the McRae et al., (2005) and Vinson and Vigliocco (2008) norms (as we are presenting them as a combined dataset) on the translated cue-feature set only. Because we used the translated cue-feature set, repeated instances of cue-features would occur (i.e., the original abandon-leave and abandon-leaving is only one line when using translated abandon-leave), and thus only the unique set was considered. Second, we calculated these values on each dataset separately, as well as for the 26 cues that overlapped in all three datasets. The overall overlap between the database cue-feature sets and the free association cue-target sets was approximately 37%, ranging from 32% for verbs and nearly 52% for adjectives.

Table 5 Percent and mean overlap to the free association norms

Next, we investigated the strength of the relation between cue-feature combinations that were present in the Nelson et al., (2004) norms. Forward strength indicates the number of times a target word was listed in response to a cue word in a free association task, which simply asks participants to name the first word that comes to mind when presented with a cue word. Backward strength is the number of times a cue word was listed with a target word, as free association is directional (i.e., the number of times cheese is listed in response to cheddar is not the same as the number of times that cheddar is listed in response to cheese).

Similar to our previous results, the range of the forward strength was large (.01 - .94), however, the average forward strength was low for overlapping pairs, M = .11 (SD = .14). These results indicated that while it will always be difficult to separate association and meaning, the dataset presented here represents a low association when examining overlapping values, and more than 60% of the data is completely separate from the free association norms. The limitation to this finding is the removal of idiosyncratic responses from the Nelson et al., (2004) norms; but even if these were to be included in some form, the average forward strength would still be quite low when comparing cue-feature lists to cue-target lists. In examining these values by dataset, it appears that the new norms have the highest overlap with the Nelson et al., (2004) data, while the average, standard deviation, minimum, and maximum values were roughly similar for each dataset and the overlapping cues. This effect is likely driven by the inclusion of adjectives and other forms of speech, which show higher overlaps than nouns and verbs, which represent the cues present in McRae et al., (2005) and Vinson and Vigliocco (2008).

In the last column of Table 5, we calculated the correlation between forward strength and the frequency percent for the root (translated) cue-feature pairs. This correlation provides information about the relation between the strength of the association and the frequency of cue-feature mentions. Correlations were similar across parts of speech except, notably, the other category included the lowest relation. This result is likely because the instructions of a semantic feature production task might exclude normal “first word that pops into your mind” association task concepts. The correlations across datasets and the overlapping cues were also similar, denoting that as forward strength increased, the likelihood of the cue-feature mentions also increased. In general, these cue-feature pairs were still of low associative strength, as shown in the mean column of Table 5.

Convergent validity

For convergent validity, we calculated the overlap between the different data sources and the correlation between cosine and other measures of semantic similarity. First, the matching cue-cue cosines between data sources were calculated (ncue = 188, ncosines = 240). Buchanan et al., (2013) and the new dataset are listed with the subscript B, while McRae et al., (2005) is referred to with M and V for Vinson and Vigliocco (2008). For root cosine values, we found high overlap between all three datasets: MBM = .67 (SD = .14), MBV = .66 (SD = .18), and MMV = .72 (SD = .11). The raw cosine values were also correlated, even though the McRae et al., (2005) and Vinson and Vigliocco (2008) datasets were already mostly preprocessed for word stems: MBM = .55 (SD = .15), MBV = .54 (SD = .20), and MMV = .45 (SD = .19). Last, the affix cosines overlapped similarly between Buchanan et al., (2013) and McRae et al., (2005) datasets, MBM = .43 (SD = .29), but did not overlap with the Vinson and Vigliocco (2008) datasets: MBV = .04 (SD = .14), and MMV = .09 (SD = .19), likely due to Vinson and Vigliocco (2008) dataset preprocessing.

These values were then correlated with Latent Semantic Analysis score (LSA), and Jiang-Conrath semantic distance (JCN). LSA is one of the most well-known semantic memory models (Landauer & Dumais, 1997; McRae & Jones, 2013), wherein a large text corpus (i.e., many texts) is used to create a word by document (i.e., each text) matrix. From this matrix, words are weighted relative to their frequency, and singular value decomposition is then used to select only the largest semantic components. This process creates a word space that can then be used to calculate the relation between two cues by examining the patterns of their occurrence across documents, usually cosine or correlation. JCN is calculated from an online dictionary (WordNet, Fellbaum & Felbaum, 1998), by measuring the semantic distance between concepts in a hierarchical structure. JCN is backwards coded, as zero values indicate close semantic neighbors (low dictionary distance) and high values indicate low semantic relation. These two measures were selected for convergent validity because they are well-cited measures of meaning. To examine if the type of processing impacted convergent validity of the dataset, we calculated the McRae et al., (2005) and Vinson and Vigliocco (2008) cosine values based on their original cue-feature matrices provided in their publications. These datasets were coded for more complex features in a propositional style (“is a”, “has a”), while our processing took a single word count-based approach. Therefore, providing the original processing correlations allows one to examine if the cosine values provided are convergent, as well as similarly correlated across other measures of meaning.

Table 6 displays the correlations between similarity measures. Of particular interest was the different processing styles between previous publications and the current paper (“MV COS”, “PCOS”, “Raw”, and “Root”), and these correlations were all r > .80 indicating convergent validity. The affix measures indicated medium to large size correlations with the cosine measures, and approximately the same size correlations with the other similarity measures implying a different but still related piece of information in our affix values. The small negative correlations between JCN and cosine measures replicated previous findings (Buchanan, Holmes, Teasley, & Hutchison, 2013). LSA values showed small positive correlations with cosine values, indicating some overlap with thematic information and semantic feature overlap (Maki & Buchanan, 2008). The correlation between propositional processing (“MV COS” column) and JCN was higher than the new root cosine measure (-.39 versus -.18 respectively). JCN is created through a hierarchical dictionary with a structure similar to the complex propositional coding provided in McRae et al., (2005) and Vinson and Vigliocco (2008), and correspondingly, the relation between them is stronger.

Table 6 Correlations and 95% CI between semantic and associative variables

Relation to semantic priming

The correlation between our cosine values and the Z-priming values from the Semantic Priming Project were examined. The Semantic Priming Project includes lexical decision (i.e., responding if a presented string is a word or nonword) and naming (i.e., reading a concept aloud) response latencies for priming at 200 and 1200 ms stimulus onset asynchronies (SOA). In these experiments, participants were shown cue-target words that were either the first associate of a concept or an other associate (second response or higher in the Nelson et al., 2004 norms) with the delay between the cue and target at 200 or 1200 ms SOA. The response latency of the target word in the related condition (either first or other associate) was subtracted from the response latency in the unrelated condition to create a priming response latency. We selected the Z-scored priming from the dataset to correlate with our data, as Hutchison et al., (2013) demonstrated that the Z-scored data more accurately captures priming controlled for individual differences in response latencies.

In addition to root, raw, and affix cosine, we additionally calculated feature set size for the cue and target of the primed pairs. Feature set size is the number of features listed by participants when creating the norms for that concept. Because of the nature of our norms, we calculated both feature set size for the raw, untranslated features, as well as the translated features. The average feature set sizes for our dataset can be found in Table 2. The last variable included was cosine set size which was defined as the number of other concepts each cue or target was nonzero paired with in the cosine values. Feature set size indicates the number of features listed for each cue or target, while cosine set size indicates the number of other semantically related concepts for each cue or target. Feature and cue set size are often called semantic richness, representing the variability or extent of associated information for a cue (Buchanan, Westbury, & Burgess, 2001; Pexman, Hargreaves, Edwards, Henry, & Goodyear, 2007; Pexman, Hargreaves, Siakaluk, Bodner, & Pope, 2008). Several studies have showed the positive effects of semantic richness on semantic tasks based on task demand (Duñabeitia, Avilés, & Carreiras, 2008; Pexman, Hargreaves, Siakaluk, Bodner, & Pope, 2008; Yap, Pexman, Wellsby, Hargreaves, & Huff, 2012; Yap, Tan, Pexman, & Hargreaves, 2011), and thus, they were included as important variables to examine.

Tables 7 (for the lexical decision task) and 8 (for the naming task) display the correlations between the new semantic variables described above, as well as forward strength, backward strength, Latent Semantic Analysis score, and Jiang-Conrath semantic distance for reference. Only cue-target pairs with complete values were included in this analysis to allow for comparison between correlations. Looking at both tables reveals that most of the correlations between semantic/associative similarity and priming are nearly zero or very small. The notable exceptions are lexical decision priming times and semantic richness, which showed some medium correlations (r s ∼ .3) for feature set sizes; however, this effect did not appear in the naming data.

Table 7 Lexical decision response latencies’ correlation and 95% CI with semantic and associative variables
Table 8 Naming response latencies’ correlation and 95% CI with semantic and associative variables

Discussion

This research project focused on expanding the availability of English semantic feature overlap norms, in an effort to provide more coverage of concepts that occur in other large database projects like the Semantic Priming and English Lexicon Projects. The number and breadth of linguistic variables and normed databases has increased over the years, however, researchers can still be limited by the concept overlap between them. Projects like the Small World of Words provide newly expanded datasets for association norms (De Deyne, Navarro, Perfors, Brysbaert, & Storms, 2018), and our work helps fill the voids for corresponding semantic norms. To provide the largest dataset of similar data, we combined the newly collected data with previous work by using Buchanan et al., (2013), McRae et al., (2005), and Vinson and Vigliocco (2008) together. These norms were reprocessed from previous work to explore the impact of feature coding for feature overlap. As shown in the correlation between root and raw cosines, the parsing of words to root form created very similar results across other variables. This finding does not imply that these cosine values are the same, as root cosines were larger than their corresponding raw cosine. It does, however, imply that the cue-feature coding can produce similar results in raw or translated format. Because the correlation between the current paper’s cosine values and the previous cosine values was high (r s = .91 and .94), we would suggest using the new values, simply for the increase in dataset size.

Of particular interest was the information that is often lost when translating raw features back to a root word. One surprising result in this study was the sheer number of affixes present on each cue word. With these values, we believe we have captured some of the nuance that is often discarded in this type of research. Affix cosines were less related than other cosines to their feature root and raw counterparts. Potentially, affix overlap can be used to add small but meaningful predictive value to related semantic phenomena. Further investigation into the compound prediction of these variables is warranted to fully explore how these, and other lexical variables, may be used to understand semantic priming. An examination of the cosine values from the Semantic Priming Project cue-target set indicates that these values were low, with many zeros (i.e., no feature overlap between cues and targets). This restriction of range of the cosine relatedness could explain the small correlations with priming because the semantic priming was variable, but the cosine values were not.

One important limitation of the instructions in this study is that multiple senses of concepts were not distinguished. We did not wish to prime participants for specific senses to capture the features for multiple senses of a concept, however, this procedure could lead to lower cosine values for concepts that might intuitively seem very related. The affixes could shed light on the polysemy of cues, as normal processing of features might exclude characteristic, location or magnitude type cues. The cue-feature lists could be examined for different senses and categorized by their ontology.

We encourage readers to use the corresponding website associated with these norms to download the data, explore the Shiny apps, and use the options provided for controlled experimental stimuli creation. We previously documented the limitations of feature production norms that rely on single word instances as their features (i.e., four and legs), rather than combined phrase sets. One potential limitation, then, is the inability to create fine distinctions between cues; however, the small feature set sizes imply that the granulation of features is large, since many distinguishing features are often never listed in these tasks. For instance, dogs are living creatures, but has lungs or has skin would usually not be listed during a feature production task, and thus, feature sets should not be considered a complete snapshot of mental representation (Rogers & McClelland, 2004). Additionally, the cue-feature lists could be explored for the type of cue-feature representation that is listed for each part of speech (i.e., physical, functional, etc.) and the complexity in coding could be increased or decreased depending on the researcher’s goal. The previous data and other norms were purposely combined in the recoded format, so that researchers could use the entire set of available norms which increases comparability across datasets. Given the strong correlation between databases, we suspect that using single word features does not reduce their reliability and validity. We found high correlations between the different types of feature coding (i.e., complex/propositional versus single word/count), thus suggesting that either dataset could be used for future work where the advantage of the current project is the size of the norms.