Annotated free paraphrases and difficulty of interpretation for 45 novel English noun-noun compounds

dataset

posted on 2020-09-10, 10:02 authored by Martin SchäferMartin Schäfer, Melanie BellMelanie Bell

This is a dataset containing free paraphrases of the meaning of 45 novel English compound nouns, presented in isolation, together with Likert scale ratings of the difficulty of coming up with that meaning. Each compound has a variety of possible interpretations, and the free paraphrases are coded for the interpretation they represent.

All the novel compounds occur just once in ukWaC (Baroni et al. 2009), a 2-billion-word corpus of English. The constituent nouns in the compounds are the same as the constituents of the compounds used in Reddy et al. (2011) and in the annotated compound family dataset published with Bell and Schäfer (2016). The compounds were presented in randomly ordered lists, with each random list of 45 items distributed over two participants so that no participant saw a constituent twice.

The study was designed as a computer-based questionnaire, presented using a Python 3 script. For each compound, participants were first asked ‘What does [COMPOUND] mean?’. After typing their interpretation in the box provided, they were asked ‘How difficult was it for you to think of a meaning for [COMPOUND]?’. For this question, an explicitly labelled Likert-type scale was used. The labels were: Extremely difficult, Moderately difficult, Slightly difficult, Slightly easy, Moderately easy, Extremely easy. In the dataset, these ratings have been mapped onto the values 6 to 1, respectively.

A bottom-up categorisation process was employed to group the paraphrases into a set of distinct interpretations for each compound. Two coders, Melanie J. Bell and Diane Nicholls, an experienced professional lexicographer, both native speakers of English, categorised the data. For each compound, the coders saw the 20 paraphrases in different random orders, and independently identified those paraphrases that they thought converged on the same interpretation. This was done using the constant comparative method: i.e. there was no a priori set of categories, but instead the coders introduced categories as needed to adequately capture the data. After completing the categorisation individually, the two coders resolved any discrepancies through discussion and consensus over a series of meetings. The coders further grouped interpretations under their superordinate interpretation if such a superordinate interpretation occurred in the paraphrases. The coders also added a key in the form of minimal definitions of the interpretations. For further details and illustration of this coding procedure, see Schäfer & Bell (2020).

The raw data was collected from January to April 2018 at Anglia Ruskin University, Cambridge. Participants were sixth form students from Cambridge, all between 16 and 18 years old and all monolingual native speakers of English who had grown up in the East of England. Personal information, including gender, was collected using free-text fields. Half the participants self-identified as ‘male’, and half as ‘female’. These gender self-identifications are contained in the dataset.

For a detailed description of the structure of the dataset, see the README file.

This project received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 752436.

References

Baroni, Marco, Silvia Bernardini, Adriano Ferraresi and Eros Zanchetta (2009). The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226. https://doi.org/10.1007/s10579-009-9081-4.

Bell, Melanie J. and Martin Schäfer (2016). Modelling semantic transparency. Morphology, 26(2), 157-199. https://doi.org/10.1007/s11525-016-9286-3.

Reddy, Siva, Diana McCarthy and Suresh Manandhar (2011). An empirical study on compositionality in compound nouns. In Proceedings of 5th International Conference on Natural Language Processing, AFNLP, Chiang Mai, Thailand (pp. 210–218). https://www.aclweb.org/anthology/I11-1024.

Schäfer, Martin and Melanie J. Bell (2020). Constituent polysemy and interpretational diversity in attested English novel compounds. The Mental Lexicon, 15(1), 42-61. https://doi.org/10.1075/ml.00013.sch.