A comparison of informal and formal acceptability judgments using a random sample from Linguistic Inquiry 2001–2010
Introduction
Acceptability judgments provide the primary empirical foundation of many syntactic theories (Chomsky, 1965, Schütze, 1996). The vast majority of the acceptability judgments that have been reported in the syntax literature were collected using methods that appear relatively informal compared to the data collection methods in other domains of cognitive science. However, over the past 15 years or so there has been a shift in data collection practices, with the number of studies employing formal experimental methods, sometimes known as experimental syntax techniques following Cowart (1997), increasing every year. This development means that there are two methods for collecting acceptability judgments currently in widespread use in the field of syntax: the relatively informal traditional methods that have largely established the foundation of the field for the past 60 years (henceforth informal methods), and the more formal experimental methods that have been gaining popularity over the past 15 years (henceforth formal methods). This methodological dichotomy has led a number of researchers to ask which method is empirically superior (e.g., Bard et al., 1996, Keller, 2000, Edelman and Christiansen, 2003, Phillips and Lasnik, 2003, Featherston, 2005a, Featherston, 2005b, Featherston, 2007, Featherston, 2008, Featherston, 2009, Ferreira, 2005, Sorace and Keller, 2005, Wasow and Arnold, 2005, den Dikken et al., 2007, Alexopoulou and Keller, 2007, Bornkessel-Schlesewsky and Schlesewsky, 2007, Fanselow, 2007, Grewendorf, 2007, Haider, 2007, Newmeyer, 2007, Sprouse, 2007, Culbertson and Gross, 2009, Myers, 2009a, Myers, 2009b, Phillips, 2010, Bader and Häussler, 2010, Dąbrowska, 2010, Gibson and Fedorenko, 2010, Culicover and Jackendoff, 2010, Gross and Culbertson, 2011, Sprouse, 2011b, Weskott and Fanselow, 2011, Gibson et al., 2011, Sprouse and Almeida, 2012, Sprouse and Almeida, 2013, Gibson and Fedorenko, 2013). Our goal in this paper is to substantially increase the empirical basis of this line of research by comparing the results of informal and formal methods for a very large and random sample of phenomena from the cutting edge of syntactic theorizing.
The goal of the present study is provide a direct comparison of the results of informal judgment collection methods with the results of formal judgment collection methods. We believe that the first step in understanding the relative merits of each family of methods is to determine to what extent the two methods converge (or diverge). Although there have been several previous studies that have compared results of informal methods with the results of formal methods (e.g., Gordon and Hendrick, 1997, Clifton et al., 2006, Gibson and Fedorenko, 2013), these previous studies cannot in principle be used to estimate a convergence rate between informal and formal methods, for two reasons. First, these studies have investigated a relatively small number of phenomena (e.g., Gibson and Fedorenko, 2013 test seven data points comprising three distinct phenomena in their criticism of informal methods) compared to the sheer number of data points published in the syntax literature. With a small sample size, it is unclear whether the number of divergences is high relative to the number of convergences. Testing this requires a much larger sample of phenomena. Second, the phenomena that have been tested in previous studies have been selected using a biased selection procedure. We cannot know exactly how the phenomena were chosen, but previous studies do not claim to have selected the phenomena randomly from the literature. Only a random selection provides confidence that the divergences and convergences are representative of the population they are drawn from. This confidence is quantified with the margin of error, which can be calculated based on the relative size of the sample compared to the population. For these reasons we decided to randomly select a large sample from the population of phenomena published in Linguistic Inquiry (LI) from 2001 to 2010. One added benefit of random sampling is that a sufficiently large random sample will likely (although not strictly necessarily) mean that a large number of distinct syntactic phenomena will be investigated, providing a broad empirical base for the comparison of the results of the two methods.
The rest of this article is organized as follows. In Section 2 we present the design of the random sampling study, along with the rationale for each of the design choices that we made. In Section 3 we present the details and results of the acceptability judgment experiments. The results of the three tasks and five statistical analyses suggest convergence rates with the published informal judgments ranging between 86% and 99%, depending on the analysis. In Section 4 we discuss a principled way of selecting a point estimate for the convergence rate, and the potential inferences licensed by that selection. In Section 5 we discuss the information that would be necessary to make additional inferences about syntactic methodology, and the general design of the follow-up experiments that would be necessary to gather that information. Section 6 concludes.
Section snippets
The design of the random sampling study
Any random sampling study requires a number of methodological decisions, such as what constitutes the appropriate population of study, what constitutes an appropriate sample, how best to calculate the estimate of interest, etc. In this section we discuss, in detail, the rationale underlying each of the methodological choices that we made in the construction of the present study. For readers under time constraints, a succinct summary of our methodology is as follows: First, we randomly sampled
Division into nine sub-experiments
As discussed in Section 2, the full test sample consists of 300 conditions that form 150 pairwise phenomena. This means that in order to have a repeated-measures design in which each participant rates each condition once, the three primary experiments (ME, LS, and FC) would each be 300 sentences long. As a general rule, we prefer to keep the length of acceptability judgment experiments to approximately 100 sentences in order to minimize fatigue-based artifacts. In order to meet this length
The convergence rate
The central question at hand is to what extent informal and formal methods yield convergent results, in this case defined over pairwise phenomena. The current study shows that the results of informal and formal methods are not identical, and suggests that the number of divergences is between 1% and 14% (±5.3–5.8%) of the phenomena published in LI between 2001 and 2010. The two questions we would like to discuss in this section are (i) whether we can choose a more precise point estimate of the
How to move the conversation forward
The present study is the first large-scale comparison of informal and formal acceptability judgment collection methods using randomly sampled phenomena from the cutting edge of syntactic theory. The results suggest that the differences between the two methods are relatively small, with a convergence rate of 95 ± 5.3–5.8%. Although this is a substantial new piece of information in its own right, in this section we would like to highlight additional kinds of information about the collection of
Conclusion
We have conducted the first large-scale comparison of informal and formal methods based on a random sample of phenomena from the cutting-edge of syntactic theory, and obtained a convergence rate of 95% with a margin of error of 5.3–5.8%. As we move forward as a field in the conversation about judgment methods, these results suggest that we can no longer assume or assert that the choice of methods would affect a large proportion of the empirical base of syntax, at least with respect to standard
Acknowledgments
We would like to thank audiences at the following universities for helpful comments on earlier stages of this project: Harvard University, Johns Hopkins University, Michigan State University, Pomona College, Princeton University, University of Connecticut, University of Michigan, and the attendees of TEAL 7 at Hiroshima University. We would also like to thank Colin Phillips and one anonymous reviewer for helpful comments on an earlier draft. This work was supported in part by NSF grant
References (74)
- et al.
Mixed-effects modeling with crossed random effects for subjects and items
Journal of Memory and Language
(2008) The language-as-fixed-effect fallacy: a critique of language statistics in psychological research
Journal of Verbal Learning and Verbal Behavior
(1973)- et al.
Quantitative methods alone are not enough: response to Gibson and Fedorenko
Trends in Cognitive Sciences
(2010) - et al.
How seriously should we take Minimalist syntax?
Trends in Cognitive Sciences
(2003) Magnitude estimation and what it can do for your syntax: some wh-constraints in German
Lingua
(2005)- et al.
Weak quantitative standards in linguistics research
Trends in Cognitive Sciences
(2010) - et al.
Context effects and their interaction with development: area judgments
Cognitive Development
(1990) - et al.
Intuitive knowledge of linguistic co-reference
Cognition
(1997) Informant elicitation of data on syntactic variation
Lingua
(1973)Categorical data analysis: away from ANOVAs (transformation or not) and towards logit mixed models
Journal of Memory and Language
(2008)
The design and analysis of small-scale syntactic judgment experiments
Lingua
Linguistics and empirical evidence: reply to Edelman and Christiansen
Trends in Cognitive Sciences
How to deal with the “Language-as-fixed-effect fallacy”: common misconceptions and alternative solutions
Journal of Memory and Language
Gradience in linguistic data
Lingua
Intuitions in linguistic argumentation
Lingua
On the choice of design and of test statistic in the analysis of experiments with sampled materials
Journal of Verbal Learning and Verbal Behavior
Comments on Clark's “The language-as-fixed-effect fallacy”
Journal of Verbal Learning and Verbal Behavior
Locality, cyclicity and resumption: at the interface between the grammar and the human sentence processor
Language
Analyzing Linguistic Data: A Practical Introduction to Statistics using R
Toward a model of grammaticality judgments
Journal of Linguistics
Null hypothesis significance testing revisited
Methodology
Magnitude estimation of linguistic acceptability
Language
lme4: Linear Mixed-effects Models using S4 Classes. R Package Version 0.999999-0
The wolf in sheep's clothing: against a new judgment-driven imperialism
Theoretical Linguistics
On the distribution of null complementizers
Linguistic Inquiry
Aspects of the Theory of Syntax
Amnestying superiority violations: processing multiple questions
Linguistic Inquiry
Random means random
Journal of Verbal Learning and Verbal Behavior
The Earth is round (p < .05)
American Psychologist
Experimental Syntax: Applying Objective Methods to Sentence Judgments
Are linguists better subjects?
British Journal for the Philosophy of Science
Naïve v. expert intuitions: an empirical study of acceptability judgments
Linguistic Review
Data and grammar: means and individuals
Theoretical Linguistics
Carrots – perfect as vegetables, but please not as a main dish
Theoretical Linguistics
Universals and grammaticality: Wh-constraints in German and English
Linguistics
Data in generative grammar: the stick and the carrot
Theoretical Linguistics
Thermometer judgments as linguistic evidence
Cited by (157)
Quantum transfer learning for acceptability judgements
2024, Quantum Machine IntelligenceIconic Syntax: sign language classifier predicates and gesture sequences
2024, Linguistics and PhilosophyA Preliminary Study on Japanese CSL Learners’ Acquisition of Mandarin Potential Expressions
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)