Elsevier

Lingua

Volume 134, September 2013, Pages 219-248
Lingua

A comparison of informal and formal acceptability judgments using a random sample from Linguistic Inquiry 2001–2010

https://doi.org/10.1016/j.lingua.2013.07.002Get rights and content

Highlights

  • We compare the results of informal and formal acceptability judgment methods.

  • We test 296 sentence types randomly sampled from Linguistic Inquiry.

  • 936 participants, 3 tasks: magnitude estimation, Likert scale, or forced choice.

  • We find 95% (±5%) convergence between the results of the two methods.

  • We explore the consequences of this convergence for syntactic methodology.

Abstract

The goal of the present study is to provide a direct comparison of the results of informal judgment collection methods with the results of formal judgment collection methods, as a first step in understanding the relative merits of each family of methods. Although previous studies have compared small samples of informal and formal results, this article presents the first large-scale comparison based on a random sample of phenomena from a leading theoretical journal (Linguistic Inquiry). We tested 296 data points from the approximately 1743 English data points that were published in Linguistic Inquiry between 2001 and 2010. We tested this sample with 936 naïve participants using three formal judgment tasks (magnitude estimation, 7-point Likert scale, and two-alternative forced-choice) and report five statistical analyses. The results suggest a convergence rate of 95% between informal and formal methods, with a margin of error of 5.3–5.8%. We discuss the implications of this convergence rate for the ongoing conversation about judgment collection methods, and lay out a set of questions for future research into syntactic methodology.

Introduction

Acceptability judgments provide the primary empirical foundation of many syntactic theories (Chomsky, 1965, Schütze, 1996). The vast majority of the acceptability judgments that have been reported in the syntax literature were collected using methods that appear relatively informal compared to the data collection methods in other domains of cognitive science. However, over the past 15 years or so there has been a shift in data collection practices, with the number of studies employing formal experimental methods, sometimes known as experimental syntax techniques following Cowart (1997), increasing every year. This development means that there are two methods for collecting acceptability judgments currently in widespread use in the field of syntax: the relatively informal traditional methods that have largely established the foundation of the field for the past 60 years (henceforth informal methods), and the more formal experimental methods that have been gaining popularity over the past 15 years (henceforth formal methods). This methodological dichotomy has led a number of researchers to ask which method is empirically superior (e.g., Bard et al., 1996, Keller, 2000, Edelman and Christiansen, 2003, Phillips and Lasnik, 2003, Featherston, 2005a, Featherston, 2005b, Featherston, 2007, Featherston, 2008, Featherston, 2009, Ferreira, 2005, Sorace and Keller, 2005, Wasow and Arnold, 2005, den Dikken et al., 2007, Alexopoulou and Keller, 2007, Bornkessel-Schlesewsky and Schlesewsky, 2007, Fanselow, 2007, Grewendorf, 2007, Haider, 2007, Newmeyer, 2007, Sprouse, 2007, Culbertson and Gross, 2009, Myers, 2009a, Myers, 2009b, Phillips, 2010, Bader and Häussler, 2010, Dąbrowska, 2010, Gibson and Fedorenko, 2010, Culicover and Jackendoff, 2010, Gross and Culbertson, 2011, Sprouse, 2011b, Weskott and Fanselow, 2011, Gibson et al., 2011, Sprouse and Almeida, 2012, Sprouse and Almeida, 2013, Gibson and Fedorenko, 2013). Our goal in this paper is to substantially increase the empirical basis of this line of research by comparing the results of informal and formal methods for a very large and random sample of phenomena from the cutting edge of syntactic theorizing.

The goal of the present study is provide a direct comparison of the results of informal judgment collection methods with the results of formal judgment collection methods. We believe that the first step in understanding the relative merits of each family of methods is to determine to what extent the two methods converge (or diverge). Although there have been several previous studies that have compared results of informal methods with the results of formal methods (e.g., Gordon and Hendrick, 1997, Clifton et al., 2006, Gibson and Fedorenko, 2013), these previous studies cannot in principle be used to estimate a convergence rate between informal and formal methods, for two reasons. First, these studies have investigated a relatively small number of phenomena (e.g., Gibson and Fedorenko, 2013 test seven data points comprising three distinct phenomena in their criticism of informal methods) compared to the sheer number of data points published in the syntax literature. With a small sample size, it is unclear whether the number of divergences is high relative to the number of convergences. Testing this requires a much larger sample of phenomena. Second, the phenomena that have been tested in previous studies have been selected using a biased selection procedure. We cannot know exactly how the phenomena were chosen, but previous studies do not claim to have selected the phenomena randomly from the literature. Only a random selection provides confidence that the divergences and convergences are representative of the population they are drawn from. This confidence is quantified with the margin of error, which can be calculated based on the relative size of the sample compared to the population. For these reasons we decided to randomly select a large sample from the population of phenomena published in Linguistic Inquiry (LI) from 2001 to 2010. One added benefit of random sampling is that a sufficiently large random sample will likely (although not strictly necessarily) mean that a large number of distinct syntactic phenomena will be investigated, providing a broad empirical base for the comparison of the results of the two methods.

The rest of this article is organized as follows. In Section 2 we present the design of the random sampling study, along with the rationale for each of the design choices that we made. In Section 3 we present the details and results of the acceptability judgment experiments. The results of the three tasks and five statistical analyses suggest convergence rates with the published informal judgments ranging between 86% and 99%, depending on the analysis. In Section 4 we discuss a principled way of selecting a point estimate for the convergence rate, and the potential inferences licensed by that selection. In Section 5 we discuss the information that would be necessary to make additional inferences about syntactic methodology, and the general design of the follow-up experiments that would be necessary to gather that information. Section 6 concludes.

Section snippets

The design of the random sampling study

Any random sampling study requires a number of methodological decisions, such as what constitutes the appropriate population of study, what constitutes an appropriate sample, how best to calculate the estimate of interest, etc. In this section we discuss, in detail, the rationale underlying each of the methodological choices that we made in the construction of the present study. For readers under time constraints, a succinct summary of our methodology is as follows: First, we randomly sampled

Division into nine sub-experiments

As discussed in Section 2, the full test sample consists of 300 conditions that form 150 pairwise phenomena. This means that in order to have a repeated-measures design in which each participant rates each condition once, the three primary experiments (ME, LS, and FC) would each be 300 sentences long. As a general rule, we prefer to keep the length of acceptability judgment experiments to approximately 100 sentences in order to minimize fatigue-based artifacts. In order to meet this length

The convergence rate

The central question at hand is to what extent informal and formal methods yield convergent results, in this case defined over pairwise phenomena. The current study shows that the results of informal and formal methods are not identical, and suggests that the number of divergences is between 1% and 14% (±5.3–5.8%) of the phenomena published in LI between 2001 and 2010. The two questions we would like to discuss in this section are (i) whether we can choose a more precise point estimate of the

How to move the conversation forward

The present study is the first large-scale comparison of informal and formal acceptability judgment collection methods using randomly sampled phenomena from the cutting edge of syntactic theory. The results suggest that the differences between the two methods are relatively small, with a convergence rate of 95 ± 5.3–5.8%. Although this is a substantial new piece of information in its own right, in this section we would like to highlight additional kinds of information about the collection of

Conclusion

We have conducted the first large-scale comparison of informal and formal methods based on a random sample of phenomena from the cutting-edge of syntactic theory, and obtained a convergence rate of 95% with a margin of error of 5.3–5.8%. As we move forward as a field in the conversation about judgment methods, these results suggest that we can no longer assume or assert that the choice of methods would affect a large proportion of the empirical base of syntax, at least with respect to standard

Acknowledgments

We would like to thank audiences at the following universities for helpful comments on earlier stages of this project: Harvard University, Johns Hopkins University, Michigan State University, Pomona College, Princeton University, University of Connecticut, University of Michigan, and the attendees of TEAL 7 at Hiroshima University. We would also like to thank Colin Phillips and one anonymous reviewer for helpful comments on an earlier draft. This work was supported in part by NSF grant

References (74)

  • J. Myers

    The design and analysis of small-scale syntactic judgment experiments

    Lingua

    (2009)
  • C. Phillips et al.

    Linguistics and empirical evidence: reply to Edelman and Christiansen

    Trends in Cognitive Sciences

    (2003)
  • J.G. Raaijmakers et al.

    How to deal with the “Language-as-fixed-effect fallacy”: common misconceptions and alternative solutions

    Journal of Memory and Language

    (1999)
  • A. Sorace et al.

    Gradience in linguistic data

    Lingua

    (2005)
  • T. Wasow et al.

    Intuitions in linguistic argumentation

    Lingua

    (2005)
  • T.D. Wickens et al.

    On the choice of design and of test statistic in the analysis of experiments with sampled materials

    Journal of Verbal Learning and Verbal Behavior

    (1983)
  • E.L. Wike et al.

    Comments on Clark's “The language-as-fixed-effect fallacy”

    Journal of Verbal Learning and Verbal Behavior

    (1976)
  • T. Alexopoulou et al.

    Locality, cyclicity and resumption: at the interface between the grammar and the human sentence processor

    Language

    (2007)
  • R.H. Baayen

    Analyzing Linguistic Data: A Practical Introduction to Statistics using R

    (2007)
  • M. Bader et al.

    Toward a model of grammaticality judgments

    Journal of Linguistics

    (2010)
  • N. Balluerka et al.

    Null hypothesis significance testing revisited

    Methodology

    (2005)
  • E.G. Bard et al.

    Magnitude estimation of linguistic acceptability

    Language

    (1996)
  • D.M. Bates et al.

    lme4: Linear Mixed-effects Models using S4 Classes. R Package Version 0.999999-0

    (2012)
  • I. Bornkessel-Schlesewsky et al.

    The wolf in sheep's clothing: against a new judgment-driven imperialism

    Theoretical Linguistics

    (2007)
  • Ž. Bošković et al.

    On the distribution of null complementizers

    Linguistic Inquiry

    (2003)
  • N. Chomsky

    Aspects of the Theory of Syntax

    (1965)
  • C. Clifton et al.

    Amnestying superiority violations: processing multiple questions

    Linguistic Inquiry

    (2006)
  • J. Cohen

    Random means random

    Journal of Verbal Learning and Verbal Behavior

    (1976)
  • J. Cohen

    The Earth is round (p < .05)

    American Psychologist

    (1994)
  • W. Cowart

    Experimental Syntax: Applying Objective Methods to Sentence Judgments

    (1997)
  • J. Culbertson et al.

    Are linguists better subjects?

    British Journal for the Philosophy of Science

    (2009)
  • E. Dąbrowska

    Naïve v. expert intuitions: an empirical study of acceptability judgments

    Linguistic Review

    (2010)
  • M. den Dikken

    Data and grammar: means and individuals

    Theoretical Linguistics

    (2007)
  • G. Fanselow

    Carrots – perfect as vegetables, but please not as a main dish

    Theoretical Linguistics

    (2007)
  • S. Featherston

    Universals and grammaticality: Wh-constraints in German and English

    Linguistics

    (2005)
  • S. Featherston

    Data in generative grammar: the stick and the carrot

    Theoretical Linguistics

    (2007)
  • S. Featherston

    Thermometer judgments as linguistic evidence

  • View full text