Parsing models have many applications in AI, ranging from natural language processing (NLP) and computational music analysis to logic programming and computational learning. Broadly conceived, a parsing model seeks to uncover the underlying structure of an input, that is, the various ways in which elements of the input combine to form phrases or constituents and how those phrases recursively combine to form a tree structure for the whole input. During the last fifteen years, a major shift has taken place from rule-based, deterministic parsing to corpus-based, probabilistic parsing. A quick glance over the NLP literature from the last ten years, for example, indicates that virtually all natural language parsing systems are currently probabilistic. The same development can be observed in (stochastic) logic programming and (statistical) relational learning. This trend towards probabilistic parsing is not surprising: the increasing availability of very large collections of text, music, images and the like allow for inducing statistically motivated parsing systems from actual data.
A corpus-based parsing approach that has been quite successful in various fields of AI, is known as Data-Oriented Parsing or DOP. DOP was originally developed as an NLP technique but has been generalized to music analysis, problem-solving and unsupervised structure learning [7, 8, 14, 81]. The distinctive feature of the DOP approach, when it was first presented, was to model sentence structures on the basis of previously observed frequencies of sentencestructure fragments, without imposing any constraints on the sizeof these fragments. Fragments include, for instance, subtrees of depth 1 (corresponding to context-free rules), as well as entire trees.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Abeillé A (ed.) (2003) Treebanks. Kluwer Academic Publishers, Dordrecht, The Netherlands.
Alonso M, Finn E (1996) Physics. Addison Wesley, Reading, MA.
Baader F, Nipkow T (1998) Term Rewriting and All That. Cambridge University Press, UK.
Black E, Abney S, Flickinger D, Gnadiec C, Grishman R, Harrison P, Hindle D, Ingria R, Jelinek F, Klavans J, Liberman M, Marcus M, Roukos S, Santorini B, Strzalkowski T (1991) A Procedure for quantitatively comparing the syn-tactic coverage of English. In: Proc. 5th DARPA Speech and Natural Language Workshop, Pacific Grove, CA, Morgan Kaufmann, San Mateo, CA: 306-311.
Black E, Lafferty J, Roukos S (1992) Development and evaluation of a broad-coverage probabilistic grammar of English-language computer manuals. In: Proc. 30th Association Computer Linguistics Conf. (ACL’92), Newark, DE, Association for Computer Linguistics, Stroudsburg, PA: 185-192.
Black E, Garside R, Leech G (1993) Statistically-Driven Computer Grammars of English: The IBM/Lancaster Approach. Rodopi, Amsterdam, The Netherlands.
Bod R (1992) Data-oriented parsing. In: Proc. Computational Linguistics Conf. (COLING’92), Nantes, France, Association for Computer Linguistics, Stroudsburg, PA: 854-859.
Bod R (1998) Beyond Grammar: An Experience-Based Theory of Language. Stanford: CSLI Publications(Lecture Notes number 88), distributed by Cambridge University Press, Cambridge, UK.
Bod R (1999) Context-sensitive spoken dialogue processing with the DOP model. Natural Language Engineering, 54: 309-323.
Bod R (2000) Parsing with the shortest derivation. In: Proc. 18th ACL Compu-tational Linguistics Conf. (COLING’2000), Saarbrücken, Germany, Association for Computer Linguistics, Stroudsburg, PA: 69-75.
Bod R (2001) What is the minimal set of subtrees that achieves maximal parse accuracy? In: Proc. 39th Association Computer Linguistics Conf. (ACL’2001), Toulouse, France, Association for Computer Linguistics, Stroudsburg, PA: 66-73.
Bod R (2002) A unified model of structural organization in language and music. J. Artificial Intelligence Research, 17: 289-308.
Bod R (2002) Memory-based models of melodic analysis: challenging the Gestalt principles. J. New Music Research, 311: 27-37.
Bod R (2003) An efficient implementation of a new DOP model. In: Proc. 10th European Association Computer Linguistics Conf. (EACL’03), 12-17 April, Budapest, Hungary, Association for Computer Linguistics, Stroudsburg, PA: 19-26.
Bod R(2004) Exemplar-based explanation. In: Proc. Computation and Philosophy Conf. (ECAP04), 3-5 June, Pavia, Italy.
Bod R (2005) Modeling scientific problem solving by DOP. In: Proc. Cognitive Science Conf. (CogSci’05). Stresa, Italy: 103.
Bod R (2006) Unsupervised parsing with U-DOP. In: Proc. 10th Computational Natural Language Learning Conf. (CONLL’2006), 8-9 June, New York, NY, Association for Computer Linguistics, Stroudsburg, PA: 85-92.
Bod R (2006) An all-subtrees approach to unsupervised parsing. In: Proc. ACL Computational Linguistics Conf. (COLING’2006), Sydney, Australia, Association for Computer Linguistics, Stroudsburg, PA: 865-872.
Bod R (2006) Towards a general model of applying science. Intl. Studies in the Philosophy of Science, 201: 5-25.
Bod R (2006) Exemplar-based reasoning with the shortest derivation. In: Magnani L (ed.) Model-Based Reasoning in Science and Engineering. College Publications, London, UK: 119-140.
Bod R (2006) Exemplar-based syntax: how to get productivity from examples. The Linguistic Review (Special Issue on Exemplar-Based Models in Linguistics), 233: 289-318.
Bod R, Kaplan R (1998) A probabilistic corpus-driven model for lexical-functional analysis. In: Proc. ACL Computational Linguistics Conf. (COLING’98), 10-14 August, Montreal, Canada, Association for Computer Linguistics, Stroudsburg, PA: 145-152.
Bod R, Hay J, Jannedy S (eds.) (2003) Probabilistic Linguistics. MIT Press, Cambridge, MA.
Bod R, Scha R, Sima’an K (eds.) (2003) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.
Bod R, Kaplan R (2003) A DOP model for lexical-functional grammar. In: Bod R, Scha R, Sima’an K (eds.) (2003) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.
Bonnema R, Bod R, Scha R (1997) A DOP model for semantic interpretation. In: Proc. 4th European Association Computer Linguistics Conf. (EACL’97), Madrid, Spain, Association for Computer Linguistics, Stroudsburg, PA: 159-167.
Briscoe T, Waegner N (1992) Robust stochastic parsing using the inside-outside algorithm. In: Proc. AAAI Workshop Statistically-Based Techniques in Natural Language Processing, Menlo Park, CA, AAAI Press/MIT Press, Cambridge, MA: 39-53.
Carbonell J (1993) Derivational analogy: a theory of reconstructive problem solving and expertise acquisition. In: Michalski RS, Carbonell J, Mitchell T (eds.) Machine Learning II. Morgan Kaufmann, San Francisco, CA: 371-392.
Charniak E (1997) Statistical techniques for natural language parsing. AI Magazine, Winter: 32-43.
Charniak E (2000) A maximum-entropy-inspired parser. In: Proc. 1st North American ACL Chapter Conf. (ANLP-NAACL’2000), Seattle, WA, Morgan Kaufmann, San Francisco, CA: 132-139.
Chater N (1999) The search for simplicity: a fundamental cognitive principle? The Quarterly J. Experimental Psychology, 52A2: 273-302.
Chiang D (2000) Statistical parsing with an automatically extracted tree adjoining grammar. In: Proc. 38th Association Computer Linguistics Conf. (ACL’2000), October, Hong Kong, China, Association for Computer Linguistics, Stroudsburg, PA: 456-463.
Clark A (2001) Unsupervised induction of stochastic context-free gram-mars using distributional clustering. In: Proc. Computational Natural Lan-guage Learning Conf. (CoNLL’2001), July, Toulouse, France, Association for Computer Linguistics, Stroudsburg, PA: 97-104.
Chomsky N (1965) Aspects of the Theory of Syntax. MIT Press, Cambridge MA.
Collins M (1996) A new statistical parser based on Bigram lexical dependen-cies. In: Proc. 34th Association Computer Linguistics Conf. (ACL’96), 23-28 June, Santa Cruz, CA, Association for Computer Linguistics, Stroudsburg, PA: 184-191.
Collins M (1997) Three generative lexicalised models for statistical parsing. In: Proc. 35th Association Computer Linguistics Conf. (ACL’97), July, Madrid, Spain, Association for Computer Linguistics, Stroudsburg, PA: 16-23.
Collins M (1999) Head-Driven Statistical Models for Natural Language Parsing. PhD Thesis, University of Pennsylvania, PA.
Collins M (2000) Discriminative reranking for natural language parsing. In: Proc. 17th Intl. Conf. Machine Learning (ICML-2000), Stanford, CA: 175-182.
Collins M, Duffy N (2001) Convolution kernels for natural language. In: Dietrich TG, Becker S, Gharamani Z (eds.) Advances in NIPS 14 (Proc. NIPS’2001), 3-8 December, Vancouver, Canada, MIT Press, Cambridge, MA: 617-624.
Collins M, Duffy N (2002) New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron. In: Proc. 38th Asso-ciation Computer Linguistics Conf. (ACL’2002), Philadelphia, PA, Association for Computer Linguistics, Stroudsburg, PA: 263-270.
Conklin D (2006) Melodic analysis with segment classes. Machine Learning, 652-3: 349-360.
Cussens J (2001) Parameter estimation in stochastic logic programs. Machine Learning, 443: 245-271.
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society, 39: 1-38.
De Raedt L, Kersting K (2004) Probabilistic inductive logic programming. In: Proc. Algorithmic Learning Theory (ALT) Conf., Lecture Notes in Computer Science 3244, Springer-Verlag, Berlin: 19-36.
Douglas J, Matthews R (1996) Fluid Mechanics 1 (3rd ed.). Longman, Essex, UK.
Eisner J (1996) Three new probabilistic models for dependency parsing: an exploration. In: Proc. 18th ACL Computational Linguistics Conf. (COL-ING’96), August, Copenhagen, Denmark, Association for Computer Linguistics, Stroudsburg, PA: 340-345.
Ferrand M, Nelson P, Wiggins G (2003) Unsupervised learning of melodic seg-mentation: a memory-based approach. In: Proc. 5th European Society for the Cognitive Sciences of Music Conf. (ESCOM’2003), 8-13 September, Hanover, Germany.
Frazier L (1978) On Comprehending Sentences: Syntactic Parsing Strategies. PhD Thesis, University of Connecticut.
Fujisaki T, Jelinek F, Cocke J, Black E, Nishino T (1989) A probabilistic method for sentence disambiguation. In: Proc. 1st Intl. Workshop Parsing Technologies, 28-31 August, Pittsburgh, PA: 85-94.
Gahl S, Garnsey S (2004) Knowledge of grammar, knowledge of usage: syntactic probabilities affect pronunciation variation. Language, 804: 748-775.
Giere R (1988) Explaining Science: A Cognitive Approach. University of Chicago Press, Chicago, IL.
Goldberg A (2006) Constructions at Work. Oxford University Press, Oxford, UK.
Goodman J (1996) Efficient algorithms for parsing the DOP model. In: Proc. Empirical Methods in Natural Language Processing, Philadelphia, PA: 143-152.
Goodman J (2003) Efficient parsing of DOP with PCFG-reductions. In: Bod R, Scha R, Sima’an K (eds.) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.
Hearne M, Way A (2003) Seeing the wood for the trees: data-oriented translation. In: Proc. Machine Translation Summit IX, September, New Orleans, LO: 165-172.
Hearne M, Way A (2004) Data-oriented parsing and the Penn Chinese Treebank. In: Proc. 1st Intl. Joint Conf. Natural Language Processing, May, Hainan Island, China: 406-413.
Hearne M, Way A (2006) Disambiguation strategies for data-oriented transla-tion. In: Proc. 11th Intl. Conf. European Association for Machine Translation, 19-20 June, Oslo, Norway.
Hoogweg L (2003) Extending DOP with insertion. In: Bod R, Scha R, Sima’an K (eds.) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.
Huron D (1996) The melodic arch in western folksongs. Computing in Musicology, 10: 2-23.
Johnson M(1998) PCFG models of linguistic tree representations. Computational Linguistics, 24(4): 613-632.
Johnson M (2002) The DOP estimation method is biased and inconsistent. Computational Linguistics, 281: 71-76.
Jurafsky D (2003) Probabilistic modeling in psycholinguistics. In: Bod R, Scha R, Sima’an K (eds) Data-Oriented Parsing. University of Chicago Press, Chicago, IL: 39-96.
Klein D (2005) The unsupervised learning of natural language structure. PhD Thesis, Department of Computer Science, Stanford University, CA.
Klein D, Manning C (2002) A general constituent-context model for improved grammar induction. In: Proc. 40th Association Computer Linguistics Conf. (ACL’2002), July, Philadelphia, PA, Association for Computer Linguistics, Stroudsburg, PA: 128-135.
Klein D, Manning C (2004) Corpus-based induction of syntactic structure: models of dependency and constituency. Proc. 42nd Association Computer Linguistics Conf. (ACL’2004), 21-26 July, Barcelona, Spain, Association for Computer Linguistics, Stroudsburg, PA: 438.
Kudo T, Suzuki J, Isozaki H (2005) Boosting-based parse reranking with subtree features. In: Proc. 43rd Association Computer Linguistics Conf. (ACL’2005), June, Ann Arbor, MI, Association for Computer Linguistics, Stroudsburg, PA: 189-196.
Kuhn T (1970) The Structure of Scientific Revolutions (2nd ed.). University of Chicago Press, Chicago, IL.
Lerdahl F, Jackendoff R (1983) A Generative Theory of Tonal Music. MIT Press, Cambridge, MA.
Longuet-Higgins H (1976) Perception of melodies. Nature, 263, October 21: 646-653.
Longuet-Higgins H, Lee C (1987) The rhythmic interpretation of monophonic music. In: Longuet-Higgins H (ed.) Mental Processes: Studies in Cognitive Science, MIT Press, Cambridge, MA.
Makatchev M, Jordan P, VanLehn K (2004) Abductive theorem proving for analyzing student explanations to guide feedback in intelligent tutoring systems. J. Automated Reasoning, (Special Issue: Automated Reasoning and Theorem Proving in Education), 323: 187-226.
Manning C (2003) Probabilistic syntax. In: Bod R, Hay J, Jannedy S (eds.) Probabilistic Linguistics. MIT Press, Cambridge, MA: 289-342.
Manning C, Schuetze H (1999) Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.
Marcus M, Santorini B, Marcinkiewicz M(1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313-330.
McClosky D, Charniak E, Johnson M (2006) Effective self-training for parsing. In: Proc. North American Chapter of ACL Conf. Human Language Technol-ogy (NAACL-HLT 2006), June, New York, NY, Association for Computer Linguistics, Stroudsburg, PA: 152-159.
Mitchell T, Keller R, Kedar-Cabelli S (1986) Explanation-based learning: a unifying view. Machine Learning, 1: 47-80.
Mooney J, Zelle J (1994) Integrating ILP and EBL. SIGART Bulletin, 51: 12-21.
Muggleton S (1996) Stochastic logic programs. In: De Raed L (ed.) Advances in Inductive Logic Programming (Proc. 5th Intl. Workshop Inductive Logic Programming), IOS Press, Amsterdam, The Netherlands: 254-264.
Neumann G (2003) A data-oriented approach to HPSG. In: Bod R, Scha R, Sima’an K (eds) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.
Pereira F, Schabes Y (1992) Inside-outside reestimation from partially bracketed corpora. In: In: Proc. 30th Association Computer Linguistics Conf. (ACL’92), Newark, DL, Association for Computer Linguistics, Stroudsburg, PA: 128-135.
Scha R (1990) Taaltheorie en taaltechnologie; competence en performance. In: de Kort Q, Leerdam G (eds) Computertoepassingen in de Neerlandistiek. Landelijke Vereniging van Neerlandici (LVVN-jaarboek), Almere, The Netherlands.
Schaffrath H (1995) The Essen Folksong Collection in the Humdrum Kern Format. In: Huron D (ed.) Probabilistic Grammars for Music. Center for Computer Assisted Research in the Humanities, Menlo Park, CA.
Sima’an K (1996) Computational complexity of probabilistic disambiguation by means of tree grammars. In: Proc. 14th Computational Linguistics Conf. (COLING’96), 5-9 August, Copenhagen, Denmark, Association for Computer Linguistics, Stroudsburg, PA: 1175-1180.
Sima’an K (1999) Learning Efficient Disambiguation. ILLC Dissertation Series 1999-02, Utrecht University, The Netherlands.
Sima’an K, Itai A, Winter Y, Altman A, Nativ N (2001) Building a tree-bank of modern Hebrew text. J. Traitement Automatique des Langues (Special Issue on Natural Language Processing and Corpus Linguistics), 422: 347-380.
Temperley D (2001) The Cognition of Basic Musical Structures. MIT Press, Cambridge, MA.
Tomasello M(2003) Constructing a Language. Harvard University Press, Harvard, MA.
Van Lehn K (1998) Analogy events: how examples are used during problem solving. Cognitive Science, 223: 347-388.
van Zaanen M (2000) ABL: alignment-based learning. In: Proc. 18th Compu-tational Linguistics Conf. (COLING’2000), 31 July - 4 August, Saarbrücken, Germany, Association for Computer Linguistics, Stroudsburg, PA: 961-967.
van Zaanen M (2002) Bootstrapping Structure into Language. PhD thesis. School of Computing, University of Leeds, UK.
van Zaanen M, Bod R, Honing H (2003) A memory-based approach to meter induction. In: Proc. 5th European Society for the Cognitive Sciences of Music Conf. (ESCOM5), September, Hanover, Germany: 250-253.
Veloso M, Carbonell J (1993) Derivational analogy in PRODIGY: automating case acquisition, storage, and utilization. Machine Learning, 103: 249-278.
Wertheimer M (1923) Untersuchungen zur lehre von der gestalt. Psychologische Forschung, 4: 301-350.
Younger D (1967) Recognition and parsing of context-free languages in time n3. Information and Control, 102: 189-208.
Zollmann A, Sima’an, K (2005) A consistent and efficient estimator for data-oriented parsing. J. Automata, Languages and Combinatorics, 10: 367-388.
Zuidema W (2006) What are the productive units of natural language gram-mar? A DOP approach to the automatic identification of constructions. In: Proc. 10th Computational Natural Language Learning Conf. (CONLL’2006), 8-9 June, New York, NY, Association for Computer Linguistics, Stroudsburg, PA: 29-36.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Bod, R. (2008). The Data-Oriented Parsing Approach: Theory and Application. In: Fulcher, J., Jain, L.C. (eds) Computational Intelligence: A Compendium. Studies in Computational Intelligence, vol 115. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78293-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-540-78293-3_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78292-6
Online ISBN: 978-3-540-78293-3
eBook Packages: EngineeringEngineering (R0)