The Data-Oriented Parsing Approach: Theory and Application

Bod, Rens

doi:10.1007/978-3-540-78293-3_7

The Data-Oriented Parsing Approach: Theory and Application

Rens Bod⁴

Chapter

2255 Accesses
3 Citations

Part of the book series: Studies in Computational Intelligence ((SCI,volume 115))

Parsing models have many applications in AI, ranging from natural language processing (NLP) and computational music analysis to logic programming and computational learning. Broadly conceived, a parsing model seeks to uncover the underlying structure of an input, that is, the various ways in which elements of the input combine to form phrases or constituents and how those phrases recursively combine to form a tree structure for the whole input. During the last fifteen years, a major shift has taken place from rule-based, deterministic parsing to corpus-based, probabilistic parsing. A quick glance over the NLP literature from the last ten years, for example, indicates that virtually all natural language parsing systems are currently probabilistic. The same development can be observed in (stochastic) logic programming and (statistical) relational learning. This trend towards probabilistic parsing is not surprising: the increasing availability of very large collections of text, music, images and the like allow for inducing statistically motivated parsing systems from actual data.

A corpus-based parsing approach that has been quite successful in various fields of AI, is known as Data-Oriented Parsing or DOP. DOP was originally developed as an NLP technique but has been generalized to music analysis, problem-solving and unsupervised structure learning [7, 8, 14, 81]. The distinctive feature of the DOP approach, when it was first presented, was to model sentence structures on the basis of previously observed frequencies of sentencestructure fragments, without imposing any constraints on the sizeof these fragments. Fragments include, for instance, subtrees of depth 1 (corresponding to context-free rules), as well as entire trees.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 389.00; Price excludes VAT (USA)

Hardcover Book: USD 499.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abeillé A (ed.) (2003) Treebanks. Kluwer Academic Publishers, Dordrecht, The Netherlands.
MATH Google Scholar
Alonso M, Finn E (1996) Physics. Addison Wesley, Reading, MA.
Google Scholar
Baader F, Nipkow T (1998) Term Rewriting and All That. Cambridge University Press, UK.
Google Scholar
Black E, Abney S, Flickinger D, Gnadiec C, Grishman R, Harrison P, Hindle D, Ingria R, Jelinek F, Klavans J, Liberman M, Marcus M, Roukos S, Santorini B, Strzalkowski T (1991) A Procedure for quantitatively comparing the syn-tactic coverage of English. In: Proc. 5th DARPA Speech and Natural Language Workshop, Pacific Grove, CA, Morgan Kaufmann, San Mateo, CA: 306-311.
Google Scholar
Black E, Lafferty J, Roukos S (1992) Development and evaluation of a broad-coverage probabilistic grammar of English-language computer manuals. In: Proc. 30th Association Computer Linguistics Conf. (ACL’92), Newark, DE, Association for Computer Linguistics, Stroudsburg, PA: 185-192.
Google Scholar
Black E, Garside R, Leech G (1993) Statistically-Driven Computer Grammars of English: The IBM/Lancaster Approach. Rodopi, Amsterdam, The Netherlands.
Google Scholar
Bod R (1992) Data-oriented parsing. In: Proc. Computational Linguistics Conf. (COLING’92), Nantes, France, Association for Computer Linguistics, Stroudsburg, PA: 854-859.
Google Scholar
Bod R (1998) Beyond Grammar: An Experience-Based Theory of Language. Stanford: CSLI Publications(Lecture Notes number 88), distributed by Cambridge University Press, Cambridge, UK.
Google Scholar
Bod R (1999) Context-sensitive spoken dialogue processing with the DOP model. Natural Language Engineering, 54: 309-323.
Article Google Scholar
Bod R (2000) Parsing with the shortest derivation. In: Proc. 18th ACL Compu-tational Linguistics Conf. (COLING’2000), Saarbrücken, Germany, Association for Computer Linguistics, Stroudsburg, PA: 69-75.
Google Scholar
Bod R (2001) What is the minimal set of subtrees that achieves maximal parse accuracy? In: Proc. 39th Association Computer Linguistics Conf. (ACL’2001), Toulouse, France, Association for Computer Linguistics, Stroudsburg, PA: 66-73.
Google Scholar
Bod R (2002) A unified model of structural organization in language and music. J. Artificial Intelligence Research, 17: 289-308.
MATH Google Scholar
Bod R (2002) Memory-based models of melodic analysis: challenging the Gestalt principles. J. New Music Research, 311: 27-37.
Article Google Scholar
Bod R (2003) An efficient implementation of a new DOP model. In: Proc. 10th European Association Computer Linguistics Conf. (EACL’03), 12-17 April, Budapest, Hungary, Association for Computer Linguistics, Stroudsburg, PA: 19-26.
Google Scholar
Bod R(2004) Exemplar-based explanation. In: Proc. Computation and Philosophy Conf. (ECAP04), 3-5 June, Pavia, Italy.
Google Scholar
Bod R (2005) Modeling scientific problem solving by DOP. In: Proc. Cognitive Science Conf. (CogSci’05). Stresa, Italy: 103.
Google Scholar
Bod R (2006) Unsupervised parsing with U-DOP. In: Proc. 10th Computational Natural Language Learning Conf. (CONLL’2006), 8-9 June, New York, NY, Association for Computer Linguistics, Stroudsburg, PA: 85-92.
Google Scholar
Bod R (2006) An all-subtrees approach to unsupervised parsing. In: Proc. ACL Computational Linguistics Conf. (COLING’2006), Sydney, Australia, Association for Computer Linguistics, Stroudsburg, PA: 865-872.
Google Scholar
Bod R (2006) Towards a general model of applying science. Intl. Studies in the Philosophy of Science, 201: 5-25.
Article MathSciNet Google Scholar
Bod R (2006) Exemplar-based reasoning with the shortest derivation. In: Magnani L (ed.) Model-Based Reasoning in Science and Engineering. College Publications, London, UK: 119-140.
Google Scholar
Bod R (2006) Exemplar-based syntax: how to get productivity from examples. The Linguistic Review (Special Issue on Exemplar-Based Models in Linguistics), 233: 289-318.
Google Scholar
Bod R, Kaplan R (1998) A probabilistic corpus-driven model for lexical-functional analysis. In: Proc. ACL Computational Linguistics Conf. (COLING’98), 10-14 August, Montreal, Canada, Association for Computer Linguistics, Stroudsburg, PA: 145-152.
Google Scholar
Bod R, Hay J, Jannedy S (eds.) (2003) Probabilistic Linguistics. MIT Press, Cambridge, MA.
MATH Google Scholar
Bod R, Scha R, Sima’an K (eds.) (2003) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.
Google Scholar
Bod R, Kaplan R (2003) A DOP model for lexical-functional grammar. In: Bod R, Scha R, Sima’an K (eds.) (2003) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.
Google Scholar
Bonnema R, Bod R, Scha R (1997) A DOP model for semantic interpretation. In: Proc. 4th European Association Computer Linguistics Conf. (EACL’97), Madrid, Spain, Association for Computer Linguistics, Stroudsburg, PA: 159-167.
Google Scholar
Briscoe T, Waegner N (1992) Robust stochastic parsing using the inside-outside algorithm. In: Proc. AAAI Workshop Statistically-Based Techniques in Natural Language Processing, Menlo Park, CA, AAAI Press/MIT Press, Cambridge, MA: 39-53.
Google Scholar
Carbonell J (1993) Derivational analogy: a theory of reconstructive problem solving and expertise acquisition. In: Michalski RS, Carbonell J, Mitchell T (eds.) Machine Learning II. Morgan Kaufmann, San Francisco, CA: 371-392.
Google Scholar
Charniak E (1997) Statistical techniques for natural language parsing. AI Magazine, Winter: 32-43.
Google Scholar
Charniak E (2000) A maximum-entropy-inspired parser. In: Proc. 1st North American ACL Chapter Conf. (ANLP-NAACL’2000), Seattle, WA, Morgan Kaufmann, San Francisco, CA: 132-139.
Google Scholar
Chater N (1999) The search for simplicity: a fundamental cognitive principle? The Quarterly J. Experimental Psychology, 52A2: 273-302.
Google Scholar
Chiang D (2000) Statistical parsing with an automatically extracted tree adjoining grammar. In: Proc. 38th Association Computer Linguistics Conf. (ACL’2000), October, Hong Kong, China, Association for Computer Linguistics, Stroudsburg, PA: 456-463.
Google Scholar
Clark A (2001) Unsupervised induction of stochastic context-free gram-mars using distributional clustering. In: Proc. Computational Natural Lan-guage Learning Conf. (CoNLL’2001), July, Toulouse, France, Association for Computer Linguistics, Stroudsburg, PA: 97-104.
Google Scholar
Chomsky N (1965) Aspects of the Theory of Syntax. MIT Press, Cambridge MA.
Google Scholar
Collins M (1996) A new statistical parser based on Bigram lexical dependen-cies. In: Proc. 34th Association Computer Linguistics Conf. (ACL’96), 23-28 June, Santa Cruz, CA, Association for Computer Linguistics, Stroudsburg, PA: 184-191.
Google Scholar
Collins M (1997) Three generative lexicalised models for statistical parsing. In: Proc. 35th Association Computer Linguistics Conf. (ACL’97), July, Madrid, Spain, Association for Computer Linguistics, Stroudsburg, PA: 16-23.
Google Scholar
Collins M (1999) Head-Driven Statistical Models for Natural Language Parsing. PhD Thesis, University of Pennsylvania, PA.
Google Scholar
Collins M (2000) Discriminative reranking for natural language parsing. In: Proc. 17th Intl. Conf. Machine Learning (ICML-2000), Stanford, CA: 175-182.
Google Scholar
Collins M, Duffy N (2001) Convolution kernels for natural language. In: Dietrich TG, Becker S, Gharamani Z (eds.) Advances in NIPS 14 (Proc. NIPS’2001), 3-8 December, Vancouver, Canada, MIT Press, Cambridge, MA: 617-624.
Google Scholar
Collins M, Duffy N (2002) New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron. In: Proc. 38th Asso-ciation Computer Linguistics Conf. (ACL’2002), Philadelphia, PA, Association for Computer Linguistics, Stroudsburg, PA: 263-270.
Google Scholar
Conklin D (2006) Melodic analysis with segment classes. Machine Learning, 652-3: 349-360.
Article Google Scholar
Cussens J (2001) Parameter estimation in stochastic logic programs. Machine Learning, 443: 245-271.
Article MATH Google Scholar
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society, 39: 1-38.
MATH MathSciNet Google Scholar
De Raedt L, Kersting K (2004) Probabilistic inductive logic programming. In: Proc. Algorithmic Learning Theory (ALT) Conf., Lecture Notes in Computer Science 3244, Springer-Verlag, Berlin: 19-36.
Google Scholar
Douglas J, Matthews R (1996) Fluid Mechanics 1 (3rd ed.). Longman, Essex, UK.
Google Scholar
Eisner J (1996) Three new probabilistic models for dependency parsing: an exploration. In: Proc. 18th ACL Computational Linguistics Conf. (COL-ING’96), August, Copenhagen, Denmark, Association for Computer Linguistics, Stroudsburg, PA: 340-345.
Google Scholar
Ferrand M, Nelson P, Wiggins G (2003) Unsupervised learning of melodic seg-mentation: a memory-based approach. In: Proc. 5th European Society for the Cognitive Sciences of Music Conf. (ESCOM’2003), 8-13 September, Hanover, Germany.
Google Scholar
Frazier L (1978) On Comprehending Sentences: Syntactic Parsing Strategies. PhD Thesis, University of Connecticut.
Google Scholar
Fujisaki T, Jelinek F, Cocke J, Black E, Nishino T (1989) A probabilistic method for sentence disambiguation. In: Proc. 1st Intl. Workshop Parsing Technologies, 28-31 August, Pittsburgh, PA: 85-94.
Google Scholar
Gahl S, Garnsey S (2004) Knowledge of grammar, knowledge of usage: syntactic probabilities affect pronunciation variation. Language, 804: 748-775.
Article Google Scholar
Giere R (1988) Explaining Science: A Cognitive Approach. University of Chicago Press, Chicago, IL.
Google Scholar
Goldberg A (2006) Constructions at Work. Oxford University Press, Oxford, UK.
Google Scholar
Goodman J (1996) Efficient algorithms for parsing the DOP model. In: Proc. Empirical Methods in Natural Language Processing, Philadelphia, PA: 143-152.
Google Scholar
Goodman J (2003) Efficient parsing of DOP with PCFG-reductions. In: Bod R, Scha R, Sima’an K (eds.) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.
Google Scholar
Hearne M, Way A (2003) Seeing the wood for the trees: data-oriented translation. In: Proc. Machine Translation Summit IX, September, New Orleans, LO: 165-172.
Google Scholar
Hearne M, Way A (2004) Data-oriented parsing and the Penn Chinese Treebank. In: Proc. 1st Intl. Joint Conf. Natural Language Processing, May, Hainan Island, China: 406-413.
Google Scholar
Hearne M, Way A (2006) Disambiguation strategies for data-oriented transla-tion. In: Proc. 11th Intl. Conf. European Association for Machine Translation, 19-20 June, Oslo, Norway.
Google Scholar
Hoogweg L (2003) Extending DOP with insertion. In: Bod R, Scha R, Sima’an K (eds.) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.
Google Scholar
Huron D (1996) The melodic arch in western folksongs. Computing in Musicology, 10: 2-23.
Google Scholar
Johnson M(1998) PCFG models of linguistic tree representations. Computational Linguistics, 24(4): 613-632.
Google Scholar
Johnson M (2002) The DOP estimation method is biased and inconsistent. Computational Linguistics, 281: 71-76.
Article Google Scholar
Jurafsky D (2003) Probabilistic modeling in psycholinguistics. In: Bod R, Scha R, Sima’an K (eds) Data-Oriented Parsing. University of Chicago Press, Chicago, IL: 39-96.
Google Scholar
Klein D (2005) The unsupervised learning of natural language structure. PhD Thesis, Department of Computer Science, Stanford University, CA.
Google Scholar
Klein D, Manning C (2002) A general constituent-context model for improved grammar induction. In: Proc. 40th Association Computer Linguistics Conf. (ACL’2002), July, Philadelphia, PA, Association for Computer Linguistics, Stroudsburg, PA: 128-135.
Google Scholar
Klein D, Manning C (2004) Corpus-based induction of syntactic structure: models of dependency and constituency. Proc. 42nd Association Computer Linguistics Conf. (ACL’2004), 21-26 July, Barcelona, Spain, Association for Computer Linguistics, Stroudsburg, PA: 438.
Google Scholar
Kudo T, Suzuki J, Isozaki H (2005) Boosting-based parse reranking with subtree features. In: Proc. 43rd Association Computer Linguistics Conf. (ACL’2005), June, Ann Arbor, MI, Association for Computer Linguistics, Stroudsburg, PA: 189-196.
Google Scholar
Kuhn T (1970) The Structure of Scientific Revolutions (2nd ed.). University of Chicago Press, Chicago, IL.
Google Scholar
Lerdahl F, Jackendoff R (1983) A Generative Theory of Tonal Music. MIT Press, Cambridge, MA.
Google Scholar
Longuet-Higgins H (1976) Perception of melodies. Nature, 263, October 21: 646-653.
Article Google Scholar
Longuet-Higgins H, Lee C (1987) The rhythmic interpretation of monophonic music. In: Longuet-Higgins H (ed.) Mental Processes: Studies in Cognitive Science, MIT Press, Cambridge, MA.
Google Scholar
Makatchev M, Jordan P, VanLehn K (2004) Abductive theorem proving for analyzing student explanations to guide feedback in intelligent tutoring systems. J. Automated Reasoning, (Special Issue: Automated Reasoning and Theorem Proving in Education), 323: 187-226.
Google Scholar
Manning C (2003) Probabilistic syntax. In: Bod R, Hay J, Jannedy S (eds.) Probabilistic Linguistics. MIT Press, Cambridge, MA: 289-342.
Google Scholar
Manning C, Schuetze H (1999) Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.
MATH Google Scholar
Marcus M, Santorini B, Marcinkiewicz M(1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313-330.
Google Scholar
McClosky D, Charniak E, Johnson M (2006) Effective self-training for parsing. In: Proc. North American Chapter of ACL Conf. Human Language Technol-ogy (NAACL-HLT 2006), June, New York, NY, Association for Computer Linguistics, Stroudsburg, PA: 152-159.
Google Scholar
Mitchell T, Keller R, Kedar-Cabelli S (1986) Explanation-based learning: a unifying view. Machine Learning, 1: 47-80.
Google Scholar
Mooney J, Zelle J (1994) Integrating ILP and EBL. SIGART Bulletin, 51: 12-21.
Article Google Scholar
Muggleton S (1996) Stochastic logic programs. In: De Raed L (ed.) Advances in Inductive Logic Programming (Proc. 5th Intl. Workshop Inductive Logic Programming), IOS Press, Amsterdam, The Netherlands: 254-264.
Google Scholar
Neumann G (2003) A data-oriented approach to HPSG. In: Bod R, Scha R, Sima’an K (eds) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.
Google Scholar
Pereira F, Schabes Y (1992) Inside-outside reestimation from partially bracketed corpora. In: In: Proc. 30th Association Computer Linguistics Conf. (ACL’92), Newark, DL, Association for Computer Linguistics, Stroudsburg, PA: 128-135.
Google Scholar
Scha R (1990) Taaltheorie en taaltechnologie; competence en performance. In: de Kort Q, Leerdam G (eds) Computertoepassingen in de Neerlandistiek. Landelijke Vereniging van Neerlandici (LVVN-jaarboek), Almere, The Netherlands.
Google Scholar
Schaffrath H (1995) The Essen Folksong Collection in the Humdrum Kern Format. In: Huron D (ed.) Probabilistic Grammars for Music. Center for Computer Assisted Research in the Humanities, Menlo Park, CA.
Google Scholar
Sima’an K (1996) Computational complexity of probabilistic disambiguation by means of tree grammars. In: Proc. 14th Computational Linguistics Conf. (COLING’96), 5-9 August, Copenhagen, Denmark, Association for Computer Linguistics, Stroudsburg, PA: 1175-1180.
Google Scholar
Sima’an K (1999) Learning Efficient Disambiguation. ILLC Dissertation Series 1999-02, Utrecht University, The Netherlands.
Google Scholar
Sima’an K, Itai A, Winter Y, Altman A, Nativ N (2001) Building a tree-bank of modern Hebrew text. J. Traitement Automatique des Langues (Special Issue on Natural Language Processing and Corpus Linguistics), 422: 347-380.
Google Scholar
Temperley D (2001) The Cognition of Basic Musical Structures. MIT Press, Cambridge, MA.
Google Scholar
Tomasello M(2003) Constructing a Language. Harvard University Press, Harvard, MA.
Google Scholar
Van Lehn K (1998) Analogy events: how examples are used during problem solving. Cognitive Science, 223: 347-388.
Article Google Scholar
van Zaanen M (2000) ABL: alignment-based learning. In: Proc. 18th Compu-tational Linguistics Conf. (COLING’2000), 31 July - 4 August, Saarbrücken, Germany, Association for Computer Linguistics, Stroudsburg, PA: 961-967.
Google Scholar
van Zaanen M (2002) Bootstrapping Structure into Language. PhD thesis. School of Computing, University of Leeds, UK.
Google Scholar
van Zaanen M, Bod R, Honing H (2003) A memory-based approach to meter induction. In: Proc. 5th European Society for the Cognitive Sciences of Music Conf. (ESCOM5), September, Hanover, Germany: 250-253.
Google Scholar
Veloso M, Carbonell J (1993) Derivational analogy in PRODIGY: automating case acquisition, storage, and utilization. Machine Learning, 103: 249-278.
Article Google Scholar
Wertheimer M (1923) Untersuchungen zur lehre von der gestalt. Psychologische Forschung, 4: 301-350.
Article Google Scholar
Younger D (1967) Recognition and parsing of context-free languages in time n3. Information and Control, 102: 189-208.
Article MATH Google Scholar
Zollmann A, Sima’an, K (2005) A consistent and efficient estimator for data-oriented parsing. J. Automata, Languages and Combinatorics, 10: 367-388.
MATH MathSciNet Google Scholar
Zuidema W (2006) What are the productive units of natural language gram-mar? A DOP approach to the automatic identification of constructions. In: Proc. 10th Computational Natural Language Learning Conf. (CONLL’2006), 8-9 June, New York, NY, Association for Computer Linguistics, Stroudsburg, PA: 29-36.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, University of St. Andrews, Scotland
Rens Bod

Authors

Rens Bod
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Software Engineering Faculty of Informatics, University of Wollongong, Northfields Ave, Wollongong, NSW, 2522, Australia
John Fulcher
Knowledge-Based Engineering Founding Director of the KES Centre, University of South Australia, SCT-Building Mawson Lakes Campus, Adelaide, South Australia SA, 5095, Australia
L. C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bod, R. (2008). The Data-Oriented Parsing Approach: Theory and Application. In: Fulcher, J., Jain, L.C. (eds) Computational Intelligence: A Compendium. Studies in Computational Intelligence, vol 115. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78293-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-540-78293-3_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78292-6
Online ISBN: 978-3-540-78293-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Buying options