Skip to main content

The Data-Oriented Parsing Approach: Theory and Application

  • Chapter

Part of the book series: Studies in Computational Intelligence ((SCI,volume 115))

Parsing models have many applications in AI, ranging from natural language processing (NLP) and computational music analysis to logic programming and computational learning. Broadly conceived, a parsing model seeks to uncover the underlying structure of an input, that is, the various ways in which elements of the input combine to form phrases or constituents and how those phrases recursively combine to form a tree structure for the whole input. During the last fifteen years, a major shift has taken place from rule-based, deterministic parsing to corpus-based, probabilistic parsing. A quick glance over the NLP literature from the last ten years, for example, indicates that virtually all natural language parsing systems are currently probabilistic. The same development can be observed in (stochastic) logic programming and (statistical) relational learning. This trend towards probabilistic parsing is not surprising: the increasing availability of very large collections of text, music, images and the like allow for inducing statistically motivated parsing systems from actual data.

A corpus-based parsing approach that has been quite successful in various fields of AI, is known as Data-Oriented Parsing or DOP. DOP was originally developed as an NLP technique but has been generalized to music analysis, problem-solving and unsupervised structure learning [7, 8, 14, 81]. The distinctive feature of the DOP approach, when it was first presented, was to model sentence structures on the basis of previously observed frequencies of sentencestructure fragments, without imposing any constraints on the sizeof these fragments. Fragments include, for instance, subtrees of depth 1 (corresponding to context-free rules), as well as entire trees.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   389.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abeillé A (ed.) (2003) Treebanks. Kluwer Academic Publishers, Dordrecht, The Netherlands.

    MATH  Google Scholar 

  2. Alonso M, Finn E (1996) Physics. Addison Wesley, Reading, MA.

    Google Scholar 

  3. Baader F, Nipkow T (1998) Term Rewriting and All That. Cambridge University Press, UK.

    Google Scholar 

  4. Black E, Abney S, Flickinger D, Gnadiec C, Grishman R, Harrison P, Hindle D, Ingria R, Jelinek F, Klavans J, Liberman M, Marcus M, Roukos S, Santorini B, Strzalkowski T (1991) A Procedure for quantitatively comparing the syn-tactic coverage of English. In: Proc. 5th DARPA Speech and Natural Language Workshop, Pacific Grove, CA, Morgan Kaufmann, San Mateo, CA: 306-311.

    Google Scholar 

  5. Black E, Lafferty J, Roukos S (1992) Development and evaluation of a broad-coverage probabilistic grammar of English-language computer manuals. In: Proc. 30th Association Computer Linguistics Conf. (ACL’92), Newark, DE, Association for Computer Linguistics, Stroudsburg, PA: 185-192.

    Google Scholar 

  6. Black E, Garside R, Leech G (1993) Statistically-Driven Computer Grammars of English: The IBM/Lancaster Approach. Rodopi, Amsterdam, The Netherlands.

    Google Scholar 

  7. Bod R (1992) Data-oriented parsing. In: Proc. Computational Linguistics Conf. (COLING’92), Nantes, France, Association for Computer Linguistics, Stroudsburg, PA: 854-859.

    Google Scholar 

  8. Bod R (1998) Beyond Grammar: An Experience-Based Theory of Language. Stanford: CSLI Publications(Lecture Notes number 88), distributed by Cambridge University Press, Cambridge, UK.

    Google Scholar 

  9. Bod R (1999) Context-sensitive spoken dialogue processing with the DOP model. Natural Language Engineering, 54: 309-323.

    Article  Google Scholar 

  10. Bod R (2000) Parsing with the shortest derivation. In: Proc. 18th ACL Compu-tational Linguistics Conf. (COLING’2000), Saarbrücken, Germany, Association for Computer Linguistics, Stroudsburg, PA: 69-75.

    Google Scholar 

  11. Bod R (2001) What is the minimal set of subtrees that achieves maximal parse accuracy? In: Proc. 39th Association Computer Linguistics Conf. (ACL’2001), Toulouse, France, Association for Computer Linguistics, Stroudsburg, PA: 66-73.

    Google Scholar 

  12. Bod R (2002) A unified model of structural organization in language and music. J. Artificial Intelligence Research, 17: 289-308.

    MATH  Google Scholar 

  13. Bod R (2002) Memory-based models of melodic analysis: challenging the Gestalt principles. J. New Music Research, 311: 27-37.

    Article  Google Scholar 

  14. Bod R (2003) An efficient implementation of a new DOP model. In: Proc. 10th European Association Computer Linguistics Conf. (EACL’03), 12-17 April, Budapest, Hungary, Association for Computer Linguistics, Stroudsburg, PA: 19-26.

    Google Scholar 

  15. Bod R(2004) Exemplar-based explanation. In: Proc. Computation and Philosophy Conf. (ECAP04), 3-5 June, Pavia, Italy.

    Google Scholar 

  16. Bod R (2005) Modeling scientific problem solving by DOP. In: Proc. Cognitive Science Conf. (CogSci’05). Stresa, Italy: 103.

    Google Scholar 

  17. Bod R (2006) Unsupervised parsing with U-DOP. In: Proc. 10th Computational Natural Language Learning Conf. (CONLL’2006), 8-9 June, New York, NY, Association for Computer Linguistics, Stroudsburg, PA: 85-92.

    Google Scholar 

  18. Bod R (2006) An all-subtrees approach to unsupervised parsing. In: Proc. ACL Computational Linguistics Conf. (COLING’2006), Sydney, Australia, Association for Computer Linguistics, Stroudsburg, PA: 865-872.

    Google Scholar 

  19. Bod R (2006) Towards a general model of applying science. Intl. Studies in the Philosophy of Science, 201: 5-25.

    Article  MathSciNet  Google Scholar 

  20. Bod R (2006) Exemplar-based reasoning with the shortest derivation. In: Magnani L (ed.) Model-Based Reasoning in Science and Engineering. College Publications, London, UK: 119-140.

    Google Scholar 

  21. Bod R (2006) Exemplar-based syntax: how to get productivity from examples. The Linguistic Review (Special Issue on Exemplar-Based Models in Linguistics), 233: 289-318.

    Google Scholar 

  22. Bod R, Kaplan R (1998) A probabilistic corpus-driven model for lexical-functional analysis. In: Proc. ACL Computational Linguistics Conf. (COLING’98), 10-14 August, Montreal, Canada, Association for Computer Linguistics, Stroudsburg, PA: 145-152.

    Google Scholar 

  23. Bod R, Hay J, Jannedy S (eds.) (2003) Probabilistic Linguistics. MIT Press, Cambridge, MA.

    MATH  Google Scholar 

  24. Bod R, Scha R, Sima’an K (eds.) (2003) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.

    Google Scholar 

  25. Bod R, Kaplan R (2003) A DOP model for lexical-functional grammar. In: Bod R, Scha R, Sima’an K (eds.) (2003) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.

    Google Scholar 

  26. Bonnema R, Bod R, Scha R (1997) A DOP model for semantic interpretation. In: Proc. 4th European Association Computer Linguistics Conf. (EACL’97), Madrid, Spain, Association for Computer Linguistics, Stroudsburg, PA: 159-167.

    Google Scholar 

  27. Briscoe T, Waegner N (1992) Robust stochastic parsing using the inside-outside algorithm. In: Proc. AAAI Workshop Statistically-Based Techniques in Natural Language Processing, Menlo Park, CA, AAAI Press/MIT Press, Cambridge, MA: 39-53.

    Google Scholar 

  28. Carbonell J (1993) Derivational analogy: a theory of reconstructive problem solving and expertise acquisition. In: Michalski RS, Carbonell J, Mitchell T (eds.) Machine Learning II. Morgan Kaufmann, San Francisco, CA: 371-392.

    Google Scholar 

  29. Charniak E (1997) Statistical techniques for natural language parsing. AI Magazine, Winter: 32-43.

    Google Scholar 

  30. Charniak E (2000) A maximum-entropy-inspired parser. In: Proc. 1st North American ACL Chapter Conf. (ANLP-NAACL’2000), Seattle, WA, Morgan Kaufmann, San Francisco, CA: 132-139.

    Google Scholar 

  31. Chater N (1999) The search for simplicity: a fundamental cognitive principle? The Quarterly J. Experimental Psychology, 52A2: 273-302.

    Google Scholar 

  32. Chiang D (2000) Statistical parsing with an automatically extracted tree adjoining grammar. In: Proc. 38th Association Computer Linguistics Conf. (ACL’2000), October, Hong Kong, China, Association for Computer Linguistics, Stroudsburg, PA: 456-463.

    Google Scholar 

  33. Clark A (2001) Unsupervised induction of stochastic context-free gram-mars using distributional clustering. In: Proc. Computational Natural Lan-guage Learning Conf. (CoNLL’2001), July, Toulouse, France, Association for Computer Linguistics, Stroudsburg, PA: 97-104.

    Google Scholar 

  34. Chomsky N (1965) Aspects of the Theory of Syntax. MIT Press, Cambridge MA.

    Google Scholar 

  35. Collins M (1996) A new statistical parser based on Bigram lexical dependen-cies. In: Proc. 34th Association Computer Linguistics Conf. (ACL’96), 23-28 June, Santa Cruz, CA, Association for Computer Linguistics, Stroudsburg, PA: 184-191.

    Google Scholar 

  36. Collins M (1997) Three generative lexicalised models for statistical parsing. In: Proc. 35th Association Computer Linguistics Conf. (ACL’97), July, Madrid, Spain, Association for Computer Linguistics, Stroudsburg, PA: 16-23.

    Google Scholar 

  37. Collins M (1999) Head-Driven Statistical Models for Natural Language Parsing. PhD Thesis, University of Pennsylvania, PA.

    Google Scholar 

  38. Collins M (2000) Discriminative reranking for natural language parsing. In: Proc. 17th Intl. Conf. Machine Learning (ICML-2000), Stanford, CA: 175-182.

    Google Scholar 

  39. Collins M, Duffy N (2001) Convolution kernels for natural language. In: Dietrich TG, Becker S, Gharamani Z (eds.) Advances in NIPS 14 (Proc. NIPS’2001), 3-8 December, Vancouver, Canada, MIT Press, Cambridge, MA: 617-624.

    Google Scholar 

  40. Collins M, Duffy N (2002) New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron. In: Proc. 38th Asso-ciation Computer Linguistics Conf. (ACL’2002), Philadelphia, PA, Association for Computer Linguistics, Stroudsburg, PA: 263-270.

    Google Scholar 

  41. Conklin D (2006) Melodic analysis with segment classes. Machine Learning, 652-3: 349-360.

    Article  Google Scholar 

  42. Cussens J (2001) Parameter estimation in stochastic logic programs. Machine Learning, 443: 245-271.

    Article  MATH  Google Scholar 

  43. Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society, 39: 1-38.

    MATH  MathSciNet  Google Scholar 

  44. De Raedt L, Kersting K (2004) Probabilistic inductive logic programming. In: Proc. Algorithmic Learning Theory (ALT) Conf., Lecture Notes in Computer Science 3244, Springer-Verlag, Berlin: 19-36.

    Google Scholar 

  45. Douglas J, Matthews R (1996) Fluid Mechanics 1 (3rd ed.). Longman, Essex, UK.

    Google Scholar 

  46. Eisner J (1996) Three new probabilistic models for dependency parsing: an exploration. In: Proc. 18th ACL Computational Linguistics Conf. (COL-ING’96), August, Copenhagen, Denmark, Association for Computer Linguistics, Stroudsburg, PA: 340-345.

    Google Scholar 

  47. Ferrand M, Nelson P, Wiggins G (2003) Unsupervised learning of melodic seg-mentation: a memory-based approach. In: Proc. 5th European Society for the Cognitive Sciences of Music Conf. (ESCOM’2003), 8-13 September, Hanover, Germany.

    Google Scholar 

  48. Frazier L (1978) On Comprehending Sentences: Syntactic Parsing Strategies. PhD Thesis, University of Connecticut.

    Google Scholar 

  49. Fujisaki T, Jelinek F, Cocke J, Black E, Nishino T (1989) A probabilistic method for sentence disambiguation. In: Proc. 1st Intl. Workshop Parsing Technologies, 28-31 August, Pittsburgh, PA: 85-94.

    Google Scholar 

  50. Gahl S, Garnsey S (2004) Knowledge of grammar, knowledge of usage: syntactic probabilities affect pronunciation variation. Language, 804: 748-775.

    Article  Google Scholar 

  51. Giere R (1988) Explaining Science: A Cognitive Approach. University of Chicago Press, Chicago, IL.

    Google Scholar 

  52. Goldberg A (2006) Constructions at Work. Oxford University Press, Oxford, UK.

    Google Scholar 

  53. Goodman J (1996) Efficient algorithms for parsing the DOP model. In: Proc. Empirical Methods in Natural Language Processing, Philadelphia, PA: 143-152.

    Google Scholar 

  54. Goodman J (2003) Efficient parsing of DOP with PCFG-reductions. In: Bod R, Scha R, Sima’an K (eds.) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.

    Google Scholar 

  55. Hearne M, Way A (2003) Seeing the wood for the trees: data-oriented translation. In: Proc. Machine Translation Summit IX, September, New Orleans, LO: 165-172.

    Google Scholar 

  56. Hearne M, Way A (2004) Data-oriented parsing and the Penn Chinese Treebank. In: Proc. 1st Intl. Joint Conf. Natural Language Processing, May, Hainan Island, China: 406-413.

    Google Scholar 

  57. Hearne M, Way A (2006) Disambiguation strategies for data-oriented transla-tion. In: Proc. 11th Intl. Conf. European Association for Machine Translation, 19-20 June, Oslo, Norway.

    Google Scholar 

  58. Hoogweg L (2003) Extending DOP with insertion. In: Bod R, Scha R, Sima’an K (eds.) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.

    Google Scholar 

  59. Huron D (1996) The melodic arch in western folksongs. Computing in Musicology, 10: 2-23.

    Google Scholar 

  60. Johnson M(1998) PCFG models of linguistic tree representations. Computational Linguistics, 24(4): 613-632.

    Google Scholar 

  61. Johnson M (2002) The DOP estimation method is biased and inconsistent. Computational Linguistics, 281: 71-76.

    Article  Google Scholar 

  62. Jurafsky D (2003) Probabilistic modeling in psycholinguistics. In: Bod R, Scha R, Sima’an K (eds) Data-Oriented Parsing. University of Chicago Press, Chicago, IL: 39-96.

    Google Scholar 

  63. Klein D (2005) The unsupervised learning of natural language structure. PhD Thesis, Department of Computer Science, Stanford University, CA.

    Google Scholar 

  64. Klein D, Manning C (2002) A general constituent-context model for improved grammar induction. In: Proc. 40th Association Computer Linguistics Conf. (ACL’2002), July, Philadelphia, PA, Association for Computer Linguistics, Stroudsburg, PA: 128-135.

    Google Scholar 

  65. Klein D, Manning C (2004) Corpus-based induction of syntactic structure: models of dependency and constituency. Proc. 42nd Association Computer Linguistics Conf. (ACL’2004), 21-26 July, Barcelona, Spain, Association for Computer Linguistics, Stroudsburg, PA: 438.

    Google Scholar 

  66. Kudo T, Suzuki J, Isozaki H (2005) Boosting-based parse reranking with subtree features. In: Proc. 43rd Association Computer Linguistics Conf. (ACL’2005), June, Ann Arbor, MI, Association for Computer Linguistics, Stroudsburg, PA: 189-196.

    Google Scholar 

  67. Kuhn T (1970) The Structure of Scientific Revolutions (2nd ed.). University of Chicago Press, Chicago, IL.

    Google Scholar 

  68. Lerdahl F, Jackendoff R (1983) A Generative Theory of Tonal Music. MIT Press, Cambridge, MA.

    Google Scholar 

  69. Longuet-Higgins H (1976) Perception of melodies. Nature, 263, October 21: 646-653.

    Article  Google Scholar 

  70. Longuet-Higgins H, Lee C (1987) The rhythmic interpretation of monophonic music. In: Longuet-Higgins H (ed.) Mental Processes: Studies in Cognitive Science, MIT Press, Cambridge, MA.

    Google Scholar 

  71. Makatchev M, Jordan P, VanLehn K (2004) Abductive theorem proving for analyzing student explanations to guide feedback in intelligent tutoring systems. J. Automated Reasoning, (Special Issue: Automated Reasoning and Theorem Proving in Education), 323: 187-226.

    Google Scholar 

  72. Manning C (2003) Probabilistic syntax. In: Bod R, Hay J, Jannedy S (eds.) Probabilistic Linguistics. MIT Press, Cambridge, MA: 289-342.

    Google Scholar 

  73. Manning C, Schuetze H (1999) Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.

    MATH  Google Scholar 

  74. Marcus M, Santorini B, Marcinkiewicz M(1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313-330.

    Google Scholar 

  75. McClosky D, Charniak E, Johnson M (2006) Effective self-training for parsing. In: Proc. North American Chapter of ACL Conf. Human Language Technol-ogy (NAACL-HLT 2006), June, New York, NY, Association for Computer Linguistics, Stroudsburg, PA: 152-159.

    Google Scholar 

  76. Mitchell T, Keller R, Kedar-Cabelli S (1986) Explanation-based learning: a unifying view. Machine Learning, 1: 47-80.

    Google Scholar 

  77. Mooney J, Zelle J (1994) Integrating ILP and EBL. SIGART Bulletin, 51: 12-21.

    Article  Google Scholar 

  78. Muggleton S (1996) Stochastic logic programs. In: De Raed L (ed.) Advances in Inductive Logic Programming (Proc. 5th Intl. Workshop Inductive Logic Programming), IOS Press, Amsterdam, The Netherlands: 254-264.

    Google Scholar 

  79. Neumann G (2003) A data-oriented approach to HPSG. In: Bod R, Scha R, Sima’an K (eds) Data-Oriented Parsing. University of Chicago Press, Chicago, IL.

    Google Scholar 

  80. Pereira F, Schabes Y (1992) Inside-outside reestimation from partially bracketed corpora. In: In: Proc. 30th Association Computer Linguistics Conf. (ACL’92), Newark, DL, Association for Computer Linguistics, Stroudsburg, PA: 128-135.

    Google Scholar 

  81. Scha R (1990) Taaltheorie en taaltechnologie; competence en performance. In: de Kort Q, Leerdam G (eds) Computertoepassingen in de Neerlandistiek. Landelijke Vereniging van Neerlandici (LVVN-jaarboek), Almere, The Netherlands.

    Google Scholar 

  82. Schaffrath H (1995) The Essen Folksong Collection in the Humdrum Kern Format. In: Huron D (ed.) Probabilistic Grammars for Music. Center for Computer Assisted Research in the Humanities, Menlo Park, CA.

    Google Scholar 

  83. Sima’an K (1996) Computational complexity of probabilistic disambiguation by means of tree grammars. In: Proc. 14th Computational Linguistics Conf. (COLING’96), 5-9 August, Copenhagen, Denmark, Association for Computer Linguistics, Stroudsburg, PA: 1175-1180.

    Google Scholar 

  84. Sima’an K (1999) Learning Efficient Disambiguation. ILLC Dissertation Series 1999-02, Utrecht University, The Netherlands.

    Google Scholar 

  85. Sima’an K, Itai A, Winter Y, Altman A, Nativ N (2001) Building a tree-bank of modern Hebrew text. J. Traitement Automatique des Langues (Special Issue on Natural Language Processing and Corpus Linguistics), 422: 347-380.

    Google Scholar 

  86. Temperley D (2001) The Cognition of Basic Musical Structures. MIT Press, Cambridge, MA.

    Google Scholar 

  87. Tomasello M(2003) Constructing a Language. Harvard University Press, Harvard, MA.

    Google Scholar 

  88. Van Lehn K (1998) Analogy events: how examples are used during problem solving. Cognitive Science, 223: 347-388.

    Article  Google Scholar 

  89. van Zaanen M (2000) ABL: alignment-based learning. In: Proc. 18th Compu-tational Linguistics Conf. (COLING’2000), 31 July - 4 August, Saarbrücken, Germany, Association for Computer Linguistics, Stroudsburg, PA: 961-967.

    Google Scholar 

  90. van Zaanen M (2002) Bootstrapping Structure into Language. PhD thesis. School of Computing, University of Leeds, UK.

    Google Scholar 

  91. van Zaanen M, Bod R, Honing H (2003) A memory-based approach to meter induction. In: Proc. 5th European Society for the Cognitive Sciences of Music Conf. (ESCOM5), September, Hanover, Germany: 250-253.

    Google Scholar 

  92. Veloso M, Carbonell J (1993) Derivational analogy in PRODIGY: automating case acquisition, storage, and utilization. Machine Learning, 103: 249-278.

    Article  Google Scholar 

  93. Wertheimer M (1923) Untersuchungen zur lehre von der gestalt. Psychologische Forschung, 4: 301-350.

    Article  Google Scholar 

  94. Younger D (1967) Recognition and parsing of context-free languages in time n3. Information and Control, 102: 189-208.

    Article  MATH  Google Scholar 

  95. Zollmann A, Sima’an, K (2005) A consistent and efficient estimator for data-oriented parsing. J. Automata, Languages and Combinatorics, 10: 367-388.

    MATH  MathSciNet  Google Scholar 

  96. Zuidema W (2006) What are the productive units of natural language gram-mar? A DOP approach to the automatic identification of constructions. In: Proc. 10th Computational Natural Language Learning Conf. (CONLL’2006), 8-9 June, New York, NY, Association for Computer Linguistics, Stroudsburg, PA: 29-36.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Bod, R. (2008). The Data-Oriented Parsing Approach: Theory and Application. In: Fulcher, J., Jain, L.C. (eds) Computational Intelligence: A Compendium. Studies in Computational Intelligence, vol 115. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78293-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78293-3_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78292-6

  • Online ISBN: 978-3-540-78293-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics