Skip to main content
Log in

Combining corpus and machine-readable dictionary data for building bilingual lexicons

  • Published:
Machine Translation

Abstract

This paper describes and discusses some theoretical and practical problems arising from developing a system to combine the structured but incomplete information from machine readable dictionaries (MRDs) with the unstructured but more complete information available in corpora for the creation of a bilingual lexical data base, presenting a methodology to integrate information from both sources into a single lexical data structure. The BICORD system (BIlingual CORpus-enhanced Dictionaries) involves linking entries in Collins English-French and French-English bilingual dictionary with a large English-French and French-English bilingual corpus. We have concentrated on the class of action verbs of movement, building on earlier work on lexical correspondences specific to this verb class between languages (Klavans and Tzoukermann, 1989), (Klavans and Tzoukermann, 1990a), (Klavans and Tzoukermann, 1990b).1 We first examine the way prototypical verbs of movement are translated in the Collins-Robert (Atkins, Duval, and Milne, 1978) bilingual dictionary, and then analyze the behavior of some of these verbs in a large bilingual corpus. We incorporate the results of linguistic research on the theory of verb types to motivate corpus analysis coupled with data from MRDs for the purpose of establishing lexical correspondences with the full range of associated translations, and with statistical data attached to the relevant nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Atkins, Beryl T. 1987. Semantic ID tags: Corpus evidence for dictionary senses. InProceedings of the Third Conference of the University of Waterloo. Centre for the New Oxford English Dictionary and Text Research: Electronic Text Research.

    Google Scholar 

  • Atkins, Beryl T., Alain Duval, and Rosemary C. Milne. 1978.Collins Robert French Dictionary: French-English. English-French. Collins Publishers, London.

    Google Scholar 

  • Atkins, Beryl T., Judith Kegl, and Beth Levin. 1988. Anatomy of a verb entry: from linguistics theory to lexicographic practice.International Journal of Lexicography, 1:84–126.

    Google Scholar 

  • Boguraev, Branimir. 1991. Building a lexicon: The contribution of computers.International Journal of Lexicography, 4(3).

  • Boguraev, Branimir, Roy Byrd, Judith Klavans, and Mary Neff. 1989. From structural analysis of lexical resources to semantics in a lexical knowledge base. InFirst International Lexical Acquisition Workshop, Detroit, Michigan. International Joint Conference on Artificial Intelligence.

  • Brent, Michael R. 1993. From grammar to lexicon: Unsupervised learning of lexical syntax.Computational Linguistics, 19(2):243–262.

    Google Scholar 

  • Brill, Eric. 1992. A simple rule-based part of speech tagger. InThird Conference on Applied Computational Linguistics, Trento, Italy.

  • Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin. 1988. A statistical approach to language translation. InProceedings of the Twelfth International Conference on Computational Linguistics, Budapest, Hungary.

  • Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin. 1990. A statistical approach to machine translation.Computational Linguistics, 16(2):79–85.

    Google Scholar 

  • Brown, P., J. Lai, and R. Mercer. 1991. Aligning sentences in parallel corpora. InProceedings of the Twenty-ninth Annual Meeting of the Association for Computational Linguistics, pages 169–176, Berkeley, California.

  • Brown, P., S. Della Pietra, V. Della Pietra, M. Goldsmith, J. Hajic, R. Mercer, and S. Mohanty. 1993. But dictionaries are data too. InProceedings of DARPA, Princeton, New Jersey.

  • Byrd, Roy, Nicoletta Calzolari, Martin Chodorow, Judith Klavans, Mary Neff, and Omneya Rizk. 1987. Tools and methods for computational lexicology.Computational Linguistics, 13(3):219–240.

    Google Scholar 

  • Calzolari, Nicoletta and Remo Bindi. 1990. Acquisition of lexical information from a large textual Italian corpus. InProceedings of the Thirteenth International Conference on Computational Linguistics, Helsinki, Finland.

  • Carter, Richard. 1988. On movement (written in 1984). In Beth Levin and Carol Tenny, editors,On Linking: Papers by Richard Carter, volume 25 ofLexicon Project Working Papers. MIT Press, pages 231–252.

  • Catizone, Robert, Graham Russell, and Susan Warwick, 1989.Deriving Translation Data from Bilingual Text. ISSCO, Geneva, Switzerland, unpublished manuscript.

    Google Scholar 

  • Chodorow, Martin S., Roy J. Byrd, and George E. Heidorn. 1985. Extracting semantic hierarchies from a large on-line dictionary. InProceedings of the Twenty-third Annual Meeting of the Association for Computational Linguistics, pages 299–304.

  • Church, Kenneth W. 1989. A stochastic parts program noun phrase parser for unrestricted text. InIEEE Proceedings of the ICASSP, pages 695–698, Glasgow.

  • Church, Kenneth W. 1993. Char_align: A program for aligning parallel texts at the character level. InProceedings of the Thirty-first Annual Meeting of the Association for Computational Linguistics, pages 1–8.

  • Church, Kenneth W. and Patrick Hanks. 1990. Word association norms, mutual information and lexicography.Computational Linguistics, 16(1):22–29.

    Google Scholar 

  • Church, Kenneth W., Patrick Hanks, D. Hindle, and W. Gale. 1991. Using statistics in lexical analysis. In Uri Zernik, editor,Lexical Acquisition: Using on-line Resources to Build a Lexicon. Lawrence Erlbaum.

  • Cruse, D. Alan. 1986.Lexical Semantics. Cambridge University Press, Cambridge, England.

    Google Scholar 

  • DeRose, Stephen. 1988. Grammatical category disambiguation by statistical optimization.Computational Linguistics, 14(1):31–39.

    Google Scholar 

  • Dorr, Bonnie J. 1992. The use of lexical semantics in interlingual machine translation.Machine Translation, 7(3):135–193.

    Google Scholar 

  • Dowty, David. 1979.Word Meaning and Montague Grammar. Reidel, Dordrecht.

    Google Scholar 

  • Gove, Philip B., editor. 1963.Webster's Seventh New Collegiate Dictionary. G. & C. Merriam Company, Springfield, Mass.

    Google Scholar 

  • Grishman, Ralph and Richard Kittredge, editors. 1986.Analyzing language in restricted domains: Sublanguage description and processing. Lawrence Erlbaum.

  • Gruber, J. S. 1965.Studies in Lexical Relations. Ph.D. thesis, The Massachusetts Institute of Technology, Department of Linguistics, Cambridge, Massachusetts. published later 1976 as Lexical Structures in Syntax and Semantics, North-Holland, Amsterdam.

    Google Scholar 

  • Hale, Kenneth and Jay Keyser. 1986.Some Transitivity Alternations in English. Center for Cognitive Science, The Massachusetts Institute of Technology.

  • Jackendoff, Ray S. 1987. The status of thematic relations in linguistic theory.Linguistic Inquiry, 18(3):369–411.

    Google Scholar 

  • Jackendoff, Ray S. (1990).Semantic Structures. MIT Press, Cambridge, MA.

    Google Scholar 

  • Kay, Martin and Martin Röscheisen. 1993. Text translation alignment.Computational Linguistics, 19(1):75–102.

    Google Scholar 

  • Klavans, Judith L. 1988. Complex: A computational lexicon for natural language systems. InProceedings of the Twelfth International Conference on Computational Linguistics, Budapest, Hungary.

  • Klavans, Judith L., Martin Chodorow, and Nina Wacholder. 1990. From dictionary to knowledge base via taxonomy. InProceedings of the Sixth Conference of the University of Waterloo. Centre for the New Oxford English Dictionary and Text Research: Electronic Text Research.

  • Klavans, Judith L. and Philip Resnik, editors. 1996.The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge, Mass.

    Google Scholar 

  • Klavans, Judith L. and Evelyne Tzoukermann. 1989. Corpus-based lexical acquisition for translation systems. InProceedings of the Sixth Israeli Conference of Artificial Intelligence and Computer Vision, Tel Aviv, Israel.

  • Klavans, Judith L. and Evelyne Tzoukermann. 1990a. Linking bilingual corpora and machine readable dictionaries with the BICORD system. InProceedings of the Sixth Conference of the University of Waterloo. Centre for the New Oxford English Dictionary and Text Research: Electronic Text Research.

  • Klavans, Judith L. and Evelyne Tzoukermann. 1990b. The BICORD system: Combining lexical information from bilingual corpora and machine readable dictionaries. InProceedings of the Thirteenth International Conference on Computational Linguistics, Helsinki, Finland.

  • Kupiec, Julian. 1989. Augmenting a hidden markov model for phrase-dependent word tagging. InProceedings of the 1989 DARPA Speech and Natural Language Workshop, pages 92–98, San Mateo, California. Morgan Kaufmann.

    Google Scholar 

  • Leech, Geoffrey, Roger Garside, and Erik Atwell. 1983. Automatic grammatical tagging of the LOB corpus.ICAME News, 7:13–33.

    Google Scholar 

  • Levin, Beth and Malka Rappaport. 1988. On the nature of unaccusativity. InProceedings of New England Linguistic Society.

  • Merialdo, Bernard. 1994. Tagging English text with a probabilistic model.Computational Linguistics, 20(2):155–172.

    Google Scholar 

  • Neff, Mary and Bran Boguraev. 1989. Dictionaries, dictionary grammars and dictionary entry parsing. InProceedings of the Twenty-seventh Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada.

  • Neff, Mary S., Roy J. Byrd, and Omneya A. Rizk. 1988. Creating and querying hierarchical lexical data bases. InProceedings of the Second Applied Association for Computational Linguistics Conference, pages 84–92, Austin, Texas.

  • Pustejovsky, James, Sabine Bergler, and Peter Anick. 1993. Lexical semantic techniques for corpus analysis.Computational Linguistics, 19(2):331–358.

    Google Scholar 

  • Rizk, Omneya, 1989.Sense Disambiguation of Word Translation in Bilingual Dictionaries: Trying to Solve the Mapping Problem Automatically. Unpublished M.A. thesis. Courant Institute of Mathematical Sciences, New York University, New York.

    Google Scholar 

  • Sadler, Victor, 1989.The Bilingual Knowledge Bank: A New conceptual basis for MT. BSO/Research, unpublished manuscript, Utrecht.

  • Smadja, Frank, Kathleen McKeown, and Vasileios Hatzivassiloglou. in press, Translating collocations for bilingual lexicons: A statistical approach.Computational Linguistics.

  • Talmy, Leonard. 1975. Semantics and syntax of motion. In J.P. Kimball, editor,Syntax and Semantics, volume 4. Academic Press, New York, NY, pages 181–238.

    Google Scholar 

  • Talmy, Leonard. 1985. Lexicalization patterns: Semantic structure in lexical forms. In T. Shopen, editor,Language Typology and Syntactic Description: Grammatical categories and the Lexicon. Cambridge University Press, Cambridge UK.

    Google Scholar 

  • Tenny, Carol L., 1992.How Motion Verbs are Special. University of Pittsburgh, Department of Linguistics, unpublished manuscript.

  • Tenny, Carol L. 1994.Aspectual Roles and the Syntax-Semantics Interface. Kluwer Academic Publishers, Dordrecht.

    Google Scholar 

  • Tzoukermann, Evelyne and Bernard Merialdo, 1989.Some Statistical Approaches for Tagging Unrestricted Text. IBM, T. J. Watson Research Center, Yorktown Heights, New York, unpublished manuscript.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Klavans, J., Tzoukermann, E. Combining corpus and machine-readable dictionary data for building bilingual lexicons. Mach Translat 10, 185–218 (1995). https://doi.org/10.1007/BF00981486

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00981486

Keywords

Navigation