skip to main content
10.1145/1460027.1460029acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Efficient multi-word expressions extractor using suffix arrays and related structures

Published:30 October 2008Publication History

ABSTRACT

For Information Retrieval purposes, there is a need for regularly processing predictably dynamic and potentially huge corpora for extraction of contiguous Multi Word Expressions (MWEs), in a way that should be computationally tractable. In this paper we'll be mainly exploring the use of Suffix Arrays, together with the SCP association measure and the Smoothed LocalMaxs algorithm. The choice of Suffix Arrays and the construction of auxiliary structures enabled a clear minimization of the time for extracting multi-word expressions, with linear complexity by the introduction of a limitation on the number of words. Despite the methodology being essentially of a statistical nature, we show how to handle hybrid extraction mechanisms.

References

  1. M. I. Abouelhoda, S. Kurtz and E. Ohlebush, 2004. Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms 2: 53--86, Elsevier. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. I. Blanck, 1998, Computer-Aided Analysis of Multilingual Patent Documentation. Proceedings of the First LREC, pp. 765--771.Google ScholarGoogle Scholar
  3. D. Bourigault, 1996, Lexter, a Natural Language Processing Tool for Terminology Extraction. Proceedings of the 7th EURALEX International Congress.Google ScholarGoogle Scholar
  4. D. Bourigault, C. Jacquemin, and M.-C. L'Homme 2001. Recent Advances in Computational Terminology, Natural Language Processing, 2(2)328--332, John Benjamins.Google ScholarGoogle Scholar
  5. Z. Chengxiang, 1997, Exploiting Context to Identify Lexical Atoms: a Statistical view of Linguistic Context. cmp lg 9701001, 2 January 1997.Google ScholarGoogle Scholar
  6. K. Church et al, 1990, Word Association Norms Mutual Information and Lexicography. Computational Linguistics 16(1) 23--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. I. Dagan, 1994, Termight: Identifying and Translating Tech-nical Terminology. Proceedings of the 4th Conference on Natural Language Processing, ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Daille, 1996. Study and Implementation of Combined Techniques for the Automatic Extraction of Terminology. In J. Klavans and P. Resnik, editors, The Balancing Act Combining Symbolic and Statistical Approaches to Language. pp. 49--66. Cambridge, Massachusetts: MIT Press.Google ScholarGoogle Scholar
  9. C. Enguehard, 1993. Acquisition de Terminologie a partir de Gros Corpus. Informatique & Langue Naturelle, ILN'93, pp. 383--394.Google ScholarGoogle Scholar
  10. P. Gamallo, A. Agustini and G. P. Lopes 2005. 'Clustering Positions with Similar Requirements. Computational Lin-guistics. 31(1): 107--145. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Jacquemin and D. Bourigault 2003. Term Extraction and Automatic Indexing, Chapter 19, in R. Mitkov, editor, Hand-book of Computational Linguistics Oxford University Press, Oxford.Google ScholarGoogle Scholar
  12. J. S. Justeson and S. M. Katz 1995. Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text, Natural Language Engineering. 1(1):9--27.Google ScholarGoogle ScholarCross RefCross Ref
  13. N. Larsson and K. Sadakane, 1999. Faster suffix sorting. Technical Report LU-CS-TR:99-214. Department of Computer Science, Lund University, Lund, Sweden.Google ScholarGoogle Scholar
  14. U. Manber and G. Myers. 1990. Suffix arrays: A new method for on-line string searches. In Proceedings of The First Annual ACM-SIAM Symposium on Discrete Algorithms, pages 319--327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. McNamee and J. Mayfield, 2006. Translation of Multiword Expressions Using Parallel Suffix Arrays. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, pp. 100--109, Cambridge August 2006, AMTAGoogle ScholarGoogle Scholar
  16. V. Seretan and E. Wehrli, 2006. Accurate Collocation Ex-traction Using a Multilingual Parser. Proceedings of the 21st Int. Conf. on Computational Linguistics and 44th Annual Meeting of ACL, Sydney, July 2006. pp.953--960 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Shimohata, 1997, Retrieving collocations by co occurrences and Word Order Constraints. Proceedings of ACL-EACL. 476--481. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J.F.Silva, G.Dias, S.Guilloré, J.G.P.Lopes. 1999. "Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units". In P. Barahona, editor, Progress in Artificial Intelligence: 9th Portuguese Conference on AI, EPIA'99, Évora Portugal September 1999, Proceedings. LNAI series, Springer-Verlag, Vol. 1695, p. 113--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J.F. Silva, and J.G.P.Lopes. 1999b. "A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora". In Proceedings of the Sixth Meeting on Mathematics of Language (MOL6), Orlando, Florida July 23-25, 1999. pp. 369--381Google ScholarGoogle Scholar
  20. J.F.Silva, and J.G.P. Lopes. 2006. "Identification of Document Language is not yet a completely solved problem". In L. A. Zadeh and S. Grossberg, editors, Proceedings of International Conference on Computational Intelligence for Modelling, Control and Automation (CIMCA), Sidney, Australia, 28 November to 1 December. IEEE. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Smadja, 1993, Retrieving collocations from Text: STRACT. Computational Linguistics 19(1), pp. 143--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Voutilainen. 1993. NPtool. A detector of English noun phrases, Proceedings of the Workshop on Very Large Corpora, Columbus, Ohio.Google ScholarGoogle Scholar
  23. M. Yamamoto and K. Church, 2001. Using suffix arrays to compute term frequency and document frequency for all sub-strings in a corpus. Computational Linguistics 27(1): 1--30. MIT Press Cambridge, MA, USA Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient multi-word expressions extractor using suffix arrays and related structures

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searching
      October 2008
      112 pages
      ISBN:9781605584164
      DOI:10.1145/1460027

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 October 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader