ABSTRACT
For Information Retrieval purposes, there is a need for regularly processing predictably dynamic and potentially huge corpora for extraction of contiguous Multi Word Expressions (MWEs), in a way that should be computationally tractable. In this paper we'll be mainly exploring the use of Suffix Arrays, together with the SCP association measure and the Smoothed LocalMaxs algorithm. The choice of Suffix Arrays and the construction of auxiliary structures enabled a clear minimization of the time for extracting multi-word expressions, with linear complexity by the introduction of a limitation on the number of words. Despite the methodology being essentially of a statistical nature, we show how to handle hybrid extraction mechanisms.
- M. I. Abouelhoda, S. Kurtz and E. Ohlebush, 2004. Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms 2: 53--86, Elsevier. Google ScholarDigital Library
- I. Blanck, 1998, Computer-Aided Analysis of Multilingual Patent Documentation. Proceedings of the First LREC, pp. 765--771.Google Scholar
- D. Bourigault, 1996, Lexter, a Natural Language Processing Tool for Terminology Extraction. Proceedings of the 7th EURALEX International Congress.Google Scholar
- D. Bourigault, C. Jacquemin, and M.-C. L'Homme 2001. Recent Advances in Computational Terminology, Natural Language Processing, 2(2)328--332, John Benjamins.Google Scholar
- Z. Chengxiang, 1997, Exploiting Context to Identify Lexical Atoms: a Statistical view of Linguistic Context. cmp lg 9701001, 2 January 1997.Google Scholar
- K. Church et al, 1990, Word Association Norms Mutual Information and Lexicography. Computational Linguistics 16(1) 23--29. Google ScholarDigital Library
- I. Dagan, 1994, Termight: Identifying and Translating Tech-nical Terminology. Proceedings of the 4th Conference on Natural Language Processing, ACL. Google ScholarDigital Library
- B. Daille, 1996. Study and Implementation of Combined Techniques for the Automatic Extraction of Terminology. In J. Klavans and P. Resnik, editors, The Balancing Act Combining Symbolic and Statistical Approaches to Language. pp. 49--66. Cambridge, Massachusetts: MIT Press.Google Scholar
- C. Enguehard, 1993. Acquisition de Terminologie a partir de Gros Corpus. Informatique & Langue Naturelle, ILN'93, pp. 383--394.Google Scholar
- P. Gamallo, A. Agustini and G. P. Lopes 2005. 'Clustering Positions with Similar Requirements. Computational Lin-guistics. 31(1): 107--145. MIT Press. Google ScholarDigital Library
- C. Jacquemin and D. Bourigault 2003. Term Extraction and Automatic Indexing, Chapter 19, in R. Mitkov, editor, Hand-book of Computational Linguistics Oxford University Press, Oxford.Google Scholar
- J. S. Justeson and S. M. Katz 1995. Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text, Natural Language Engineering. 1(1):9--27.Google ScholarCross Ref
- N. Larsson and K. Sadakane, 1999. Faster suffix sorting. Technical Report LU-CS-TR:99-214. Department of Computer Science, Lund University, Lund, Sweden.Google Scholar
- U. Manber and G. Myers. 1990. Suffix arrays: A new method for on-line string searches. In Proceedings of The First Annual ACM-SIAM Symposium on Discrete Algorithms, pages 319--327. Google ScholarDigital Library
- P. McNamee and J. Mayfield, 2006. Translation of Multiword Expressions Using Parallel Suffix Arrays. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, pp. 100--109, Cambridge August 2006, AMTAGoogle Scholar
- V. Seretan and E. Wehrli, 2006. Accurate Collocation Ex-traction Using a Multilingual Parser. Proceedings of the 21st Int. Conf. on Computational Linguistics and 44th Annual Meeting of ACL, Sydney, July 2006. pp.953--960 Google ScholarDigital Library
- S. Shimohata, 1997, Retrieving collocations by co occurrences and Word Order Constraints. Proceedings of ACL-EACL. 476--481. Google ScholarDigital Library
- J.F.Silva, G.Dias, S.Guilloré, J.G.P.Lopes. 1999. "Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units". In P. Barahona, editor, Progress in Artificial Intelligence: 9th Portuguese Conference on AI, EPIA'99, Évora Portugal September 1999, Proceedings. LNAI series, Springer-Verlag, Vol. 1695, p. 113--132. Google ScholarDigital Library
- J.F. Silva, and J.G.P.Lopes. 1999b. "A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora". In Proceedings of the Sixth Meeting on Mathematics of Language (MOL6), Orlando, Florida July 23-25, 1999. pp. 369--381Google Scholar
- J.F.Silva, and J.G.P. Lopes. 2006. "Identification of Document Language is not yet a completely solved problem". In L. A. Zadeh and S. Grossberg, editors, Proceedings of International Conference on Computational Intelligence for Modelling, Control and Automation (CIMCA), Sidney, Australia, 28 November to 1 December. IEEE. 2006. Google ScholarDigital Library
- F. Smadja, 1993, Retrieving collocations from Text: STRACT. Computational Linguistics 19(1), pp. 143--177. Google ScholarDigital Library
- A. Voutilainen. 1993. NPtool. A detector of English noun phrases, Proceedings of the Workshop on Very Large Corpora, Columbus, Ohio.Google Scholar
- M. Yamamoto and K. Church, 2001. Using suffix arrays to compute term frequency and document frequency for all sub-strings in a corpus. Computational Linguistics 27(1): 1--30. MIT Press Cambridge, MA, USA Google ScholarDigital Library
Index Terms
- Efficient multi-word expressions extractor using suffix arrays and related structures
Recommendations
Phrase Translation Extraction from Aligned Parallel Corpora Using Suffix Arrays and Related Structures
EPIA '09: Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial IntelligenceIn this paper, we will address term translation extraction from indexed aligned parallel corpora, by using a couple of association measures combined by a voting scheme, for scaling down translation pairs according to the degree of internal cohesiveness, ...
Multi-Word Expressions Annotations Effect in Document Classification Task
Natural Language Processing and Information SystemsAbstractDocument classification is a necessary task for most Natural Language Processing tools since it classifies documents content in a helpful and meaningful way. The main concern in this paper is to investigate the impact of using multi-words for ...
Counting suffix arrays and strings
Suffix arrays are used in various applications and research areas like data compression or computational biology. In this work, our goal is to characterise the combinatorial properties of suffix arrays and their enumeration. For a fixed alphabet size ...
Comments