research-article

Efficient multi-word expressions extractor using suffix arrays and related structures

Authors:
José Aires

Universidade Nova de Lisboa, Lisboa, Portugal

Universidade Nova de Lisboa, Lisboa, Portugal
View Profile

,
Gabriel Lopes

Universidade Nova de Lisboa, Lisboa, Portugal

Universidade Nova de Lisboa, Lisboa, Portugal
View Profile

,
Joaquim Ferreira Silva

Universidade Nova de Lisboa, Lisboa, Portugal

Universidade Nova de Lisboa, Lisboa, Portugal
View Profile

iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searchingOctober 2008Pages 1–8https://doi.org/10.1145/1460027.1460029

Published:30 October 2008Publication History

iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searching

Pages 1–8

ABSTRACT

For Information Retrieval purposes, there is a need for regularly processing predictably dynamic and potentially huge corpora for extraction of contiguous Multi Word Expressions (MWEs), in a way that should be computationally tractable. In this paper we'll be mainly exploring the use of Suffix Arrays, together with the SCP association measure and the Smoothed LocalMaxs algorithm. The choice of Suffix Arrays and the construction of auxiliary structures enabled a clear minimization of the time for extracting multi-word expressions, with linear complexity by the introduction of a limitation on the number of words. Despite the methodology being essentially of a statistical nature, we show how to handle hybrid extraction mechanisms.

References

M. I. Abouelhoda, S. Kurtz and E. Ohlebush, 2004. Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms 2: 53--86, Elsevier. Google ScholarDigital Library
I. Blanck, 1998, Computer-Aided Analysis of Multilingual Patent Documentation. Proceedings of the First LREC, pp. 765--771.Google Scholar
D. Bourigault, 1996, Lexter, a Natural Language Processing Tool for Terminology Extraction. Proceedings of the 7th EURALEX International Congress.Google Scholar
D. Bourigault, C. Jacquemin, and M.-C. L'Homme 2001. Recent Advances in Computational Terminology, Natural Language Processing, 2(2)328--332, John Benjamins.Google Scholar
Z. Chengxiang, 1997, Exploiting Context to Identify Lexical Atoms: a Statistical view of Linguistic Context. cmp lg 9701001, 2 January 1997.Google Scholar
K. Church et al, 1990, Word Association Norms Mutual Information and Lexicography. Computational Linguistics 16(1) 23--29. Google ScholarDigital Library
I. Dagan, 1994, Termight: Identifying and Translating Tech-nical Terminology. Proceedings of the 4th Conference on Natural Language Processing, ACL. Google ScholarDigital Library
B. Daille, 1996. Study and Implementation of Combined Techniques for the Automatic Extraction of Terminology. In J. Klavans and P. Resnik, editors, The Balancing Act Combining Symbolic and Statistical Approaches to Language. pp. 49--66. Cambridge, Massachusetts: MIT Press.Google Scholar
C. Enguehard, 1993. Acquisition de Terminologie a partir de Gros Corpus. Informatique & Langue Naturelle, ILN'93, pp. 383--394.Google Scholar
P. Gamallo, A. Agustini and G. P. Lopes 2005. 'Clustering Positions with Similar Requirements. Computational Lin-guistics. 31(1): 107--145. MIT Press. Google ScholarDigital Library
C. Jacquemin and D. Bourigault 2003. Term Extraction and Automatic Indexing, Chapter 19, in R. Mitkov, editor, Hand-book of Computational Linguistics Oxford University Press, Oxford.Google Scholar
J. S. Justeson and S. M. Katz 1995. Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text, Natural Language Engineering. 1(1):9--27.Google ScholarCross Ref
N. Larsson and K. Sadakane, 1999. Faster suffix sorting. Technical Report LU-CS-TR:99-214. Department of Computer Science, Lund University, Lund, Sweden.Google Scholar
U. Manber and G. Myers. 1990. Suffix arrays: A new method for on-line string searches. In Proceedings of The First Annual ACM-SIAM Symposium on Discrete Algorithms, pages 319--327. Google ScholarDigital Library
P. McNamee and J. Mayfield, 2006. Translation of Multiword Expressions Using Parallel Suffix Arrays. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, pp. 100--109, Cambridge August 2006, AMTAGoogle Scholar
V. Seretan and E. Wehrli, 2006. Accurate Collocation Ex-traction Using a Multilingual Parser. Proceedings of the 21st Int. Conf. on Computational Linguistics and 44th Annual Meeting of ACL, Sydney, July 2006. pp.953--960 Google ScholarDigital Library
S. Shimohata, 1997, Retrieving collocations by co occurrences and Word Order Constraints. Proceedings of ACL-EACL. 476--481. Google ScholarDigital Library
J.F.Silva, G.Dias, S.Guilloré, J.G.P.Lopes. 1999. "Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units". In P. Barahona, editor, Progress in Artificial Intelligence: 9th Portuguese Conference on AI, EPIA'99, Évora Portugal September 1999, Proceedings. LNAI series, Springer-Verlag, Vol. 1695, p. 113--132. Google ScholarDigital Library
J.F. Silva, and J.G.P.Lopes. 1999b. "A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora". In Proceedings of the Sixth Meeting on Mathematics of Language (MOL6), Orlando, Florida July 23-25, 1999. pp. 369--381Google Scholar
J.F.Silva, and J.G.P. Lopes. 2006. "Identification of Document Language is not yet a completely solved problem". In L. A. Zadeh and S. Grossberg, editors, Proceedings of International Conference on Computational Intelligence for Modelling, Control and Automation (CIMCA), Sidney, Australia, 28 November to 1 December. IEEE. 2006. Google ScholarDigital Library
F. Smadja, 1993, Retrieving collocations from Text: STRACT. Computational Linguistics 19(1), pp. 143--177. Google ScholarDigital Library
A. Voutilainen. 1993. NPtool. A detector of English noun phrases, Proceedings of the Workshop on Very Large Corpora, Columbus, Ohio.Google Scholar
M. Yamamoto and K. Church, 2001. Using suffix arrays to compute term frequency and document frequency for all sub-strings in a corpus. Computational Linguistics 27(1): 1--30. MIT Press Cambridge, MA, USA Google ScholarDigital Library

Index Terms

Efficient multi-word expressions extractor using suffix arrays and related structures
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Phrase Translation Extraction from Aligned Parallel Corpora Using Suffix Arrays and Related Structures
EPIA '09: Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence

In this paper, we will address term translation extraction from indexed aligned parallel corpora, by using a couple of association measures combined by a voting scheme, for scaling down translation pairs according to the degree of internal cohesiveness, ...
Read More
Multi-Word Expressions Annotations Effect in Document Classification Task
Natural Language Processing and Information Systems
Abstract
Document classification is a necessary task for most Natural Language Processing tools since it classifies documents content in a helpful and meaningful way. The main concern in this paper is to investigate the impact of using multi-words for ...
Read More
Counting suffix arrays and strings

Suffix arrays are used in various applications and research areas like data compression or computational biology. In this work, our goal is to characterise the combinatorial properties of suffix arrays and their enumeration. For a fixed alphabet size ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searching
October 2008
112 pages
ISBN:9781605584164
DOI:10.1145/1460027
Program Chairs:
Fotis Lazarinis
Technological Educational Institute of Mesolongli, Greece
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jesus Vilares
University of A Coruna, Spain
,
John I. Tait
Information Retrieval Facility, Austria
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 October 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
extraction
language independent
large corpus
multi-word expressions
suffix arrays
Qualifiers
- research-article
Conference
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 219
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient multi-word expressions extractor using suffix arrays and related structures

iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searching

ABSTRACT

References

Cited By

Index Terms

Recommendations

Phrase Translation Extraction from Aligned Parallel Corpora Using Suffix Arrays and Related Structures

Multi-Word Expressions Annotations Effect in Document Classification Task

Counting suffix arrays and strings

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient multi-word expressions extractor using suffix arrays and related structures

iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searching

ABSTRACT

References

Cited By

Index Terms

Recommendations

Phrase Translation Extraction from Aligned Parallel Corpora Using Suffix Arrays and Related Structures

Multi-Word Expressions Annotations Effect in Document Classification Task

Counting suffix arrays and strings

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media