ABSTRACT
Cross-lingual name matching is an important problem in the fields of machine translation and data mining. Though well studied, it lacks a generic solution largely due to issues like language specific nuances, resource scarcity, etc. Most of the proposed unsupervised approaches focus on a small subset of languages, mostly English and its derivatives, and employ specific handcrafted rules that do not port well to other languages. In this paper, we propose a generic multilingual solution that instead adds simple probabilistic extensions to existing string similarity methods. Not only does our solution depend only on freely available open source resources but we also demonstrate the superiority of our approach on 60 language pairs drawn across language families.
- E. Aramaki, T. Imai, K. Miyo, and K. Ohe. Orthographic disambiguation incorporating transliterated probability. In IJCNLP, pages 48--55, 2008.Google Scholar
- M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, (5):16--23, 2003. Google ScholarDigital Library
- P. F. Brown, J. C. Lai, and R. L. Mercer. Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics, pages 169--176. Association for Computational Linguistics, 1991. Google ScholarDigital Library
- L. Bruzzone and S. B. Serpico. A technique for feature selection in multiclass problems. International Journal of Remote Sensing, 21(3):549--563, 2000.Google ScholarCross Ref
- E. Cantú-Paz. Feature subset selection, class separability, and genetic algorithms. In Genetic and Evolutionary Computation--GECCO 2004, pages 959--970. Springer, 2004.Google ScholarCross Ref
- W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string metrics for matching names and records. In Kdd workshop on data cleaning and object consolidation, volume 3, pages 73--78, 2003.Google ScholarDigital Library
- M. Dhore, S. Dixit, and R. M. Dhore. Issues in hindi to english and marathi to english machine transliteration of named entities. Issues, 51(14), 2012.Google Scholar
- R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179--188, 1936.Google ScholarCross Ref
- A. T. Freeman, S. L. Condon, and C. M. Ackerman. Cross linguistic name matching in english and arabic: a one to many mapping extension of the levenshtein edit distance algorithm. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 471--478. Association for Computational Linguistics, 2006. Google ScholarDigital Library
- S. Hanov. Fast and easy levenshtein distance using a trie. November, 30:4--30, 2013.Google Scholar
- L. O. Jimenez, D. Landgrebe, et al. Supervised classification in high-dimensional space: geometrical, statistical, and asymptotical properties of multivariate data. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 28(1):39--54, 1998. Google ScholarDigital Library
- V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707--710, 1966.Google Scholar
- N. Londhe. A generic character aligned machine transliteration system for Indic languages. STATE UNIVERSITY OF NEW YORK AT BUFFALO, 2013.Google Scholar
- A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. 1997.Google Scholar
- F. J. Och and H. Ney. Giza++: Training of statistical translation models, 2000.Google Scholar
- L. Philips. Hanging on the metaphone. Computer Language, 7(12 (December)), 1990.Google Scholar
- R. C. Russell. Index., Apr. 2 1918. US Patent 1,261,167.Google Scholar
- K. Saravanan, R. Udupa, and A. Kumaran. Crosslingual information retrieval system enhanced with transliteration generation and mining. In Forum for Information Retrieval Evaluation (FIRE-2010) Workshop, 2010.Google Scholar
- M. Shaikh, N. Memon, and U. K. Wiil. Extended approximate string matching algorithms to detect name aliases. In Intelligence and Security Informatics (ISI), 2011 IEEE International Conference on, pages 216--219. IEEE, 2011.Google ScholarCross Ref
- A. K. Singh, H. Surana, and K. Gali. More accurate fuzzy text search for languages using abugida scripts. In Proceedings of ACM SIGIR Workshop on Improving Web Retrieval for Non-English Queries, 2007.Google Scholar
- P. Virga and S. Khudanpur. Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition-Volume 15, pages 57--64. Association for Computational Linguistics, 2003. Google ScholarDigital Library
- H. Xiong, M. Swamy, and M. O. Ahmad. Two-dimensional fld for face recognition. Pattern Recognition, 38(7):1121--1124, 2005. Google ScholarDigital Library
Recommendations
Hindi Word Sense Disambiguation Using Lesk Approach on Bigram and Trigram Words
AICTC '16: Proceedings of the International Conference on Advances in Information Communication Technology & ComputingWord Sense Disambiguation (WSD) is a vital task which provides the definition of particular words according to their sense or according to given context. Lesk algorithm is originally based on the gloss overlap that can be observed as the measure, ...
SemEval-2010 task 3: cross-lingual word sense disambiguation
SEW '09: Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future DirectionsWe propose a multilingual unsupervised Word Sense Disambiguation (WSD) task for a sample of English nouns. Instead of providing manually sensetagged examples for each sense of a polysemous noun, our sense inventory is built up on the basis of the ...
Toward an Effective Igbo Part-of-Speech Tagger
Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments ...
Comments