skip to main content
10.1145/2838706.2838716acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfireConference Proceedingsconference-collections
short-paper

MESS: A Multilingual Error based String Similarity measure for transliterated name variants

Published:04 December 2015Publication History

ABSTRACT

Cross-lingual name matching is an important problem in the fields of machine translation and data mining. Though well studied, it lacks a generic solution largely due to issues like language specific nuances, resource scarcity, etc. Most of the proposed unsupervised approaches focus on a small subset of languages, mostly English and its derivatives, and employ specific handcrafted rules that do not port well to other languages. In this paper, we propose a generic multilingual solution that instead adds simple probabilistic extensions to existing string similarity methods. Not only does our solution depend only on freely available open source resources but we also demonstrate the superiority of our approach on 60 language pairs drawn across language families.

References

  1. E. Aramaki, T. Imai, K. Miyo, and K. Ohe. Orthographic disambiguation incorporating transliterated probability. In IJCNLP, pages 48--55, 2008.Google ScholarGoogle Scholar
  2. M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, (5):16--23, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. F. Brown, J. C. Lai, and R. L. Mercer. Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics, pages 169--176. Association for Computational Linguistics, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Bruzzone and S. B. Serpico. A technique for feature selection in multiclass problems. International Journal of Remote Sensing, 21(3):549--563, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  5. E. Cantú-Paz. Feature subset selection, class separability, and genetic algorithms. In Genetic and Evolutionary Computation--GECCO 2004, pages 959--970. Springer, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  6. W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string metrics for matching names and records. In Kdd workshop on data cleaning and object consolidation, volume 3, pages 73--78, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Dhore, S. Dixit, and R. M. Dhore. Issues in hindi to english and marathi to english machine transliteration of named entities. Issues, 51(14), 2012.Google ScholarGoogle Scholar
  8. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179--188, 1936.Google ScholarGoogle ScholarCross RefCross Ref
  9. A. T. Freeman, S. L. Condon, and C. M. Ackerman. Cross linguistic name matching in english and arabic: a one to many mapping extension of the levenshtein edit distance algorithm. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 471--478. Association for Computational Linguistics, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Hanov. Fast and easy levenshtein distance using a trie. November, 30:4--30, 2013.Google ScholarGoogle Scholar
  11. L. O. Jimenez, D. Landgrebe, et al. Supervised classification in high-dimensional space: geometrical, statistical, and asymptotical properties of multivariate data. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 28(1):39--54, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707--710, 1966.Google ScholarGoogle Scholar
  13. N. Londhe. A generic character aligned machine transliteration system for Indic languages. STATE UNIVERSITY OF NEW YORK AT BUFFALO, 2013.Google ScholarGoogle Scholar
  14. A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. 1997.Google ScholarGoogle Scholar
  15. F. J. Och and H. Ney. Giza++: Training of statistical translation models, 2000.Google ScholarGoogle Scholar
  16. L. Philips. Hanging on the metaphone. Computer Language, 7(12 (December)), 1990.Google ScholarGoogle Scholar
  17. R. C. Russell. Index., Apr. 2 1918. US Patent 1,261,167.Google ScholarGoogle Scholar
  18. K. Saravanan, R. Udupa, and A. Kumaran. Crosslingual information retrieval system enhanced with transliteration generation and mining. In Forum for Information Retrieval Evaluation (FIRE-2010) Workshop, 2010.Google ScholarGoogle Scholar
  19. M. Shaikh, N. Memon, and U. K. Wiil. Extended approximate string matching algorithms to detect name aliases. In Intelligence and Security Informatics (ISI), 2011 IEEE International Conference on, pages 216--219. IEEE, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  20. A. K. Singh, H. Surana, and K. Gali. More accurate fuzzy text search for languages using abugida scripts. In Proceedings of ACM SIGIR Workshop on Improving Web Retrieval for Non-English Queries, 2007.Google ScholarGoogle Scholar
  21. P. Virga and S. Khudanpur. Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition-Volume 15, pages 57--64. Association for Computational Linguistics, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Xiong, M. Swamy, and M. O. Ahmad. Two-dimensional fld for face recognition. Pattern Recognition, 38(7):1121--1124, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    FIRE '15: Proceedings of the 7th Annual Meeting of the Forum for Information Retrieval Evaluation
    December 2015
    57 pages
    ISBN:9781450340045
    DOI:10.1145/2838706

    Copyright © 2015 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 4 December 2015

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • short-paper
    • Research
    • Refereed limited

    Acceptance Rates

    FIRE '15 Paper Acceptance Rate12of42submissions,29%Overall Acceptance Rate19of64submissions,30%
  • Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader