skip to main content
research-article

A Hybrid Technique for English-Chinese Cross Language Information Retrieval

Published:01 April 2008Publication History
Skip Abstract Section

Abstract

In this article we describe a hybrid technique for dictionary-based query translation suitable for English-Chinese cross language information retrieval. This technique marries a graph-based model for the resolution of candidate term ambiguity with a pattern-based method for the translation of out-of-vocabulary (OOV) terms. We evaluate the performance of this hybrid technique in an experiment using several NTCIR test collections. Experimental results indicate a substantial increase in retrieval effectiveness over various baseline systems incorporating machine- and dictionary-based translation.

References

  1. AbdulJaleel, N. and Larkey, L. S. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM'03). New Orleans, LA. ACM Press. 139--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Adriani, M. 2000. Using statistical term similarity for sense disambiguationin cross-language information retrieval. Inf. Retr. 2, 1, 71--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ballesteros, L. and Croft, W. B. 1998. Resolving ambiguity for cross-language retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australia. ACM Press. 64--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Proceedings of the 7th International World Wide Web Conference (WWW'98). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Brody, S., Navigli, R., and Lapata, M. 2006. Ensemble methods for unsupervised wsd. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL (ACL'06). Association for Computational Linguistics, Morristown, NJ, 97--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Buckley, C., Mitra, M., Walz, J., and Cardie, C. 2000. Using clustering and superconcepts within smart: Trec 6. Inform. Process. Manage. 36, 1, 109--131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cao, G., Gao, J., and Nie, J.-Y. 2007. A system to mine large-scale bilingual dictionaries from monolingual Web pages. In Machine Translation Summit XI. Copenhagen, Denmark, 57--64.Google ScholarGoogle Scholar
  8. Chen, J., Li, Q., and Jia, W. 2005. Automatically generating an e-textbook on the Web. World Wide Web 8, 4, 377--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chen, K.-J. and Ma, W.-Y. 2002. Unknown word extraction for Chinese Documents. In Proceedings of the 19th International Conference on Computational Linguistics (COLIN'02). Association for Computational Linguistics, Morristown, NJ, 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cheng, P.-J., Teng, J.-W., Chen, R.-C., Wang, J.-H., Lu, W.-H., and Chien, L.-F. 2004. Translating unknown queries with Web corpora for cross-language information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'04). Sheffield, UK. ACM Press, 146--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cimiano, P., Handschuh, S., and Staab, S. 2004. Towards the self-annotating Web. In Proceedings of the 13th International Conference on World Wide Web (WWW'04). New York, NY. ACM Press. 462--471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cimiano, P., Ladwig, G., and Staab, S. 2005. Gimme the context: context-driven automatic semantic annotation with c-pankow. In Proceedings of the 14th International Conference on World Wide Web (WWW'05). Chiba, Japan: ACM Press, 332--341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Erkan, G. and Radev, D. R. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. J. AI Res. 22, 457--479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. 2004. Web-scale information extraction in knowitall: (preliminary results). In Proceedings of the 13th International Conference on World Wide Web (WWW'04). New York, NY. ACM Press, 100--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Federico, M. and Bertoldi, N. 2002. Statistical cross-language information retrieval using n-best query translations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'02). Tampere, Finland. ACM Press, 167--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Fujii, A. and Ishikawa, T. 2001. Japanese/English cross-language information retrieval: Exploration of query translation and transliteration. Comput. Human. 35, 4, 389--420.Google ScholarGoogle ScholarCross RefCross Ref
  17. Gao, J. and Nie, J.-Y. 2006. A study of statistical models for query translation: Finding a good unit of translation. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'06). Seattle, WA. ACM Press, 194--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gao, J., Zhou, M., Nie, J.-Y., He, H., and Chen, W. 2002. Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'02). New York, NY. ACM Press, 183--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hearst, M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics (COLING'92). Association for Computational Linguistics, Morristown, NJ, 539--545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Iwanska, L., Mata, N., and Kruger, K. 1999. Fully automatic acquisition of taxonomic knowledge from large corpora of texts: Limited syntax knowledge representation system based on natural language. In Proceedings of the 11th International Symposium on Foundations of Intelligent Systems (ISMIS'95). London, UK. Springer-Verlag, 430--438. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jang, M.-G., Myaeng, S. H., and Park, S. Y. 1999. Using mutual information to resolve query translation ambiguities and query term weighting. In Proceedings of the 37th Annual Meeting of the Association on Computational Linguistics (COLING'99). College Park, MD. Association for Computational Linguistics, 223--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kang, I.-H. and Kim, G. 2000. English-to-Korean transliteration using multiple unbounded overlapping phoneme chunks. In Proceedings of the 18th Conference on Computational Linguistics - Volume 1. Saarbrcken, Germany. Association for Computational Linguistics. 418--424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. 324140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kraaij, W. 2001. Tno at clef-2001. In Proceedings of Workshop on Cross-Language Evaluation Forum (CLEF'01). Darmstadt, Germany, 79--83.Google ScholarGoogle Scholar
  25. Kurland, O. and Lee, L. 2005. Pagerank without hyperlinks: structural re-ranking using links induced by language models. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). Salvador, Brazil. ACM Press, 306--313. 1076087. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kwok, K.-L. and Dinstl, N. 2007. Ntcir-6 monolingual Chinese and English-Chinese cross language retrieval experiments using pircs. In Proceedings of the 6th NTCIR Workshop Meeting. NII, Tokyo, Japan, 190--197.Google ScholarGoogle Scholar
  27. Liu, B., Chin, C. W., and Ng, H. T. 2003. Mining topic-specific concepts and definitions on the Web. In Proceedings of the 12th International Conference on World Wide Web (WWW'03). New York, NY. ACM Press, 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Liu, Y., Jin, R., and Chai, J. Y. 2005. A maximum coherence model for dictionary-based cross-language information retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). Salvador, Brazil. ACM Press, 536--543. 1076125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lu, C., Xu, Y., and Geva, S. 2007. Translation disambiguation in Web-based translation extraction for English-Chinese CLIR. In Proceedings of the 2007 ACM Symposium on Applied Computing (SAC'07). New York, NY. ACM Press, 819--823. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Lu, W.-H., Chien, L.-F., and Lee, H.-J. 2002. Translation of Web queries using anchor text mining. ACM Trans. Asian Lang. Inform. Process. 1, 2, 159--172. 568958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Maeda, A., Sadat, F., Yoshikawa, M., and Uemura, S. 2000. Query term disambiguation for Web cross-language information retrieval using a search engine. In Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages (IRAL'00). Hong Kong. ACM Press, 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Mihalcea, R. 2005. Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT'05). Morristown, NJ. Association for Computational Linguistics, 411--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mihalcea, R. and Tarau, P. 2004. Textrank-bringing order into texts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'04). 404--411.Google ScholarGoogle Scholar
  35. Monz, C. and Dorr, B. J. 2005. Iterative translation disambiguation for cross-language information retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). Salvador, Brazil. ACM Press, 520--527. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Pirkola, A. 1998. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98). Melbourne, Australia. ACM Press, 55--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Pirkola, A., Keskustalo, H., Leppanen, E., Kansala, A.-P., and Jarvelin, K. 2002. Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants. Inform. Res. 7, 2.Google ScholarGoogle Scholar
  38. Pirkola, A., Toivonen, J., Keskustalo, H., Visala, K., J, K., and Rvelin. 2003. Fuzzy translation of cross-lingual spelling variants. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR'03). Toronto, Canada. ACM Press, 345--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Qu, Y., Grefenstette, G., and Evans, D. A. 2003. Automatic transliteration for Japanese-to-English text retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'03). Toronto, Canada. ACM Press, 353--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Serban, R., Teije, A. T., Harmelen, F. V., Marcos, M., and C., P. 2005. Ontology-driven extraction of linguistic patterns for modelling clinical guidelines. In Proceedings of the 10th European Conference on Artificial Intelligence in Medicine (AIME'05). 194--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Sperer, R. and Oard, D. W. 2000. Structured translation for cross-language information retrieval. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'00). New York, NY. ACM Press, 120--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Virga, P. and Khudanpur, S. 2003. Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the ACL Workshop on Multilingual and Mixed-language Named Entity Recognition, Vol. 15. Association for Computational Linguistics, 57--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Voorhees, E. and Harman, D. 2000. Overview of the ninth text retrieval conference. In Proceedings of the 9th Text Retrieval Conference. NIST, 1--28.Google ScholarGoogle Scholar
  44. Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Academic Press, San Diego, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Wu, Y.-C., Tsai, K.-C., and Yang, J.-C. 2007. Ncu in bilingual information retrieval experiments at NTCIR-6. In Proceedings of the 6th NTCIR Workshop Meeting. NII, Tokyo, Japan, 133--139.Google ScholarGoogle Scholar
  46. Zhang, Y. and Vines, P. 2004. Using the Web for automated translation extraction in cross-language information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'04). Sheffield, UK. ACM Press, 162--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zhang, Y., Vines, P., and Zobel, J. 2005. Chinese OOV translation and post-translation query expansion in Chinese--English cross-lingual information retrieval. ACM Trans. Asian Lang. Inform. Process. 4, 2, 57--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Zhou, D., Goulding, J., Truran, M., and Brailsford, T. 2007. Llama: automatic hypertext generation utilizing language models. In Proceedings of the 18th Conference on Hypertext and Hypermedia (HT'07). New York, NY. ACM Press, 77--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Zhou, D., Truran, M., Brailsford, T., and Ashman, H. 2007. NTCIR-6 experiments using pattern matched translation extraction. In Proceedings of the 6th NTCIR Workshop Meeting. NII, Tokyo, Japan, 145--151.Google ScholarGoogle Scholar

Index Terms

  1. A Hybrid Technique for English-Chinese Cross Language Information Retrieval

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian Language Information Processing
          ACM Transactions on Asian Language Information Processing  Volume 7, Issue 2
          June 2008
          86 pages
          ISSN:1530-0226
          EISSN:1558-3430
          DOI:10.1145/1362782
          Issue’s Table of Contents

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 April 2008
          • Accepted: 1 March 2008
          • Revised: 1 February 2008
          • Received: 1 December 2007
          Published in talip Volume 7, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader