research-article

A Hybrid Technique for English-Chinese Cross Language Information Retrieval

Authors:
Dong Zhou

University of Nottingham

University of Nottingham
View Profile

,
Mark Truran

University of Teesside

University of Teesside
View Profile

,
Tim Brailsford

University of Nottingham

University of Nottingham
View Profile

,
Helen Ashman

University of South, Australia

University of South, Australia
View Profile

ACM Transactions on Asian Language Information Processing Volume 7 Issue 2Article No.: 5pp 1–35https://doi.org/10.1145/1362782.1362784

Published:01 April 2008Publication History

ACM Transactions on Asian Language Information Processing

Abstract

In this article we describe a hybrid technique for dictionary-based query translation suitable for English-Chinese cross language information retrieval. This technique marries a graph-based model for the resolution of candidate term ambiguity with a pattern-based method for the translation of out-of-vocabulary (OOV) terms. We evaluate the performance of this hybrid technique in an experiment using several NTCIR test collections. Experimental results indicate a substantial increase in retrieval effectiveness over various baseline systems incorporating machine- and dictionary-based translation.

References

AbdulJaleel, N. and Larkey, L. S. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM'03). New Orleans, LA. ACM Press. 139--146. Google ScholarDigital Library
Adriani, M. 2000. Using statistical term similarity for sense disambiguationin cross-language information retrieval. Inf. Retr. 2, 1, 71--82. Google ScholarDigital Library
Ballesteros, L. and Croft, W. B. 1998. Resolving ambiguity for cross-language retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australia. ACM Press. 64--71. Google ScholarDigital Library
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Proceedings of the 7th International World Wide Web Conference (WWW'98). Google ScholarDigital Library
Brody, S., Navigli, R., and Lapata, M. 2006. Ensemble methods for unsupervised wsd. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL (ACL'06). Association for Computational Linguistics, Morristown, NJ, 97--104. Google ScholarDigital Library
Buckley, C., Mitra, M., Walz, J., and Cardie, C. 2000. Using clustering and superconcepts within smart: Trec 6. Inform. Process. Manage. 36, 1, 109--131. Google ScholarDigital Library
Cao, G., Gao, J., and Nie, J.-Y. 2007. A system to mine large-scale bilingual dictionaries from monolingual Web pages. In Machine Translation Summit XI. Copenhagen, Denmark, 57--64.Google Scholar
Chen, J., Li, Q., and Jia, W. 2005. Automatically generating an e-textbook on the Web. World Wide Web 8, 4, 377--394. Google ScholarDigital Library
Chen, K.-J. and Ma, W.-Y. 2002. Unknown word extraction for Chinese Documents. In Proceedings of the 19th International Conference on Computational Linguistics (COLIN'02). Association for Computational Linguistics, Morristown, NJ, 1--7. Google ScholarDigital Library
Cheng, P.-J., Teng, J.-W., Chen, R.-C., Wang, J.-H., Lu, W.-H., and Chien, L.-F. 2004. Translating unknown queries with Web corpora for cross-language information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'04). Sheffield, UK. ACM Press, 146--153. Google ScholarDigital Library
Cimiano, P., Handschuh, S., and Staab, S. 2004. Towards the self-annotating Web. In Proceedings of the 13th International Conference on World Wide Web (WWW'04). New York, NY. ACM Press. 462--471. Google ScholarDigital Library
Cimiano, P., Ladwig, G., and Staab, S. 2005. Gimme the context: context-driven automatic semantic annotation with c-pankow. In Proceedings of the 14th International Conference on World Wide Web (WWW'05). Chiba, Japan: ACM Press, 332--341. Google ScholarDigital Library
Erkan, G. and Radev, D. R. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. J. AI Res. 22, 457--479. Google ScholarDigital Library
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. 2004. Web-scale information extraction in knowitall: (preliminary results). In Proceedings of the 13th International Conference on World Wide Web (WWW'04). New York, NY. ACM Press, 100--110. Google ScholarDigital Library
Federico, M. and Bertoldi, N. 2002. Statistical cross-language information retrieval using n-best query translations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'02). Tampere, Finland. ACM Press, 167--174. Google ScholarDigital Library
Fujii, A. and Ishikawa, T. 2001. Japanese/English cross-language information retrieval: Exploration of query translation and transliteration. Comput. Human. 35, 4, 389--420.Google ScholarCross Ref
Gao, J. and Nie, J.-Y. 2006. A study of statistical models for query translation: Finding a good unit of translation. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'06). Seattle, WA. ACM Press, 194--201. Google ScholarDigital Library
Gao, J., Zhou, M., Nie, J.-Y., He, H., and Chen, W. 2002. Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'02). New York, NY. ACM Press, 183--190. Google ScholarDigital Library
Hearst, M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics (COLING'92). Association for Computational Linguistics, Morristown, NJ, 539--545. Google ScholarDigital Library
Iwanska, L., Mata, N., and Kruger, K. 1999. Fully automatic acquisition of taxonomic knowledge from large corpora of texts: Limited syntax knowledge representation system based on natural language. In Proceedings of the 11th International Symposium on Foundations of Intelligent Systems (ISMIS'95). London, UK. Springer-Verlag, 430--438. Google ScholarDigital Library
Jang, M.-G., Myaeng, S. H., and Park, S. Y. 1999. Using mutual information to resolve query translation ambiguities and query term weighting. In Proceedings of the 37th Annual Meeting of the Association on Computational Linguistics (COLING'99). College Park, MD. Association for Computational Linguistics, 223--229. Google ScholarDigital Library
Kang, I.-H. and Kim, G. 2000. English-to-Korean transliteration using multiple unbounded overlapping phoneme chunks. In Proceedings of the 18th Conference on Computational Linguistics - Volume 1. Saarbrcken, Germany. Association for Computational Linguistics. 418--424. Google ScholarDigital Library
Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. 324140. Google ScholarDigital Library
Kraaij, W. 2001. Tno at clef-2001. In Proceedings of Workshop on Cross-Language Evaluation Forum (CLEF'01). Darmstadt, Germany, 79--83.Google Scholar
Kurland, O. and Lee, L. 2005. Pagerank without hyperlinks: structural re-ranking using links induced by language models. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). Salvador, Brazil. ACM Press, 306--313. 1076087. Google ScholarDigital Library
Kwok, K.-L. and Dinstl, N. 2007. Ntcir-6 monolingual Chinese and English-Chinese cross language retrieval experiments using pircs. In Proceedings of the 6th NTCIR Workshop Meeting. NII, Tokyo, Japan, 190--197.Google Scholar
Liu, B., Chin, C. W., and Ng, H. T. 2003. Mining topic-specific concepts and definitions on the Web. In Proceedings of the 12th International Conference on World Wide Web (WWW'03). New York, NY. ACM Press, 251--260. Google ScholarDigital Library
Liu, Y., Jin, R., and Chai, J. Y. 2005. A maximum coherence model for dictionary-based cross-language information retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). Salvador, Brazil. ACM Press, 536--543. 1076125. Google ScholarDigital Library
Lu, C., Xu, Y., and Geva, S. 2007. Translation disambiguation in Web-based translation extraction for English-Chinese CLIR. In Proceedings of the 2007 ACM Symposium on Applied Computing (SAC'07). New York, NY. ACM Press, 819--823. Google ScholarDigital Library
Lu, W.-H., Chien, L.-F., and Lee, H.-J. 2002. Translation of Web queries using anchor text mining. ACM Trans. Asian Lang. Inform. Process. 1, 2, 159--172. 568958. Google ScholarDigital Library
Maeda, A., Sadat, F., Yoshikawa, M., and Uemura, S. 2000. Query term disambiguation for Web cross-language information retrieval using a search engine. In Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages (IRAL'00). Hong Kong. ACM Press, 25--32. Google ScholarDigital Library
Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google ScholarDigital Library
Mihalcea, R. 2005. Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT'05). Morristown, NJ. Association for Computational Linguistics, 411--418. Google ScholarDigital Library
Mihalcea, R. and Tarau, P. 2004. Textrank-bringing order into texts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'04). 404--411.Google Scholar
Monz, C. and Dorr, B. J. 2005. Iterative translation disambiguation for cross-language information retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). Salvador, Brazil. ACM Press, 520--527. Google ScholarDigital Library
Pirkola, A. 1998. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98). Melbourne, Australia. ACM Press, 55--63. Google ScholarDigital Library
Pirkola, A., Keskustalo, H., Leppanen, E., Kansala, A.-P., and Jarvelin, K. 2002. Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants. Inform. Res. 7, 2.Google Scholar
Pirkola, A., Toivonen, J., Keskustalo, H., Visala, K., J, K., and Rvelin. 2003. Fuzzy translation of cross-lingual spelling variants. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR'03). Toronto, Canada. ACM Press, 345--352. Google ScholarDigital Library
Qu, Y., Grefenstette, G., and Evans, D. A. 2003. Automatic transliteration for Japanese-to-English text retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'03). Toronto, Canada. ACM Press, 353--360. Google ScholarDigital Library
Serban, R., Teije, A. T., Harmelen, F. V., Marcos, M., and C., P. 2005. Ontology-driven extraction of linguistic patterns for modelling clinical guidelines. In Proceedings of the 10th European Conference on Artificial Intelligence in Medicine (AIME'05). 194--253. Google ScholarDigital Library
Sperer, R. and Oard, D. W. 2000. Structured translation for cross-language information retrieval. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'00). New York, NY. ACM Press, 120--127. Google ScholarDigital Library
Virga, P. and Khudanpur, S. 2003. Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the ACL Workshop on Multilingual and Mixed-language Named Entity Recognition, Vol. 15. Association for Computational Linguistics, 57--64. Google ScholarDigital Library
Voorhees, E. and Harman, D. 2000. Overview of the ninth text retrieval conference. In Proceedings of the 9th Text Retrieval Conference. NIST, 1--28.Google Scholar
Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Academic Press, San Diego, CA. Google ScholarDigital Library
Wu, Y.-C., Tsai, K.-C., and Yang, J.-C. 2007. Ncu in bilingual information retrieval experiments at NTCIR-6. In Proceedings of the 6th NTCIR Workshop Meeting. NII, Tokyo, Japan, 133--139.Google Scholar
Zhang, Y. and Vines, P. 2004. Using the Web for automated translation extraction in cross-language information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'04). Sheffield, UK. ACM Press, 162--169. Google ScholarDigital Library
Zhang, Y., Vines, P., and Zobel, J. 2005. Chinese OOV translation and post-translation query expansion in Chinese--English cross-lingual information retrieval. ACM Trans. Asian Lang. Inform. Process. 4, 2, 57--77. Google ScholarDigital Library
Zhou, D., Goulding, J., Truran, M., and Brailsford, T. 2007. Llama: automatic hypertext generation utilizing language models. In Proceedings of the 18th Conference on Hypertext and Hypermedia (HT'07). New York, NY. ACM Press, 77--80. Google ScholarDigital Library
Zhou, D., Truran, M., Brailsford, T., and Ashman, H. 2007. NTCIR-6 experiments using pattern matched translation extraction. In Proceedings of the 6th NTCIR Workshop Meeting. NII, Tokyo, Japan, 145--151.Google Scholar

Index Terms

A Hybrid Technique for English-Chinese Cross Language Information Retrieval

Recommendations

Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval
IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languages

We investigated using the LDC English/Chinese bilingual wordlists for English-Chinese cross language retrieval. It is shown that the Chinese-to-English wordlist can be considered as both a phrase and word dictionary, and is preferable to the English-to-...
Read More
Statistical transliteration for english-arabic cross language information retrieval
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

Out of vocabulary (OOV) words are problematic for cross language information retrieval. One way to deal with OOV words when the two languages have different alphabets, is to transliterate the unknown words, that is, to render them in the orthography of ...
Read More
Cross language information retrieval based on concept base and language grid
ESAIR '10: Proceedings of the third workshop on Exploiting semantic annotations in information retrieval

This paper describes query translation using multiple language resources and concept base method for the Cross Language Information Retrieval (CLIR). In the proposed method, the queries are translated by multiple machine translation systems on the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 7, Issue 2
June 2008
86 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/1362782
Issue’s Table of Contents

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 2008
- Accepted: 1 March 2008
- Revised: 1 February 2008
- Received: 1 December 2007
Published in talip Volume 7, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cross language information retrieval
disambiguation
graph-based analysis
patterns
unknown term translation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 806
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Hybrid Technique for English-Chinese Cross Language Information Retrieval

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval

Statistical transliteration for english-arabic cross language information retrieval

Cross language information retrieval based on concept base and language grid

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Hybrid Technique for English-Chinese Cross Language Information Retrieval

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval

Statistical transliteration for english-arabic cross language information retrieval

Cross language information retrieval based on concept base and language grid

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media