Adaptation of machine translation for multilingual information retrieval in the medical domain
Introduction
The development of health information search and retrieval techniques is an important research topic. Indeed, it has been found that almost 70% of search engine users in the US have conducted a web search for information about a specific disease or health problem [1]. Given that much medical content is written in the English language, research to date in the medical space has predominantly focused on monolingual English retrieval. However, given the large number of non-English speaking users of the Internet and the lack of content in their native language, support for them to search and utilize these English sources is required if the value of the information available on the Internet is to be fully realized [2]. In a recent study, Lopes and Ribeiro [3] assessed the effect of translating health queries for users with different levels of English language proficiency. Their results confirmed that users with even basic competence of English can benefit from a system which automatically retrieves English content based on a non-English query, or at least suggests English translations of the non-English queries.
Support for search of English language content by non-native English speakers is one of the major goals of the large integrated EU-funded Khresmoi project.1 Among other goals, including joint text and image retrieval of radiodiagnostic records, the Khresmoi project aims to develop technology for transparent cross-lingual search of medical sources, for both professionals and laypeople, with the emphasis primarily on publicly available web sources. While a sophisticated search interface is being developed for the needs of medical professionals, the final application for the general public should be as simple as possible to operate and similar to the well-known interfaces of web search engines in use today with the addition of cross-lingual functionality.
The languages supported by the Khresmoi project are English (EN), Czech (CS), German (DE), and French (FR). Queries come from Czech, German, and French and are machine-translated to English. This reflects the real availability of data, which is predominantly available in English, and query translation needs of non-native speakers of English. Our focus in this paper is on the machine translation (MT) part of the cross-lingual search and retrieval task, while using a standard information retrieval (IR) technique for the search and retrieval part, in order to pinpoint contributions and problems with using MT for query translation from the three languages selected (Czech, German, and French) into English and its influence on the resulting quality of retrieved sets of documents.
Our MT system is based on Moses [4], a state-of-the-art statistical MT system. The IR experiments are performed using the Lucene search engine2 on the CLEF eHealth 2013 dataset for the languages specified above, directed towards retrieving English documents only. Since MT is only an intermediate component of the whole system pipeline, we proceed in two steps. We first independently tune MT to produce the best possible translations of queries (Section 2) and then use various techniques to modify and expand the translated queries for improved IR performance (Section 3). The methods applied in Section 2 include: in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, exploiting synonyms to construct translation variants, and decompounding (splitting) of complex German words on the source language side, which normally appear as unknown words. For evaluation of translation quality itself, we use BLEU – the de facto standard automatic evaluation metric [5], which compares MT output against manual reference translation and accounts both for adequacy and fluency (word order) of the machine translation. We also report inverse position-independent word error rate [6], called PER, another automatic evaluation metric which compares words in the MT output and the reference translation but without taking the word order into account and thus might be better suited to application of MT in IR, where word order is often ignored. In selected experiments, the automatic evaluation is supplemented by manual assessment of the results performed by medical professionals.
The results of our MT for experiments for queries show that we are able to outperform results of Google Translate, the best freely available MT service on the web. We also find that using synonyms to enrich training data with translation variants does not improve the MT performance; however, decompounding of complex German words slightly improves the translation, at least according to BLEU. In Section 3, we evaluate query translation in a cross-lingual IR setting using standard methods on the CLEF eHealth 2013 Task 3 test collection. Here, despite achieving superior performance on the query MT task, as described in Section 2, we do not outperform the retrieval results obtained by using queries translated by Google Translate. In the last section, we perform a summary analysis of the overall results, the results of the individual techniques for improving MT performance and their integration into an IR system, and give suggestions for further work.
Section snippets
Machine translation for medical queries
In this section, we describe the application of phrase-based statistical machine translation (SMT) to the translation of medical queries with the goal of producing accurate and fluent translations. This task differs from typical MT applications in two aspects: the domain and the genre of the input text. The domain, which reflects what the text is about, is very specific, characterized by a large and specialized vocabulary which does not occur in general texts. The genre, which indicates the
Optimizing query translation for cross-lingual information retrieval
In a standard MT scenario, the MT system is optimized to produce an output aimed to be read by a human. However, if used in a cross-lingual IR (CLIR) system, a consumer of the MT output is a computer system performing IR. Such systems usually do not require the input to be linguistically fluent or grammatically correct. The ordering of words can be loose and function words and the accuracy of other words deemed to be IR-irrelevant (traditionally called stopwords) does not matter. On the other
Conclusions
In this work, we explored cross-lingual IR in the domain of medicine and focused on machine translation as a key component introducing the possibility to search in a multilingual environment. We translate queries in Czech, German, and French to English and perform search on a collection of English documents from CLEF eHealth 2013 Task 3. Such a task is especially challenging when applied to a specific domain, such as medicine, because traditional MT systems are not generally tuned to translate
Acknowledgments
This work was supported by the EU FP7 project Khresmoi (contract no. 257528), the Czech Science Foundation (grant no. P103/12/G084), the Science Foundation Ireland (grant no. 07/CE/I1142) as part of the Centre for Next Generation Localisation at Dublin City University, and by the ESF project ELIAS.
The work described herein uses language resources hosted by the LINDAT/CLARIN repository,20 funded by the project LM2010013 of the MEYS of the Czech Republic.
References (135)
- et al.
DBpedia – a crystallization point for the web of data
Web Semantics: Science, Services and Agents on the World Wide Web
(2009) Health Topics: 80% of internet users look for health information online, Technical Report
(2011)- et al.
Consumer health information seeking on the internet: the state of the art
Health Education Research
(2001) - et al.
Measuring the value of health query translation: an analysis by user language proficiency
Journal of the American Society for Information Science and Technology
(2013) - et al.
Moses: open source toolkit for statistical machine translation
- et al.
BLEU: a method for automatic evaluation of machine translation
- et al.
Accelerated DP based search for statistical translation
Statistical methods for speech recognition
(1997)- et al.
A systematic comparison of various statistical alignment models
Computational Linguistics
(2003) Minimum error rate training in statistical machine translation
Improved minimum error rate training in Moses
Prague Bulletin of Mathematical Linguistics
Europarl: a parallel corpus for statistical machine translation
Hansard corpus of parallel English and French
The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages
Findings of the 2012 workshop on statistical machine translation
Domain adaptation of statistical machine translation using web-crawled resources: A case study
Improving a general-purpose statistical translation engine by terminological lexicons
Log-linear weight optimisation via bayesian adaptation in statistical machine translation
Fill-up versus interpolation methods for phrase-based SMT adaptation
Improving English–Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing tokenization and recasing
Experiments in domain adaptation for statistical machine translation
Improving domain-specific word alignment with a general bilingual corpus
Domain adaptation in machine translation: Final report
Language model adaptation for statistical machine translation based on information retrieval
Intelligent selection of language model training data
Adaptation of the translation model for statistical machine translation based on information retrieval
Domain adaptation via pseudo in-domain data selection
Combining translation and language model scoring for domain-specific data filtering
Automatic recognition of spontaneous speech for access to multilingual oral history archives
IEEE Transactions on Speech and Audio Processing
Improving machine translation performance by exploiting non-parallel corpora
Computational Linguistics
Domain adaptation for machine translation by mining unseen words
Domain adaptation for statistical machine translation with monolingual resources
Towards using web-crawled data for domain adaptation in statistical machine translation
Experiments on domain adaptation for patent machine translation in the PLuTO project
Findings of the 2011 Workshop on Statistical Machine Translation
Domain adaptation in SMT of user-generated forum content guided by OOV word reduction: Normalization and/or supplementary data?
Cutting the long tail: hybrid language models for translation style adaptation
From subtitles to parallel corpora
Adaptation of statistical machine translation model for cross-lingual information retrieval in a service context
Improving statistical machine translation in the medical domain using the Unified Medical Language System
UMLS reference manual
Statistical machine translation for biomedical text: are we there yet?
AMIA Annual Symposium Proceedings
Machine translation in medicine. A quality analysis of statistical machine translation in the medical domain
Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
BMC Bioinformatics
Cross-language retrieval experiments at CLEF 2002
Empirical methods for compound splitting
Statistical machine translation of German compound words
Improving SMT quality with morpho-syntactic analysis
Decompounding query keywords from compounding languages
Optimizing synonym extraction using monolingual and bilingual resources
Cited by (35)
On bridging consumer health search across languages using cross-lingual word space
2023, Electronic Commerce Research and ApplicationsThe role of Roman Urdu in multilingual information retrieval: A regional study
2020, Journal of Academic LibrarianshipCitation Excerpt :The study discovers that Roman Urdu fulfills the internet user's needs, but it may not provide them with all the information they need. Previous studies have also shown that users may not be satisfied from their research in their native language and they want the result in English as most of the documents available are in English (Pecina et al., 2014). Users perceive that information in English language is more authentic than in the other languages as English language is the international language and is used by most of the websites.
A cloud-based framework for large-scale traditional Chinese medical record retrieval
2018, Journal of Biomedical InformaticsCitation Excerpt :However, these medical resources are not suitable to deal with TCMRs. Although machine translation technology has been introduced into medical domain [34], its accuracy can’t meet clinical application due to the complexity of TCMRs. Some works integrate or improve the current information retrieval models to build medical record retrieval system for clinical diagnosis and research.
Telemedicine as a special case of machine translation
2015, Computerized Medical Imaging and GraphicsCitation Excerpt :They observed very low translation quality for languages with small training corpora. Pecina et al. [41] within the Khresmoi project investigated MT of user search queries in the context of cross-lingual information retrieval (IR) in the eHealth domain. Authors performed experiments and thoroughly evaluated on three language pairs: Czech–English, German–English, and French–English.
The Use of Machine Translation for Outreach and Health Communication in Epidemiology and Public Health: Scoping Review
2023, JMIR Public Health and SurveillanceTextual Representations for Crosslingual Information Retrieval
2021, ECNLP 2021 - 4th Workshop on e-Commerce and NLP, Proceedings