INTRODUCTION

For 25 million individuals with limited English proficiency (LEP) in the USA, language barriers limit equitable access to healthcare, which results in worse clinical outcomes and decreased therapeutic engagement.1,2,3,4,5 Clinical communication extends beyond the transference of information or instruction; it helps to build rapport and interpersonal relationships between patients and clinicians. While certified medical interpretation remains an indispensable tool for communicating with language-discordant patients, these resources are often impractical or unfeasible in certain clinical settings and are therefore underutilized.6 In the perioperative setting, the busy workflow, the sterile environment, varying levels of patient consciousness, and sporadic and brief conversation exchanges make it challenging to utilize formal medical interpretation services. Consequently, clinicians may forgo using medical interpreters and instead rely on nonverbal communication, which poses a significant challenge to safe and high-quality care.7

Machine translation (MT) has the potential to fill the gaps of communicating in language-discordant clinical situations. MT refers to automated software with the capacity for two-way translation (text) and interpretation (speech) between languages. MT products are widely available on mobile devices with small infrastructural cost, making them a tempting pragmatic resource for clinicians. However, the evaluation of MT for healthcare remains limited, and MT use in clinical settings has raised safety concerns.8,9 MT has been evaluated for translating patient portal messages, discharge instructions, and public health information with mixed results depending on the language translated,10,11,12 but only a few have evaluated the use of MT for interpretation.13 Previous studies have shown that MT interpretation is accurate in limited settings.9,14

Machine interpretation is necessarily more complex than machine translation. Proper speech recognition, transcription (speech into written form), and language synthesis (speech generation) are necessary for MT to function as a two-way interpreter. To determine whether MT interpretation is useful for brief and low-stakes two-way communication encounters, we designed a non-inferiority study to compare the accuracy and safety of three commercially available MT applications against professional interpreters between English and Spanish, as well as between English and Mandarin Chinese, the two most common non-English languages in the United States.15

METHODS

Study Design

We designed a non-inferiority study to evaluate the quality of MT interpretation for two-way communication between patients and clinicians. Professional medical interpreter services served as a gold standard. Three MT applications, Google Translate (GT), Apple iTranslate (AT), and Microsoft Translator (MS), were selected based on their availability without cost to users across multiple devices and operating systems. All three applications utilize machine learning algorithm based on artificial neural networks that can improve with aggregation of more data.16,17,18

Recognizing that the perioperative setting is one where professional interpretation is often not used, we formulated study phrases that simulate conversation between English-speaking clinicians and patients with LEP using input from anesthesiologists and perioperative nurses. Each study phrase consisted of one to three sentences in a standard language, devoid of slang or excessive colloquialism, such as “Can you please point to where it hurts the most?” Additional examples of the study phrases are available in Appendix A. Using the conventional, predetermined 15% non-inferiority margin, we developed 105 provider-to-patient and 105 patient-to-provider phrases.

To assess MT interpretation (speech to speech), study phrases were first audio recorded; provider-to-patient phrases were recorded in English, and patient-to-provider phrases were recorded in Spanish and Mandarin by native bilingual speakers. These recordings were played into each MT application, and the resulting interpretations were captured as audio files. Professional medical interpreters were provided with the same audio recording of the study phrases, and their interpretations were also captured as audio files. Transcriptions of study phrases were not provided to simulate live two-way interpretation (Fig. 1).

Figure 1
figure 1

Diagram of study workflow. Study phrases simulating two-way communication between English-speaking providers (English) and patients with limited English proficiency. Provider communications (in green) were recorded in English and patient communications (in pink) were recorded in either Mandarin or Spanish. These recordings were then played onto three MT applications and resulting interpretations were captured as audio files. A professional medical interpreter also provided interpretations, serving as a gold standard. The interpretations were then evaluated by bilingual assessors based on four categories (Fluency, Accuracy, Meaning, and Clinical Risk) using 5-point Likert scale. In this figure, English–Mandarin interpretation workflow is shown. Same steps were taken for evaluating English–Spanish interpretations

Each audio recording was reviewed for sound clarity, and volume was adjusted to comparable decibel levels using WavePad Audio Editor (Version 11.33, Canberra, Australia). We downloaded the MT applications from Apple AppStore onto an iPhone running iOS 14.3 for consistency across the device hardware and software versions (GT: 6.16.x, AT: 14.1.x, MS: 4.049x). All machine interpretations and audio records occurred between February 5th and 7th of 2021. Data were collected in a quiet room. A desktop computer with dedicated speakers and a high-fidelity microphone was used to record and capture MT interpretations.

Evaluation Metrics and Outcome Measures

For each language, two bilingual assessors evaluated the quality of MT interpretation, with a third bilingual assessor adjudicating the difference in scores if necessary. The six assessors (3 per language) were a mix of clinician (4) and non-clinical (2) volunteers. Assessors were instructed to listen to interpretation audio files and score one interpretation at a time. The order in which the four interpretations (human, GT, AT, and MS) were presented was randomized for each phrase to mitigate habituation bias. Assessors were instructed to take frequent breaks to minimize fatigue bias. Assessors were also instructed to describe the types of errors encountered in their evaluation process. Errors were classified as omission, abbreviation (inability to accurate identify abbreviation), syntactic (word order and/or sentence structure), lexical (related to vocabulary), nonsense interpretation, and phonemic (distinguishing one word from another, such as pad, pat, bad, and bat).

Due to a lack of consensus on evaluation metrics for MT interpretation, we modified and adapted four assessment categories commonly used for evaluating MT translation.19,20 “Accuracy” evaluated for a loss of information (omission), “Fluency” assessed grammar, “Meaning” assessed unnecessary additions or changes that impacted meaning, and “Clinical Risk” assessed whether that change in meaning could lead to a poor patient outcome.21 Each category was scored on a 5-point Likert scale; Clinical Risk was inversely coded such that a high number indicated less (no) risk. Only the clinicians scored the Clinical Risk category.

The outcome was the acceptability of MT interpretation based on a composite score of the 4 assessment categories. We defined an interpretation as acceptable if it scored 16 or higher out of 20 possible points (four 5-point Likert categories). We also examined each category separately, defining acceptability as a score of 4 or greater on the 5-point Likert scale.

Statistical Analysis

Descriptive statistics (proportions with 95% confidence interval [CI]) were used to characterize the proportion of phrases with acceptable interpretations. Paired t-tests were used to compare each MT application to the human interpreter. MT applications were not compared with each other. A p-value of less than 0.05 was considered statistically significant for all analyses. Cronbach’s alpha was used to measure inter-assessor agreement.

RESULTS

Six assessors evaluated 105 phrases from English to Spanish/Mandarin and 105 phrases from Spanish/Mandarin to English. The inter-assessor reliability was high for both Spanish (alpha: 0.80) and Mandarin (alpha: 0.86). Figure 2 presents the proportion of interpretations that met the acceptability criteria by language and direction of interpretations. For English to Spanish, the proportion of MT-interpreted phrases scored as acceptable ranged from 0.68 to 0.84. Only the GT algorithm came close to the non-inferiority criteria (0.84, 95% CI: 0.77–0.91). For English-to-Mandarin interpretation, the proportion of MT-interpreted phrases scored as acceptable ranged from 0.62 to 0.76; no MT interpretation met the non-inferiority threshold (Table 1). Both Spanish-to-English and Mandarin-to-English interpretations had a lower composite score (median range 13.0 to 14.0 out of 20), and a low proportion of MT-interpreted phrases scored as acceptable (range 0.36–0.41). Every interpretation by professional medical interpreters, both to and from English, was rated highly and scored as acceptable.

Figure 2
figure 2

Proportion of interpreted phrases deemed acceptable based on the composite scores of 4 assessment categories

Table 1 Composite scores and proportions of acceptable interpretations. Median and interquartile range (IQR) of composite scores and the proportion of interpretations that have met the acceptability criteria (composite score of 16 or higher) is presented with its 95% confidence interval (CI)

Figure 3 shows the proportions of interpreted phrases scored as being acceptable by individual assessment categories. For English to Spanish, scores of the accuracy (range 0.83 to 0.96) and clinical risk (0.82 to 0.90) categories were higher than fluency (0.60 to 0.81) and meaning (0.75 to 0.85) for MT applications. For Spanish-to-English interpretations, accuracy scored 0.70 to 0.76, but the other three categories scored lower (0.40 to 0.51). For English to Mandarin, MT applications scored better in the accuracy category (0.88 to 0.91) than the other three categories (0.68 to 0.86). For Mandarin to English, all four categories scored low (0.36 to 0.59).

Figure 3
figure 3

Proportions of interpreted phrases deemed acceptable (defined as score of 4 or greater on 5-point Likert scale) by individual assessment categories. a English–Spanish interpretation, b English–Mandarin interpretation

Assessors described the types of errors they encountered during their evaluation of MT interpretations. Table 2 presents examples of the errors. Errors of syntactic parsing (i.e., word order and/or sentence structure issues) and differentiating statements from questions were common. Commonly used abbreviations sometimes posed challenges; while two MT applications correctly recognized “I.V.” as “intravenous,” one MT application understood it as “ivy,” resulting in a significant error in the interpretation of the overall phrase.

Table 2 Examples and types of interpretation errors

DISCUSSION

In this study of three widely available MT applications, we found the overall quality of MT interpretation to be poor for two-way clinical communication use for conversations, even in low-stakes settings. In general, MT applications performed significantly better at interpreting English to Mandarin/Spanish than vice versa. All MT applications were inferior to professional human interpretation, and only English-to-Spanish interpretation using GT came close to meeting the non-inferiority threshold.

Previous studies have reported fewer Spanish MT inaccuracies compared to those of Chinese translations.10,12 However, this study found similar quality for Spanish and Mandarin interpretations. As machine interpretation requires appropriate transcription and speech synthesis in addition to translation, challenges in either domain may have impacted the accuracy and quality of Spanish interpretation seen in this study. This may also explain the lower quality of MT interpretation from either Spanish/Mandarin to English than from English to other languages, as the current machine algorithm may be better adapted to handle English transcription than other languages with distinct inherent challenges in each language, such as tonation for Mandarin.22

All three MT applications performed poorly when interpreting phrases containing medical abbreviations, regardless of the direction of interpretation. This may be due to language ambiguity when using abbreviations, medical jargons, or uncommon phrases. Language ambiguity can influence pronunciations and connotations, thereby increasing the risk of improper interpretation.23 In this study, MT had difficulty differentiating between “por que? (why)” and “porque (because).” Intonation and context would allow the human interpreter to distinguish between the two but may pose challenges for machines.

Disfluency (such as fillers, stutters, or pauses) may also impact MT interpretation. Examples of these fillers include “um,” “well,” and “you know,” which professional interpreters would ignore, but MT applications may either incorporate them into their interpretation or stop the interpretation even before the statement was completed.14 Anxiety is common among hospitalized patients, and communicative anxiety may generate a higher prevalence of language disfluencies.24

The results of this study should be interpreted in the context of its limitations. Although the order in which the human and the three MT interpretations were presented was randomized, the human voice clearly differs from MT audio outputs. The absence of established criteria to evaluate MT interpretation led us to adapt metrics created by the Advanced Research Project Agency (ARPA) for evaluating MT translation.19 However, we did not test whether MT interpretations were comprehensible to patients. Comprehensibility, defined as the extent to which an interpretation is understandable, takes into consideration the fact that recipients may be able to infer the original content even if interpretation is deficient in lexical, grammatical, stylistic accuracy, or fluency. Performing a specific action following the interpretation of an instruction could serve as a reasonable test of MT comprehension.25 Finally, in the real world, a person using MT applications would notice issues with MT interpretation (i.e., if the application stopped transcribing mid-sentence) and would repeat the statement using the visual cues provided by the applications.

The critical role of professional interpretation in healthcare is well documented. Executive Order 13166 mandates that federally funded healthcare institutions provide access to professional medical interpretation for patients with limited English proficiency.1 Professional interpreters (compared to no interpretation) improve patient satisfaction, quality of care, many outcomes, and patient safety.2 Hospital systems, several of which have undergone litigation related to patient safety or quality of care events, also promote the use of professional interpretation.3,4,5,6 Although this study compared MT interpretation to that of professional medical interpretation, we are aware that the most common alternative in low-stakes communication is, unfortunately, no interpretation at all. Nevertheless, our findings do not currently support a recommendation for use of MT interpretations in clinical settings. Instead, we encourage clinicians to use professional interpretation and advocate for hardware (speaker phones and video interpretation) in all settings, at least until MT improves significantly for two-way communication.26

In conclusion, three common MT programs demonstrated inferior quality in interpreting two-way verbal communication between English–Spanish and English–Mandarin, even in simple, brief encounters when compared to a professional medical interpreter. Until the quality of MT interpretation significantly improves, clinicians must ensure safe, effective, and equitable care by working with professional medical interpreters whenever possible.