Skip to main content

Medical Speech Recognition: Reaching Parity with Humans

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

Abstract

We present a speech recognition system for the medical domain whose architecture is based on a state-of-the-art stack trained on over 270 h of medical speech data and 30 million tokens of text from clinical episodes. Despite the acoustic challenges and linguistic complexity of the domain, we were able to reduce the system’s word error rate to below 16% in a realistic clinical use case. To further benchmark our system, we determined the human word error rate on a corpus covering a wide variety of speakers, working with multiple medical transcriptionists, and found that our speech recognition system performs on a par with humans.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Leeming, B., Porter, D., Jackson, J., Bleich, H., Simon, M.: Computerized radiologic reporting with voice data-entry. Radiology 138(3), 585–588 (1981)

    Article  Google Scholar 

  2. Akers, G.: Using your voice: speech recognition technology in medicine and surgery. Clin. Plast. Surg. 13(3), 509–511 (1986)

    Google Scholar 

  3. Matumoto, T., Iinuma, T., Tateno, Y., Ikehira, H., Yamasaki, Y., Fukuhisa, K., Tsunemoto, H., Shishido, F., Kubo, Y., Inamura, K.: Automatic radiologic reporting system using speech recognition. Med. Prog. Technol. 12(3–4), 243–257 (1987)

    Google Scholar 

  4. Steinbiss, V., Ney, H., Essen, U., Tran, B.H., Aubert, X., Dugast, C., Kneser, R., Meier, H.G., Oerder, M., Haeb-Umbach, R., Geller, D., Höllerbauer, W., Bartosik, H.: Continuous speech dictation from theory to practice. Speech Commun. 17(1–2), 19–38 (1995)

    Article  Google Scholar 

  5. Hundt, W., Stark, O., Scharnberg, B., Hold, M., Kohz, P., Lienemann, A., Bonél, H., Reiser, M.: Speech processing in radiology. Eur. Radiol. 9(7), 1451–1456 (1999)

    Article  Google Scholar 

  6. Zafar, A., Overhage, J., McDonald, C.: Continuous speech recognition for clinicians. J. Am. Med. Inf. Assoc. 6(3), 195–204 (1999)

    Article  Google Scholar 

  7. Devine, E., Gaehde, S., Curtis, A.: Comparative evaluation of three continuous speech recognition software packages in the generation of medical reports. J. Am. Med. Inf. Assoc. 7(5), 462–468 (2000)

    Article  Google Scholar 

  8. Paulett, J., Langlotz, C.: Improving language models for radiology speech recognition. J. Biomed. Inf. 42(1), 53–58 (2009)

    Article  Google Scholar 

  9. Hawkins, C., Hall, S., Hardin, J., Salisbury, S., Towbin, A.: Prepopulated radiology report templates: a prospective analysis of error rate and turnaround time. J. Digit Imaging 25(4), 504–511 (2012)

    Article  Google Scholar 

  10. Smith, K.: A discrete speech recognition system for dermatology: 8 years of daily experience in a medical dermatology office. Semin. Cutan. Med. Surg. 21(3), 205–208 (2002)

    Article  Google Scholar 

  11. Hippmann, R., Dostálová, T., Zvárová, J., Nagy, M., Seydlová, M., Hanzlícek, P., Kriz, P., Smídl, L., Trmal, J.: Voice-supported electronic health record for temporomandibular joint disorders. Methods Inf. Med. 49(2), 168–172 (2010)

    Article  Google Scholar 

  12. Hodgson, T., Coiera, E.: Risks and benefits of speech recognition for clinical documentation: a systematic review. J. Am. Med. Inf. Assoc. 23(e1), e169–e179 (2016)

    Article  Google Scholar 

  13. Hammana, I., Lepanto, L., Poder, T., Bellemare, C., Ly, M.S.: Speech recognition in the radiology department: a systematic review. HIM. J. 44(2), 4–10 (2015)

    Article  Google Scholar 

  14. Cao, Y.G., Liu, F., Simpson, P., Antieau, L., Bennett, A., Cimino, J., Ely, J., Yu, H.: Askhermes: an online question answering system for complex clinical questions. J. Biomed. Inf. 44(2), 277–288 (2011)

    Article  Google Scholar 

  15. Liu, F., Tur, G., Hakkani-Tür, D., Yu, H.: Towards spoken clinical-question answering: evaluating and adapting automatic speech-recognition systems for spoken clinical questions. J. Am. Med. Inf. Assoc. 18(5), 625–630 (2011)

    Article  Google Scholar 

  16. Luu, T., Phan, R., Davey, R., Hanlen, L., Chetty, G.: Automatic clinical speech recognition for CLEF 2015 ehealth challenge. Working notes report/paper, University of Canberra (2015)

    Google Scholar 

  17. Paats, A., Alumäe, T., Meister, E., Fridolin, I.: Evaluation of automatic speech recognition prototype for estonian language in radiology domain: a pilot study. In: Mindedal, H., Persson, M. (eds.) 16th Nordic-Baltic Conference on Biomedical Engineering. IFMBE Proceedings, vol. 48, pp. 96–99. Springer, Cham (2015). doi:10.1007/978-3-319-12967-9_26

    Google Scholar 

  18. Alumäe, T.: Full-duplex speech-to-text system for Estonian. In: Proceedings of the Baltic HLT, Kaunas, Lithuania, pp. 3–10. IOS Press (2014)

    Google Scholar 

  19. du Toit, J., Hattingh, R., Pitcher, R.: The accuracy of radiology speech recognition reports in a multilingual south african teaching hospital. BMC Med. Imaging 15(8), 1 (2015)

    Google Scholar 

  20. Strahan, R., Schneider-Kolsky, M.: Voice recognition versus transcriptionist: error rates and productivity in MRI reporting. J. Med. Imaging Radiat. Oncol. 54(5), 411–414 (2010)

    Article  Google Scholar 

  21. Zick, R., Olsen, J.: Voice recognition software versus a traditional transcription service for physician charting in the ed. Am. J. Emerg. Med. 19(4), 295–298 (2001)

    Article  Google Scholar 

  22. DuToit, J., Hattingh, R., Pitcher, R.: The accuracy of radiology speech recognition reports in a multilingual South African teaching hospital. BMC Med. Imaging 15(8), 1–5 (2015)

    Google Scholar 

  23. Basma, S., Lord, B., Jacks, L., Rizk, M., Scaranelo, A.: Error rates in breast imaging reports: comparison of automatic speech recognition and dictation transcription. AJR Am. J. Roentgenol. 197(4), 923–927 (2011)

    Article  Google Scholar 

  24. Suendermann-Oeft, D., Ghaffarzadegan, S., Edwards, E., Salloum, W., Miller, M.: A system for automated extraction of clinical standard codes in spoken medical reports. In: Proceedings of Workshop SLT, San Diego, CA. IEEE (2016)

    Google Scholar 

  25. Zechner, K.: What did they actually say? Agreement and disagreement among transcribers of non-native spontaneous speech responses in an English proficiency test. In: Proceedings of SLaTE, Warwickshire, UK, pp. 25–28. ISCA (2009)

    Google Scholar 

  26. Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., Zweig, G.: Achieving human parity in conversational speech recognition, pp. 1–13 (2017). arXiv:1610.05256

  27. Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhadran, B., Picheny, M., Lim, L.L., Roomi, B., Hall, P.: English conversational telephone speech recognition by humans and machines, pp. 1–7 (2017). arXiv:1703.02136

  28. Suendermann, D., Pieraccini, R.: Crowdsourcing for industrial spoken dialog systems. In: Eskénazi, M., Levow, G.A., Meng, H., Parent, G., Suendermann, D. (eds.) Crowdsourcing for Speech Processing, pp. 280–302. J. Wiley, Chichester (2013)

    Chapter  Google Scholar 

  29. Salloum, W., Edwards, E., Ghaffarzadegan, S., Suendermann-Oeft, D., Miller, M.: Crowdsourced continuous improvement of medical speech recognition. In: Proceedings of AAAI Workshop Crowdsourcing, San Francisco, CA. AAAI (2017)

    Google Scholar 

  30. Glass, J., Hazen, T., Hetherington, I.: Real-time telephone-based speech recognition in the Jupiter domain. In: Proceedings of ICASSP, vol. 1, pp. 61–64. IEEE (1999)

    Google Scholar 

  31. Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16(1), 69–88 (2002)

    Article  Google Scholar 

  32. Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvêa, E., Wolf, P., Woelfel, J.: Sphinx-4: a flexible open source framework for speech recognition. Technical report SMLI TR2004-0811, Sun Microsystems, Inc. (2004)

    Google Scholar 

  33. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: a general and efficient weighted finite-state transducer library. In: Holub, J., Žd’árek, J. (eds.) CIAA 2007. LNCS, vol. 4783, pp. 11–23. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76336-9_3

    Chapter  Google Scholar 

  34. Gorman, K.: Openfst library (2016). http://openfst.org

  35. Povey, D., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlícek, P., Qian, Y., Schwarz, P., Silovsky, J.: The kaldi speech recognition toolkit. In: Proceedings of Workshop ASRU, 4 p. IEEE (2011)

    Google Scholar 

  36. Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. 27(2), 113–120 (1979)

    Article  Google Scholar 

  37. Hermus, K., Wambacq, P., Van Hamme, H.: A review of signal subspace speech enhancement and its application to noise robust speech recognition. EURASIP J. Adv. Signal Process. 2007(45821), 1–15 (2007)

    MathSciNet  MATH  Google Scholar 

  38. Zwicker, E., Feldtkeller, R.: Das Ohr als Nachrichtenempfänger, 2nd edn. Monographien der elektrischen Nachrichtentechnik; Bd. 19. Hirzel, Stuttgart (1967)

    Google Scholar 

  39. Fastl, H., Zwicker, E.: Psychoacoustics: Facts and Models, 3rd edn. Springer, Berlin (2007)

    Book  Google Scholar 

  40. Kim, C., Stern, R.: Power-normalized cepstral coefficients (PNCC) for robust speech recognition. In: Proceedings of ICASSP, pp. 4101–4104. IEEE (2012)

    Google Scholar 

  41. Kim, C., Stern, R.: Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(7), 1315–1329 (2016)

    Article  Google Scholar 

  42. Imai, S.: Cepstral analysis synthesis on the mel frequency scale. In: Proceedings of ICASSP, vol. 8, pp. 93–96. IEEE (1983)

    Google Scholar 

  43. Stevens, S., Volkmann, J.: The relation of pitch to frequency: a revised scale. Am. J. Psychol. 53(3), 329–353 (1940)

    Article  Google Scholar 

  44. Hermansky, H.: An efficient speaker-independent automatic speech recognition by simulation of some properties of human auditory perception. In: Proceedings of ICASSP, vol. 12, pp. 1159–1162. IEEE (1987)

    Google Scholar 

  45. Murthi, M., Rao, B.: Minimum variance distortionless response (MVDR) modeling of voiced speech. In: Proceedings of ICASSP, vol. 3, pp. 1687–1690. IEEE (1997)

    Google Scholar 

  46. Yapanel, U., Dharanipragada, S., Hansen, J.: Perceptual mvdr-based cepstral coefficients (PMCCS) for high accuracy speech recognition. In: Proceedings of EUROSPEECH, pp. 1829–1832. ISCA (2003)

    Google Scholar 

  47. Juang, B.H., Rabiner, L., Wilpon, J.: On the use of bandpass liftering in speech recognition. IEEE Trans. Acoust. 35(7), 947–954 (1987)

    Article  Google Scholar 

  48. Rath, S., Povey, D., Veselý, K., Cernocký, J.: Improved feature processing for deep neural networks. In: Proceedings of INTERSPEECH, pp. 109–113. ISCA (2013)

    Google Scholar 

  49. Boersma, P., van Heuven, V.: Praat, a system for doing phonetics by computer. Glot. Int. 5(9/10), 341–345 (2002)

    Google Scholar 

  50. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK book (for HTK version 3.4). Book HTK Version 3.4, Cambridge University Engineering Department, March 2009

    Google Scholar 

  51. Lee, A.: The Julius Book. Nagoya Institute of Technology, May 2010

    Google Scholar 

  52. Rybach, D., Hahn, S., Lehnen, P., Nolden, D., Sundermeyer, M., Tüske, Z., Wiesler, S., Schlüter, R., Ney, H.: RASR - the RWTH Aachen University open source speech recognition toolkit. In: Proceedings of ASRU Workshop, IEEE (2011)

    Google Scholar 

  53. Gaida, C., Lange, P., Petrick, R., Proba, P., Malatawy, A., Suendermann-Oeft, D.: Comparing open-source speech recognition toolkits. Technical report, DHBW, October 2014

    Google Scholar 

  54. Lippmann, R.: Review of neural networks for speech recognition. Neural Comput. 1(1), 1–38 (1989)

    Article  Google Scholar 

  55. Bourlard, H., Morgan, N., Renals, S.: Neural nets and hidden markov models: review and generalizations. Speech Commun. 11(2–3), 237–246 (1992)

    Article  Google Scholar 

  56. Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2006)

    Article  Google Scholar 

  57. Plátek, O.: Speech recognition using Kaldi. Masters thesis, Charles University (2014)

    Google Scholar 

  58. Gil, V.: Automatic speech recognition with Kaldi toolkit. Doctoral thesis, University Politècnica de Catalunya (2016)

    Google Scholar 

  59. Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: Proceedings of ICASSP, pp. 215–219. IEEE (2014)

    Google Scholar 

  60. Miao, Y.: Kaldi + pdnn: building dnn-based ASR systems with kaldi and PDNN, 4 p. (2014). arXiv:1401.6984

  61. Povey, D., Chu, S., Varadarajan, B.: Universal background model based speech recognition. In: Proceedings of ICASSP, pp. 4561–4564. IEEE (2008)

    Google Scholar 

  62. Snyder, D., Garcia-Romero, D., Povey, D.: Time delay deep neural network-based universal background models for speaker recognition. In: Proceedings of Workshop ASRU, pp. 92–97. IEEE (2015)

    Google Scholar 

  63. Dehak, N., Dehak, R., Kenny, P., Brümmer, N., Ouellet, P., Dumouchel, P.: Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In: Proceedings of INTERSPEECH, pp. 1559–1562. ISCA (2009)

    Google Scholar 

  64. Zhang, Y., Yan, Z.J., Huo, Q.: A new i-vector approach and its application to irrelevant variability normalization based acoustic model training. In: Proceedings of Workshop MLSP, pp. 1–6. IEEE (2011)

    Google Scholar 

  65. Karafiát, M., Burget, L., Matejka, P., Glembek, O., Cernocký, J.: ivector-based discriminative adaptation for automatic speech recognition. In: Proceedings of Workshop ASRU. IEEE (2011)

    Google Scholar 

  66. Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with i-vector inputs. In: Proceedings of ICASSP, pp. 225–229. IEEE (2014)

    Google Scholar 

  67. Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., Khudanpur, S.: Jhu aspire system: robust LVCSR with TDNNS, ivector adaptation and RNN-LMS. In: Proceedings of Workshop on ASRU, pp. 539–546. IEEE (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Suendermann-Oeft .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Edwards, E. et al. (2017). Medical Speech Recognition: Reaching Parity with Humans. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66429-3_51

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66428-6

  • Online ISBN: 978-3-319-66429-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics