Abstract
The addressee detection problem arises in real spoken dialogue systems (SDSs) which are supposed to distinguish the speech addressed to them from the speech addressed to real humans. In this work, several modalities were analyzed, and acoustic data has been chosen as the main modality by reason of the most flexible usability in modern SDSs. To resolve the problem of addressee detection, deep learning methods such as fully-connected neural networks and Long Short-Term Memory were applied in the present study. The developed models were improved by using different optimization methods, activation functions and a learning rate optimization method. Also the models were optimized by using a recursive feature elimination method and multiple initialization to increase the training speed. A fully-connected neural network reaches an average recall of 0.78, a Long Short-Term Memory neural network shows an average recall of 0.65. Advantages and disadvantages of both architectures are provided for the particular task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Lee, H., Stolcke, A., Shriberg, E.: Using out-of-domain data for lexical addressee detection in human-human-computer dialog. In: Proceedings of NAACL, pp. 221–229 (2013)
Shriberg, E., Stolcke, A., Ravuri, S.: Addressee detection for dialog systems using temporal and spectral dimensions of speaking style. In: Proceedings of Interspeech (2013)
Tsai, T.J., Stolcke, A., Slaney, M.: Multimodel addressee detection in multiparty dialogue systems. In: Proceedings of ICASSP, pp. 2314–2318, April 2015
SmartWeb Handled Corpus. http://catalog.elra.info/product_info.php?products_id=1068. Accessed 27 Apr 2017
Batliner, A., Hacker, C., Nöth, E.: J Multimodal User Interfaces 2, 171 (2008). https://doi.org/10.1007/s12193-009-0016-6
OpenSmile IS13 configuration. http://www.audeering.com/research-and-opensource/files/openSMILE-book-latest.pdf. Accessed 27 Apr 2017
INTERSPEECH 2013 Computational Paralinguistics Challenge. http://emotionresearch.net/sigs/speech-sig/is13-compare. Accessed 21 June 2017
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018
Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). In: ICLR (2016)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: International Conference on Computational Statistics, pp. 177–187 (2010)
Zeng, X., Chen, Y.W.: Feature selection using recursive feature elimination for handwritten digit recognition. In: Proceedings of Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 1205–1208 (2009)
Ray, A., Rajeswar, S., Chaudhury, S.: Text recognition using deep blstm network. In: Proceedings of the International Conference on Advances of Pattern Recognition (2015)
Understanding LSTM Networks – colah’s blog. http://colah.github.io/posts/2015-08Understanding-LSTMs/. Accessed 21 June 2017
An overview of gradient descent optimization algorithms. http://sebastianruder.com/optimizing-gradient-descent/index.html#rmsprop. Accessed 21 June 2017
Cho, K.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078 (2014)
Weiss, K., Khoshgoftaar, T.M., Wang, D.D.: A survey of transfer learning. J. Big Data 3(1), 1–40 (2016)
Acknowledgments
This work is partially supported by the grant of the President of Russia (No. MD-254.2017.8) and by the RFBR (project No. 16-37-60100).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Pugachev, A., Akhtiamov, O., Karpov, A., Minker, W. (2018). Deep Learning for Acoustic Addressee Detection in Spoken Dialogue Systems. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2017. Communications in Computer and Information Science, vol 789. Springer, Cham. https://doi.org/10.1007/978-3-319-71746-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-71746-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71745-6
Online ISBN: 978-3-319-71746-3
eBook Packages: Computer ScienceComputer Science (R0)