Spoken intent detection has become a popular approach to interface with various smart devices with ease. However, such systems are limited to the preset list of intents-terms or commands, which restricts the quick customization of personal devices to new intents. This paper presents a few-shot spoken intent classification approach with task-agnostic representations via meta-learning paradigm. Specifically, we leverage the popular representation based meta-learning learning to build a task-agnostic representation of utterances, that then use a linear classifier for prediction. We evaluate three such approaches on our novel experimental protocol developed on two popular spoken intent classification datasets: Google Commands and the Fluent Speech Commands dataset. For a 5-shot (1-shot) classification of novel classes, the proposed framework provides an average classification accuracy of 88.6% (76.3%) on the Google Commands dataset, and 78.5% (64.2%) on the Fluent Speech Commands dataset. The performance is comparable to traditionally supervised classification models with abundant training samples.
Cite as: Mittal, A., Bharadwaj, S., Khare, S., Chemmengath, S., Sankaranarayanan, K., Kingsbury, B. (2020) Representation Based Meta-Learning for Few-Shot Spoken Intent Recognition. Proc. Interspeech 2020, 4283-4287, doi: 10.21437/Interspeech.2020-3208
@inproceedings{mittal20_interspeech, author={Ashish Mittal and Samarth Bharadwaj and Shreya Khare and Saneem Chemmengath and Karthik Sankaranarayanan and Brian Kingsbury}, title={{Representation Based Meta-Learning for Few-Shot Spoken Intent Recognition}}, year=2020, booktitle={Proc. Interspeech 2020}, pages={4283--4287}, doi={10.21437/Interspeech.2020-3208} }