ABSTRACT
Real-world intelligent services employing deep learning technology typically take a two-tier system architecture -- a dumb front-end device and smart back-end cloud servers. The front-end device simply forwards a human query while the back-end servers run a complex deep model to resolve the query and respond to the front-end device. While simple and effective, the current architecture not only increases the load at servers but also runs the risk of harming user privacy. In this paper, we present knowledge caching, which exploits the front-end device as a smart cache of a generalized deep model. The cache locally resolves a subset of popular or privacy-sensitive queries while it forwards the rest of them to back-end cloud servers. We discuss the feasibility of knowledge caching as well as technical challenges around deep model specialization and compression. We show our prototype two-stage inference system that populates a front-end cache with 10 voice commands out of 35 commands. We demonstrate that our specialization and compression techniques reduce the cached model size by 17.4x from the original model with 1.8x improvement on the inference accuracy.
- 2018. Smart Home Market Report: Trends, Forecast and Competitive Analysis. Technical Report. Lucintel, 8951 Cypress Waters Blvd., Suite 160, Dallas.Google Scholar
- 2019. Echo & Alexa - Amazon Devices. Retrieved April 10, 2019 from https: //www.amazon.com/Amazon-Echo-And-Alexa-Devices/b?node=9818047011Google Scholar
- 2019. Edge TPU -- Run Inference at the Edge. Retrieved April 10, 2019 from https://cloud.google.com/edge-tpu/Google Scholar
- 2019. The future is here: iPhone X. Retrieved April 10, 2019 from https: //www.apple.com/newsroom/2017/09/the-future-is-here-iphone-x/Google Scholar
- 2019. Galaxy S10 Performance. Retrieved April 10, 2019 from https://www. samsung.com/us/mobile/galaxy-s10/performance/Google Scholar
- 2019. Google Home. Retrieved April 10, 2019 from https://store.google.com/ product/google_homeGoogle Scholar
- 2019. High Performance AI at the Edge | NVIDIA Jetson TX2. Retrieved April 10, 2019 from https://www.nvidia.com/en-us/autonomous-machines/ embedded-systems/jetson-tx2/Google Scholar
- 2019. LG Smart Refrigerators: Powered by SmartThinQ IOT. Retrieved April 10, 2019 from https://www.lg.com/us/discover/smartthinq/refrigeratorsGoogle Scholar
- 2019. Samsung Family Hub Smart Refrigerator. Retrieved April 10, 2019 from https://www.samsung.com/us/explore/family-hub-refrigerator/refrigerator/Google Scholar
- 2019. TensorFlow audio recognition tutorial. Retrieved April 10, 2019 from https://www.tensorflow.org/tutorials/sequences/audio_recognitionGoogle Scholar
- Grigory Antipov, Moez Baccouche, Sid-Ahmed Berrani, and Jean-Luc Dugelay. 2016. Apparent Age Estimation from Face Images Combining General and Children-Specialized Deep Learning Models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.Google ScholarCross Ref
- Katie Canales. 2018. A couple says that Amazon's Alexa recorded a private conversation and randomly sent it to a friend. Retrieved April 10, 2019 from https:// www.businessinsider.com/amazon-alexa-records-private-conversation-2018--5Google Scholar
- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NIPS). Google ScholarDigital Library
- Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In International Conference on Machine Learning (ICML). Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. CoRR abs/1503.02531.Google Scholar
- Bret Kinsella. 2018. There are Now More Than 70,000 Alexa Skills Worldwide, Amazon Announces 25 Top Skills of 2018. Retrieved April 10, 2019 from https://bit.ly/2VLyqJ9Google Scholar
- Bret Kinsella and Ava Mutchler. 2018. Smart Speaker Consumer Adoption Report. Retrieved April 10, 2019 from https://voicebot.ai/wp-content/uploads/2018/10/ voicebot-smart-speaker-consumer-adoption-report.pdfGoogle Scholar
- Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. 2017. Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 1, 149--163. Google ScholarDigital Library
- Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Master's thesis. Department of Computer Science, University of Toronto.Google Scholar
- Kimin Lee, Changho Hwang, KyoungSoo Park, and Jinwoo Shin. 2017. Confident Multiple Choice Learning. In International Conference on Machine Learning (ICML). Google ScholarDigital Library
- Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018. A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
- Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. Pruning Filters for Efficient ConvNets. In International Conference on Learning Representations (ICLR).Google Scholar
- Mark McCaffrey, Paige Hayes, Jason Wagner, and Matt Hobbs. 2018. Consumer Intelligence Series: Prepare for the voice revolution. Technical Report.Google Scholar
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. CoRR abs/1312.5602.Google Scholar
- Sharan Narang, Greg Diamos, Shubho Sengupta, and Erich Elsen. 2017. Exploring Sparsity in Recurrent Neural Networks. In International Conference on Learning Representations (ICLR).Google Scholar
- Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Regularizing Neural Networks by Penalizing Confident Output Distributions. In International Conference on Learning Representations (ICLR).Google Scholar
- Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In International Conference on Learning Representations (ICLR).Google Scholar
- Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. CoRR abs/1804.02767.Google Scholar
- Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR).Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- PeteWarden. 2018. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. CoRR abs/1804.03209.Google Scholar
- Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. 2018. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40, 4, 849--862.Google ScholarCross Ref
Index Terms
- A Case for Two-stage Inference with Knowledge Caching
Recommendations
Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches
Although direct-mapped caches suffer from higher miss ratios as compared to set-associative caches, they are attractive for today's high-speed pipelined processors that require very low access times. Victim caching was proposed by Jouppi [1] as an ...
A reusability-aware cache memory sharing technique for high-performance low-power CMPs with private L2 caches
ISLPED '07: Proceedings of the 2007 international symposium on Low power electronics and designChip multiprocessors (CMPs) emerge as a dominant architectural alternative in high-end embedded systems. Since off-chip accesses require a long latency and consume a large amount of power, CMPs are typically based on multiple levels of on-chip cache ...
Dynamic Loop Caching Meets Preloaded Loop Caching " A Hybrid Approach
ICCD '02: Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)Dynamically-loaded tagless loop caching reduces instruction fetch power for embedded software with small loops, but only supports simple loops without taken branches. Preloaded tagless loop caching supports complex loops with branches and thus can ...
Comments