research-article

A Case for Two-stage Inference with Knowledge Caching

Authors:
Geonha Park

KAIST, Daejeon, Republic of Korea

KAIST, Daejeon, Republic of Korea
View Profile

,
Changho Hwang

KAIST, Daejeon, Republic of Korea

KAIST, Daejeon, Republic of Korea
View Profile

,
KyoungSoo Park

KAIST, Daejeon, Republic of Korea

KAIST, Daejeon, Republic of Korea
View Profile

EMDL '19: The 3rd International Workshop on Deep Learning for Mobile Systems and ApplicationsJune 2019Pages 29–34https://doi.org/10.1145/3325413.3329789

Published:13 June 2019Publication History

EMDL '19: The 3rd International Workshop on Deep Learning for Mobile Systems and Applications

Pages 29–34

ABSTRACT

Real-world intelligent services employing deep learning technology typically take a two-tier system architecture -- a dumb front-end device and smart back-end cloud servers. The front-end device simply forwards a human query while the back-end servers run a complex deep model to resolve the query and respond to the front-end device. While simple and effective, the current architecture not only increases the load at servers but also runs the risk of harming user privacy. In this paper, we present knowledge caching, which exploits the front-end device as a smart cache of a generalized deep model. The cache locally resolves a subset of popular or privacy-sensitive queries while it forwards the rest of them to back-end cloud servers. We discuss the feasibility of knowledge caching as well as technical challenges around deep model specialization and compression. We show our prototype two-stage inference system that populates a front-end cache with 10 voice commands out of 35 commands. We demonstrate that our specialization and compression techniques reduce the cached model size by 17.4x from the original model with 1.8x improvement on the inference accuracy.

References

2018. Smart Home Market Report: Trends, Forecast and Competitive Analysis. Technical Report. Lucintel, 8951 Cypress Waters Blvd., Suite 160, Dallas.Google Scholar
2019. Echo & Alexa - Amazon Devices. Retrieved April 10, 2019 from https: //www.amazon.com/Amazon-Echo-And-Alexa-Devices/b?node=9818047011Google Scholar
2019. Edge TPU -- Run Inference at the Edge. Retrieved April 10, 2019 from https://cloud.google.com/edge-tpu/Google Scholar
2019. The future is here: iPhone X. Retrieved April 10, 2019 from https: //www.apple.com/newsroom/2017/09/the-future-is-here-iphone-x/Google Scholar
2019. Galaxy S10 Performance. Retrieved April 10, 2019 from https://www. samsung.com/us/mobile/galaxy-s10/performance/Google Scholar
2019. Google Home. Retrieved April 10, 2019 from https://store.google.com/ product/google_homeGoogle Scholar
2019. High Performance AI at the Edge | NVIDIA Jetson TX2. Retrieved April 10, 2019 from https://www.nvidia.com/en-us/autonomous-machines/ embedded-systems/jetson-tx2/Google Scholar
2019. LG Smart Refrigerators: Powered by SmartThinQ IOT. Retrieved April 10, 2019 from https://www.lg.com/us/discover/smartthinq/refrigeratorsGoogle Scholar
2019. Samsung Family Hub Smart Refrigerator. Retrieved April 10, 2019 from https://www.samsung.com/us/explore/family-hub-refrigerator/refrigerator/Google Scholar
2019. TensorFlow audio recognition tutorial. Retrieved April 10, 2019 from https://www.tensorflow.org/tutorials/sequences/audio_recognitionGoogle Scholar
Grigory Antipov, Moez Baccouche, Sid-Ahmed Berrani, and Jean-Luc Dugelay. 2016. Apparent Age Estimation from Face Images Combining General and Children-Specialized Deep Learning Models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.Google ScholarCross Ref
Katie Canales. 2018. A couple says that Amazon's Alexa recorded a private conversation and randomly sent it to a friend. Retrieved April 10, 2019 from https:// www.businessinsider.com/amazon-alexa-records-private-conversation-2018--5Google Scholar
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NIPS). Google ScholarDigital Library
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In International Conference on Machine Learning (ICML). Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. CoRR abs/1503.02531.Google Scholar
Bret Kinsella. 2018. There are Now More Than 70,000 Alexa Skills Worldwide, Amazon Announces 25 Top Skills of 2018. Retrieved April 10, 2019 from https://bit.ly/2VLyqJ9Google Scholar
Bret Kinsella and Ava Mutchler. 2018. Smart Speaker Consumer Adoption Report. Retrieved April 10, 2019 from https://voicebot.ai/wp-content/uploads/2018/10/ voicebot-smart-speaker-consumer-adoption-report.pdfGoogle Scholar
Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. 2017. Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 1, 149--163. Google ScholarDigital Library
Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Master's thesis. Department of Computer Science, University of Toronto.Google Scholar
Kimin Lee, Changho Hwang, KyoungSoo Park, and Jinwoo Shin. 2017. Confident Multiple Choice Learning. In International Conference on Machine Learning (ICML). Google ScholarDigital Library
Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018. A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. Pruning Filters for Efficient ConvNets. In International Conference on Learning Representations (ICLR).Google Scholar
Mark McCaffrey, Paige Hayes, Jason Wagner, and Matt Hobbs. 2018. Consumer Intelligence Series: Prepare for the voice revolution. Technical Report.Google Scholar
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. CoRR abs/1312.5602.Google Scholar
Sharan Narang, Greg Diamos, Shubho Sengupta, and Erich Elsen. 2017. Exploring Sparsity in Recurrent Neural Networks. In International Conference on Learning Representations (ICLR).Google Scholar
Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Regularizing Neural Networks by Penalizing Confident Output Distributions. In International Conference on Learning Representations (ICLR).Google Scholar
Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In International Conference on Learning Representations (ICLR).Google Scholar
Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. CoRR abs/1804.02767.Google Scholar
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR).Google Scholar
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
PeteWarden. 2018. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. CoRR abs/1804.03209.Google Scholar
Xu-Yao Zhang, Fei Yin, Yan-Ming Zhang, Cheng-Lin Liu, and Yoshua Bengio. 2018. Drawing and Recognizing Chinese Characters with Recurrent Neural Network. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40, 4, 849--862.Google ScholarCross Ref

Index Terms

A Case for Two-stage Inference with Knowledge Caching

Recommendations

Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches

Although direct-mapped caches suffer from higher miss ratios as compared to set-associative caches, they are attractive for today's high-speed pipelined processors that require very low access times. Victim caching was proposed by Jouppi [1] as an ...
Read More
A reusability-aware cache memory sharing technique for high-performance low-power CMPs with private L2 caches
ISLPED '07: Proceedings of the 2007 international symposium on Low power electronics and design

Chip multiprocessors (CMPs) emerge as a dominant architectural alternative in high-end embedded systems. Since off-chip accesses require a long latency and consume a large amount of power, CMPs are typically based on multiple levels of on-chip cache ...
Read More
Dynamic Loop Caching Meets Preloaded Loop Caching " A Hybrid Approach
ICCD '02: Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)

Dynamically-loaded tagless loop caching reduces instruction fetch power for embedded software with small loops, but only supports simple loops without taken branches. Preloaded tagless loop caching supports complex loops with branches and thus can ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EMDL '19: The 3rd International Workshop on Deep Learning for Mobile Systems and Applications
June 2019
46 pages
ISBN:9781450367714
DOI:10.1145/3325413
Program Chairs:
Hongkai Wen
University of Warwick
,
Sourav Bhattacharya
Samsung AI Centre
,
Aruna Balasubramanian
Stony Brook University
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
caching systems
embedded systems
neural networks
Qualifiers
- research-article
Conference
Upcoming Conference
MOBISYS '24

Sponsor:

sigmobile

The 22nd Annual International Conference on Mobile Systems, Applications and Services

June 3 - 7, 2024

Minato-ku, Tokyo , Japan
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 122
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Case for Two-stage Inference with Knowledge Caching

EMDL '19: The 3rd International Workshop on Deep Learning for Mobile Systems and Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches

A reusability-aware cache memory sharing technique for high-performance low-power CMPs with private L2 caches

Dynamic Loop Caching Meets Preloaded Loop Caching " A Hybrid Approach