research-article

AlertMe: Towards Natural Language-Based Live Video Trigger Systems at the Edge

Authors:
Angela Ning Ye

Samsung AI Center Toronto, Canada

Samsung AI Center Toronto, Canada
View Profile

,
Zhiming Hu

Samsung AI Center Toronto, Canada

Samsung AI Center Toronto, Canada
View Profile

,
Caleb Phillips

Samsung AI Center Toronto, Canada

Samsung AI Center Toronto, Canada
View Profile

,
Iqbal Mohomed

Samsung AI Center Toronto, Canada

Samsung AI Center Toronto, Canada
View Profile

EdgeSys '21: Proceedings of the 4th International Workshop on Edge Systems, Analytics and NetworkingApril 2021Pages 67–72https://doi.org/10.1145/3434770.3459740

Published:26 April 2021Publication History

EdgeSys '21: Proceedings of the 4th International Workshop on Edge Systems, Analytics and Networking

Pages 67–72

ABSTRACT

Advances in deep learning have enabled brand new video analytics systems and applications. Existing systems research on real-time video event detection does not consider matching based on natural language; rather, it focuses on using Domain Specific Languages that define spatio-temporal operators on video streams for efficient matching. Alternatively, research in the multimodal AI community on joint understanding of video and language focuses on applications such as language-based video retrieval, where videos may have been processed offline. In this work, we propose AlertMe, a multimodal-based live video trigger system that matches incoming video streams to a set of user-defined natural language triggers. We dynamically select the optimal sliding window size to extract feature vectors from different modalities in near real time. We also describe our approach to achieve on-device deployment by introducing a profiler to select runtime-efficient feature extractors. Lastly, we show that limiting the number of trigger candidates can significantly increase event detection performance in applications such as task following in AR glasses.

References

2020. Dense Convolutional Network (DenseNet). https://bit.ly/3650k8I Accessed: 2020-09-11.Google Scholar
2020. MobileNetV2 Feature Extrator. https://bit.ly/3jW9Xvm Accessed: 2020-09-01.Google Scholar
2021. Watch For. https://www.microsoft.com/en-us/garage/wall-of-fame/watch-for/Google Scholar
Christopher Canel, Thomas Kim, Giulio Zhou, Conglong Li, Hyeontaek Lim, David G Andersen, Michael Kaminsky, and Subramanya R Dulloor. 2019. Scaling Video Analytics on Constrained Edge Nodes. arXiv preprint arXiv:1905.13536 (2019).Google Scholar
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. arXiv preprint arXiv:1707.05612 (2017).Google Scholar
Daniel Y Fu, Will Crichton, James Hong, Xinwei Yao, Haotian Zhang, Anh Truong, Avanika Narayan, Maneesh Agrawala, Christopher Ré, and Kayvon Fatahalian. 2019. Rekall: Specifying Video Events using Compositions of Spatiotemporal Labels. arXiv preprint arXiv:1910.02993 (2019).Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385.Google Scholar
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN Architectures for Large-Scale Audio Classification. In 2017 IEEE ICASSP. 131--135.Google Scholar
Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B Gibbons, and Onur Mutlu. 2018. Focus: Querying Large Video Datasets with Low Latency and Low Cost. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 269--286.Google Scholar
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. In Proc. of the IEEE CVPR. 7132--7141.Google Scholar
Zhiming Hu, Maayan Shvo, Allan Jepson, and Iqbal Mohomed. 2020. Interactive Planning-based Cognitive Assistance on the Edge. In 3rd {USENIX} Workshop on Hot Topics in Edge Computing (HotEdge 20).Google Scholar
Zhiming Hu, Ning Ye, Caleb Phillips, Tim Capes, and Iqbal Mohomed. 2021. mmFilter: Language-Guided Video Analytics at the Edge. In Proc. of ACM/IFIP Middleware Industry Track.Google Scholar
Daniel Kang, Peter Bailis, and Matei Zaharia. 2019. Challenges and Opportunities in DNN-Based Video Analytics: A Demonstration of the BlazeIt Video Query Engine. In 9th Biennial Conference on Innovative Data Systems Research.Google Scholar
Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. Noscope: Optimizing Neural Network Queries over Video at Scale. arXiv preprint arXiv:1703.02529 (2017).Google Scholar
Peter Kraft, Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, and Matei Zaharia. 2019. Willump: A Statistically-Aware End-to-End Optimizer for Machine Learning Inference. arXiv preprint arXiv:1906.01974 (2019).Google Scholar
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-Captioning Events in Videos. In Proc. of the IEEE ICCV. 706--715.Google ScholarCross Ref
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proc. of ECCV. 201--216.Google ScholarCross Ref
Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding. In Proc. of the IEEE ICCV. 7083--7093.Google ScholarCross Ref
Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman. 2019. Use What You Have: Video Retrieval using Representations from Collaborative Experts. In arXiv preprint arxiv:1907.13487.Google Scholar
Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv:1804.02516 (2018).Google Scholar
Lenin Ravindranath, Matthai Philipose, Peter Bodik, and Paramvir Bahl. 2017. Live Video Stream Triggers. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. 195--195.Google ScholarDigital Library
Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP '19). New York, NY, USA, 322--337.Google ScholarDigital Library
Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded Compositional Semantics for Finding and Describing Images with Sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207--218.Google ScholarCross Ref
Weiyao Wang, Du Tran, and Matt Feiszli. 2020. What Makes Training Multi-Modal Classification Networks Hard?. In Proc. of IEEE/CVF CVPR. 12695--12705.Google ScholarCross Ref
Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018. Towards Automatic Learning of Procedures From Web Instructional Videos. In AAAI Conference on Artificial Intelligence. 7590--7598.Google Scholar

Index Terms

AlertMe: Towards Natural Language-Based Live Video Trigger Systems at the Edge
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Optimizing QoE and Latency of Live Video Streaming Using Edge Computing and In-Network Intelligence
MMSys '21: Proceedings of the 12th ACM Multimedia Systems Conference

Live video streaming traffic and related applications have experienced significant growth in recent years. More users have started generating and delivering live streams with high quality (e.g., 4K resolution) through popular online streaming platforms ...
Read More
CDN and SDN Support and Player Interaction for HTTP Adaptive Video Streaming
MMSys '21: Proceedings of the 12th ACM Multimedia Systems Conference

Video streaming has become one of the most prevailing, bandwidth-hungry, and latency-sensitive Internet applications. HTTP Adaptive Streaming (HAS) has become the dominant video delivery mechanism over the Internet. Lack of coordination among the ...
Read More
ES-HAS: an edge- and SDN-assisted framework for HTTP adaptive video streaming
NOSSDAV '21: Proceedings of the 31st ACM Workshop on Network and Operating Systems Support for Digital Audio and Video

Recently, HTTP Adaptive Streaming (HAS) has become the dominant video delivery technology over the Internet. In HAS, clients have full control over the media streaming and adaptation processes. Lack of coordination among the clients and lack of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EdgeSys '21: Proceedings of the 4th International Workshop on Edge Systems, Analytics and Networking
April 2021
84 pages
ISBN:9781450382915
DOI:10.1145/3434770
General Chairs:
Aaron Ding
TU Delft, The Netherlands
,
Richard Mortier
University of Cambridge, UK
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 April 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Edge Computing
Multimodal Learning
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate10of23submissions,43%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 134
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

AlertMe: Towards Natural Language-Based Live Video Trigger Systems at the Edge

EdgeSys '21: Proceedings of the 4th International Workshop on Edge Systems, Analytics and Networking

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimizing QoE and Latency of Live Video Streaming Using Edge Computing and In-Network Intelligence

CDN and SDN Support and Player Interaction for HTTP Adaptive Video Streaming

ES-HAS: an edge- and SDN-assisted framework for HTTP adaptive video streaming

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media