ABSTRACT
Advances in deep learning have enabled brand new video analytics systems and applications. Existing systems research on real-time video event detection does not consider matching based on natural language; rather, it focuses on using Domain Specific Languages that define spatio-temporal operators on video streams for efficient matching. Alternatively, research in the multimodal AI community on joint understanding of video and language focuses on applications such as language-based video retrieval, where videos may have been processed offline. In this work, we propose AlertMe, a multimodal-based live video trigger system that matches incoming video streams to a set of user-defined natural language triggers. We dynamically select the optimal sliding window size to extract feature vectors from different modalities in near real time. We also describe our approach to achieve on-device deployment by introducing a profiler to select runtime-efficient feature extractors. Lastly, we show that limiting the number of trigger candidates can significantly increase event detection performance in applications such as task following in AR glasses.
- 2020. Dense Convolutional Network (DenseNet). https://bit.ly/3650k8I Accessed: 2020-09-11.Google Scholar
- 2020. MobileNetV2 Feature Extrator. https://bit.ly/3jW9Xvm Accessed: 2020-09-01.Google Scholar
- 2021. Watch For. https://www.microsoft.com/en-us/garage/wall-of-fame/watch-for/Google Scholar
- Christopher Canel, Thomas Kim, Giulio Zhou, Conglong Li, Hyeontaek Lim, David G Andersen, Michael Kaminsky, and Subramanya R Dulloor. 2019. Scaling Video Analytics on Constrained Edge Nodes. arXiv preprint arXiv:1905.13536 (2019).Google Scholar
- Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. arXiv preprint arXiv:1707.05612 (2017).Google Scholar
- Daniel Y Fu, Will Crichton, James Hong, Xinwei Yao, Haotian Zhang, Anh Truong, Avanika Narayan, Maneesh Agrawala, Christopher Ré, and Kayvon Fatahalian. 2019. Rekall: Specifying Video Events using Compositions of Spatiotemporal Labels. arXiv preprint arXiv:1910.02993 (2019).Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385.Google Scholar
- Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN Architectures for Large-Scale Audio Classification. In 2017 IEEE ICASSP. 131--135.Google Scholar
- Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B Gibbons, and Onur Mutlu. 2018. Focus: Querying Large Video Datasets with Low Latency and Low Cost. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 269--286.Google Scholar
- Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. In Proc. of the IEEE CVPR. 7132--7141.Google Scholar
- Zhiming Hu, Maayan Shvo, Allan Jepson, and Iqbal Mohomed. 2020. Interactive Planning-based Cognitive Assistance on the Edge. In 3rd {USENIX} Workshop on Hot Topics in Edge Computing (HotEdge 20).Google Scholar
- Zhiming Hu, Ning Ye, Caleb Phillips, Tim Capes, and Iqbal Mohomed. 2021. mmFilter: Language-Guided Video Analytics at the Edge. In Proc. of ACM/IFIP Middleware Industry Track.Google Scholar
- Daniel Kang, Peter Bailis, and Matei Zaharia. 2019. Challenges and Opportunities in DNN-Based Video Analytics: A Demonstration of the BlazeIt Video Query Engine. In 9th Biennial Conference on Innovative Data Systems Research.Google Scholar
- Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. Noscope: Optimizing Neural Network Queries over Video at Scale. arXiv preprint arXiv:1703.02529 (2017).Google Scholar
- Peter Kraft, Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, and Matei Zaharia. 2019. Willump: A Statistically-Aware End-to-End Optimizer for Machine Learning Inference. arXiv preprint arXiv:1906.01974 (2019).Google Scholar
- Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-Captioning Events in Videos. In Proc. of the IEEE ICCV. 706--715.Google ScholarCross Ref
- Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proc. of ECCV. 201--216.Google ScholarCross Ref
- Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding. In Proc. of the IEEE ICCV. 7083--7093.Google ScholarCross Ref
- Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman. 2019. Use What You Have: Video Retrieval using Representations from Collaborative Experts. In arXiv preprint arxiv:1907.13487.Google Scholar
- Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv:1804.02516 (2018).Google Scholar
- Lenin Ravindranath, Matthai Philipose, Peter Bodik, and Paramvir Bahl. 2017. Live Video Stream Triggers. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. 195--195.Google ScholarDigital Library
- Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP '19). New York, NY, USA, 322--337.Google ScholarDigital Library
- Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded Compositional Semantics for Finding and Describing Images with Sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207--218.Google ScholarCross Ref
- Weiyao Wang, Du Tran, and Matt Feiszli. 2020. What Makes Training Multi-Modal Classification Networks Hard?. In Proc. of IEEE/CVF CVPR. 12695--12705.Google ScholarCross Ref
- Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018. Towards Automatic Learning of Procedures From Web Instructional Videos. In AAAI Conference on Artificial Intelligence. 7590--7598.Google Scholar
Index Terms
- AlertMe: Towards Natural Language-Based Live Video Trigger Systems at the Edge
Recommendations
Optimizing QoE and Latency of Live Video Streaming Using Edge Computing and In-Network Intelligence
MMSys '21: Proceedings of the 12th ACM Multimedia Systems ConferenceLive video streaming traffic and related applications have experienced significant growth in recent years. More users have started generating and delivering live streams with high quality (e.g., 4K resolution) through popular online streaming platforms ...
CDN and SDN Support and Player Interaction for HTTP Adaptive Video Streaming
MMSys '21: Proceedings of the 12th ACM Multimedia Systems ConferenceVideo streaming has become one of the most prevailing, bandwidth-hungry, and latency-sensitive Internet applications. HTTP Adaptive Streaming (HAS) has become the dominant video delivery mechanism over the Internet. Lack of coordination among the ...
ES-HAS: an edge- and SDN-assisted framework for HTTP adaptive video streaming
NOSSDAV '21: Proceedings of the 31st ACM Workshop on Network and Operating Systems Support for Digital Audio and VideoRecently, HTTP Adaptive Streaming (HAS) has become the dominant video delivery technology over the Internet. In HAS, clients have full control over the media streaming and adaptation processes. Lack of coordination among the clients and lack of ...
Comments