skip to main content
10.1145/3434770.3459740acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

AlertMe: Towards Natural Language-Based Live Video Trigger Systems at the Edge

Published:26 April 2021Publication History

ABSTRACT

Advances in deep learning have enabled brand new video analytics systems and applications. Existing systems research on real-time video event detection does not consider matching based on natural language; rather, it focuses on using Domain Specific Languages that define spatio-temporal operators on video streams for efficient matching. Alternatively, research in the multimodal AI community on joint understanding of video and language focuses on applications such as language-based video retrieval, where videos may have been processed offline. In this work, we propose AlertMe, a multimodal-based live video trigger system that matches incoming video streams to a set of user-defined natural language triggers. We dynamically select the optimal sliding window size to extract feature vectors from different modalities in near real time. We also describe our approach to achieve on-device deployment by introducing a profiler to select runtime-efficient feature extractors. Lastly, we show that limiting the number of trigger candidates can significantly increase event detection performance in applications such as task following in AR glasses.

References

  1. 2020. Dense Convolutional Network (DenseNet). https://bit.ly/3650k8I Accessed: 2020-09-11.Google ScholarGoogle Scholar
  2. 2020. MobileNetV2 Feature Extrator. https://bit.ly/3jW9Xvm Accessed: 2020-09-01.Google ScholarGoogle Scholar
  3. 2021. Watch For. https://www.microsoft.com/en-us/garage/wall-of-fame/watch-for/Google ScholarGoogle Scholar
  4. Christopher Canel, Thomas Kim, Giulio Zhou, Conglong Li, Hyeontaek Lim, David G Andersen, Michael Kaminsky, and Subramanya R Dulloor. 2019. Scaling Video Analytics on Constrained Edge Nodes. arXiv preprint arXiv:1905.13536 (2019).Google ScholarGoogle Scholar
  5. Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. arXiv preprint arXiv:1707.05612 (2017).Google ScholarGoogle Scholar
  6. Daniel Y Fu, Will Crichton, James Hong, Xinwei Yao, Haotian Zhang, Anh Truong, Avanika Narayan, Maneesh Agrawala, Christopher Ré, and Kayvon Fatahalian. 2019. Rekall: Specifying Video Events using Compositions of Spatiotemporal Labels. arXiv preprint arXiv:1910.02993 (2019).Google ScholarGoogle Scholar
  7. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385.Google ScholarGoogle Scholar
  8. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN Architectures for Large-Scale Audio Classification. In 2017 IEEE ICASSP. 131--135.Google ScholarGoogle Scholar
  9. Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B Gibbons, and Onur Mutlu. 2018. Focus: Querying Large Video Datasets with Low Latency and Low Cost. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 269--286.Google ScholarGoogle Scholar
  10. Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. In Proc. of the IEEE CVPR. 7132--7141.Google ScholarGoogle Scholar
  11. Zhiming Hu, Maayan Shvo, Allan Jepson, and Iqbal Mohomed. 2020. Interactive Planning-based Cognitive Assistance on the Edge. In 3rd {USENIX} Workshop on Hot Topics in Edge Computing (HotEdge 20).Google ScholarGoogle Scholar
  12. Zhiming Hu, Ning Ye, Caleb Phillips, Tim Capes, and Iqbal Mohomed. 2021. mmFilter: Language-Guided Video Analytics at the Edge. In Proc. of ACM/IFIP Middleware Industry Track.Google ScholarGoogle Scholar
  13. Daniel Kang, Peter Bailis, and Matei Zaharia. 2019. Challenges and Opportunities in DNN-Based Video Analytics: A Demonstration of the BlazeIt Video Query Engine. In 9th Biennial Conference on Innovative Data Systems Research.Google ScholarGoogle Scholar
  14. Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. Noscope: Optimizing Neural Network Queries over Video at Scale. arXiv preprint arXiv:1703.02529 (2017).Google ScholarGoogle Scholar
  15. Peter Kraft, Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, and Matei Zaharia. 2019. Willump: A Statistically-Aware End-to-End Optimizer for Machine Learning Inference. arXiv preprint arXiv:1906.01974 (2019).Google ScholarGoogle Scholar
  16. Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-Captioning Events in Videos. In Proc. of the IEEE ICCV. 706--715.Google ScholarGoogle ScholarCross RefCross Ref
  17. Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proc. of ECCV. 201--216.Google ScholarGoogle ScholarCross RefCross Ref
  18. Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding. In Proc. of the IEEE ICCV. 7083--7093.Google ScholarGoogle ScholarCross RefCross Ref
  19. Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman. 2019. Use What You Have: Video Retrieval using Representations from Collaborative Experts. In arXiv preprint arxiv:1907.13487.Google ScholarGoogle Scholar
  20. Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv:1804.02516 (2018).Google ScholarGoogle Scholar
  21. Lenin Ravindranath, Matthai Philipose, Peter Bodik, and Paramvir Bahl. 2017. Live Video Stream Triggers. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. 195--195.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP '19). New York, NY, USA, 322--337.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded Compositional Semantics for Finding and Describing Images with Sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207--218.Google ScholarGoogle ScholarCross RefCross Ref
  24. Weiyao Wang, Du Tran, and Matt Feiszli. 2020. What Makes Training Multi-Modal Classification Networks Hard?. In Proc. of IEEE/CVF CVPR. 12695--12705.Google ScholarGoogle ScholarCross RefCross Ref
  25. Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018. Towards Automatic Learning of Procedures From Web Instructional Videos. In AAAI Conference on Artificial Intelligence. 7590--7598.Google ScholarGoogle Scholar

Index Terms

  1. AlertMe: Towards Natural Language-Based Live Video Trigger Systems at the Edge

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      EdgeSys '21: Proceedings of the 4th International Workshop on Edge Systems, Analytics and Networking
      April 2021
      84 pages
      ISBN:9781450382915
      DOI:10.1145/3434770

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 April 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate10of23submissions,43%
    • Article Metrics

      • Downloads (Last 12 months)12
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader