research-article

Learnable Optimal Sequential Grouping for Video Scene Detection

Authors:
Daniel Rotman

IBM Research, Haifa, Israel

IBM Research, Haifa, Israel
View Profile

,
Yevgeny Yaroker

IBM Research, Haifa, Israel

IBM Research, Haifa, Israel
View Profile

,
Elad Amrani

IBM Research, Haifa, Israel

IBM Research, Haifa, Israel
View Profile

,
Udi Barzelay

IBM Research, Haifa, Israel

IBM Research, Haifa, Israel
View Profile

,
Rami Ben-Ari

IBM Research, Haifa, Israel

IBM Research, Haifa, Israel
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 1958–1966https://doi.org/10.1145/3394171.3413612

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 1958–1966

ABSTRACT

Video scene detection is the task of dividing videos into temporal semantic chapters. This is an important preliminary step before attempting to analyze heterogeneous video content. Recently, Optimal Sequential Grouping (OSG) was proposed as a powerful unsupervised solution to solve a formulation of the video scene detection problem. In this work, we extend the capabilities of OSG to the learning regime. By giving the capability to both learn from examples and leverage a robust optimization formulation, we can boost performance and enhance the versatility of the technology. We present a comprehensive analysis of incorporating OSG into deep learning neural networks under various configurations. These configurations include learning an embedding in a straight-forward manner, a tailored loss designed to guide the solution of OSG, and an integrated model where the learning is performed through the OSG pipeline. With thorough evaluation and analysis, we assess the benefits and behavior of the various configurations, and show that our learnable OSG approach exhibits desirable behavior and enhanced performance compared to the state of the art.

Supplemental Material

3394171.3413612.mp4

mp4

13.7 MB

Download

Available for Download

zip

mmfp1263aux.zip (5.2 MB)

Please find in file Learnable_OSG_ACMMM_Supplementary_final.pdf the appendices for the original publication.

References

Evlampios Apostolidis and Vasileios Mezaris. 2014. Fast shot segmentation combining global and local visual descriptors. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6583--6587.Google ScholarCross Ref
Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Analysis and Re- Use of Videos in Educational Digital Libraries with Automatic Scene Detection. In 11th Italian Research Conference on Digital Libraries. Springer, 155--164.Google Scholar
Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. A Deep Siamese Network for Scene Detection in Broadcast Videos. In Proceedings of the 23rd ACM International Conference on Multimedia (Brisbane, Australia) (MM '15). ACM, New York, NY, USA, 1199--1202. https://doi.org/10.1145/2733373.2806316Google ScholarDigital Library
Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Measuring scene detection performance. In Iberian Conference on Pattern Recognition and Image Analysis. Springer, 395--403.Google ScholarCross Ref
Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Shot and scene detection via hierarchical clustering for re-using broadcast video. In International Conference on Computer Analysis of Images and Patterns. Springer, 801--811.Google ScholarCross Ref
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarCross Ref
Vasileios T Chasanis, Aristidis C Likas, and Nikolaos P Galatsanos. 2008. Scene detection in videos using shot clustering and sequence alignment. IEEE transactions on multimedia 11, 1 (2008), 89--100.Google Scholar
Manfred Del Fabro and Laszlo Böszörmenyi. 2013. State-of-the-art and future challenges in video scene detection: a survey. Multimedia systems 19, 5 (2013), 427--454.Google Scholar
Diego Didona, Francesco Quaglia, Paolo Romano, and Ennio Torre. 2015. Enhancing performance prediction robustness by combining analytical modeling and machine learning. In Proceedings of the 6th ACM/SPEC international conference on performance engineering. ACM, 145--156.Google ScholarDigital Library
Alex Endert, William Ribarsky, Cagatay Turkay, BL William Wong, Ian Nabney, I Díaz Blanco, and Fabrice Rossi. 2017. The state of the art in integrating machine learning into visual analytics. In Computer Graphics Forum, Vol. 36. Wiley Online Library, 458--486.Google ScholarCross Ref
Antonino Furnari, Giovanni Maria Farinella, and Sebastiano Battiato. 2016. Temporal segmentation of egocentric videos to highlight personal locations of interest. In European Conference on Computer Vision. Springer, 474--489.Google ScholarCross Ref
Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6576--6585.Google ScholarCross Ref
Bo Han and Weiguo Wu. 2011. Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In 2011 IEEE International conference on multimedia and expo. IEEE, 1--6.Google ScholarDigital Library
Muhammad Haroon, Junaid Baber, Ihsan Ullah, Sher Muhammad Daudpota, Maheen Bakhtyar, and Varsha Devi. 2018. Video Scene Detection Using Compact Bag of Visual Word Models. Advances in Multimedia 2018 (2018).Google Scholar
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 131--135.Google ScholarDigital Library
Alina Kloss, Stefan Schaal, and Jeannette Bohg. 2017. Combining learned and analytical models for predicting action effects. arXiv preprint arXiv:1710.04102 (2017).Google Scholar
Chao Liang, Yifan Zhang, Jian Cheng, Changsheng Xu, and Hanqing Lu. 2009. A novel role-based movie scene segmentation method. In Pacific-Rim Conference on Multimedia. Springer, 917--922.Google ScholarDigital Library
Debabrata Mahapatra, Ragunathan Mariappan, and Vaibhav Rajan. 2018. Automatic Hierarchical Table of Contents Generation for Educational Videos. In Companion Proceedings of the TheWeb Conference 2018. InternationalWorld Wide Web Conferences Steering Committee, 267--274.Google Scholar
Bernd Münzer and Klaus Schoeffmann. 2018. Video Browsing on a Circular Timeline. In International Conference on Multimedia Modeling. Springer, 395--399.Google ScholarCross Ref
Alessandro Ortis, GiovanniMFarinella, Valeria D?Amico, Luca Addesso, Giovanni Torrisi, and Sebastiano Battiato. 2017. Organizing egocentric videos of daily living activities. Pattern Recognition 72 (2017), 207--218.Google ScholarDigital Library
Rameswar Panda, Sanjay K Kuanar, and Ananda S Chowdhury. 2017. Nyström Approximated Temporally Constrained Multisimilarity Spectral Clustering Approach for Movie Scene Detection. IEEE Transactions on Cybernetics (2017).Google Scholar
Yair Poleg, Chetan Arora, and Shmuel Peleg. 2014. Temporal segmentation of egocentric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2537--2544.Google ScholarDigital Library
Stanislav Protasov, Adil Mehmood Khan, Konstantin Sozykin, and Muhammad Ahmad. 2018. Using deep features for video scene detection and annotation. Signal, Image and Video Processing 12, 5 (2018), 991--999.Google ScholarCross Ref
Zeeshan Rasheed and Mubarak Shah. 2005. Detection and representation of scenes in videos. IEEE transactions on Multimedia 7, 6 (2005), 1097--1105.Google Scholar
Paramita Ray and Amlan Chakrabarti. 2019. A Mixed approach of Deep Learning method and Rule-Based method to improve Aspect Level Sentiment Analysis. Applied Computing and Informatics (2019).Google Scholar
Daniel Rotman, Dror Porat, and Gal Ashour. 2016. Robust and efficient video scene detection using optimal sequential grouping. In 2016 IEEE International Symposium on Multimedia (ISM). IEEE, 275--280.Google ScholarCross Ref
Daniel Rotman, Dror Porat, and Gal Ashour. 2017. Robust video scene detection using multimodal fusion of optimally grouped features. In 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 1--6.Google ScholarCross Ref
Daniel Rotman, Dror Porat, Gal Ashour, and Udi Barzelay. 2018. Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 187--195.Google ScholarDigital Library
Yong Rui, Thomas S Huang, and Sharad Mehrotra. 1999. Constructing table-ofcontent for videos. Multimedia systems 7, 5 (1999), 359--368.Google Scholar
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarCross Ref
Yair Shemer, Daniel Rotman, and Nahum Shimkin. 2019. ILS-SUMM: Iterated Local Search for Unsupervised Video Summarization. arXiv preprint arXiv:1912.03650 (2019).Google Scholar
Panagiotis Sidiropoulos, Vasileios Mezaris, Ioannis Kompatsiaris, Hugo Meinedo, Miguel Bugalho, and Isabel Trancoso. 2011. Temporal video segmentation to scenes using high-level audiovisual features. IEEE Transactions on Circuits and Systems for Video Technology 21, 8 (2011), 1163--1177.Google ScholarDigital Library
Alan F Smeaton, Paul Over, and Aiden R Doherty. 2010. Video shot boundary detection: Seven years of TRECVid activity. Computer Vision and Image Understanding 114, 4 (2010), 411--418.Google ScholarDigital Library
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818--2826.Google ScholarCross Ref
Makarand Tapaswi, Martin Bauml, and Rainer Stiefelhagen. 2014. Storygraphs: visualizing character interactions as a timeline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 827--834.Google ScholarDigital Library
Tiago H. Trojahn, Rodrigo M. Kishi, and Rudinei Goularte. 2018. A New Multimodal Deep-learning Model to Video Scene Segmentation. In Proceedings of the 24th Brazilian Symposium on Multimedia and the Web (Salvador, BA, Brazil) (WebMedia '18). ACM, New York, NY, USA, 205--212. https://doi.org/10.1145/ 3243082.3243108Google ScholarDigital Library
Jeroen Vendrig and Marcel Worring. 2002. Systematic evaluation of logical story unit segmentation. IEEE Transactions on Multimedia 4, 4 (2002), 492--499.Google ScholarDigital Library
Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, and Eftychios Protopapadakis. 2018. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience 2018 (2018).Google Scholar
Minerva Yeung, Boon-Lock Yeo, and Bede Liu. 1998. Segmentation of video by clustering and graph analysis. Computer vision and image understanding 71, 1 (1998), 94--109.Google Scholar

Index Terms

Recommendations

Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Video scene detection is the task of temporally dividing a video into its semantic sections. This is an important preliminary step for effective analysis of heterogeneous video content. We present a unique formulation of this task as a generic ...
Read More
Video scene detection using graph-based representations

One of the fundamental steps in organizing videos is to parse it in smaller descriptive parts. One way of realizing this step is to obtain shot or scene information. One or more consecutive semantically correlated shots sharing the same content ...
Read More
Deep reinforcement learning in computer vision: a comprehensive survey
Abstract
Deep reinforcement learning augments the reinforcement learning framework and utilizes the powerful representation of deep neural networks. Recent works have demonstrated the remarkable successes of deep reinforcement learning in various domains ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep learning
dynamic programming
optimization
temporal segmentation
video analysis
video scene detection
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 214
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learnable Optimal Sequential Grouping for Video Scene Detection

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection

Video scene detection using graph-based representations

Deep reinforcement learning in computer vision: a comprehensive survey