ABSTRACT
Video scene detection is the task of dividing videos into temporal semantic chapters. This is an important preliminary step before attempting to analyze heterogeneous video content. Recently, Optimal Sequential Grouping (OSG) was proposed as a powerful unsupervised solution to solve a formulation of the video scene detection problem. In this work, we extend the capabilities of OSG to the learning regime. By giving the capability to both learn from examples and leverage a robust optimization formulation, we can boost performance and enhance the versatility of the technology. We present a comprehensive analysis of incorporating OSG into deep learning neural networks under various configurations. These configurations include learning an embedding in a straight-forward manner, a tailored loss designed to guide the solution of OSG, and an integrated model where the learning is performed through the OSG pipeline. With thorough evaluation and analysis, we assess the benefits and behavior of the various configurations, and show that our learnable OSG approach exhibits desirable behavior and enhanced performance compared to the state of the art.
Supplemental Material
Available for Download
Please find in file Learnable_OSG_ACMMM_Supplementary_final.pdf the appendices for the original publication.
- Evlampios Apostolidis and Vasileios Mezaris. 2014. Fast shot segmentation combining global and local visual descriptors. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6583--6587.Google ScholarCross Ref
- Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Analysis and Re- Use of Videos in Educational Digital Libraries with Automatic Scene Detection. In 11th Italian Research Conference on Digital Libraries. Springer, 155--164.Google Scholar
- Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. A Deep Siamese Network for Scene Detection in Broadcast Videos. In Proceedings of the 23rd ACM International Conference on Multimedia (Brisbane, Australia) (MM '15). ACM, New York, NY, USA, 1199--1202. https://doi.org/10.1145/2733373.2806316Google ScholarDigital Library
- Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Measuring scene detection performance. In Iberian Conference on Pattern Recognition and Image Analysis. Springer, 395--403.Google ScholarCross Ref
- Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Shot and scene detection via hierarchical clustering for re-using broadcast video. In International Conference on Computer Analysis of Images and Patterns. Springer, 801--811.Google ScholarCross Ref
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarCross Ref
- Vasileios T Chasanis, Aristidis C Likas, and Nikolaos P Galatsanos. 2008. Scene detection in videos using shot clustering and sequence alignment. IEEE transactions on multimedia 11, 1 (2008), 89--100.Google Scholar
- Manfred Del Fabro and Laszlo Böszörmenyi. 2013. State-of-the-art and future challenges in video scene detection: a survey. Multimedia systems 19, 5 (2013), 427--454.Google Scholar
- Diego Didona, Francesco Quaglia, Paolo Romano, and Ennio Torre. 2015. Enhancing performance prediction robustness by combining analytical modeling and machine learning. In Proceedings of the 6th ACM/SPEC international conference on performance engineering. ACM, 145--156.Google ScholarDigital Library
- Alex Endert, William Ribarsky, Cagatay Turkay, BL William Wong, Ian Nabney, I Díaz Blanco, and Fabrice Rossi. 2017. The state of the art in integrating machine learning into visual analytics. In Computer Graphics Forum, Vol. 36. Wiley Online Library, 458--486.Google ScholarCross Ref
- Antonino Furnari, Giovanni Maria Farinella, and Sebastiano Battiato. 2016. Temporal segmentation of egocentric videos to highlight personal locations of interest. In European Conference on Computer Vision. Springer, 474--489.Google ScholarCross Ref
- Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6576--6585.Google ScholarCross Ref
- Bo Han and Weiguo Wu. 2011. Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In 2011 IEEE International conference on multimedia and expo. IEEE, 1--6.Google ScholarDigital Library
- Muhammad Haroon, Junaid Baber, Ihsan Ullah, Sher Muhammad Daudpota, Maheen Bakhtyar, and Varsha Devi. 2018. Video Scene Detection Using Compact Bag of Visual Word Models. Advances in Multimedia 2018 (2018).Google Scholar
- Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 131--135.Google ScholarDigital Library
- Alina Kloss, Stefan Schaal, and Jeannette Bohg. 2017. Combining learned and analytical models for predicting action effects. arXiv preprint arXiv:1710.04102 (2017).Google Scholar
- Chao Liang, Yifan Zhang, Jian Cheng, Changsheng Xu, and Hanqing Lu. 2009. A novel role-based movie scene segmentation method. In Pacific-Rim Conference on Multimedia. Springer, 917--922.Google ScholarDigital Library
- Debabrata Mahapatra, Ragunathan Mariappan, and Vaibhav Rajan. 2018. Automatic Hierarchical Table of Contents Generation for Educational Videos. In Companion Proceedings of the TheWeb Conference 2018. InternationalWorld Wide Web Conferences Steering Committee, 267--274.Google Scholar
- Bernd Münzer and Klaus Schoeffmann. 2018. Video Browsing on a Circular Timeline. In International Conference on Multimedia Modeling. Springer, 395--399.Google ScholarCross Ref
- Alessandro Ortis, GiovanniMFarinella, Valeria D?Amico, Luca Addesso, Giovanni Torrisi, and Sebastiano Battiato. 2017. Organizing egocentric videos of daily living activities. Pattern Recognition 72 (2017), 207--218.Google ScholarDigital Library
- Rameswar Panda, Sanjay K Kuanar, and Ananda S Chowdhury. 2017. Nyström Approximated Temporally Constrained Multisimilarity Spectral Clustering Approach for Movie Scene Detection. IEEE Transactions on Cybernetics (2017).Google Scholar
- Yair Poleg, Chetan Arora, and Shmuel Peleg. 2014. Temporal segmentation of egocentric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2537--2544.Google ScholarDigital Library
- Stanislav Protasov, Adil Mehmood Khan, Konstantin Sozykin, and Muhammad Ahmad. 2018. Using deep features for video scene detection and annotation. Signal, Image and Video Processing 12, 5 (2018), 991--999.Google ScholarCross Ref
- Zeeshan Rasheed and Mubarak Shah. 2005. Detection and representation of scenes in videos. IEEE transactions on Multimedia 7, 6 (2005), 1097--1105.Google Scholar
- Paramita Ray and Amlan Chakrabarti. 2019. A Mixed approach of Deep Learning method and Rule-Based method to improve Aspect Level Sentiment Analysis. Applied Computing and Informatics (2019).Google Scholar
- Daniel Rotman, Dror Porat, and Gal Ashour. 2016. Robust and efficient video scene detection using optimal sequential grouping. In 2016 IEEE International Symposium on Multimedia (ISM). IEEE, 275--280.Google ScholarCross Ref
- Daniel Rotman, Dror Porat, and Gal Ashour. 2017. Robust video scene detection using multimodal fusion of optimally grouped features. In 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 1--6.Google ScholarCross Ref
- Daniel Rotman, Dror Porat, Gal Ashour, and Udi Barzelay. 2018. Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 187--195.Google ScholarDigital Library
- Yong Rui, Thomas S Huang, and Sharad Mehrotra. 1999. Constructing table-ofcontent for videos. Multimedia systems 7, 5 (1999), 359--368.Google Scholar
- Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarCross Ref
- Yair Shemer, Daniel Rotman, and Nahum Shimkin. 2019. ILS-SUMM: Iterated Local Search for Unsupervised Video Summarization. arXiv preprint arXiv:1912.03650 (2019).Google Scholar
- Panagiotis Sidiropoulos, Vasileios Mezaris, Ioannis Kompatsiaris, Hugo Meinedo, Miguel Bugalho, and Isabel Trancoso. 2011. Temporal video segmentation to scenes using high-level audiovisual features. IEEE Transactions on Circuits and Systems for Video Technology 21, 8 (2011), 1163--1177.Google ScholarDigital Library
- Alan F Smeaton, Paul Over, and Aiden R Doherty. 2010. Video shot boundary detection: Seven years of TRECVid activity. Computer Vision and Image Understanding 114, 4 (2010), 411--418.Google ScholarDigital Library
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818--2826.Google ScholarCross Ref
- Makarand Tapaswi, Martin Bauml, and Rainer Stiefelhagen. 2014. Storygraphs: visualizing character interactions as a timeline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 827--834.Google ScholarDigital Library
- Tiago H. Trojahn, Rodrigo M. Kishi, and Rudinei Goularte. 2018. A New Multimodal Deep-learning Model to Video Scene Segmentation. In Proceedings of the 24th Brazilian Symposium on Multimedia and the Web (Salvador, BA, Brazil) (WebMedia '18). ACM, New York, NY, USA, 205--212. https://doi.org/10.1145/ 3243082.3243108Google ScholarDigital Library
- Jeroen Vendrig and Marcel Worring. 2002. Systematic evaluation of logical story unit segmentation. IEEE Transactions on Multimedia 4, 4 (2002), 492--499.Google ScholarDigital Library
- Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, and Eftychios Protopapadakis. 2018. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience 2018 (2018).Google Scholar
- Minerva Yeung, Boon-Lock Yeo, and Bede Liu. 1998. Segmentation of video by clustering and graph analysis. Computer vision and image understanding 71, 1 (1998), 94--109.Google Scholar
Index Terms
- Learnable Optimal Sequential Grouping for Video Scene Detection
Recommendations
Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia RetrievalVideo scene detection is the task of temporally dividing a video into its semantic sections. This is an important preliminary step for effective analysis of heterogeneous video content. We present a unique formulation of this task as a generic ...
Video scene detection using graph-based representations
One of the fundamental steps in organizing videos is to parse it in smaller descriptive parts. One way of realizing this step is to obtain shot or scene information. One or more consecutive semantically correlated shots sharing the same content ...
Deep reinforcement learning in computer vision: a comprehensive survey
AbstractDeep reinforcement learning augments the reinforcement learning framework and utilizes the powerful representation of deep neural networks. Recent works have demonstrated the remarkable successes of deep reinforcement learning in various domains ...
Comments