skip to main content
10.1145/3394171.3413612acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learnable Optimal Sequential Grouping for Video Scene Detection

Published:12 October 2020Publication History

ABSTRACT

Video scene detection is the task of dividing videos into temporal semantic chapters. This is an important preliminary step before attempting to analyze heterogeneous video content. Recently, Optimal Sequential Grouping (OSG) was proposed as a powerful unsupervised solution to solve a formulation of the video scene detection problem. In this work, we extend the capabilities of OSG to the learning regime. By giving the capability to both learn from examples and leverage a robust optimization formulation, we can boost performance and enhance the versatility of the technology. We present a comprehensive analysis of incorporating OSG into deep learning neural networks under various configurations. These configurations include learning an embedding in a straight-forward manner, a tailored loss designed to guide the solution of OSG, and an integrated model where the learning is performed through the OSG pipeline. With thorough evaluation and analysis, we assess the benefits and behavior of the various configurations, and show that our learnable OSG approach exhibits desirable behavior and enhanced performance compared to the state of the art.

Skip Supplemental Material Section

Supplemental Material

3394171.3413612.mp4

mp4

13.7 MB

References

  1. Evlampios Apostolidis and Vasileios Mezaris. 2014. Fast shot segmentation combining global and local visual descriptors. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6583--6587.Google ScholarGoogle ScholarCross RefCross Ref
  2. Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Analysis and Re- Use of Videos in Educational Digital Libraries with Automatic Scene Detection. In 11th Italian Research Conference on Digital Libraries. Springer, 155--164.Google ScholarGoogle Scholar
  3. Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. A Deep Siamese Network for Scene Detection in Broadcast Videos. In Proceedings of the 23rd ACM International Conference on Multimedia (Brisbane, Australia) (MM '15). ACM, New York, NY, USA, 1199--1202. https://doi.org/10.1145/2733373.2806316Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Measuring scene detection performance. In Iberian Conference on Pattern Recognition and Image Analysis. Springer, 395--403.Google ScholarGoogle ScholarCross RefCross Ref
  5. Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Shot and scene detection via hierarchical clustering for re-using broadcast video. In International Conference on Computer Analysis of Images and Patterns. Springer, 801--811.Google ScholarGoogle ScholarCross RefCross Ref
  6. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarGoogle ScholarCross RefCross Ref
  7. Vasileios T Chasanis, Aristidis C Likas, and Nikolaos P Galatsanos. 2008. Scene detection in videos using shot clustering and sequence alignment. IEEE transactions on multimedia 11, 1 (2008), 89--100.Google ScholarGoogle Scholar
  8. Manfred Del Fabro and Laszlo Böszörmenyi. 2013. State-of-the-art and future challenges in video scene detection: a survey. Multimedia systems 19, 5 (2013), 427--454.Google ScholarGoogle Scholar
  9. Diego Didona, Francesco Quaglia, Paolo Romano, and Ennio Torre. 2015. Enhancing performance prediction robustness by combining analytical modeling and machine learning. In Proceedings of the 6th ACM/SPEC international conference on performance engineering. ACM, 145--156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Alex Endert, William Ribarsky, Cagatay Turkay, BL William Wong, Ian Nabney, I Díaz Blanco, and Fabrice Rossi. 2017. The state of the art in integrating machine learning into visual analytics. In Computer Graphics Forum, Vol. 36. Wiley Online Library, 458--486.Google ScholarGoogle ScholarCross RefCross Ref
  11. Antonino Furnari, Giovanni Maria Farinella, and Sebastiano Battiato. 2016. Temporal segmentation of egocentric videos to highlight personal locations of interest. In European Conference on Computer Vision. Springer, 474--489.Google ScholarGoogle ScholarCross RefCross Ref
  12. Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6576--6585.Google ScholarGoogle ScholarCross RefCross Ref
  13. Bo Han and Weiguo Wu. 2011. Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In 2011 IEEE International conference on multimedia and expo. IEEE, 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Muhammad Haroon, Junaid Baber, Ihsan Ullah, Sher Muhammad Daudpota, Maheen Bakhtyar, and Varsha Devi. 2018. Video Scene Detection Using Compact Bag of Visual Word Models. Advances in Multimedia 2018 (2018).Google ScholarGoogle Scholar
  15. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 131--135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Alina Kloss, Stefan Schaal, and Jeannette Bohg. 2017. Combining learned and analytical models for predicting action effects. arXiv preprint arXiv:1710.04102 (2017).Google ScholarGoogle Scholar
  17. Chao Liang, Yifan Zhang, Jian Cheng, Changsheng Xu, and Hanqing Lu. 2009. A novel role-based movie scene segmentation method. In Pacific-Rim Conference on Multimedia. Springer, 917--922.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Debabrata Mahapatra, Ragunathan Mariappan, and Vaibhav Rajan. 2018. Automatic Hierarchical Table of Contents Generation for Educational Videos. In Companion Proceedings of the TheWeb Conference 2018. InternationalWorld Wide Web Conferences Steering Committee, 267--274.Google ScholarGoogle Scholar
  19. Bernd Münzer and Klaus Schoeffmann. 2018. Video Browsing on a Circular Timeline. In International Conference on Multimedia Modeling. Springer, 395--399.Google ScholarGoogle ScholarCross RefCross Ref
  20. Alessandro Ortis, GiovanniMFarinella, Valeria D?Amico, Luca Addesso, Giovanni Torrisi, and Sebastiano Battiato. 2017. Organizing egocentric videos of daily living activities. Pattern Recognition 72 (2017), 207--218.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Rameswar Panda, Sanjay K Kuanar, and Ananda S Chowdhury. 2017. Nyström Approximated Temporally Constrained Multisimilarity Spectral Clustering Approach for Movie Scene Detection. IEEE Transactions on Cybernetics (2017).Google ScholarGoogle Scholar
  22. Yair Poleg, Chetan Arora, and Shmuel Peleg. 2014. Temporal segmentation of egocentric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2537--2544.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Stanislav Protasov, Adil Mehmood Khan, Konstantin Sozykin, and Muhammad Ahmad. 2018. Using deep features for video scene detection and annotation. Signal, Image and Video Processing 12, 5 (2018), 991--999.Google ScholarGoogle ScholarCross RefCross Ref
  24. Zeeshan Rasheed and Mubarak Shah. 2005. Detection and representation of scenes in videos. IEEE transactions on Multimedia 7, 6 (2005), 1097--1105.Google ScholarGoogle Scholar
  25. Paramita Ray and Amlan Chakrabarti. 2019. A Mixed approach of Deep Learning method and Rule-Based method to improve Aspect Level Sentiment Analysis. Applied Computing and Informatics (2019).Google ScholarGoogle Scholar
  26. Daniel Rotman, Dror Porat, and Gal Ashour. 2016. Robust and efficient video scene detection using optimal sequential grouping. In 2016 IEEE International Symposium on Multimedia (ISM). IEEE, 275--280.Google ScholarGoogle ScholarCross RefCross Ref
  27. Daniel Rotman, Dror Porat, and Gal Ashour. 2017. Robust video scene detection using multimodal fusion of optimally grouped features. In 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  28. Daniel Rotman, Dror Porat, Gal Ashour, and Udi Barzelay. 2018. Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 187--195.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yong Rui, Thomas S Huang, and Sharad Mehrotra. 1999. Constructing table-ofcontent for videos. Multimedia systems 7, 5 (1999), 359--368.Google ScholarGoogle Scholar
  30. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarGoogle ScholarCross RefCross Ref
  31. Yair Shemer, Daniel Rotman, and Nahum Shimkin. 2019. ILS-SUMM: Iterated Local Search for Unsupervised Video Summarization. arXiv preprint arXiv:1912.03650 (2019).Google ScholarGoogle Scholar
  32. Panagiotis Sidiropoulos, Vasileios Mezaris, Ioannis Kompatsiaris, Hugo Meinedo, Miguel Bugalho, and Isabel Trancoso. 2011. Temporal video segmentation to scenes using high-level audiovisual features. IEEE Transactions on Circuits and Systems for Video Technology 21, 8 (2011), 1163--1177.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Alan F Smeaton, Paul Over, and Aiden R Doherty. 2010. Video shot boundary detection: Seven years of TRECVid activity. Computer Vision and Image Understanding 114, 4 (2010), 411--418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818--2826.Google ScholarGoogle ScholarCross RefCross Ref
  35. Makarand Tapaswi, Martin Bauml, and Rainer Stiefelhagen. 2014. Storygraphs: visualizing character interactions as a timeline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 827--834.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tiago H. Trojahn, Rodrigo M. Kishi, and Rudinei Goularte. 2018. A New Multimodal Deep-learning Model to Video Scene Segmentation. In Proceedings of the 24th Brazilian Symposium on Multimedia and the Web (Salvador, BA, Brazil) (WebMedia '18). ACM, New York, NY, USA, 205--212. https://doi.org/10.1145/ 3243082.3243108Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jeroen Vendrig and Marcel Worring. 2002. Systematic evaluation of logical story unit segmentation. IEEE Transactions on Multimedia 4, 4 (2002), 492--499.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, and Eftychios Protopapadakis. 2018. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience 2018 (2018).Google ScholarGoogle Scholar
  39. Minerva Yeung, Boon-Lock Yeo, and Bede Liu. 1998. Segmentation of video by clustering and graph analysis. Computer vision and image understanding 71, 1 (1998), 94--109.Google ScholarGoogle Scholar

Index Terms

  1. Learnable Optimal Sequential Grouping for Video Scene Detection

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                MM '20: Proceedings of the 28th ACM International Conference on Multimedia
                October 2020
                4889 pages
                ISBN:9781450379885
                DOI:10.1145/3394171

                Copyright © 2020 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 12 October 2020

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Acceptance Rates

                Overall Acceptance Rate995of4,171submissions,24%

                Upcoming Conference

                MM '24
                MM '24: The 32nd ACM International Conference on Multimedia
                October 28 - November 1, 2024
                Melbourne , VIC , Australia

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader