ABSTRACT
This paper proposes a cycle architecture based on policy gradients for unsupervised video summarization. Specifically, the Modified DSNet and DSN-attention net constitute a cycle architecture and promote each other in the training stage to achieve higher performance compared with the unsupervised methods that formulate video summarization as a sequential decision-making process. In the training stage, the DSN-attention net is trained by the policy gradient in combination with the additional MSE loss between the two outputs of the modified DSNet and DSN-attention net. Then the output of the DSN-attention net is taken for generating the labels to train the modified DSNet. As a result, a cycle architecture is built up for unsupervised video summarization. At the test stage, the final video summary is produced by the average fusion of the outputs of both the Modified DSNet and DSN attention net. Extensive experiments and analysis on two benchmark datasets demonstrate the effectiveness of our method and its superior performance in comparison with the state-of-the-art unsupervised methods.
- Dong Liang, Zongqi Wei, Han Sun, Huiyu Zhou. 2021. Robust Cross-Scene Foreground Segmentation in Surveillance Video. in 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China.Google Scholar
- Dong Liang, Bin Kang, Xinyu Liu, Pan Gao, Xiaoyang Tan, Shun'ichi Kaneko. 2021. Cross-scene foreground segmentation with supervised and unsupervised model communication. Pattern Recognition, vol. 117.Google Scholar
- E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris and I. Patras. Nov. 2021. Video Summarization Using Deep Neural Networks: A Survey. in Proceedings of the IEEE, vol. 109, no. 11, pp. 1838-1863, doi: 10.1109/JPROC.2021.3117472.Google ScholarCross Ref
- K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. 2016. Video summarization with long short-term memory. in Proc. ECCV, Amsterdam, Netherlands, pp. 766–782, doi: 10.1007/978-3-319-46478_747.Google ScholarCross Ref
- Yi-Nung Chung, Tun-Chang Lu, Ming-Tsung Yeh, Yu-Xian Huang, and Chun-Yi Wu. June. 2015. Applying the Video Summarization Algorithm to Surveillance Systems. Journal of Image and Graphics, vol. 3, no. 1, pp. 20-24, doi: 10.18178/joig.3.1.20-24.Google ScholarCross Ref
- Mohamed Maher Ben Ismail and Ouiem Bchir. June 2015. CE Video Summarization Using Relational Motion Histogram Descriptor. Journal of Image and Graphics, vol. 3, no. 1, pp. 34-39, doi: 10.18178/joig.3.1.34-39.Google ScholarCross Ref
- E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris and I. Patras. Aug. 2021. AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization. IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 8, pp. 3278-3292, doi: 10.1109/TCSVT.2020.3037883.Google ScholarCross Ref
- B. Mahasseni, M. Lam, and S. Todorovic, Jul. 2017. Unsupervised video summarization with adversarial LSTM networks. in Proc. CVPR, Hawaii, USA, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.Google ScholarCross Ref
- Y. Jung, D. Cho, D. Kim, S. Woo, I.S. Kweon. 2019. Discriminative feature learning for unsupervised video summarization. in Proc. AAAI, Hawaii, USA, pp. 8537-8544.Google ScholarDigital Library
- P. Li, Q. Ye, L. Zhang, Y. Li, X. Xu and L. Shao. 2021. Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognition, vol. 111, doi: 10.1016/j.patcog.2020.107677.Google ScholarCross Ref
- K. Zhou, Y. Qiao, and T. Xiang. 2018. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. in Proc. AAAI, New Orleans, Louisiana, USA, pp. 7582–7589.Google Scholar
- T. Liu, Q. Meng, J. -J. Huang, A. Vlontzos, D. Rueckert and B. Kainz. 2022. Video Summarization Through Reinforcement Learning With a 3D Spatio-Temporal U-Net. in IEEE Transactions on Image Processing, vol. 31, pp. 1573-1586, doi: 10.1109/TIP.2022.3143699.Google ScholarDigital Library
- W. Zhu, J. Lu, J. Li and J. Zhou. 2021. DSNet: A Flexible Detect-to-Summarize Network for Video Summarization. IEEE Trans. Image Process., vol. 30, pp. 948-962, doi: 10.1109/TIP.2020.3039886.Google ScholarDigital Library
- C. Szegedy, W. Liu, Y. 2015. Going deeper with convolutions. in Proc. CVPR, Boston, MA, USA, pp. 1-9, doi: 10.1109/CVPR.2015.7298594.Google ScholarCross Ref
- S. Ren, K. He, R. Girshick, J. Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. in Proc. NIPS, Montréal, CANADA, pp. 91-99.Google Scholar
- A. Vaswani, N. Shazeer, N. Parmar, 2017. Attention is all you need. in Proc. NIPS, Long Beach, CA, USA, pp. 5999-6009.Google Scholar
- D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid. 2014. Category-specific video summarization. in Proc. ECCV, Zurich, Switzerland, pp. 540–555, doi: 10.1007/978-3-319-10599-4_35.Google ScholarCross Ref
- T. Lin, P. Goyal, R. Girshick, K. He and P. Dollar. 2017. Focal Loss for Dense Object Detection. in Proc. ICCV, Venice, Italy, pp. 2999-3007, doi: 10.1109/ICCV.2017.324.Google ScholarCross Ref
- Y. Song, J. Vallmitjana, A. Stent, A. Jaimes. 2015. TVSum: Summarizing web videos using titles. in Proc. CVPR, Boston, MA, USA, pp. 5179-5187, doi: 10.1109/CVPR.2015.7299154.Google ScholarCross Ref
- M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool. 2014. Creating summaries from user videos. in Proc. ECCV, Zurich, Switzerland, pp. 505-520, doi: 10.1007/978-3-319-10584-0_33.Google ScholarCross Ref
- S. E. F. De Avila, A. P. B. Lopes, A. Da Luz Jr., Jan. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters, vol. 32, no. 1, pp. 56–68, doi: 10.1016/j.patrec.2010.08.004.Google ScholarDigital Library
- Guoqiang Zhang. 2023. On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance. arXiv preprint arXiv:2302.01029, doi: 10.48550/arXiv.2302.01029.Google ScholarCross Ref
- R. Zhong, R. Wang, Y. Zou, Z. Hong and M. Hu. 2022. Graph Attention Networks Adjusted Bi-LSTM for Video Summarization. IEEE Signal Processing Letters, doi: 10.1109/LSP.2021.3066349.Google ScholarCross Ref
- M. Hu, R. Hu, Z. Wang, 2022. Spatiotemporal two-stream LSTM network for unsupervised video summarization. Multimedia Tools and Applications, doi: 10.1007/s11042-022-12901-4.Google ScholarDigital Library
Index Terms
- A Cycle Architecture Based on Policy Gradient for Unsupervised Video Summarization
Recommendations
A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization
AI4TV '19: Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and DeliveryIn this paper we present our work on improving the efficiency of adversarial training for unsupervised video summarization. Our starting point is the SUM-GAN model, which creates a representative summary based on the intuition that such a summary should ...
Unsupervised Video Summarization via Multi-source Features
ICMR '21: Proceedings of the 2021 International Conference on Multimedia RetrievalVideo summarization aims at generating a compact yet representative visual summary that conveys the essence of the original video. The advantage of unsupervised approaches is that they do not require human annotations to learn the summarization ...
Unsupervised Extractive Text Summarization with Distance-Augmented Sentence Graphs
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalSupervised summarization has made significant improvements in recent years by leveraging cutting-edge deep learning technologies. However, the true success of supervised methods relies on the availability of large quantity of human-generated summaries of ...
Comments