Abstract
Shots are key narrative elements of various videos, e.g. movies, TV series, and user-generated videos that are thriving over the Internet. The types of shots greatly influence how the underlying ideas, emotions, and messages are expressed. The technique to analyze shot types is important to the understanding of videos, which has seen increasing demand in real-world applications in this era. Classifying shot type is challenging due to the additional information required beyond the video content, such as the spatial composition of a frame and camera movement. To address these issues, we propose a learning framework Subject Guidance Network (SGNet) for shot type recognition. SGNet separates the subject and background of a shot into two streams, serving as separate guidance maps for scale and movement type classification respectively. To facilitate shot type analysis and model evaluations, we build a large-scale dataset MovieShots, which contains 46K shots from 7K movie trailers with annotations of their scale and movement types. Experiments show that our framework is able to recognize these two attributes of shot accurately, outperforming all the previous methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Background image is equal to the whole image minus the subject part.
- 2.
[39] is adopted here to cut shots from the film.
- 3.
More results and their corresponding videos are shown in the supplementary videos.
References
Bagheri-Khaligh, A., Raziperchikolaei, R., Moghaddam, M.E.: A new method for shot classification in soccer sports video based on SVM classifier. In: 2012 IEEE Southwest Symposium on Image Analysis and Interpretation, pp. 109–112. IEEE (2012)
Belagiannis, V., Farshad, A., Galasso, F.: Adversarial network compression. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 431–449. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11018-5_37
Benini, S., Canini, L., Leonardi, R.: Estimating cinematographic scene depth in movie shots. In: 2010 IEEE International Conference on Multimedia and Expo, pp. 855–860. IEEE (2010)
Bhattacharya, S., Mehran, R., Sukthankar, R., Shah, M.: Classification of cinematographic shots using lie algebra and its application to complex event recognition. IEEE Trans. Multimed. 16(3), 686–696 (2014)
Caelles, S., Pont-Tuset, J., Perazzi, F., Montes, A., Maninis, K.K., Van Gool, L.: The 2019 DAVIS challenge on VOS: unsupervised multi-object segmentation. arXiv:1905.00737 (2019)
Canini, L., Benini, S., Leonardi, R.: Classifying cinematographic shot types. Multimed. Tools Appl. 62(1), 51–73 (2013)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)
Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H., Hu, S.M.: Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 569–582 (2014)
Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017)
Christoph, R., Pinz, F.A.: Spatiotemporal residual networks for video action recognition. In: Advances in Neural Information Processing Systems, pp. 3468–3476 (2016)
Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., Mei, T.: Relation distillation networks for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7023–7032 (2019)
Deng, Z., et al.: R3Net: recurrent residual refinement network for saliency detection. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 684–690. AAAI Press (2018)
Duan, L.Y., Xu, M., Tian, Q., Xu, C.S., Jin, J.S.: A unified framework for semantic shot classification in sports video. IEEE Trans. Multimed. 7(6), 1066–1083 (2005)
Ekin, A., Tekalp, A.M.: Shot type classification by dominant color for sports video segmentation and summarization. In: Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003, ICASSP 2003, vol. 3, pp. III-173. IEEE (2003)
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982 (2018)
Giannetti, L.D., Leach, J.: Understanding Movies, vol. 1. Prentice Hall, Upper Saddle River (1999)
Goldblum, M., Fowl, L., Feizi, S., Goldstein, T.: Adversarially robust distillation. In: Thirty-Fourth AAAI Conference on Artificial Intelligence (2020)
Guo, C., et al.: Progressive sparse local attention for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3909–3918 (2019)
Hasan, M.A., Xu, M., He, X., Xu, C.: CAMHID: camera motion histogram descriptor and its application to cinematographic shot classification. IEEE Trans. Circuits Syst. Video Technol. 24(10), 1682–1695 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2015)
Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. IEEE TPAMI 41(4), 815–828 (2019)
Huang, Q., Xiong, Y., Rao, A., Wang, J., Lin, D.: Movienet: a holistic dataset for movie understanding. In: The European Conference on Computer Vision (ECCV). Springer, Cham (2020)
Jiang, H., Zhang, M.: Tennis video shot classification based on support vector machine. In: 2011 IEEE International Conference on Computer Science and Automation Engineering, vol. 2, pp. 757–761. IEEE (2011)
Kowdle, A., Chen, T.: Learning to segment a video to clips based on scene and camera motion. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 272–286. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_20
Li, L., Zhang, X., Hu, W., Li, W., Zhu, P.: Soccer video shot classification based on color characterization using dominant sets clustering. In: Muneesawang, P., Wu, F., Kumazawa, I., Roeksabutr, A., Liao, M., Tang, X. (eds.) PCM 2009. LNCS, vol. 5879, pp. 923–929. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10467-1_83
Li, S., He, F., Du, B., Zhang, L., Xu, Y., Tao, D.: Fast spatio-temporal residual network for video super-resolution. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Li, X., et al.: DeepSaliency: multi-task deep neural network model for salient object detection. IEEE Trans. Image Process. 25(8), 3919–3930 (2016)
Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42, 502–508 (2019)
Prasertsakul, P., Kondo, T., Iida, H.: Video shot classification using 2D motion histogram. In: 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 202–205. IEEE (2017)
Rao, A., et al.: A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10146–10155 (2020)
Recasens, A., Kellnhofer, P., Stent, S., Matusik, W., Torralba, A.: Learning to zoom: a saliency-based sampling layer for neural networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 52–67. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_4
Roth, J., et al.: AVA-ActiveSpeaker: an audio-visual dataset for active speaker detection. arXiv preprint arXiv:1901.01342 (2019)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Savardi, M., Signoroni, A., Migliorati, P., Benini, S.: Shot scale analysis in movies by convolutional neural networks. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 2620–2624. IEEE (2018)
Shou, Z., et al.: DMC-Net: generating discriminative motion cues for fast compressed video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1268–1277 (2019)
Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Bugalho, M., Trancoso, I.: Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans. Circuits Syst. Video Technol. 21(8), 1163–1177 (2011)
Wang, H.L., Cheong, L.F.: Taxonomy of directing semantics for film shot classification. IEEE Trans. Circuits Syst. Video Technol. 19(10), 1529–1542 (2009)
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Wang, X., Zhang, R., Sun, Y., Qi, J.: KDGAN: knowledge distillation with generative adversarial networks. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31, pp. 775–786. Curran Associates, Inc. (2018)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Wang, Y., Xu, C., Xu, C., Tao, D.: Adversarial learning of portable student networks. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Wikipedia: As seen through a telescope. https://en.wikipedia.org/. Accessed 18 Feb 2020
Xia, J., Rao, A., Huang, Q., Xu, L., Wen, J., Lin, D.: Online multi-modal person search in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 174–190. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_11
Tong, X.-F., Liu, Q.-S., Lu, H.-Q., Jin, H.-L.: Shot classification in sports video. In: Proceedings 7th International Conference on Signal Processing, ICSP 2004, vol. 2, pp. 1364–1367 (2004)
Xiong, Y., Huang, Q., Guo, L., Zhou, H., Zhou, B., Lin, D.: A graph-based framework to bridge movies and synopses. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
Xu, G., Liu, Z., Li, X., Loy, C.C.: Knowledge distillation meets self-supervision. In: European Conference on Computer Vision (ECCV). Springer, Cham (2020)
Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3D network for temporal activity detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5783–5792 (2017)
Xu, K., Wen, L., Li, G., Bo, L., Huang, Q.: Spatiotemporal CNN for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1379–1388 (2019)
Xu, M., et al.: Using context saliency for movie shot classification. In: 2011 18th IEEE International Conference on Image Processing, pp. 3653–3656. IEEE (2011)
Xu, R., Li, X., Zhou, B., Loy, C.C.: Deep flow-guided video inpainting. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Yang, J., Zheng, W.S., Yang, Q., Chen, Y.C., Tian, Q.: Spatial-temporal graph convolutional network for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3289–3299 (2020)
Laradji, I.H., Rostamzadeh, N., Pinheiro, P.O., Vazquez, D., Schmidt, M.: Where are the blobs: counting by localization with point supervision. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 560–576. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_34
Yuan, L., et al.: Central similarity quantization for efficient image and video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3083–3092 (2020)
Zeng, X., Liao, R., Gu, L., Xiong, Y., Fidler, S., Urtasun, R.: DMM-Net: differentiable mask-matching network for video object segmentation. arXiv preprint arXiv:1909.12471 (2019)
Zhang, H., Liu, D., Xiong, Z.: Two-stream oriented video super-resolution for action recognition. arXiv preprint arXiv:1903.05577 (2019)
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923 (2017)
Zhu, W., Liang, S., Wei, Y., Sun, J.: Saliency optimization from robust background detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2814–2821 (2014)
Acknowledgement
This work is partially supported by the SenseTime Collaborative Grant on Large-scale Multi-modality Analysis (CUHK Agreement No. TS1610626 & No. TS1712093), the General Research Fund (GRF) of Hong Kong (No. 14203518 & No. 14205719), and Innovation and Technology Support Program (ITSP) Tier 2, ITS/431/18F.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Rao, A. et al. (2020). A Unified Framework for Shot Type Classification Based on Subject Centric Lens. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12356. Springer, Cham. https://doi.org/10.1007/978-3-030-58621-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-58621-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58620-1
Online ISBN: 978-3-030-58621-8
eBook Packages: Computer ScienceComputer Science (R0)