Abstract
In recent years, dynamic scene understanding has gained attention from researchers because of its widespread applications. The main important factor in successfully understanding the dynamic scenes lies in jointly representing the appearance and motion features to obtain an informative description. Numerous methods have been introduced to solve dynamic scene recognition problem, nevertheless, a few concerns still need to be investigated. In this article, we introduce a novel multi-modal network for dynamic scene understanding from video data, which captures both spatial appearance and temporal dynamics effectively. Furthermore, two-level joint tuning layers are proposed to integrate the global and local spatial features as well as spatial and temporal stream deep features. In order to extract the temporal information, we present a novel dynamic descriptor, namely, Volume Symmetric Gradient Local Graph Structure (VSGLGS), which generates temporal feature maps similar to optical flow maps. However, this approach overcomes the issues of optical flow maps. Additionally, Volume Local Directional Transition Pattern (VLDTP) based handcrafted spatiotemporal feature descriptor is also introduced, which extracts the directional information through exploiting edge responses. Lastly, a stacked Bidirectional Long Short-Term Memory (Bi-LSTM) network along with a temporal mixed pooling scheme is designed to achieve the dynamic information without noise interference. The extensive experimental investigation proves that the proposed multi-modal network outperforms most of the state-of-the-art approaches for dynamic scene understanding.
- [1] 2018. Dynamic scene recognition using spatiotemporal based DLTP on spark. IEEE Access 6 (2018), 66123–66133.Google ScholarCross Ref
- [2] . 2014. Face recognition with symmetric local graph structure (SLGS). Expert Systems with Applications 41, 14 (2014), 6131–6137.Google ScholarCross Ref
- [3] . 2016. Youtube-8M: A large-scale video classification benchmark. https://arxiv.org/pdf/1609.08675.pdf.Google Scholar
- [4] . 2016. Facial expression recognition in video with multiple feature fusion. IEEE Transactions on Affective Computing 9 (2016), 38–50. Google ScholarDigital Library
- [5] 2012. Dynamic scene understanding: The role of orientation features in space and time in scene classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, 1306–1313. Google ScholarDigital Library
- [6] 2020. Interpretable spatio-temporal attention LSTM model for flood forecasting. Neurocomputing 403 (2020), 348–359.Google ScholarCross Ref
- [7] . 2020. A convolution bidirectional long short-term memory neural network for driver emotion recognition. IEEE Transactions on Intelligent Transportation Systems (2020).
DOI: DOI: https://doi.org/10.1109/TITS.2020.3007357Google Scholar - [8] . 2013. Spacetime forests with complementary features for dynamic scene recognition. In Proceedings of British Machine Vision Conference (BMVC’13). BMVA Press, 56.1–56.12.Google ScholarCross Ref
- [9] . 2014. Bags of spacetime energies for dynamic scene recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE, 2681–2688. Google ScholarDigital Library
- [10] . 2016. Dynamic scene recognition with complementary spatiotemporal features. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 12 (2016), 2389–2400. Google ScholarDigital Library
- [11] . 2017. Temporal residual networks for dynamic scene recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, 591–598.Google ScholarCross Ref
- [12] . 2016. Spatial multi-scale gradient orientation consistency for place instance and scene category recognition. Information Sciences 372 (2016), 84–97. Google ScholarDigital Library
- [13] . 2016. A novel feature extraction method for scene recognition based on centered convolutional restricted Boltzmann machines. Neurocomputing 214 (2016), 708–717. Google ScholarDigital Library
- [14] . 2010. A new large scale dynamic texture dataset with application to ConvNet understanding. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, 320–335.Google Scholar
- [15] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. Google ScholarDigital Library
- [16] . 2018. D3: Recognizing dynamic scenes with deep dual descriptor based on key frames and key segments. Neurocomputing 273 (2018), 611–621. Google ScholarDigital Library
- [17] . 2019. Long-short-term features for dynamic scene classification. IEEE Transactions on Circuits and Systems for Video Technology 29, 4 (2019), 1038–1047.Google ScholarCross Ref
- [18] . 2019. Attentive temporal pyramid network for dynamic scene classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19). Google ScholarDigital Library
- [19] . 2010. Local directional pattern (LDP) for face recognition. In Proceedings of the IEEE International Conference on Consumer Electronics (ICCE’10). IEEE, 329–330. Google ScholarDigital Library
- [20] . 2018. Video-based depression level analysis by encoding deep spatiotemporal features. IEEE Transactions on Affective Computing (2018).
DOI: DOI: https://doi.org/10.1109/TAFFC.2018.2870884Google Scholar - [21] . 2019. Deep patch representations with shared codebook for scene classification. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s (2019), Article
5 , 17 pages. Google ScholarDigital Library - [22] . 2011. Local ternary patterns from three orthogonal planes for human action classification. Expert Systems with Applications 38 (2011), 5125–5128. Google ScholarDigital Library
- [23] . 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’06). IEEE. Google ScholarDigital Library
- [24] . 2020. Efficient online learning with improved LSTM neural networks. Digital Signal Processing 102 (2020), 102742.Google ScholarCross Ref
- [25] . 2020. Part-based feature aggregation method for dynamic scene recognition. In Proceedings of the Digital Image Computing: Techniques and Applications (DICTA’20). IEEE.Google Scholar
- [26] 2016. Dynamic texture and scene classification by transferring deep image features. Neurocomputing 171 (2016), 1230–1241. Google ScholarDigital Library
- [27] . 2010. Moving vistas: Exploiting motion for describing scenes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). IEEE.Google ScholarCross Ref
- [28] . 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). ACM, 568–576. Google ScholarDigital Library
- [29] . 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
- [30] . 2017. Inceptionv4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI’17). ACM, 4278–4284. Google ScholarDigital Library
- [31] . 2017. GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 225 (2017), 188–197.Google ScholarDigital Library
- [32] . 2019. Detecting smoky vehicles from traffic surveillance videos based on dynamic features. Applied Intelligence 50 (2019), 1057–1072.Google ScholarCross Ref
- [33] 2013. Dynamic scene classification: Learning motion descriptors with slow features analysis. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). IEEE, 2603–2610. Google ScholarDigital Library
- [34] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). IEEE. Google ScholarDigital Library
- [35] . 2019. Activity recognition using temporal optical flow convolutional features and multilayer LSTM. IEEE Transactions on Industrial Electronics 66 (2019), 9692–9702.Google ScholarCross Ref
- [36] . 2017. A spatio-temporal feature learning approach for dynamic scene recognition. In Proceedings of International Conference on Pattern Recognition and Machine Intelligence (PReMI’17). Springer, 591–598.Google ScholarCross Ref
- [37] . 2013. Dynamic scene classification using spatial and temporal cues. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE, 803–810. Google ScholarDigital Library
- [38] . 2002. Slow feature analysis: Unsupervised learning of invariances. Neural Computation 14 (2002), 715–770. Google ScholarDigital Library
- [39] . 2020. Averaged Bi-LSTM networks for RUL prognostics with non-life-cycle labeled dataset. Neurocomputing 402 (2020), 134–147.Google ScholarCross Ref
- [40] . 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 6 (2007), 915–928. Google ScholarDigital Library
- [41] . 2018. Automated depression diagnosis based on deep networks to encode facial appearance and dynamics. IEEE Transactions on Affective Computing 9, 4 (2018), 578–584.Google ScholarDigital Library
Index Terms
- A Novel Multi-Modal Network-Based Dynamic Scene Understanding
Recommendations
Manhattan Scene Understanding via XSlit Imaging
CVPR '13: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern RecognitionA Manhattan World (MW) is composed of planar surfaces and parallel lines aligned with three mutually orthogonal principal axes. Traditional MW understanding algorithms rely on geometry priors such as the vanishing points and reference (ground) planes ...
Understanding Indoor Scene: Spatial Layout Estimation, Scene Classification, and Object Detection
ICMSSP '18: Proceedings of the 3rd International Conference on Multimedia Systems and Signal ProcessingIn this paper, we seek to understand scene from different viewpoints such as estimating the spatial layout of indoor scenes, detecting objects in the scene and making scene classification. In the previous work, every step has been done in a separate ...
Comments