research-article

A Novel Multi-Modal Network-Based Dynamic Scene Understanding

Authors:
Md Azher Uddin

Department of Artificial Intelligence, Ajou University, Suwon-si, Gyeonggi-do, Korea

Department of Artificial Intelligence, Ajou University, Suwon-si, Gyeonggi-do, Korea
View Profile

,
Joolekha Bibi Joolee

Department of Computer Scienceand Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do, Korea

Department of Computer Scienceand Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do, Korea
View Profile

,
Young-Koo Lee

Department of Computer Scienceand Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do, Korea

Department of Computer Scienceand Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do, Korea
View Profile

,
Kyung-Ah Sohn

Department of Software and Computer Engineering, and Department of Artificial Intelligence, Ajou University, Suwon-si, Gyeonggi-do, Korea

Department of Software and Computer Engineering, and Department of Artificial Intelligence, Ajou University, Suwon-si, Gyeonggi-do, Korea
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18 Issue 1Article No.: 7pp 1–19https://doi.org/10.1145/3462218

Published:27 January 2022Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

In recent years, dynamic scene understanding has gained attention from researchers because of its widespread applications. The main important factor in successfully understanding the dynamic scenes lies in jointly representing the appearance and motion features to obtain an informative description. Numerous methods have been introduced to solve dynamic scene recognition problem, nevertheless, a few concerns still need to be investigated. In this article, we introduce a novel multi-modal network for dynamic scene understanding from video data, which captures both spatial appearance and temporal dynamics effectively. Furthermore, two-level joint tuning layers are proposed to integrate the global and local spatial features as well as spatial and temporal stream deep features. In order to extract the temporal information, we present a novel dynamic descriptor, namely, Volume Symmetric Gradient Local Graph Structure (VSGLGS), which generates temporal feature maps similar to optical flow maps. However, this approach overcomes the issues of optical flow maps. Additionally, Volume Local Directional Transition Pattern (VLDTP) based handcrafted spatiotemporal feature descriptor is also introduced, which extracts the directional information through exploiting edge responses. Lastly, a stacked Bidirectional Long Short-Term Memory (Bi-LSTM) network along with a temporal mixed pooling scheme is designed to achieve the dynamic information without noise interference. The extensive experimental investigation proves that the proposed multi-modal network outperforms most of the state-of-the-art approaches for dynamic scene understanding.

REFERENCES

[1] Uddin M. A., Akhond M. R., and Lee. Y. K.2018. Dynamic scene recognition using spatiotemporal based DLTP on spark. IEEE Access 6 (2018), 66123–66133.Google ScholarCross Ref
[2] Abdullah M. F. A., Sayeed M. S., Muthu K. S., Bashier H. K., Azman A., and Ibrahim S. Z.. 2014. Face recognition with symmetric local graph structure (SLGS). Expert Systems with Applications 41, 14 (2014), 6131–6137.Google ScholarCross Ref
[3] Abu-El-Haija S., Kothari N., Lee J., Natsev P., Toderici G., Varadarajan B., and Vijayanarasimhan S.. 2016. Youtube-8M: A large-scale video classification benchmark. https://arxiv.org/pdf/1609.08675.pdf.Google Scholar
[4] Chen J., Chen Z., and Chi Z.. 2016. Facial expression recognition in video with multiple feature fusion. IEEE Transactions on Affective Computing 9 (2016), 38–50. Google ScholarDigital Library
[5] Derpanis K. G., Lecce M., Daniilidis K., and Wildes. R. P.2012. Dynamic scene understanding: The role of orientation features in space and time in scene classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, 1306–1313. Google ScholarDigital Library
[6] Ding Y., Zhu Y., Feng J., Zhang P., and Cheng. Z.2020. Interpretable spatio-temporal attention LSTM model for flood forecasting. Neurocomputing 403 (2020), 348–359.Google ScholarCross Ref
[7] Du G., Wang Z., Gao B., Mumtaz S., Abualnaja K. M., and Du C.. 2020. A convolution bidirectional long short-term memory neural network for driver emotion recognition. IEEE Transactions on Intelligent Transportation Systems (2020). DOI: DOI: https://doi.org/10.1109/TITS.2020.3007357Google Scholar
[8] Feichtenhofer C., Pinz A., and Wildes R. P.. 2013. Spacetime forests with complementary features for dynamic scene recognition. In Proceedings of British Machine Vision Conference (BMVC’13). BMVA Press, 56.1–56.12.Google ScholarCross Ref
[9] Feichtenhofer C., Pinz A., and Wildes R. P.. 2014. Bags of spacetime energies for dynamic scene recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE, 2681–2688. Google ScholarDigital Library
[10] Feichtenhofer C., Pinz A., and Wildes R. P.. 2016. Dynamic scene recognition with complementary spatiotemporal features. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 12 (2016), 2389–2400. Google ScholarDigital Library
[11] Feichtenhofer C., Pinz A., and Wildes R. P.. 2017. Temporal residual networks for dynamic scene recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, 591–598.Google ScholarCross Ref
[12] Gao C., Sang N., and Huang R.. 2016. Spatial multi-scale gradient orientation consistency for place instance and scene category recognition. Information Sciences 372 (2016), 84–97. Google ScholarDigital Library
[13] Gao J., Yang J., Wang G., and Li M.. 2016. A novel feature extraction method for scene recognition based on centered convolutional restricted Boltzmann machines. Neurocomputing 214 (2016), 708–717. Google ScholarDigital Library
[14] Hadji I. and Wildes R. P.. 2010. A new large scale dynamic texture dataset with application to ConvNet understanding. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, 320–335.Google Scholar
[15] Hochreiter S. and Schmidhuber J.. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. Google ScholarDigital Library
[16] Hong S., Ryu J., Im W., and Yang H. S.. 2018. D3: Recognizing dynamic scenes with deep dual descriptor based on key frames and key segments. Neurocomputing 273 (2018), 611–621. Google ScholarDigital Library
[17] Huang Y., Cao X., Wang Q., Zhang B., Zhen X., and Li X.. 2019. Long-short-term features for dynamic scene classification. IEEE Transactions on Circuits and Systems for Video Technology 29, 4 (2019), 1038–1047.Google ScholarCross Ref
[18] Huang Y., Cao X., Zhen X., and Han Z.. 2019. Attentive temporal pyramid network for dynamic scene classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19). Google ScholarDigital Library
[19] Jabid T., Kabir M. H., and Chae O.. 2010. Local directional pattern (LDP) for face recognition. In Proceedings of the IEEE International Conference on Consumer Electronics (ICCE’10). IEEE, 329–330. Google ScholarDigital Library
[20] Jazaery M. A. and Guo G.. 2018. Video-based depression level analysis by encoding deep spatiotemporal features. IEEE Transactions on Affective Computing (2018). DOI: DOI: https://doi.org/10.1109/TAFFC.2018.2870884Google Scholar
[21] Jiang Shuqiang, Chen Gongwei, Song Xinhang, and Liu Linhu. 2019. Deep patch representations with shared codebook for scene classification. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s (2019), Article 5, 17 pages. Google ScholarDigital Library
[22] Laptev I., Marszalek M., Schmid C., and Rozenfeld B.. 2011. Local ternary patterns from three orthogonal planes for human action classification. Expert Systems with Applications 38 (2011), 5125–5128. Google ScholarDigital Library
[23] Lazebnik S., Schmid C., and Ponce J.. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’06). IEEE. Google ScholarDigital Library
[24] Mirza A. H., Kerpicci M., and Kozat S. S.. 2020. Efficient online learning with improved LSTM neural networks. Digital Signal Processing 102 (2020), 102742.Google ScholarCross Ref
[25] Peng X. and Bouzerdoum A.. 2020. Part-based feature aggregation method for dynamic scene recognition. In Proceedings of the Digital Image Computing: Techniques and Applications (DICTA’20). IEEE.Google Scholar
[26] Qi X., Li C. G., Zhao G., Hong X., and M. Pietikäinen2016. Dynamic texture and scene classification by transferring deep image features. Neurocomputing 171 (2016), 1230–1241. Google ScholarDigital Library
[27] Shroff N., Turaga P., and Chellappa R.. 2010. Moving vistas: Exploiting motion for describing scenes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). IEEE.Google ScholarCross Ref
[28] Simonyan K. and Zisserman A.. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). ACM, 568–576. Google ScholarDigital Library
[29] Simonyan K. and Zisserman A.. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
[30] Szegedy C., Ioffe S., Vanhoucke V., and Alemi A. A.. 2017. Inceptionv4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI’17). ACM, 4278–4284. Google ScholarDigital Library
[31] Tang P., Wang H., and Kwong S.. 2017. GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 225 (2017), 188–197.Google ScholarDigital Library
[32] Tao H.. 2019. Detecting smoky vehicles from traffic surveillance videos based on dynamic features. Applied Intelligence 50 (2019), 1057–1072.Google ScholarCross Ref
[33] Cord. C. Thériault, N. Thome, and M.2013. Dynamic scene classification: Learning motion descriptors with slow features analysis. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). IEEE, 2603–2610. Google ScholarDigital Library
[34] Tran D., Bourdev L., Fergus R., Torresani L., and Paluri M.. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). IEEE. Google ScholarDigital Library
[35] Ullah A., Muhammad K., Ser J. D., Baik S. W., and de Albuquerque V. H. C.. 2019. Activity recognition using temporal optical flow convolutional features and multilayer LSTM. IEEE Transactions on Industrial Electronics 66 (2019), 9692–9702.Google ScholarCross Ref
[36] Ullah I. and Petrosino A.. 2017. A spatio-temporal feature learning approach for dynamic scene recognition. In Proceedings of International Conference on Pattern Recognition and Machine Intelligence (PReMI’17). Springer, 591–598.Google ScholarCross Ref
[37] Vasudevan A. B., Muralidharan S., Chintapalli S. P., and Raman S.. 2013. Dynamic scene classification using spatial and temporal cues. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE, 803–810. Google ScholarDigital Library
[38] Wiskott L. and Sejnowski T.. 2002. Slow feature analysis: Unsupervised learning of invariances. Neural Computation 14 (2002), 715–770. Google ScholarDigital Library
[39] Yu Y., Hu C., Si X., Zheng J., and Zhang J.. 2020. Averaged Bi-LSTM networks for RUL prognostics with non-life-cycle labeled dataset. Neurocomputing 402 (2020), 134–147.Google ScholarCross Ref
[40] Zhao G. and Pietikäinen M.. 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 6 (2007), 915–928. Google ScholarDigital Library
[41] Zhu Y., Shang Y., Shao Z., and Guo G.. 2018. Automated depression diagnosis based on deep networks to encode facial appearance and dynamics. IEEE Transactions on Affective Computing 9, 4 (2018), 578–584.Google ScholarDigital Library

Index Terms

A Novel Multi-Modal Network-Based Dynamic Scene Understanding
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Manhattan Scene Understanding via XSlit Imaging
CVPR '13: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition

A Manhattan World (MW) is composed of planar surfaces and parallel lines aligned with three mutually orthogonal principal axes. Traditional MW understanding algorithms rely on geometry priors such as the vanishing points and reference (ground) planes ...
Read More
3d scene modeling and understanding from image sequences
Read More
Understanding Indoor Scene: Spatial Layout Estimation, Scene Classification, and Object Detection
ICMSSP '18: Proceedings of the 3rd International Conference on Multimedia Systems and Signal Processing

In this paper, we seek to understand scene from different viewpoints such as estimating the spatial layout of indoor scenes, detecting objects in the scene and making scene classification. In the previous work, every step has been done in a separate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 1
January 2022
517 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3505205
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 January 2022
- Accepted: 1 April 2021
- Revised: 1 March 2021
- Received: 1 October 2020
Published in tomm Volume 18, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Multi-modal network
volume symmetric gradient local graph structure
volume local directional transition pattern
temporal mixed pooling
stacked Bi-LSTM network
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 428
  Total Downloads
- Downloads (Last 12 months)85
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

A Novel Multi-Modal Network-Based Dynamic Scene Understanding

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Manhattan Scene Understanding via XSlit Imaging

3d scene modeling and understanding from image sequences

Understanding Indoor Scene: Spatial Layout Estimation, Scene Classification, and Object Detection