skip to main content
research-article

A Novel Multi-Modal Network-Based Dynamic Scene Understanding

Authors Info & Claims
Published:27 January 2022Publication History
Skip Abstract Section

Abstract

In recent years, dynamic scene understanding has gained attention from researchers because of its widespread applications. The main important factor in successfully understanding the dynamic scenes lies in jointly representing the appearance and motion features to obtain an informative description. Numerous methods have been introduced to solve dynamic scene recognition problem, nevertheless, a few concerns still need to be investigated. In this article, we introduce a novel multi-modal network for dynamic scene understanding from video data, which captures both spatial appearance and temporal dynamics effectively. Furthermore, two-level joint tuning layers are proposed to integrate the global and local spatial features as well as spatial and temporal stream deep features. In order to extract the temporal information, we present a novel dynamic descriptor, namely, Volume Symmetric Gradient Local Graph Structure (VSGLGS), which generates temporal feature maps similar to optical flow maps. However, this approach overcomes the issues of optical flow maps. Additionally, Volume Local Directional Transition Pattern (VLDTP) based handcrafted spatiotemporal feature descriptor is also introduced, which extracts the directional information through exploiting edge responses. Lastly, a stacked Bidirectional Long Short-Term Memory (Bi-LSTM) network along with a temporal mixed pooling scheme is designed to achieve the dynamic information without noise interference. The extensive experimental investigation proves that the proposed multi-modal network outperforms most of the state-of-the-art approaches for dynamic scene understanding.

REFERENCES

  1. [1] Uddin M. A., Akhond M. R., and Lee. Y. K.2018. Dynamic scene recognition using spatiotemporal based DLTP on spark. IEEE Access 6 (2018), 6612366133.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Abdullah M. F. A., Sayeed M. S., Muthu K. S., Bashier H. K., Azman A., and Ibrahim S. Z.. 2014. Face recognition with symmetric local graph structure (SLGS). Expert Systems with Applications 41, 14 (2014), 61316137.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Abu-El-Haija S., Kothari N., Lee J., Natsev P., Toderici G., Varadarajan B., and Vijayanarasimhan S.. 2016. Youtube-8M: A large-scale video classification benchmark. https://arxiv.org/pdf/1609.08675.pdf.Google ScholarGoogle Scholar
  4. [4] Chen J., Chen Z., and Chi Z.. 2016. Facial expression recognition in video with multiple feature fusion. IEEE Transactions on Affective Computing 9 (2016), 3850. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Derpanis K. G., Lecce M., Daniilidis K., and Wildes. R. P.2012. Dynamic scene understanding: The role of orientation features in space and time in scene classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, 13061313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Ding Y., Zhu Y., Feng J., Zhang P., and Cheng. Z.2020. Interpretable spatio-temporal attention LSTM model for flood forecasting. Neurocomputing 403 (2020), 348359.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Du G., Wang Z., Gao B., Mumtaz S., Abualnaja K. M., and Du C.. 2020. A convolution bidirectional long short-term memory neural network for driver emotion recognition. IEEE Transactions on Intelligent Transportation Systems (2020). DOI: DOI: https://doi.org/10.1109/TITS.2020.3007357Google ScholarGoogle Scholar
  8. [8] Feichtenhofer C., Pinz A., and Wildes R. P.. 2013. Spacetime forests with complementary features for dynamic scene recognition. In Proceedings of British Machine Vision Conference (BMVC’13). BMVA Press, 56.1–56.12.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Feichtenhofer C., Pinz A., and Wildes R. P.. 2014. Bags of spacetime energies for dynamic scene recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE, 26812688. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Feichtenhofer C., Pinz A., and Wildes R. P.. 2016. Dynamic scene recognition with complementary spatiotemporal features. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 12 (2016), 23892400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Feichtenhofer C., Pinz A., and Wildes R. P.. 2017. Temporal residual networks for dynamic scene recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, 591598.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Gao C., Sang N., and Huang R.. 2016. Spatial multi-scale gradient orientation consistency for place instance and scene category recognition. Information Sciences 372 (2016), 8497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Gao J., Yang J., Wang G., and Li M.. 2016. A novel feature extraction method for scene recognition based on centered convolutional restricted Boltzmann machines. Neurocomputing 214 (2016), 708717. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Hadji I. and Wildes R. P.. 2010. A new large scale dynamic texture dataset with application to ConvNet understanding. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, 320335.Google ScholarGoogle Scholar
  15. [15] Hochreiter S. and Schmidhuber J.. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Hong S., Ryu J., Im W., and Yang H. S.. 2018. D3: Recognizing dynamic scenes with deep dual descriptor based on key frames and key segments. Neurocomputing 273 (2018), 611621. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Huang Y., Cao X., Wang Q., Zhang B., Zhen X., and Li X.. 2019. Long-short-term features for dynamic scene classification. IEEE Transactions on Circuits and Systems for Video Technology 29, 4 (2019), 10381047.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Huang Y., Cao X., Zhen X., and Han Z.. 2019. Attentive temporal pyramid network for dynamic scene classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Jabid T., Kabir M. H., and Chae O.. 2010. Local directional pattern (LDP) for face recognition. In Proceedings of the IEEE International Conference on Consumer Electronics (ICCE’10). IEEE, 329330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Jazaery M. A. and Guo G.. 2018. Video-based depression level analysis by encoding deep spatiotemporal features. IEEE Transactions on Affective Computing (2018). DOI: DOI: https://doi.org/10.1109/TAFFC.2018.2870884Google ScholarGoogle Scholar
  21. [21] Jiang Shuqiang, Chen Gongwei, Song Xinhang, and Liu Linhu. 2019. Deep patch representations with shared codebook for scene classification. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s (2019), Article 5, 17 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Laptev I., Marszalek M., Schmid C., and Rozenfeld B.. 2011. Local ternary patterns from three orthogonal planes for human action classification. Expert Systems with Applications 38 (2011), 51255128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Lazebnik S., Schmid C., and Ponce J.. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’06). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Mirza A. H., Kerpicci M., and Kozat S. S.. 2020. Efficient online learning with improved LSTM neural networks. Digital Signal Processing 102 (2020), 102742.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Peng X. and Bouzerdoum A.. 2020. Part-based feature aggregation method for dynamic scene recognition. In Proceedings of the Digital Image Computing: Techniques and Applications (DICTA’20). IEEE.Google ScholarGoogle Scholar
  26. [26] Qi X., Li C. G., Zhao G., Hong X., and M. Pietikäinen2016. Dynamic texture and scene classification by transferring deep image features. Neurocomputing 171 (2016), 12301241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Shroff N., Turaga P., and Chellappa R.. 2010. Moving vistas: Exploiting motion for describing scenes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Simonyan K. and Zisserman A.. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). ACM, 568576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Simonyan K. and Zisserman A.. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  30. [30] Szegedy C., Ioffe S., Vanhoucke V., and Alemi A. A.. 2017. Inceptionv4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI’17). ACM, 42784284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Tang P., Wang H., and Kwong S.. 2017. GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 225 (2017), 188197.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Tao H.. 2019. Detecting smoky vehicles from traffic surveillance videos based on dynamic features. Applied Intelligence 50 (2019), 10571072.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Cord. C. Thériault, N. Thome, and M.2013. Dynamic scene classification: Learning motion descriptors with slow features analysis. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). IEEE, 26032610. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Tran D., Bourdev L., Fergus R., Torresani L., and Paluri M.. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Ullah A., Muhammad K., Ser J. D., Baik S. W., and de Albuquerque V. H. C.. 2019. Activity recognition using temporal optical flow convolutional features and multilayer LSTM. IEEE Transactions on Industrial Electronics 66 (2019), 96929702.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Ullah I. and Petrosino A.. 2017. A spatio-temporal feature learning approach for dynamic scene recognition. In Proceedings of International Conference on Pattern Recognition and Machine Intelligence (PReMI’17). Springer, 591598.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Vasudevan A. B., Muralidharan S., Chintapalli S. P., and Raman S.. 2013. Dynamic scene classification using spatial and temporal cues. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE, 803810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Wiskott L. and Sejnowski T.. 2002. Slow feature analysis: Unsupervised learning of invariances. Neural Computation 14 (2002), 715770. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Yu Y., Hu C., Si X., Zheng J., and Zhang J.. 2020. Averaged Bi-LSTM networks for RUL prognostics with non-life-cycle labeled dataset. Neurocomputing 402 (2020), 134147.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Zhao G. and Pietikäinen M.. 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 6 (2007), 915928. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Zhu Y., Shang Y., Shao Z., and Guo G.. 2018. Automated depression diagnosis based on deep networks to encode facial appearance and dynamics. IEEE Transactions on Affective Computing 9, 4 (2018), 578584.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Novel Multi-Modal Network-Based Dynamic Scene Understanding

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 1
          January 2022
          517 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3505205
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 January 2022
          • Accepted: 1 April 2021
          • Revised: 1 March 2021
          • Received: 1 October 2020
          Published in tomm Volume 18, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)85
          • Downloads (Last 6 weeks)9

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format