skip to main content
10.1145/3577190.3614225acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
extended-abstract

Bridging Multimedia Modalities: Enhanced Multimodal AI Understanding and Intelligent Agents

Authors Info & Claims
Published:09 October 2023Publication History

ABSTRACT

With the increasing availability of multimodal data, especially in the sports and medical domains, there is growing interest in developing Artificial Intelligence (AI) models capable of comprehending the world in a more holistic manner. Nevertheless, various challenges exist in multimodal understanding, including the integration of multiple modalities and the resolution of semantic gaps between them. The proposed research aims to leverage multiple input modalities for the multimodal understanding of AI models, enhancing their reasoning, generation, and intelligent behavior. The research objectives focus on developing novel methods for multimodal AI, integrating them into conversational agents with optimizations for domain-specific requirements. The research methodology encompasses literature review, data curation, model development and implementation, evaluation and performance analysis, domain-specific applications, and documentation and reporting. Ethical considerations will be thoroughly addressed, and a comprehensive research plan is outlined to provide guidance. The research contributes to the field of multimodal AI understanding and the advancement of sophisticated AI systems by experimenting with multimodal data to enhance the performance of state-of-the-art neural networks.

References

  1. Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (Jan. 2018), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Khaled Bayoudh, Raja Knani, Fayçal Hamdaoui, and Abdellatif Mtibaa. 2022. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. Vis. Comput. 38, 8 (Aug. 2022), 2939–2970. https://doi.org/10.1007/s00371-021-02166-7Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv (April 2020). https://doi.org/10.48550/arXiv.2004.05150 arXiv:2004.05150Google ScholarGoogle ScholarCross RefCross Ref
  4. Yanbei Chen, Yongqin Xian, A Koepke, Ying Shan, and Zeynep Akata. 2021. Distilling audio-visual knowledge by compositional contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7016–7025.Google ScholarGoogle ScholarCross RefCross Ref
  5. Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen, Zhiyuan Liu, Haitao Zheng, and Maosong Sun. 2022. OpenPrompt: An Open-source Framework for Prompt-learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Dublin, Ireland, 105–113. https://doi.org/10.18653/v1/2022.acl-demo.10Google ScholarGoogle ScholarCross RefCross Ref
  6. Junhao Feng, Guohua Wang, Changmeng Zheng, Yi Cai, Ze Fu, Yaowei Wang, Xiao-Yong Wei, and Qing Li. 2023. Towards Bridged Vision and Language: Learning Cross-modal Knowledge Representation for Relation Extraction. IEEE Transactions on Circuits and Systems for Video Technology (2023).Google ScholarGoogle Scholar
  7. Jing Gao, Peng Li, Zhikui Chen, and Jianing Zhang. 2020. A Survey on Deep Learning for Multimodal Data Fusion. Neural Comput. 32, 5 (May 2020), 829–864. https://doi.org/10.1162/neco_a_01273Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yuzhou Gao. 2021. Human Motion Recognition Based on Multimodal Characteristics of Learning Quality in Football Scene. Mathematical Problems in Engineering (2021). https://doi.org/10.1155/2021/7963616Google ScholarGoogle ScholarCross RefCross Ref
  9. Sushant Gautam. 2022. AI-based Soccer Game Summarization: From Video Highlights to Dynamic Text Summaries. Master’s Thesis, Tribhuvan University. https://www.researchgate.net/publication/363857936_AI-based_Soccer_Game_Summarization_From_Video_Highlights_to_Dynamic_Text_SummariesGoogle ScholarGoogle Scholar
  10. Sushant Gautam, Cise Midoglu, Saeed Shafiee Sabet, Dinesh Baniya Kshatri, and Pål Halvorsen. 2022. Assisting Soccer Game Summarization via Audio Intensity Analysis of Game Highlights. In Proceedings of 12th IOE Graduate Conference, Vol. 12. Institute of Engineering, Tribhuvan University, Nepal, 25 – 32. https://conference.ioe.edu.np/publications/ioegc12/IOEGC-12-004-12009.pdfGoogle ScholarGoogle Scholar
  11. Sushant Gautam, Cise Midoglu, Saeed Shafiee Sabet, Dinesh Baniya Kshatri, and Pål Halvorsen. 2022. Soccer Game Summarization using Audio Commentary, Metadata, and Captions. In NarSUM ’22: Proceedings of the 1st Workshop on User-centric Narrative Summarization of Long Videos. Association for Computing Machinery, New York, NY, USA, 13–22. https://doi.org/10.1145/3552463.3557019Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Significant Gravitas. 2023. Auto-GPT. https://github.com/Significant-Gravitas/Auto-GPT.Google ScholarGoogle Scholar
  13. Mathias Kraus, Julia Anna Bingler, Markus Leippold, Tobias Schimanski, Chiara Colesanti Senni, Dominik Stammbach, Saeid Ashraf Vaghefi, and Nicolas Webersinke. 2023. Enhancing Large Language Models with Climate Resources. arXiv (March 2023). https://doi.org/10.48550/arXiv.2304.00116 arXiv:2304.00116Google ScholarGoogle ScholarCross RefCross Ref
  14. Jiaxin Li, Danfeng Hong, Lianru Gao, Jing Yao, Ke Zheng, Bing Zhang, and Jocelyn Chanussot. 2022. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Obs. Geoinf. 112 (Aug. 2022), 102926. https://doi.org/10.1016/j.jag.2022.102926Google ScholarGoogle ScholarCross RefCross Ref
  15. Zekang Li, Zongjia Li, Jinchao Zhang, Yang Feng, and Jie Zhou. 2021. Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog. IEEE/ACM Trans. Audio Speech Lang. Process. 29 (2021), 2476–2483. https://doi.org/10.1109/TASLP.2021.3065823Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Yufan Chen, Peter Wu, Michelle A Lee, Yuke Zhu, 2021. MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).Google ScholarGoogle Scholar
  17. Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2022. Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. arXiv (Sept. 2022). https://doi.org/10.48550/arXiv.2209.03430 arXiv:2209.03430Google ScholarGoogle ScholarCross RefCross Ref
  18. Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.). Vol. 35. Curran Associates, Inc., 1950–1965. https://proceedings.neurips.cc/paper_files/paper/2022/file/0cde695b83bd186c1fd456302888454c-Paper-Conference.pdfGoogle ScholarGoogle Scholar
  19. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9 (Jan. 2023), 1–35. https://doi.org/10.1145/3560815Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv (Feb. 2020). https://doi.org/10.48550/arXiv.2002.06353 arXiv:2002.06353Google ScholarGoogle ScholarCross RefCross Ref
  21. Hassan Mkhallati, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. 2023. SoccerNet-Caption: Dense Video Captioning for Soccer Broadcasts Commentaries. arXiv abs/2304.04565 (2023). https://doi.org/10.48550/arXiv.2304.04565 arXiv:2304.04565Google ScholarGoogle ScholarCross RefCross Ref
  22. Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. 2023. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. ArXiv 2306.05424 (2023).Google ScholarGoogle Scholar
  23. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML.Google ScholarGoogle Scholar
  24. Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language Models as Knowledge Bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2463–2473. https://doi.org/10.18653/v1/D19-1250Google ScholarGoogle ScholarCross RefCross Ref
  25. Madeline C. Schiappa, Yogesh S. Rawat, and Mubarak Shah. 2023. Self-Supervised Learning for Videos: A Survey. ACM Comput. Surv. 55, 13s (July 2023), 1–37. https://doi.org/10.1145/3577925Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv (March 2023). https://doi.org/10.48550/arXiv.2303.17580 arXiv:2303.17580Google ScholarGoogle ScholarCross RefCross Ref
  27. Mohit Shridhar, Lucas Manuelli, and Dieter Fox. 2023. Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation. In Proceedings of The 6th Conference on Robot Learning(Proceedings of Machine Learning Research, Vol. 205), Karen Liu, Dana Kulic, and Jeff Ichnowski (Eds.). PMLR, 785–799. https://proceedings.mlr.press/v205/shridhar23a.htmlGoogle ScholarGoogle Scholar
  28. Sören Richard Stahlschmidt, Benjamin Ulfenborg, and Jane Synnergren. 2022. Multimodal deep learning for biomedical data fusion: a review. Briefings Bioinf. 23, 2 (March 2022), bbab569. https://doi.org/10.1093/bib/bbab569Google ScholarGoogle ScholarCross RefCross Ref
  29. Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal. 2023. Any-to-Any Generation via Composable Diffusion. arXiv (May 2023). https://doi.org/10.48550/arXiv.2305.11846 arXiv:2305.11846Google ScholarGoogle ScholarCross RefCross Ref
  30. Vajira Thambawita, Steven A. Hicks, Andrea M. Storås, Thu Nguyen, Jorunn M. Andersen, Oliwia Witczak, Trine B. Haugen, Hugo L. Hammer, Pål Halvorsen, and Michael A. Riegler. 2023. VISEM-Tracking, a human spermatozoa tracking dataset. Sci. Data 10, 260 (May 2023), 1–8. https://doi.org/10.1038/s41597-023-02173-4Google ScholarGoogle ScholarCross RefCross Ref
  31. Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35 (2022), 10078–10093.Google ScholarGoogle Scholar
  32. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv (Feb. 2023). https://doi.org/10.48550/arXiv.2302.13971 arXiv:2302.13971Google ScholarGoogle ScholarCross RefCross Ref
  33. Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. 2023. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14549–14560.Google ScholarGoogle ScholarCross RefCross Ref
  34. Hanbo Wu. 2021. Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology 32 (2021), 1250–1261. https://doi.org/10.1109/tcsvt.2021.3077512Google ScholarGoogle ScholarCross RefCross Ref
  35. Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, and Wanli Ouyang. 2023. Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10704–10713.Google ScholarGoogle ScholarCross RefCross Ref
  36. Peng Xu, Xiatian Zhu, and David A Clifton. 2023. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).Google ScholarGoogle Scholar
  37. Duoyi Zhang, Richi Nayak, and Md Abul Bashar. 2021. Exploring Fusion Strategies in Deep Learning Models for Multi-Modal Classification. In Data Mining. Springer, Singapore, 102–117. https://doi.org/10.1007/978-981-16-8531-6_8Google ScholarGoogle ScholarCross RefCross Ref
  38. Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv (June 2023). https://doi.org/10.48550/arXiv.2306.02858 arXiv:2306.02858Google ScholarGoogle ScholarCross RefCross Ref
  39. Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, and Xiangyu Yue. 2023. Meta-Transformer: A Unified Framework for Multimodal Learning. arXiv preprint arXiv:2307.10802 (2023).Google ScholarGoogle Scholar

Index Terms

  1. Bridging Multimedia Modalities: Enhanced Multimodal AI Understanding and Intelligent Agents

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction
              October 2023
              858 pages
              ISBN:9798400700552
              DOI:10.1145/3577190

              Copyright © 2023 Owner/Author

              Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 9 October 2023

              Check for updates

              Qualifiers

              • extended-abstract
              • Research
              • Refereed limited

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format