extended-abstract

Bridging Multimedia Modalities: Enhanced Multimodal AI Understanding and Intelligent Agents

Author:
Sushant Gautam

Department of Holistic Systems (HOST), Simula Metropolitan Center for Digital Engineering (SimulaMet), Norway

Department of Holistic Systems (HOST), Simula Metropolitan Center for Digital Engineering (SimulaMet), Norway

0000-0001-9232-2661
View Profile

ICMI '23: Proceedings of the 25th International Conference on Multimodal InteractionOctober 2023Pages 695–699https://doi.org/10.1145/3577190.3614225

Published:09 October 2023Publication History

ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

Pages 695–699

ABSTRACT

With the increasing availability of multimodal data, especially in the sports and medical domains, there is growing interest in developing Artificial Intelligence (AI) models capable of comprehending the world in a more holistic manner. Nevertheless, various challenges exist in multimodal understanding, including the integration of multiple modalities and the resolution of semantic gaps between them. The proposed research aims to leverage multiple input modalities for the multimodal understanding of AI models, enhancing their reasoning, generation, and intelligent behavior. The research objectives focus on developing novel methods for multimodal AI, integrating them into conversational agents with optimizations for domain-specific requirements. The research methodology encompasses literature review, data curation, model development and implementation, evaluation and performance analysis, domain-specific applications, and documentation and reporting. Ethical considerations will be thoroughly addressed, and a comprehensive research plan is outlined to provide guidance. The research contributes to the field of multimodal AI understanding and the advancement of sophisticated AI systems by experimenting with multimodal data to enhance the performance of state-of-the-art neural networks.

References

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (Jan. 2018), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607Google ScholarDigital Library
Khaled Bayoudh, Raja Knani, Fayçal Hamdaoui, and Abdellatif Mtibaa. 2022. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. Vis. Comput. 38, 8 (Aug. 2022), 2939–2970. https://doi.org/10.1007/s00371-021-02166-7Google ScholarDigital Library
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv (April 2020). https://doi.org/10.48550/arXiv.2004.05150 arXiv:2004.05150Google ScholarCross Ref
Yanbei Chen, Yongqin Xian, A Koepke, Ying Shan, and Zeynep Akata. 2021. Distilling audio-visual knowledge by compositional contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7016–7025.Google ScholarCross Ref
Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen, Zhiyuan Liu, Haitao Zheng, and Maosong Sun. 2022. OpenPrompt: An Open-source Framework for Prompt-learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Dublin, Ireland, 105–113. https://doi.org/10.18653/v1/2022.acl-demo.10Google ScholarCross Ref
Junhao Feng, Guohua Wang, Changmeng Zheng, Yi Cai, Ze Fu, Yaowei Wang, Xiao-Yong Wei, and Qing Li. 2023. Towards Bridged Vision and Language: Learning Cross-modal Knowledge Representation for Relation Extraction. IEEE Transactions on Circuits and Systems for Video Technology (2023).Google Scholar
Jing Gao, Peng Li, Zhikui Chen, and Jianing Zhang. 2020. A Survey on Deep Learning for Multimodal Data Fusion. Neural Comput. 32, 5 (May 2020), 829–864. https://doi.org/10.1162/neco_a_01273Google ScholarDigital Library
Yuzhou Gao. 2021. Human Motion Recognition Based on Multimodal Characteristics of Learning Quality in Football Scene. Mathematical Problems in Engineering (2021). https://doi.org/10.1155/2021/7963616Google ScholarCross Ref
Sushant Gautam. 2022. AI-based Soccer Game Summarization: From Video Highlights to Dynamic Text Summaries. Master’s Thesis, Tribhuvan University. https://www.researchgate.net/publication/363857936_AI-based_Soccer_Game_Summarization_From_Video_Highlights_to_Dynamic_Text_SummariesGoogle Scholar
Sushant Gautam, Cise Midoglu, Saeed Shafiee Sabet, Dinesh Baniya Kshatri, and Pål Halvorsen. 2022. Assisting Soccer Game Summarization via Audio Intensity Analysis of Game Highlights. In Proceedings of 12th IOE Graduate Conference, Vol. 12. Institute of Engineering, Tribhuvan University, Nepal, 25 – 32. https://conference.ioe.edu.np/publications/ioegc12/IOEGC-12-004-12009.pdfGoogle Scholar
Sushant Gautam, Cise Midoglu, Saeed Shafiee Sabet, Dinesh Baniya Kshatri, and Pål Halvorsen. 2022. Soccer Game Summarization using Audio Commentary, Metadata, and Captions. In NarSUM ’22: Proceedings of the 1st Workshop on User-centric Narrative Summarization of Long Videos. Association for Computing Machinery, New York, NY, USA, 13–22. https://doi.org/10.1145/3552463.3557019Google ScholarDigital Library
Significant Gravitas. 2023. Auto-GPT. https://github.com/Significant-Gravitas/Auto-GPT.Google Scholar
Mathias Kraus, Julia Anna Bingler, Markus Leippold, Tobias Schimanski, Chiara Colesanti Senni, Dominik Stammbach, Saeid Ashraf Vaghefi, and Nicolas Webersinke. 2023. Enhancing Large Language Models with Climate Resources. arXiv (March 2023). https://doi.org/10.48550/arXiv.2304.00116 arXiv:2304.00116Google ScholarCross Ref
Jiaxin Li, Danfeng Hong, Lianru Gao, Jing Yao, Ke Zheng, Bing Zhang, and Jocelyn Chanussot. 2022. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Obs. Geoinf. 112 (Aug. 2022), 102926. https://doi.org/10.1016/j.jag.2022.102926Google ScholarCross Ref
Zekang Li, Zongjia Li, Jinchao Zhang, Yang Feng, and Jie Zhou. 2021. Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog. IEEE/ACM Trans. Audio Speech Lang. Process. 29 (2021), 2476–2483. https://doi.org/10.1109/TASLP.2021.3065823Google ScholarDigital Library
Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Yufan Chen, Peter Wu, Michelle A Lee, Yuke Zhu, 2021. MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).Google Scholar
Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2022. Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. arXiv (Sept. 2022). https://doi.org/10.48550/arXiv.2209.03430 arXiv:2209.03430Google ScholarCross Ref
Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.). Vol. 35. Curran Associates, Inc., 1950–1965. https://proceedings.neurips.cc/paper_files/paper/2022/file/0cde695b83bd186c1fd456302888454c-Paper-Conference.pdfGoogle Scholar
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9 (Jan. 2023), 1–35. https://doi.org/10.1145/3560815Google ScholarDigital Library
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv (Feb. 2020). https://doi.org/10.48550/arXiv.2002.06353 arXiv:2002.06353Google ScholarCross Ref
Hassan Mkhallati, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. 2023. SoccerNet-Caption: Dense Video Captioning for Soccer Broadcasts Commentaries. arXiv abs/2304.04565 (2023). https://doi.org/10.48550/arXiv.2304.04565 arXiv:2304.04565Google ScholarCross Ref
Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. 2023. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. ArXiv 2306.05424 (2023).Google Scholar
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML.Google Scholar
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language Models as Knowledge Bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2463–2473. https://doi.org/10.18653/v1/D19-1250Google ScholarCross Ref
Madeline C. Schiappa, Yogesh S. Rawat, and Mubarak Shah. 2023. Self-Supervised Learning for Videos: A Survey. ACM Comput. Surv. 55, 13s (July 2023), 1–37. https://doi.org/10.1145/3577925Google ScholarDigital Library
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv (March 2023). https://doi.org/10.48550/arXiv.2303.17580 arXiv:2303.17580Google ScholarCross Ref
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. 2023. Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation. In Proceedings of The 6th Conference on Robot Learning(Proceedings of Machine Learning Research, Vol. 205), Karen Liu, Dana Kulic, and Jeff Ichnowski (Eds.). PMLR, 785–799. https://proceedings.mlr.press/v205/shridhar23a.htmlGoogle Scholar
Sören Richard Stahlschmidt, Benjamin Ulfenborg, and Jane Synnergren. 2022. Multimodal deep learning for biomedical data fusion: a review. Briefings Bioinf. 23, 2 (March 2022), bbab569. https://doi.org/10.1093/bib/bbab569Google ScholarCross Ref
Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal. 2023. Any-to-Any Generation via Composable Diffusion. arXiv (May 2023). https://doi.org/10.48550/arXiv.2305.11846 arXiv:2305.11846Google ScholarCross Ref
Vajira Thambawita, Steven A. Hicks, Andrea M. Storås, Thu Nguyen, Jorunn M. Andersen, Oliwia Witczak, Trine B. Haugen, Hugo L. Hammer, Pål Halvorsen, and Michael A. Riegler. 2023. VISEM-Tracking, a human spermatozoa tracking dataset. Sci. Data 10, 260 (May 2023), 1–8. https://doi.org/10.1038/s41597-023-02173-4Google ScholarCross Ref
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35 (2022), 10078–10093.Google Scholar
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv (Feb. 2023). https://doi.org/10.48550/arXiv.2302.13971 arXiv:2302.13971Google ScholarCross Ref
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. 2023. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14549–14560.Google ScholarCross Ref
Hanbo Wu. 2021. Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology 32 (2021), 1250–1261. https://doi.org/10.1109/tcsvt.2021.3077512Google ScholarCross Ref
Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, and Wanli Ouyang. 2023. Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10704–10713.Google ScholarCross Ref
Peng Xu, Xiatian Zhu, and David A Clifton. 2023. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).Google Scholar
Duoyi Zhang, Richi Nayak, and Md Abul Bashar. 2021. Exploring Fusion Strategies in Deep Learning Models for Multi-Modal Classification. In Data Mining. Springer, Singapore, 102–117. https://doi.org/10.1007/978-981-16-8531-6_8Google ScholarCross Ref
Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv (June 2023). https://doi.org/10.48550/arXiv.2306.02858 arXiv:2306.02858Google ScholarCross Ref
Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, and Xiangyu Yue. 2023. Meta-Transformer: A Unified Framework for Multimodal Learning. arXiv preprint arXiv:2307.10802 (2023).Google Scholar

Index Terms

Bridging Multimedia Modalities: Enhanced Multimodal AI Understanding and Intelligent Agents

Recommendations

Embodied conversational agents in Wizard-of-Oz and multimodal interaction applications
COST 2102'07: Proceedings of the 2007 COST action 2102 international conference on Verbal and nonverbal communication behaviours

Embodied conversational agents employed in multimodal interaction applications have the potential to achieve similar properties as humans in faceto-face conversation. They enable the inclusion of verbal and nonverbal communication. Thus, the degree of ...
Read More
Conversational Agents: Acting on the Wave of Research and Development
CHI EA '19: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems

In the last five years, work on software that interacts with people via typed or spoken natural language, called chatbots, intelligent assistants, social bots, virtual companions, non-human players, and so on, increased dramatically. Chatbots burst into ...
Read More
MuCAI'21: 2nd ACM Multimedia Workshop on Multimodal Conversational AI
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

The second edition of the International Workshop on Multimodal Conversational AI puts forward a diverse set of contributions that aim to brainstorm this new field. Conversational agents are now becoming a commodity as this technology is being applied to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction
October 2023
858 pages
ISBN:9798400700552
DOI:10.1145/3577190
Editors:
Elisabeth André
University of Augsburg
,
Mohamed Chetouani
Sorbonne University
,
Dominique Vaufreydaz
Univ. Grenoble Alpes
,
Gale Lucas
USC Institute for Creative Technologies
,
Tanja Schultz
University of Bremen
,
Louis-Philippe Morency
Carnegie Mellon University
,
Alessandro Vinciarelli
University of Glasgow
Copyright © 2023 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 October 2023
Check for updates
Author Tags
AI Understanding
Conversational Agents
Multimedia
Multimodal Fusion
Qualifiers
- extended-abstract
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 169
  Total Downloads
- Downloads (Last 12 months)169
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Bridging Multimedia Modalities: Enhanced Multimodal AI Understanding and Intelligent Agents

ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Embodied conversational agents in Wizard-of-Oz and multimodal interaction applications

Conversational Agents: Acting on the Wave of Research and Development

MuCAI'21: 2nd ACM Multimedia Workshop on Multimodal Conversational AI

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media