ABSTRACT
With the increasing availability of multimodal data, especially in the sports and medical domains, there is growing interest in developing Artificial Intelligence (AI) models capable of comprehending the world in a more holistic manner. Nevertheless, various challenges exist in multimodal understanding, including the integration of multiple modalities and the resolution of semantic gaps between them. The proposed research aims to leverage multiple input modalities for the multimodal understanding of AI models, enhancing their reasoning, generation, and intelligent behavior. The research objectives focus on developing novel methods for multimodal AI, integrating them into conversational agents with optimizations for domain-specific requirements. The research methodology encompasses literature review, data curation, model development and implementation, evaluation and performance analysis, domain-specific applications, and documentation and reporting. Ethical considerations will be thoroughly addressed, and a comprehensive research plan is outlined to provide guidance. The research contributes to the field of multimodal AI understanding and the advancement of sophisticated AI systems by experimenting with multimodal data to enhance the performance of state-of-the-art neural networks.
- Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (Jan. 2018), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607Google ScholarDigital Library
- Khaled Bayoudh, Raja Knani, Fayçal Hamdaoui, and Abdellatif Mtibaa. 2022. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. Vis. Comput. 38, 8 (Aug. 2022), 2939–2970. https://doi.org/10.1007/s00371-021-02166-7Google ScholarDigital Library
- Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv (April 2020). https://doi.org/10.48550/arXiv.2004.05150 arXiv:2004.05150Google ScholarCross Ref
- Yanbei Chen, Yongqin Xian, A Koepke, Ying Shan, and Zeynep Akata. 2021. Distilling audio-visual knowledge by compositional contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7016–7025.Google ScholarCross Ref
- Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen, Zhiyuan Liu, Haitao Zheng, and Maosong Sun. 2022. OpenPrompt: An Open-source Framework for Prompt-learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Dublin, Ireland, 105–113. https://doi.org/10.18653/v1/2022.acl-demo.10Google ScholarCross Ref
- Junhao Feng, Guohua Wang, Changmeng Zheng, Yi Cai, Ze Fu, Yaowei Wang, Xiao-Yong Wei, and Qing Li. 2023. Towards Bridged Vision and Language: Learning Cross-modal Knowledge Representation for Relation Extraction. IEEE Transactions on Circuits and Systems for Video Technology (2023).Google Scholar
- Jing Gao, Peng Li, Zhikui Chen, and Jianing Zhang. 2020. A Survey on Deep Learning for Multimodal Data Fusion. Neural Comput. 32, 5 (May 2020), 829–864. https://doi.org/10.1162/neco_a_01273Google ScholarDigital Library
- Yuzhou Gao. 2021. Human Motion Recognition Based on Multimodal Characteristics of Learning Quality in Football Scene. Mathematical Problems in Engineering (2021). https://doi.org/10.1155/2021/7963616Google ScholarCross Ref
- Sushant Gautam. 2022. AI-based Soccer Game Summarization: From Video Highlights to Dynamic Text Summaries. Master’s Thesis, Tribhuvan University. https://www.researchgate.net/publication/363857936_AI-based_Soccer_Game_Summarization_From_Video_Highlights_to_Dynamic_Text_SummariesGoogle Scholar
- Sushant Gautam, Cise Midoglu, Saeed Shafiee Sabet, Dinesh Baniya Kshatri, and Pål Halvorsen. 2022. Assisting Soccer Game Summarization via Audio Intensity Analysis of Game Highlights. In Proceedings of 12th IOE Graduate Conference, Vol. 12. Institute of Engineering, Tribhuvan University, Nepal, 25 – 32. https://conference.ioe.edu.np/publications/ioegc12/IOEGC-12-004-12009.pdfGoogle Scholar
- Sushant Gautam, Cise Midoglu, Saeed Shafiee Sabet, Dinesh Baniya Kshatri, and Pål Halvorsen. 2022. Soccer Game Summarization using Audio Commentary, Metadata, and Captions. In NarSUM ’22: Proceedings of the 1st Workshop on User-centric Narrative Summarization of Long Videos. Association for Computing Machinery, New York, NY, USA, 13–22. https://doi.org/10.1145/3552463.3557019Google ScholarDigital Library
- Significant Gravitas. 2023. Auto-GPT. https://github.com/Significant-Gravitas/Auto-GPT.Google Scholar
- Mathias Kraus, Julia Anna Bingler, Markus Leippold, Tobias Schimanski, Chiara Colesanti Senni, Dominik Stammbach, Saeid Ashraf Vaghefi, and Nicolas Webersinke. 2023. Enhancing Large Language Models with Climate Resources. arXiv (March 2023). https://doi.org/10.48550/arXiv.2304.00116 arXiv:2304.00116Google ScholarCross Ref
- Jiaxin Li, Danfeng Hong, Lianru Gao, Jing Yao, Ke Zheng, Bing Zhang, and Jocelyn Chanussot. 2022. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Obs. Geoinf. 112 (Aug. 2022), 102926. https://doi.org/10.1016/j.jag.2022.102926Google ScholarCross Ref
- Zekang Li, Zongjia Li, Jinchao Zhang, Yang Feng, and Jie Zhou. 2021. Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog. IEEE/ACM Trans. Audio Speech Lang. Process. 29 (2021), 2476–2483. https://doi.org/10.1109/TASLP.2021.3065823Google ScholarDigital Library
- Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Yufan Chen, Peter Wu, Michelle A Lee, Yuke Zhu, 2021. MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).Google Scholar
- Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2022. Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. arXiv (Sept. 2022). https://doi.org/10.48550/arXiv.2209.03430 arXiv:2209.03430Google ScholarCross Ref
- Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.). Vol. 35. Curran Associates, Inc., 1950–1965. https://proceedings.neurips.cc/paper_files/paper/2022/file/0cde695b83bd186c1fd456302888454c-Paper-Conference.pdfGoogle Scholar
- Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9 (Jan. 2023), 1–35. https://doi.org/10.1145/3560815Google ScholarDigital Library
- Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv (Feb. 2020). https://doi.org/10.48550/arXiv.2002.06353 arXiv:2002.06353Google ScholarCross Ref
- Hassan Mkhallati, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. 2023. SoccerNet-Caption: Dense Video Captioning for Soccer Broadcasts Commentaries. arXiv abs/2304.04565 (2023). https://doi.org/10.48550/arXiv.2304.04565 arXiv:2304.04565Google ScholarCross Ref
- Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. 2023. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. ArXiv 2306.05424 (2023).Google Scholar
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML.Google Scholar
- Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language Models as Knowledge Bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2463–2473. https://doi.org/10.18653/v1/D19-1250Google ScholarCross Ref
- Madeline C. Schiappa, Yogesh S. Rawat, and Mubarak Shah. 2023. Self-Supervised Learning for Videos: A Survey. ACM Comput. Surv. 55, 13s (July 2023), 1–37. https://doi.org/10.1145/3577925Google ScholarDigital Library
- Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv (March 2023). https://doi.org/10.48550/arXiv.2303.17580 arXiv:2303.17580Google ScholarCross Ref
- Mohit Shridhar, Lucas Manuelli, and Dieter Fox. 2023. Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation. In Proceedings of The 6th Conference on Robot Learning(Proceedings of Machine Learning Research, Vol. 205), Karen Liu, Dana Kulic, and Jeff Ichnowski (Eds.). PMLR, 785–799. https://proceedings.mlr.press/v205/shridhar23a.htmlGoogle Scholar
- Sören Richard Stahlschmidt, Benjamin Ulfenborg, and Jane Synnergren. 2022. Multimodal deep learning for biomedical data fusion: a review. Briefings Bioinf. 23, 2 (March 2022), bbab569. https://doi.org/10.1093/bib/bbab569Google ScholarCross Ref
- Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal. 2023. Any-to-Any Generation via Composable Diffusion. arXiv (May 2023). https://doi.org/10.48550/arXiv.2305.11846 arXiv:2305.11846Google ScholarCross Ref
- Vajira Thambawita, Steven A. Hicks, Andrea M. Storås, Thu Nguyen, Jorunn M. Andersen, Oliwia Witczak, Trine B. Haugen, Hugo L. Hammer, Pål Halvorsen, and Michael A. Riegler. 2023. VISEM-Tracking, a human spermatozoa tracking dataset. Sci. Data 10, 260 (May 2023), 1–8. https://doi.org/10.1038/s41597-023-02173-4Google ScholarCross Ref
- Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35 (2022), 10078–10093.Google Scholar
- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv (Feb. 2023). https://doi.org/10.48550/arXiv.2302.13971 arXiv:2302.13971Google ScholarCross Ref
- Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. 2023. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14549–14560.Google ScholarCross Ref
- Hanbo Wu. 2021. Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology 32 (2021), 1250–1261. https://doi.org/10.1109/tcsvt.2021.3077512Google ScholarCross Ref
- Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, and Wanli Ouyang. 2023. Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10704–10713.Google ScholarCross Ref
- Peng Xu, Xiatian Zhu, and David A Clifton. 2023. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).Google Scholar
- Duoyi Zhang, Richi Nayak, and Md Abul Bashar. 2021. Exploring Fusion Strategies in Deep Learning Models for Multi-Modal Classification. In Data Mining. Springer, Singapore, 102–117. https://doi.org/10.1007/978-981-16-8531-6_8Google ScholarCross Ref
- Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv (June 2023). https://doi.org/10.48550/arXiv.2306.02858 arXiv:2306.02858Google ScholarCross Ref
- Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, and Xiangyu Yue. 2023. Meta-Transformer: A Unified Framework for Multimodal Learning. arXiv preprint arXiv:2307.10802 (2023).Google Scholar
Index Terms
- Bridging Multimedia Modalities: Enhanced Multimodal AI Understanding and Intelligent Agents
Recommendations
Embodied conversational agents in Wizard-of-Oz and multimodal interaction applications
COST 2102'07: Proceedings of the 2007 COST action 2102 international conference on Verbal and nonverbal communication behavioursEmbodied conversational agents employed in multimodal interaction applications have the potential to achieve similar properties as humans in faceto-face conversation. They enable the inclusion of verbal and nonverbal communication. Thus, the degree of ...
Conversational Agents: Acting on the Wave of Research and Development
CHI EA '19: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing SystemsIn the last five years, work on software that interacts with people via typed or spoken natural language, called chatbots, intelligent assistants, social bots, virtual companions, non-human players, and so on, increased dramatically. Chatbots burst into ...
MuCAI'21: 2nd ACM Multimedia Workshop on Multimodal Conversational AI
MM '21: Proceedings of the 29th ACM International Conference on MultimediaThe second edition of the International Workshop on Multimodal Conversational AI puts forward a diverse set of contributions that aim to brainstorm this new field. Conversational agents are now becoming a commodity as this technology is being applied to ...
Comments