Label distribution for multimodal machine learning

Ren, Yi; Xu, Ning; Ling, Miaogen; Geng, Xin

doi:10.1007/s11704-021-0611-6

Label distribution for multimodal machine learning

Research Article
Published: 11 September 2021

Volume 16, article number 161306, (2022)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Yi Ren¹,
Ning Xu¹,
Miaogen Ling¹ &
…
Xin Geng¹

192 Accesses
12 Citations
1 Altmetric
Explore all metrics

Abstract

Multimodal machine learning (MML) aims to understand the world from multiple related modalities. It has attracted much attention as multimodal data has become increasingly available in real-world application. It is shown that MML can perform better than single-modal machine learning, since multi-modalities containing more information which could complement each other. However, it is a key challenge to fuse the multi-modalities in MML. Different from previous work, we further consider the side-information, which reflects the situation and influences the fusion of multi-modalities. We recover multimodal label distribution (MLD) by leveraging the side-information, representing the degree to which each modality contributes to describing the instance. Accordingly, a novel framework named multimodal label distribution learning (MLDL) is proposed to recover the MLD, and fuse the multimodalities with its guidance to learn an in-depth understanding of the jointly feature representation. Moreover, two versions of MLDL are proposed to deal with the sequential data. Experiments on multimodal sentiment analysis and disease prediction show that the proposed approaches perform favorably against state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

"Challenges and future in deep learning for sentiment analysis: a comprehensive review and a proposed novel hybrid approach"

Article Open access 05 March 2024

Sentiment analysis using deep learning architectures: a review

Article 02 December 2019

A systematic review for class-imbalance in semi-supervised learning

Article 04 September 2023

References

Baltrušaitis T, Ahuja C, Morency L P. Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(2): 423–443
Article Google Scholar
Snoek C G, Worring M. Multimodal video indexing: a review of the state-of-the-art. Multimedia Tools and Applications, 2005, 25(1): 5–35
Article Google Scholar
Yuhas B P, Goldstein M H, Sejnowski T J. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 1989, 27(11): 65–71
Article Google Scholar
McGurk H, MacDonald J. Hearing lips and seeing voices. Nature, 1976, 264(5588): 746–748
Article Google Scholar
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A Y. Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning. 2011, 689–696
Poria S, Cambria E, Hazarika D, Mazumder N, Zadeh A, Morency L P. Multi-level multiple attentions for contextual multimodal sentiment analysis. In: Proceedings of 2017 IEEE International Conference on Data Mining. 2017, 1033–1038
Tsai Y H H, Bai S, Liang P P, Kolter J Z, Morency L P, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 6558–6569
Xu K, Lam M, Pang J, Gao X, Band C, Mathur P, Papay F, Khanna A K, Cywinski J B, Maheshwari K, et al. Multimodal machine learning for automated icd coding. In: Proceedings of Machine Learning for Healthcare Conference. 2019, 197–215
Phan-Minh T, Grigore E C, Boulton F A, Beijbom O, Wolff E M. Covernet: multimodal behavior prediction using trajectory sets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 14074–14083
Geng X. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(7): 1734–1748
Article Google Scholar
Weston J, Bengio S, Usunier N. Wsabie: scaling up to large vocabulary image annotation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011
Kiros R, Salakhutdinov R, Zemel R S. Unifying visual-semantic embeddings with multimodal neural language models. 2014, arXiv preprint arXiv:1411.2539
Wang J, Shen H T, Song J, Ji J. Hashing for similarity search: a survey. 2014, arXiv preprint arXiv:1408.2927
Rasiwasia N, Pereira J C, Coviello E, Doyle G, Lanckriet G R, Levy R, Vasconcelos N. A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International Conference on Multimedia. 2010, 251–260
Sargin M E, Yemez Y, Erzin E, Tekalp A M. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 2007, 9(7): 1396–1403
Article Google Scholar
Poria S, Chaturvedi I, Cambria E, Hussain A. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: Proceedings of the 16th IEEE International Conference on Data Mining. 2016, 439–448
Zadeh A, Zellers R, Pincus E, Morency L P. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intelligent Systems, 2016, 31(6): 82–88
Article Google Scholar
Morvant E, Habrard A, Ayache S. Majority vote of diverse classifiers for late fusion. In: Proceedings of Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). 2014, 153–162
Potamianos G, Neti C, Gravier G, Garg A, Senior A W. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 2003, 91(9): 1306–1326
Article Google Scholar
Evangelopoulos G, Zlatintsi A, Potamianos A, Maragos P, Rapantzikos K, Skoumas G, Avrithis Y. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia, 2013 15(7): 1553–1568
Article Google Scholar
Srivastava N, Salakhutdinov R R. Multimodal learning with deep boltzmann machines. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 2222–2230
Mroueh Y, Marcheret E, Goel V. Deep multimodal learning for audiovisual speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 2015, 2130–2134
Zadeh A, Liang P P, Poria S, Vij P, Cambria E, Morency L P. Multi-attention recurrent network for human communication comprehension. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018
Zadeh A, Liang P P, Morency L P. Foundations of multimodal co-learning. Information Fusion, 2020, 64: 188–193
Article Google Scholar
Geng X, Yin C, Zhou Z H. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(10): 2401–2412
Article Google Scholar
Geng X, Xia Y. Head pose estimation based on multivariate label distribution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1837–1842
Su K, Yu D, Xu Z, Geng X, Wang C. Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, 5674–5682
Ren Y, Geng X. Sense beauty by label distribution learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 2648–2654
Chen S, Wang J, Chen Y, Shi Z, Geng X, Rui Y. Label distribution learning on auxiliary label space graphs for facial expression recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 13984–13993
Lv J, Xu M, Feng L, Niu G, Geng X, Sugiyama M. Progressive identification of true labels for partial-label learning. In: Proceedings of International Conference on Machine Learning. 2020, 6500–6510
Xu N, Tao A, Geng X. Label enhancement for label distribution learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2018, 2926–2932
Xu N, Liu Y P, Geng X. Label enhancement for label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2021, 33(4): 1632–1643
Article Google Scholar
Xu N, Shu J, Liu Y P, Geng X. Variational label enhancement. In: Proceedings of International Conference on Machine Learning. 2020, 10597–10606
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 5998–6008
Graves A, Jaitly N, Mohamed A R. Hybrid speech recognition with deep bidirectional lstm. In: Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. 2013, 273–278
Zadeh A B, Liang P P, Poria S, Cambria E, Morency L P. Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2236–2246
Pennington J, Socher R, Manning C D. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532–1543
Tian Y L, Kanade T, Cohn J F. Facial expression analysis. In: Handbook of Face Recognition. Springer, New York, 2005
Google Scholar
Degottex G, Kane J, Drugman T, Raitio T, Scherer S. Covarep—a collaborative voice analysis repository for speech technologies. In: Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. 2014, 960–964
Yuan J, Liberman M. Speaker identification on the scotus corpus. Journal of the Acoustical Society of America, 2008, 123(5): 3878
Article Google Scholar
Wang Y, Shen Y, Liu Z, Liang P P, Zadeh A, Morency L P. Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 7216–7223
Johnson A E, Pollard T J, Shen L, Li-Wei H L, Feng M, Ghassemi G, Moody B, Szolovits P, Celi L A, Roger G Mark R G. Mimic-iii, a freely accessible critical care database. Scientific Data, 2016, 3:160035
Article Google Scholar
Choi E, Bahadori M T, Song L, Stewart W F, Sun J. Gram: graph-based attention model for healthcare representation learning. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017, 787–795
Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3111–3119
Schuster M, Paliwal K K. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997, 45(11): 2673–2681
Article Google Scholar
Choi E, Bahadori M T, Schuetz A, Stewart W F, Sun J. Doctor AI: predicting clinical events via recurrent neural networks. In: Proceedings of Machine Learning for Healthcare Conference. 2016, 301–318

Download references

Acknowledgements

This research was supported by the National Key Research and Development Plan of China (2018AAA0100104), the National Natural Science Foundation of China (Grant No.62076063), and the Fundamental Research Funds for the Central Universities (2242021k30056).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Southeast University, Nanjing, 211189, China
Yi Ren, Ning Xu, Miaogen Ling & Xin Geng

Authors

Yi Ren
View author publications
You can also search for this author in PubMed Google Scholar
Ning Xu
View author publications
You can also search for this author in PubMed Google Scholar
Miaogen Ling
View author publications
You can also search for this author in PubMed Google Scholar
Xin Geng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Geng.

Additional information

Yi Ren received the BS degree in computer science from the Southeast University, China. Now she is pursuing the PhD degree with the department of Computer Science and Engineering, Southeast University, China. Her research interest include machine learning and its application to multimodal machine learning.

Ning Xu received the BSc and MSc degrees from University of Science and Technology of China and Chinese Academy of Sciences China, respectively, and the PhD degree from Southeast University, China. He is now an assistant professor in the School of Computer Science and Engineering at Southeast University, China. His research interests mainly include pattern recognition and machine learning.

Miaogen Ling is currently a lecturer at Nanjing University of Information Science & Technology, China. He received the BS (2010) degree in mathematical science from the Soochow University, China and the MS degree and the PhD (2019) degree in computer science from the Southeast University, China. His research interests include machine learning and its application to computer vision and multimedia analysis.

Xin Geng is currently a professor and the dean of School of Computer Science and Engineering at Southeast University, China. He received the BSc (2001) and MSc (2004) degrees in computer science from Nanjing University, China, and the PhD (2008) degree in computer science from Deakin University, Australia. His research interests include machine learning, pattern recognition, and computer vision. He has published over 80 refereed papers in these areas, including those published in prestigious journals and top international conferences. He has been an Associate Editor of IEEE T-MM, Electronics, FCS and MFC, a Steering Committee Member of PRI-CAI, a Program Committee Chair for conferences such as PRICAI’18, VALSE’13, etc., an Area Chair for conferences such as CVPR’21, ACMMM’18, ICPR’21, WACV’21, and a Senior Program Committee Member for conferences such as IJCAI, AAAI, ECAI, etc. He is a Distinguished Fellow of IETI and a Member of IEEE.

Electronic supplementary material