ABSTRACT
Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pretrained models are able to retrieve snippets with better precision in all scenarios and pre-training configurations. Encouraged by these results, we employ the snippet embeddings in the higher-level task of cross-modal piece identification and conduct more experiments on several retrieval configurations. In this task, we observe that the retrieval quality improves from 30% up to 100% when real music data is present. We then conclude by arguing for the potential of self-supervised contrastive learning for alleviating the annotated data scarcity in multi-modal music retrieval models. Code and trained models are accessible at https://github.com/luisfvc/ucasr.
- Abien Fred Agarap. 2018. Deep Learning using Rectified Linear Units (ReLU). Google ScholarCross Ref
- Ruchit Agrawal, Daniel Wolff, and Simon Dixon. 2022. A Convolutional-Attentional Neural Framework for Structure-Aware Performance-Score Synchronization. IEEE Signal Processing Letters 29 (2022), 344--348.Google ScholarCross Ref
- Andreas Arzt, Sebastian Böck, Sebastian Flossmann, Harald Frostel, Martin Gasser, Cynthia C.S. Liem, and Gerhard Widmer. 2014. The Piano Music Companion. In Proceedings of the Conference on Prestigious Applications of Intelligent Systems (PAIS). Prague, Czechia.Google Scholar
- Andreas Arzt, Sebastian Böck, and Gerhard Widmer. 2012. Fast Identification of Piece and Score Position via Symbolic Fingerprinting. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Porto, Portugal, 433--438.Google Scholar
- Andreas Arzt, Gerhard Widmer, and Simon Dixon. 2008. Automatic Page Turning for Musicians via Real-Time Machine Listening. In In Proceedings of the 18th European Conference on Artificial Intelligence (ECAI). Patras, Greece, 241--245.Google Scholar
- Stefan Balke, Vlora Arifi-Müller, Lukas Lamprecht, and Meinard Müller. 2016. Retrieving Audio Recordings Using Musical Themes. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Shanghai, China, 281--285.Google ScholarDigital Library
- Stefan Balke, Matthias Dorfer, Luis Carvalho, Andreas Arzt, and Gerhard Widmer. 2019. Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Delft, Netherlands, 216--222.Google Scholar
- Sebastian Böck and Markus Schedl. 2012. Polyphonic piano note transcription with recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 121--124.Google ScholarCross Ref
- Jorge Calvo-Zaragoza, Jan Hajič Jr., and Alexander Pacha. 2021. Understanding Optical Music Recognition. Comput. Surveys 53, 4 (2021).Google Scholar
- Carlos Eduardo Cancino-Chacón, Thassilo Gadermaier, Gerhard Widmer, and Maarten Grachten. 2017. An evaluation of linear and non-linear models of expressive dynamics in classical piano and symphonic music. Machine Learning 106, 6 (2017), 887--909.Google ScholarDigital Library
- Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning (ICML).Google ScholarDigital Library
- Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1. 539--546.Google ScholarDigital Library
- Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In International Conferen1ce on Learning Representations, (ICLR).Google Scholar
- Matthias Dorfer, Andreas Arzt, and Gerhard Widmer. 2017. Learning Audio-Sheet Music Correspondences for Score Identification and Offline Alignment. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou, China, 115--122.Google Scholar
- Matthias Dorfer, Jan Hajič jr., Andreas Arzt, Harald Frostel, and Gerhard Widmer. 2018. Learning Audio-Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification. Transactions of the International Society for Music Information Retrieval 1, 1 (2018).Google ScholarCross Ref
- Matthias Dorfer, Jan Schlüter, Andreu Vall, Filip Korzeniowski, and Gerhard Widmer. 2018. End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss. International Journal of Multimedia Information Retrieval 7, 2 (01 6 2018), 117--128.Google ScholarCross Ref
- Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. 2014. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks. In Advances in Neural Information Processing Systems, Vol. 27.Google Scholar
- Christian Fremerey, Michael Clausen, Sebastian Ewert, and Meinard Müller. 2009. Sheet Music-Audio Identification. In Proceedings of the International Conference on Music Information Retrieval (ISMIR). Kobe, Japan, 645--650.Google Scholar
- Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. 2019. Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. In International Conference on Learning Representations.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In IEEE International Conference on Computer Vision (ICCV). 1026--1034.Google ScholarDigital Library
- Florian Henkel and Gerhard Widmer. 2021. Real-Time Music Following in Score Sheet Images via Multi-Resolution Prediction. Frontiers in Computer Science 3 (2021).Google Scholar
- Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML). Lille, France, 448--456.Google Scholar
- Özgür Izmirli and Gyanendra Sharma. 2012. Bridging Printed Music and Audio Through Alignment Using a Mid-level Score Representation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Porto, Portugal, 61--66. http://ismir2012.ismir.net/event/papers/061-ismir-2012.pdfGoogle Scholar
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR).Google Scholar
- Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. arXiv preprint (arXiv:1411.2539) (2014).Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems. 1097--1105.Google ScholarDigital Library
- Juan C. López-Gutiérrez, Jose J. Valero-Mas, Francisco J. Castellanos, and Jorge Calvo-Zaragoza. 2021. Data Augmentation for End-to-End Optical Music Recognition. In Proceedings of the 14th IAPR International Workshop on Graphics Recognition (GREC). Springer, 59--73.Google ScholarDigital Library
- Meinard Müller, Andreas Arzt, Stefan Balke, Matthias Dorfer, and Gerhard Widmer. 2019. Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies. IEEE Signal Processing Magazine 36, 1 (2019), 52--62.Google ScholarCross Ref
- Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). 2613--2617.Google Scholar
- Ken Perlin. 2002. Improving Noise. ACM Transactions on Graphics (TOG) 21, 3 (2002), 681--682.Google ScholarDigital Library
- Justin Salamon and Juan Pablo Bello. 2017. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Processing Letters 24, 3 (2017), 279--283.Google ScholarCross Ref
- Jan Schlüter and Thomas Grill. 2015. Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 121--126.Google Scholar
- Connor Shorten and Taghi M. Khoshgoftaar. 2019. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data 6, 60 (2019). Issue 1.Google ScholarCross Ref
- Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon. 2016. An End-to-End Neural Network for Polyphonic Piano Music Transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 5 (2016), 927--939.Google ScholarDigital Library
- Patrice Simard, Dave Steinkraus, and John Platt. 2003. Best Practices for Convolutional Neural Networks. International Conference on Document Analysis and Recognition (ICDAR) 3 (2003), 958--962.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR).Google Scholar
- Kihyuk Sohn. 2016. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In Advances in Neural Information Processing Systems. 1857--1865.Google Scholar
- Naoya Takahashi, Michael Gygli, Beat Pfister, and Luc Van Gool. 2016. Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). San Francisco, USA, 2982--2986.Google ScholarCross Ref
- Timothy J. Tsai. 2020. Towards Linking the Lakh and IMSLP Datasets. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 546--550.Google ScholarCross Ref
- Eelco van der Wel and Karen Ullrich. 2017. Optical Music Recognition with Convolutional Sequence-to-Sequence Models. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou, China, 731--737.Google Scholar
- Daniel Yang, Thitaree Tanprasert, Teerapat Jenrungrot, Mengyi Shan, and Timothy J. Tsai. 2019. MIDI Passage Retrieval Using Cell Phone Pictures of Sheet Music. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 916--923.Google Scholar
- Frank Zalkow, Stefan Balke, and Meinard Müller. 2019. Evaluating Salience Representations for Cross-Modal Retrieval of Western Classical Music Recordings. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 331--335.Google ScholarCross Ref
Index Terms
- Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems
Recommendations
Robust Polyphonic Music Retrieval with N-grams
In this paper we investigate the retrieval performance of monophonic and polyphonic queries made on a polyphonic music database. We extend the n-gram approach for full-music indexing of monophonic music data to polyphonic music using both rhythm and ...
DLVS4Audio2Sheet: Deep Learning-Based Vocal Separation for Audio into Music Sheet Conversion
Trends and Applications in Knowledge Discovery and Data MiningAbstractWhile manual transcription tools exist, music enthusiasts, including amateur singers, still encounter challenges when transcribing performances into sheet music. This paper addresses the complex task of translating music audio into music sheets, ...
AAM: a dataset of Artificial Audio Multitracks for diverse music information retrieval tasks
AbstractWe present a new dataset of 3000 artificial music tracks with rich annotations based on real instrument samples and generated by algorithmic composition with respect to music theory. Our collection provides ground truth onset information and has ...
Comments