Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems

Authors:
Luis Carvalho

Johannes Kepler University, Linz, Austria

Johannes Kepler University, Linz, Austria

https://orcid.org/0000-0002-1344-3463
View Profile

,
Tobias Washüttl

Johannes Kepler University, Linz, Austria

Johannes Kepler University, Linz, Austria

https://orcid.org/0009-0001-9947-6783
View Profile

,
Gerhard Widmer

Johannes Kepler University, Linz, Austria

Johannes Kepler University, Linz, Austria

https://orcid.org/0000-0003-3531-1282
View Profile

MMSys '23: Proceedings of the 14th ACM Multimedia Systems ConferenceJune 2023Pages 239–248https://doi.org/10.1145/3587819.3590968

Published:08 June 2023Publication History

MMSys '23: Proceedings of the 14th ACM Multimedia Systems Conference

Pages 239–248

ABSTRACT

Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pretrained models are able to retrieve snippets with better precision in all scenarios and pre-training configurations. Encouraged by these results, we employ the snippet embeddings in the higher-level task of cross-modal piece identification and conduct more experiments on several retrieval configurations. In this task, we observe that the retrieval quality improves from 30% up to 100% when real music data is present. We then conclude by arguing for the potential of self-supervised contrastive learning for alleviating the annotated data scarcity in multi-modal music retrieval models. Code and trained models are accessible at https://github.com/luisfvc/ucasr.

References

Abien Fred Agarap. 2018. Deep Learning using Rectified Linear Units (ReLU). Google ScholarCross Ref
Ruchit Agrawal, Daniel Wolff, and Simon Dixon. 2022. A Convolutional-Attentional Neural Framework for Structure-Aware Performance-Score Synchronization. IEEE Signal Processing Letters 29 (2022), 344--348.Google ScholarCross Ref
Andreas Arzt, Sebastian Böck, Sebastian Flossmann, Harald Frostel, Martin Gasser, Cynthia C.S. Liem, and Gerhard Widmer. 2014. The Piano Music Companion. In Proceedings of the Conference on Prestigious Applications of Intelligent Systems (PAIS). Prague, Czechia.Google Scholar
Andreas Arzt, Sebastian Böck, and Gerhard Widmer. 2012. Fast Identification of Piece and Score Position via Symbolic Fingerprinting. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Porto, Portugal, 433--438.Google Scholar
Andreas Arzt, Gerhard Widmer, and Simon Dixon. 2008. Automatic Page Turning for Musicians via Real-Time Machine Listening. In In Proceedings of the 18th European Conference on Artificial Intelligence (ECAI). Patras, Greece, 241--245.Google Scholar
Stefan Balke, Vlora Arifi-Müller, Lukas Lamprecht, and Meinard Müller. 2016. Retrieving Audio Recordings Using Musical Themes. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Shanghai, China, 281--285.Google ScholarDigital Library
Stefan Balke, Matthias Dorfer, Luis Carvalho, Andreas Arzt, and Gerhard Widmer. 2019. Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Delft, Netherlands, 216--222.Google Scholar
Sebastian Böck and Markus Schedl. 2012. Polyphonic piano note transcription with recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 121--124.Google ScholarCross Ref
Jorge Calvo-Zaragoza, Jan Hajič Jr., and Alexander Pacha. 2021. Understanding Optical Music Recognition. Comput. Surveys 53, 4 (2021).Google Scholar
Carlos Eduardo Cancino-Chacón, Thassilo Gadermaier, Gerhard Widmer, and Maarten Grachten. 2017. An evaluation of linear and non-linear models of expressive dynamics in classical piano and symphonic music. Machine Learning 106, 6 (2017), 887--909.Google ScholarDigital Library
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning (ICML).Google ScholarDigital Library
Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1. 539--546.Google ScholarDigital Library
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In International Conferen1ce on Learning Representations, (ICLR).Google Scholar
Matthias Dorfer, Andreas Arzt, and Gerhard Widmer. 2017. Learning Audio-Sheet Music Correspondences for Score Identification and Offline Alignment. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou, China, 115--122.Google Scholar
Matthias Dorfer, Jan Hajič jr., Andreas Arzt, Harald Frostel, and Gerhard Widmer. 2018. Learning Audio-Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification. Transactions of the International Society for Music Information Retrieval 1, 1 (2018).Google ScholarCross Ref
Matthias Dorfer, Jan Schlüter, Andreu Vall, Filip Korzeniowski, and Gerhard Widmer. 2018. End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss. International Journal of Multimedia Information Retrieval 7, 2 (01 6 2018), 117--128.Google ScholarCross Ref
Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. 2014. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks. In Advances in Neural Information Processing Systems, Vol. 27.Google Scholar
Christian Fremerey, Michael Clausen, Sebastian Ewert, and Meinard Müller. 2009. Sheet Music-Audio Identification. In Proceedings of the International Conference on Music Information Retrieval (ISMIR). Kobe, Japan, 645--650.Google Scholar
Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. 2019. Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. In International Conference on Learning Representations.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In IEEE International Conference on Computer Vision (ICCV). 1026--1034.Google ScholarDigital Library
Florian Henkel and Gerhard Widmer. 2021. Real-Time Music Following in Score Sheet Images via Multi-Resolution Prediction. Frontiers in Computer Science 3 (2021).Google Scholar
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML). Lille, France, 448--456.Google Scholar
Özgür Izmirli and Gyanendra Sharma. 2012. Bridging Printed Music and Audio Through Alignment Using a Mid-level Score Representation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Porto, Portugal, 61--66. http://ismir2012.ismir.net/event/papers/061-ismir-2012.pdfGoogle Scholar
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR).Google Scholar
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. arXiv preprint (arXiv:1411.2539) (2014).Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems. 1097--1105.Google ScholarDigital Library
Juan C. López-Gutiérrez, Jose J. Valero-Mas, Francisco J. Castellanos, and Jorge Calvo-Zaragoza. 2021. Data Augmentation for End-to-End Optical Music Recognition. In Proceedings of the 14th IAPR International Workshop on Graphics Recognition (GREC). Springer, 59--73.Google ScholarDigital Library
Meinard Müller, Andreas Arzt, Stefan Balke, Matthias Dorfer, and Gerhard Widmer. 2019. Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies. IEEE Signal Processing Magazine 36, 1 (2019), 52--62.Google ScholarCross Ref
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). 2613--2617.Google Scholar
Ken Perlin. 2002. Improving Noise. ACM Transactions on Graphics (TOG) 21, 3 (2002), 681--682.Google ScholarDigital Library
Justin Salamon and Juan Pablo Bello. 2017. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Processing Letters 24, 3 (2017), 279--283.Google ScholarCross Ref
Jan Schlüter and Thomas Grill. 2015. Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 121--126.Google Scholar
Connor Shorten and Taghi M. Khoshgoftaar. 2019. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data 6, 60 (2019). Issue 1.Google ScholarCross Ref
Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon. 2016. An End-to-End Neural Network for Polyphonic Piano Music Transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 5 (2016), 927--939.Google ScholarDigital Library
Patrice Simard, Dave Steinkraus, and John Platt. 2003. Best Practices for Convolutional Neural Networks. International Conference on Document Analysis and Recognition (ICDAR) 3 (2003), 958--962.Google Scholar
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR).Google Scholar
Kihyuk Sohn. 2016. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In Advances in Neural Information Processing Systems. 1857--1865.Google Scholar
Naoya Takahashi, Michael Gygli, Beat Pfister, and Luc Van Gool. 2016. Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). San Francisco, USA, 2982--2986.Google ScholarCross Ref
Timothy J. Tsai. 2020. Towards Linking the Lakh and IMSLP Datasets. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 546--550.Google ScholarCross Ref
Eelco van der Wel and Karen Ullrich. 2017. Optical Music Recognition with Convolutional Sequence-to-Sequence Models. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou, China, 731--737.Google Scholar
Daniel Yang, Thitaree Tanprasert, Teerapat Jenrungrot, Mengyi Shan, and Timothy J. Tsai. 2019. MIDI Passage Retrieval Using Cell Phone Pictures of Sheet Music. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 916--923.Google Scholar
Frank Zalkow, Stefan Balke, and Meinard Müller. 2019. Evaluating Salience Representations for Cross-Modal Retrieval of Western Classical Music Recordings. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 331--335.Google ScholarCross Ref

Index Terms

Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Music retrieval

Recommendations

Robust Polyphonic Music Retrieval with N-grams

In this paper we investigate the retrieval performance of monophonic and polyphonic queries made on a polyphonic music database. We extend the n-gram approach for full-music indexing of monophonic music data to polyphonic music using both rhythm and ...
Read More
DLVS4Audio2Sheet: Deep Learning-Based Vocal Separation for Audio into Music Sheet Conversion
Trends and Applications in Knowledge Discovery and Data Mining
Abstract
While manual transcription tools exist, music enthusiasts, including amateur singers, still encounter challenges when transcribing performances into sheet music. This paper addresses the complex task of translating music audio into music sheets, ...
Read More
AAM: a dataset of Artificial Audio Multitracks for diverse music information retrieval tasks
Abstract
We present a new dataset of 3000 artificial music tracks with rich annotations based on real instrument samples and generated by algorithmic composition with respect to music theory. Our collection provides ground truth onset information and has ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MMSys '23: Proceedings of the 14th ACM Multimedia Systems Conference
June 2023
495 pages
ISBN:9798400701481
DOI:10.1145/3587819
Co-chairs:
Shervin Shirmohammadi,
Mohamed Hefeeda,
Program Co-chairs:
Roger Zimmermann,
Carsten Griwodz,
Mea Wang
Copyright © 2023 Owner/Author(s)
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 June 2023
Check for updates
Author Tags
multi-modal embedding spaces
audio-sheet music retrieval
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate176of530submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 248
  Total Downloads
- Downloads (Last 12 months)248
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems

MMSys '23: Proceedings of the 14th ACM Multimedia Systems Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Robust Polyphonic Music Retrieval with N-grams

DLVS4Audio2Sheet: Deep Learning-Based Vocal Separation for Audio into Music Sheet Conversion

AAM: a dataset of Artificial Audio Multitracks for diverse music information retrieval tasks