skip to main content
10.1145/3587819.3590968acmconferencesArticle/Chapter ViewAbstractPublication PagesmmsysConference Proceedingsconference-collections
research-article
Open Access

Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems

Published:08 June 2023Publication History

ABSTRACT

Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pretrained models are able to retrieve snippets with better precision in all scenarios and pre-training configurations. Encouraged by these results, we employ the snippet embeddings in the higher-level task of cross-modal piece identification and conduct more experiments on several retrieval configurations. In this task, we observe that the retrieval quality improves from 30% up to 100% when real music data is present. We then conclude by arguing for the potential of self-supervised contrastive learning for alleviating the annotated data scarcity in multi-modal music retrieval models. Code and trained models are accessible at https://github.com/luisfvc/ucasr.

References

  1. Abien Fred Agarap. 2018. Deep Learning using Rectified Linear Units (ReLU). Google ScholarGoogle ScholarCross RefCross Ref
  2. Ruchit Agrawal, Daniel Wolff, and Simon Dixon. 2022. A Convolutional-Attentional Neural Framework for Structure-Aware Performance-Score Synchronization. IEEE Signal Processing Letters 29 (2022), 344--348.Google ScholarGoogle ScholarCross RefCross Ref
  3. Andreas Arzt, Sebastian Böck, Sebastian Flossmann, Harald Frostel, Martin Gasser, Cynthia C.S. Liem, and Gerhard Widmer. 2014. The Piano Music Companion. In Proceedings of the Conference on Prestigious Applications of Intelligent Systems (PAIS). Prague, Czechia.Google ScholarGoogle Scholar
  4. Andreas Arzt, Sebastian Böck, and Gerhard Widmer. 2012. Fast Identification of Piece and Score Position via Symbolic Fingerprinting. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Porto, Portugal, 433--438.Google ScholarGoogle Scholar
  5. Andreas Arzt, Gerhard Widmer, and Simon Dixon. 2008. Automatic Page Turning for Musicians via Real-Time Machine Listening. In In Proceedings of the 18th European Conference on Artificial Intelligence (ECAI). Patras, Greece, 241--245.Google ScholarGoogle Scholar
  6. Stefan Balke, Vlora Arifi-Müller, Lukas Lamprecht, and Meinard Müller. 2016. Retrieving Audio Recordings Using Musical Themes. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Shanghai, China, 281--285.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Stefan Balke, Matthias Dorfer, Luis Carvalho, Andreas Arzt, and Gerhard Widmer. 2019. Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Delft, Netherlands, 216--222.Google ScholarGoogle Scholar
  8. Sebastian Böck and Markus Schedl. 2012. Polyphonic piano note transcription with recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 121--124.Google ScholarGoogle ScholarCross RefCross Ref
  9. Jorge Calvo-Zaragoza, Jan Hajič Jr., and Alexander Pacha. 2021. Understanding Optical Music Recognition. Comput. Surveys 53, 4 (2021).Google ScholarGoogle Scholar
  10. Carlos Eduardo Cancino-Chacón, Thassilo Gadermaier, Gerhard Widmer, and Maarten Grachten. 2017. An evaluation of linear and non-linear models of expressive dynamics in classical piano and symphonic music. Machine Learning 106, 6 (2017), 887--909.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning (ICML).Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1. 539--546.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In International Conferen1ce on Learning Representations, (ICLR).Google ScholarGoogle Scholar
  14. Matthias Dorfer, Andreas Arzt, and Gerhard Widmer. 2017. Learning Audio-Sheet Music Correspondences for Score Identification and Offline Alignment. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou, China, 115--122.Google ScholarGoogle Scholar
  15. Matthias Dorfer, Jan Hajič jr., Andreas Arzt, Harald Frostel, and Gerhard Widmer. 2018. Learning Audio-Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification. Transactions of the International Society for Music Information Retrieval 1, 1 (2018).Google ScholarGoogle ScholarCross RefCross Ref
  16. Matthias Dorfer, Jan Schlüter, Andreu Vall, Filip Korzeniowski, and Gerhard Widmer. 2018. End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss. International Journal of Multimedia Information Retrieval 7, 2 (01 6 2018), 117--128.Google ScholarGoogle ScholarCross RefCross Ref
  17. Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. 2014. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks. In Advances in Neural Information Processing Systems, Vol. 27.Google ScholarGoogle Scholar
  18. Christian Fremerey, Michael Clausen, Sebastian Ewert, and Meinard Müller. 2009. Sheet Music-Audio Identification. In Proceedings of the International Conference on Music Information Retrieval (ISMIR). Kobe, Japan, 645--650.Google ScholarGoogle Scholar
  19. Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. 2019. Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  20. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In IEEE International Conference on Computer Vision (ICCV). 1026--1034.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Florian Henkel and Gerhard Widmer. 2021. Real-Time Music Following in Score Sheet Images via Multi-Resolution Prediction. Frontiers in Computer Science 3 (2021).Google ScholarGoogle Scholar
  22. Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML). Lille, France, 448--456.Google ScholarGoogle Scholar
  23. Özgür Izmirli and Gyanendra Sharma. 2012. Bridging Printed Music and Audio Through Alignment Using a Mid-level Score Representation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Porto, Portugal, 61--66. http://ismir2012.ismir.net/event/papers/061-ismir-2012.pdfGoogle ScholarGoogle Scholar
  24. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  25. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. arXiv preprint (arXiv:1411.2539) (2014).Google ScholarGoogle Scholar
  26. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems. 1097--1105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Juan C. López-Gutiérrez, Jose J. Valero-Mas, Francisco J. Castellanos, and Jorge Calvo-Zaragoza. 2021. Data Augmentation for End-to-End Optical Music Recognition. In Proceedings of the 14th IAPR International Workshop on Graphics Recognition (GREC). Springer, 59--73.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Meinard Müller, Andreas Arzt, Stefan Balke, Matthias Dorfer, and Gerhard Widmer. 2019. Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies. IEEE Signal Processing Magazine 36, 1 (2019), 52--62.Google ScholarGoogle ScholarCross RefCross Ref
  29. Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). 2613--2617.Google ScholarGoogle Scholar
  30. Ken Perlin. 2002. Improving Noise. ACM Transactions on Graphics (TOG) 21, 3 (2002), 681--682.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Justin Salamon and Juan Pablo Bello. 2017. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Processing Letters 24, 3 (2017), 279--283.Google ScholarGoogle ScholarCross RefCross Ref
  32. Jan Schlüter and Thomas Grill. 2015. Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 121--126.Google ScholarGoogle Scholar
  33. Connor Shorten and Taghi M. Khoshgoftaar. 2019. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data 6, 60 (2019). Issue 1.Google ScholarGoogle ScholarCross RefCross Ref
  34. Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon. 2016. An End-to-End Neural Network for Polyphonic Piano Music Transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 5 (2016), 927--939.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Patrice Simard, Dave Steinkraus, and John Platt. 2003. Best Practices for Convolutional Neural Networks. International Conference on Document Analysis and Recognition (ICDAR) 3 (2003), 958--962.Google ScholarGoogle Scholar
  36. Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  37. Kihyuk Sohn. 2016. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In Advances in Neural Information Processing Systems. 1857--1865.Google ScholarGoogle Scholar
  38. Naoya Takahashi, Michael Gygli, Beat Pfister, and Luc Van Gool. 2016. Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). San Francisco, USA, 2982--2986.Google ScholarGoogle ScholarCross RefCross Ref
  39. Timothy J. Tsai. 2020. Towards Linking the Lakh and IMSLP Datasets. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 546--550.Google ScholarGoogle ScholarCross RefCross Ref
  40. Eelco van der Wel and Karen Ullrich. 2017. Optical Music Recognition with Convolutional Sequence-to-Sequence Models. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou, China, 731--737.Google ScholarGoogle Scholar
  41. Daniel Yang, Thitaree Tanprasert, Teerapat Jenrungrot, Mengyi Shan, and Timothy J. Tsai. 2019. MIDI Passage Retrieval Using Cell Phone Pictures of Sheet Music. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 916--923.Google ScholarGoogle Scholar
  42. Frank Zalkow, Stefan Balke, and Meinard Müller. 2019. Evaluating Salience Representations for Cross-Modal Retrieval of Western Classical Music Recordings. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 331--335.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Article Metrics

        • Downloads (Last 12 months)248
        • Downloads (Last 6 weeks)14

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader