Skip to main content

Music Style Transfer Using Constant-Q Transform Spectrograms

  • Conference paper
  • First Online:
Artificial Intelligence in Music, Sound, Art and Design (EvoMUSART 2022)

Abstract

Previous work on music generation and transformation has commonly targeted single instrument or single melody music. Here, in contrast, five music genres are used with the goal to achieve selective remixing by using domain transfer methods on spectrogram images of music. A pipeline architecture comprised of two independent generative adversarial network models was created. The first applies features from one of the genres to constant-Q transform spectrogram images to perform style transfer. The second network turns a spectrogram into a real-value tensor representation which is approximately reconstructed back into audio. The system was evaluated experimentally and through a survey. Due to the increased complexity involved in processing high sample rate music with homophonic or polyphonic audio textures, the system’s audio output was considered to be low quality, but the style transfer produced noticeable selective remixing on most of the music tracks evaluated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    www.ableton.com/en; www.apple.com/mac/garageband.

  2. 2.

    www.youtube.com; soundcloud.com.

  3. 3.

    https://tinyurl.com/hiphop2vaporwave.

  4. 4.

    www.reddit.com/r/takemysurvey; www.reddit.com/r/SampleSize.

  5. 5.

    https://mcallistertyler.github.io/music-comparison.html.

  6. 6.

    https://tinyurl.com/genreidentification.

References

  1. Barthel, M., Stoking, G., Holcomb, J., Mitchell, A.: Reddit news users more likely to be male, young and digital in their news preferences. In: Seven-in-Ten Reddit Users Get News on the Site, Pew Research Center (2016). https://www.journalism.org/2016/02/25/reddit-news-users-more-likely-to-be-male-young-and-digital-in-their-news-preferences/. Accessed 11 Apr 2020

  2. Benzi, K., Defferrard, M., Vandergheynst, P., Bresson, X.: FMA: A dataset for music analysis. CoRR abs/1612.01840 (2016). http://arxiv.org/abs/1612.01840

  3. Briot, J., Hadjeres, G., Pachet, F.: Deep learning techniques for music generation – a survey. CoRR abs/1709.01620 (2017). http://arxiv.org/abs/1709.01620

  4. Brunner, G., Wang, Y., Wattenhofer, R., Zhao, S.: Symbolic music genre transfer with CycleGAN. CoRR abs/1809.07575 (2018). http://arxiv.org/abs/1809.07575

  5. Cano, P., et al.: ISMIR 2004 audio description contest. Technical report. MTG-TR-2006-02, Universitat Pompeu Fabra, January 2006

    Google Scholar 

  6. Cheuk, K.W., Anderson, H.H., Agres, K., Herremans, D.: nnAudio: an on-the-fly GPU audio to spectrogram conversion toolbox using 1D convolution neural networks. CoRR abs/1912.12055 (2019). https://arxiv.org/abs/1912.12055

  7. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)

    Google Scholar 

  8. Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: StarGAN v2: Diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8188–8197 (2020)

    Google Scholar 

  9. Dai, S., Zhang, Z., Xia, G.: Music style transfer issues: a position paper. CoRR abs/1803.06841 (2018). http://arxiv.org/abs/1803.06841

  10. Donahue, C., McAuley, J.J., Puckette, M.S.: Synthesizing audio with generative adversarial networks. CoRR abs/1802.04208 (2018). http://arxiv.org/abs/1802.04208

  11. Fagerjord, A., Klastrup, L., Allen, M.: After Convergence: YouTube and Remix Culture, pp. 187–200. Springer, Dordrecht (2010). ISBN 978-1-4020-9789-8, https://doi.org/10.1007/978-1-4020-9789-8_11

  12. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423, June 2016. https://doi.org/10.1109/CVPR.2016.265

  13. Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 776–780, New Orleans, LA (2017)

    Google Scholar 

  14. Glitsos, L.: Vaporwave, or music optimised for abandoned malls. Popular Music 37(1), 100–118 (2018). https://doi.org/10.1017/S0261143017000599

    Article  Google Scholar 

  15. Griffin, D.W., Lim, J.S.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984). https://doi.org/10.1109/TASSP.1984.1164317

    Article  Google Scholar 

  16. Huang, S., Li, Q., Anil, C., Bao, X., Oore, S., Grosse, R.B.: TimbreTron: a WaveNet(CycleGAN(CQT(Audio))) pipeline for musical timbre transfer. CoRR abs/1811.09620 (2018). http://arxiv.org/abs/1811.09620

  17. Huzaifah, M., Wyse, L.: Applying visual domain style transfer and texture synthesis techniques to audio: insights and challenges. CoRR abs/1901.10240 (2019). http://arxiv.org/abs/1901.10240

  18. International Telecommunication Union: Methods for objective and subjective assessment of speech and video quality. Technical report, International Telecommunication Union (2016). https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.913-201603-I!!PDF-E&type=items. Accessed 1 June 2020

  19. Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1 \(\times \) 1 convolutions. In: Advances in Neural Information Processing Systems, vol. 31, Curran Associates, Inc. (2018)

    Google Scholar 

  20. Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/6804c9bca0a615bdb9374d00a9fcba59-Paper.pdf. Accessed 1 June 2020

  21. Law, E., West, K., Mandel, M.I., Bay, M., Downie, J.S.: Evaluation of algorithms using games: the case of music tagging. In: 10th International Society for Music Information Retrieval Conference, pp. 387–392 (2009)

    Google Scholar 

  22. McFee, B., et al.: librosa/librosa: 0.7.2. Zenodo, January 2020. https://doi.org/10.5281/zenodo.3606573

  23. Mor, N., Wolf, L., Polyak, A., Taigman, Y.: A universal music translation network. CoRR abs/1805.07848 (2018). http://arxiv.org/abs/1805.07848

  24. Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: VISAPP International Conference on Computer Vision Theory and Applications, pp. 331–340 (2009)

    Google Scholar 

  25. van den Oord, A., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499

  26. Pachet, F., Roy, P., Cazaly, D.: A combinatorial approach to content-based music selection. In: Proceedings IEEE International Conference on Multimedia Computing and Systems, vol. 1, pp. 457–462, Florence, Italy, July 1999. https://doi.org/10.1109/MMCS.1999.779245

  27. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. Accessed 1 June 2020

  28. Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. CoRR abs/1811.00002 (2018). http://arxiv.org/abs/1811.00002

  29. Scaringella, N., Zoia, G., Mlynek, D.: Automatic genre classification of music content: a survey. IEEE Signal Process. Mag. 23, 133–141 (2006). https://doi.org/10.1109/MSP.2006.1598089

    Article  Google Scholar 

  30. Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. CoRR abs/1712.05884 (2017). http://arxiv.org/abs/1712.05884

  31. Thiede, T., et al.: PEAQ-the ITU standard for objective measurement of perceived audio quality. J. Audio Eng. Soc. 48, 3–29 (2000)

    Google Scholar 

  32. Ulyanov, D., Lebedev, V.: Audio texture synthesis and style transfer (2016). https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/. Accessed 1 June 2020

  33. Vande Veire, L., De Bie, T., Dambre, J.: A CycleGAN for style transfer between drum & bass subgenres. In: Machine Learning for Music Discovery Workshop at 36th International Conference on Machine Learning (2019). https://sites.google.com/view/ml4md2019/home. Accessed 1 June 2020

  34. Yang, S., Chung, M.: Self-imitating feedback generation using GAN for computer-assisted pronunciation training. CoRR abs/1904.09407 (2019). http://arxiv.org/abs/1904.09407

  35. Zhu, J., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR abs/1703.10593 (2017). http://arxiv.org/abs/1703.10593

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Björn Gambäck .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

McAllister, T., Gambäck, B. (2022). Music Style Transfer Using Constant-Q Transform Spectrograms. In: Martins, T., Rodríguez-Fernández, N., Rebelo, S.M. (eds) Artificial Intelligence in Music, Sound, Art and Design. EvoMUSART 2022. Lecture Notes in Computer Science, vol 13221. Springer, Cham. https://doi.org/10.1007/978-3-031-03789-4_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-03789-4_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-03788-7

  • Online ISBN: 978-3-031-03789-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics