Music Style Transfer Using Constant-Q Transform Spectrograms

McAllister, Tyler; Gambäck, Björn

doi:10.1007/978-3-031-03789-4_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13221))

Included in the following conference series:

International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar)

2217 Accesses

Abstract

Previous work on music generation and transformation has commonly targeted single instrument or single melody music. Here, in contrast, five music genres are used with the goal to achieve selective remixing by using domain transfer methods on spectrogram images of music. A pipeline architecture comprised of two independent generative adversarial network models was created. The first applies features from one of the genres to constant-Q transform spectrogram images to perform style transfer. The second network turns a spectrogram into a real-value tensor representation which is approximately reconstructed back into audio. The system was evaluated experimentally and through a survey. Due to the increased complexity involved in processing high sample rate music with homophonic or polyphonic audio textures, the system’s audio output was considered to be low quality, but the style transfer produced noticeable selective remixing on most of the music tracks evaluated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CycleDRUMS: automatic drum arrangement for bass lines using CycleGAN

Article Open access 19 January 2023

SteelyGAN: Semantic Unsupervised Symbolic Music Genre Transfer

MITT: Musical Instrument Timbre Transfer Based on the Multichannel Attention-Guided Mechanism

Notes

References

Barthel, M., Stoking, G., Holcomb, J., Mitchell, A.: Reddit news users more likely to be male, young and digital in their news preferences. In: Seven-in-Ten Reddit Users Get News on the Site, Pew Research Center (2016). https://www.journalism.org/2016/02/25/reddit-news-users-more-likely-to-be-male-young-and-digital-in-their-news-preferences/. Accessed 11 Apr 2020
Benzi, K., Defferrard, M., Vandergheynst, P., Bresson, X.: FMA: A dataset for music analysis. CoRR abs/1612.01840 (2016). http://arxiv.org/abs/1612.01840
Briot, J., Hadjeres, G., Pachet, F.: Deep learning techniques for music generation – a survey. CoRR abs/1709.01620 (2017). http://arxiv.org/abs/1709.01620
Brunner, G., Wang, Y., Wattenhofer, R., Zhao, S.: Symbolic music genre transfer with CycleGAN. CoRR abs/1809.07575 (2018). http://arxiv.org/abs/1809.07575
Cano, P., et al.: ISMIR 2004 audio description contest. Technical report. MTG-TR-2006-02, Universitat Pompeu Fabra, January 2006
Google Scholar
Cheuk, K.W., Anderson, H.H., Agres, K., Herremans, D.: nnAudio: an on-the-fly GPU audio to spectrogram conversion toolbox using 1D convolution neural networks. CoRR abs/1912.12055 (2019). https://arxiv.org/abs/1912.12055
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)
Google Scholar
Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: StarGAN v2: Diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8188–8197 (2020)
Google Scholar
Dai, S., Zhang, Z., Xia, G.: Music style transfer issues: a position paper. CoRR abs/1803.06841 (2018). http://arxiv.org/abs/1803.06841
Donahue, C., McAuley, J.J., Puckette, M.S.: Synthesizing audio with generative adversarial networks. CoRR abs/1802.04208 (2018). http://arxiv.org/abs/1802.04208
Fagerjord, A., Klastrup, L., Allen, M.: After Convergence: YouTube and Remix Culture, pp. 187–200. Springer, Dordrecht (2010). ISBN 978-1-4020-9789-8, https://doi.org/10.1007/978-1-4020-9789-8_11
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423, June 2016. https://doi.org/10.1109/CVPR.2016.265
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 776–780, New Orleans, LA (2017)
Google Scholar
Glitsos, L.: Vaporwave, or music optimised for abandoned malls. Popular Music 37(1), 100–118 (2018). https://doi.org/10.1017/S0261143017000599
Article Google Scholar
Griffin, D.W., Lim, J.S.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984). https://doi.org/10.1109/TASSP.1984.1164317
Article Google Scholar
Huang, S., Li, Q., Anil, C., Bao, X., Oore, S., Grosse, R.B.: TimbreTron: a WaveNet(CycleGAN(CQT(Audio))) pipeline for musical timbre transfer. CoRR abs/1811.09620 (2018). http://arxiv.org/abs/1811.09620
Huzaifah, M., Wyse, L.: Applying visual domain style transfer and texture synthesis techniques to audio: insights and challenges. CoRR abs/1901.10240 (2019). http://arxiv.org/abs/1901.10240
International Telecommunication Union: Methods for objective and subjective assessment of speech and video quality. Technical report, International Telecommunication Union (2016). https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.913-201603-I!!PDF-E&type=items. Accessed 1 June 2020
Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1 $\times $ 1 convolutions. In: Advances in Neural Information Processing Systems, vol. 31, Curran Associates, Inc. (2018)
Google Scholar
Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/6804c9bca0a615bdb9374d00a9fcba59-Paper.pdf. Accessed 1 June 2020
Law, E., West, K., Mandel, M.I., Bay, M., Downie, J.S.: Evaluation of algorithms using games: the case of music tagging. In: 10th International Society for Music Information Retrieval Conference, pp. 387–392 (2009)
Google Scholar
McFee, B., et al.: librosa/librosa: 0.7.2. Zenodo, January 2020. https://doi.org/10.5281/zenodo.3606573
Mor, N., Wolf, L., Polyak, A., Taigman, Y.: A universal music translation network. CoRR abs/1805.07848 (2018). http://arxiv.org/abs/1805.07848
Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: VISAPP International Conference on Computer Vision Theory and Applications, pp. 331–340 (2009)
Google Scholar
van den Oord, A., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499
Pachet, F., Roy, P., Cazaly, D.: A combinatorial approach to content-based music selection. In: Proceedings IEEE International Conference on Multimedia Computing and Systems, vol. 1, pp. 457–462, Florence, Italy, July 1999. https://doi.org/10.1109/MMCS.1999.779245
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. Accessed 1 June 2020
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. CoRR abs/1811.00002 (2018). http://arxiv.org/abs/1811.00002
Scaringella, N., Zoia, G., Mlynek, D.: Automatic genre classification of music content: a survey. IEEE Signal Process. Mag. 23, 133–141 (2006). https://doi.org/10.1109/MSP.2006.1598089
Article Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. CoRR abs/1712.05884 (2017). http://arxiv.org/abs/1712.05884
Thiede, T., et al.: PEAQ-the ITU standard for objective measurement of perceived audio quality. J. Audio Eng. Soc. 48, 3–29 (2000)
Google Scholar
Ulyanov, D., Lebedev, V.: Audio texture synthesis and style transfer (2016). https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/. Accessed 1 June 2020
Vande Veire, L., De Bie, T., Dambre, J.: A CycleGAN for style transfer between drum & bass subgenres. In: Machine Learning for Music Discovery Workshop at 36th International Conference on Machine Learning (2019). https://sites.google.com/view/ml4md2019/home. Accessed 1 June 2020
Yang, S., Chung, M.: Self-imitating feedback generation using GAN for computer-assisted pronunciation training. CoRR abs/1904.09407 (2019). http://arxiv.org/abs/1904.09407
Zhu, J., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR abs/1703.10593 (2017). http://arxiv.org/abs/1703.10593

Download references

Author information

Authors and Affiliations

Department of Computer Science, Norwegian University of Science and Technology, 7491, Trondheim, Norway
Tyler McAllister & Björn Gambäck
RISE, Research Institute of Sweden AB, Gothenburg, Sweden
Björn Gambäck

Authors

Tyler McAllister
View author publications
You can also search for this author in PubMed Google Scholar
Björn Gambäck
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Björn Gambäck .

Editor information

Editors and Affiliations

University of Coimbra, Coimbra, Portugal
Tiago Martins
University of A Coruña, A Coruña, Spain
Nereida Rodríguez-Fernández
University of Coimbra, Coimbra, Portugal
Sérgio M. Rebelo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

McAllister, T., Gambäck, B. (2022). Music Style Transfer Using Constant-Q Transform Spectrograms. In: Martins, T., Rodríguez-Fernández, N., Rebelo, S.M. (eds) Artificial Intelligence in Music, Sound, Art and Design. EvoMUSART 2022. Lecture Notes in Computer Science, vol 13221. Springer, Cham. https://doi.org/10.1007/978-3-031-03789-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-03789-4_13
Published: 15 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-03788-7
Online ISBN: 978-3-031-03789-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics