Abstract
Previous work on music generation and transformation has commonly targeted single instrument or single melody music. Here, in contrast, five music genres are used with the goal to achieve selective remixing by using domain transfer methods on spectrogram images of music. A pipeline architecture comprised of two independent generative adversarial network models was created. The first applies features from one of the genres to constant-Q transform spectrogram images to perform style transfer. The second network turns a spectrogram into a real-value tensor representation which is approximately reconstructed back into audio. The system was evaluated experimentally and through a survey. Due to the increased complexity involved in processing high sample rate music with homophonic or polyphonic audio textures, the system’s audio output was considered to be low quality, but the style transfer produced noticeable selective remixing on most of the music tracks evaluated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Barthel, M., Stoking, G., Holcomb, J., Mitchell, A.: Reddit news users more likely to be male, young and digital in their news preferences. In: Seven-in-Ten Reddit Users Get News on the Site, Pew Research Center (2016). https://www.journalism.org/2016/02/25/reddit-news-users-more-likely-to-be-male-young-and-digital-in-their-news-preferences/. Accessed 11 Apr 2020
Benzi, K., Defferrard, M., Vandergheynst, P., Bresson, X.: FMA: A dataset for music analysis. CoRR abs/1612.01840 (2016). http://arxiv.org/abs/1612.01840
Briot, J., Hadjeres, G., Pachet, F.: Deep learning techniques for music generation – a survey. CoRR abs/1709.01620 (2017). http://arxiv.org/abs/1709.01620
Brunner, G., Wang, Y., Wattenhofer, R., Zhao, S.: Symbolic music genre transfer with CycleGAN. CoRR abs/1809.07575 (2018). http://arxiv.org/abs/1809.07575
Cano, P., et al.: ISMIR 2004 audio description contest. Technical report. MTG-TR-2006-02, Universitat Pompeu Fabra, January 2006
Cheuk, K.W., Anderson, H.H., Agres, K., Herremans, D.: nnAudio: an on-the-fly GPU audio to spectrogram conversion toolbox using 1D convolution neural networks. CoRR abs/1912.12055 (2019). https://arxiv.org/abs/1912.12055
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)
Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: StarGAN v2: Diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8188–8197 (2020)
Dai, S., Zhang, Z., Xia, G.: Music style transfer issues: a position paper. CoRR abs/1803.06841 (2018). http://arxiv.org/abs/1803.06841
Donahue, C., McAuley, J.J., Puckette, M.S.: Synthesizing audio with generative adversarial networks. CoRR abs/1802.04208 (2018). http://arxiv.org/abs/1802.04208
Fagerjord, A., Klastrup, L., Allen, M.: After Convergence: YouTube and Remix Culture, pp. 187–200. Springer, Dordrecht (2010). ISBN 978-1-4020-9789-8, https://doi.org/10.1007/978-1-4020-9789-8_11
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423, June 2016. https://doi.org/10.1109/CVPR.2016.265
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 776–780, New Orleans, LA (2017)
Glitsos, L.: Vaporwave, or music optimised for abandoned malls. Popular Music 37(1), 100–118 (2018). https://doi.org/10.1017/S0261143017000599
Griffin, D.W., Lim, J.S.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984). https://doi.org/10.1109/TASSP.1984.1164317
Huang, S., Li, Q., Anil, C., Bao, X., Oore, S., Grosse, R.B.: TimbreTron: a WaveNet(CycleGAN(CQT(Audio))) pipeline for musical timbre transfer. CoRR abs/1811.09620 (2018). http://arxiv.org/abs/1811.09620
Huzaifah, M., Wyse, L.: Applying visual domain style transfer and texture synthesis techniques to audio: insights and challenges. CoRR abs/1901.10240 (2019). http://arxiv.org/abs/1901.10240
International Telecommunication Union: Methods for objective and subjective assessment of speech and video quality. Technical report, International Telecommunication Union (2016). https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.913-201603-I!!PDF-E&type=items. Accessed 1 June 2020
Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1 \(\times \) 1 convolutions. In: Advances in Neural Information Processing Systems, vol. 31, Curran Associates, Inc. (2018)
Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/6804c9bca0a615bdb9374d00a9fcba59-Paper.pdf. Accessed 1 June 2020
Law, E., West, K., Mandel, M.I., Bay, M., Downie, J.S.: Evaluation of algorithms using games: the case of music tagging. In: 10th International Society for Music Information Retrieval Conference, pp. 387–392 (2009)
McFee, B., et al.: librosa/librosa: 0.7.2. Zenodo, January 2020. https://doi.org/10.5281/zenodo.3606573
Mor, N., Wolf, L., Polyak, A., Taigman, Y.: A universal music translation network. CoRR abs/1805.07848 (2018). http://arxiv.org/abs/1805.07848
Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: VISAPP International Conference on Computer Vision Theory and Applications, pp. 331–340 (2009)
van den Oord, A., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499
Pachet, F., Roy, P., Cazaly, D.: A combinatorial approach to content-based music selection. In: Proceedings IEEE International Conference on Multimedia Computing and Systems, vol. 1, pp. 457–462, Florence, Italy, July 1999. https://doi.org/10.1109/MMCS.1999.779245
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. Accessed 1 June 2020
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. CoRR abs/1811.00002 (2018). http://arxiv.org/abs/1811.00002
Scaringella, N., Zoia, G., Mlynek, D.: Automatic genre classification of music content: a survey. IEEE Signal Process. Mag. 23, 133–141 (2006). https://doi.org/10.1109/MSP.2006.1598089
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. CoRR abs/1712.05884 (2017). http://arxiv.org/abs/1712.05884
Thiede, T., et al.: PEAQ-the ITU standard for objective measurement of perceived audio quality. J. Audio Eng. Soc. 48, 3–29 (2000)
Ulyanov, D., Lebedev, V.: Audio texture synthesis and style transfer (2016). https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/. Accessed 1 June 2020
Vande Veire, L., De Bie, T., Dambre, J.: A CycleGAN for style transfer between drum & bass subgenres. In: Machine Learning for Music Discovery Workshop at 36th International Conference on Machine Learning (2019). https://sites.google.com/view/ml4md2019/home. Accessed 1 June 2020
Yang, S., Chung, M.: Self-imitating feedback generation using GAN for computer-assisted pronunciation training. CoRR abs/1904.09407 (2019). http://arxiv.org/abs/1904.09407
Zhu, J., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR abs/1703.10593 (2017). http://arxiv.org/abs/1703.10593
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
McAllister, T., Gambäck, B. (2022). Music Style Transfer Using Constant-Q Transform Spectrograms. In: Martins, T., Rodríguez-Fernández, N., Rebelo, S.M. (eds) Artificial Intelligence in Music, Sound, Art and Design. EvoMUSART 2022. Lecture Notes in Computer Science, vol 13221. Springer, Cham. https://doi.org/10.1007/978-3-031-03789-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-03789-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-03788-7
Online ISBN: 978-3-031-03789-4
eBook Packages: Computer ScienceComputer Science (R0)