research-article

SGDiff: A Style Guided Diffusion Model for Fashion Synthesis

Authors:
Zhengwentai Sun

The Hong Kong Polytechnic University, Hong Kong, Hong Kong

The Hong Kong Polytechnic University, Hong Kong, Hong Kong

0000-0002-3884-137X
View Profile

,
Yanghong Zhou

The Hong Kong Polytechnic University, Hong King, Hong Kong

The Hong Kong Polytechnic University, Hong King, Hong Kong

0000-0002-0372-996X
View Profile

,
Honghong He

The Hong Kong Polytechnic University, Hong kong, Hong Kong

The Hong Kong Polytechnic University, Hong kong, Hong Kong

0000-0002-8766-7582
View Profile

,
P.Y. Mok

The Hong Kong Polytechnic University, Hong Kong, Hong Kong

The Hong Kong Polytechnic University, Hong Kong, Hong Kong

0000-0002-0635-5318
View Profile

MM '23: Proceedings of the 31st ACM International Conference on MultimediaOctober 2023Pages 8433–8442https://doi.org/10.1145/3581783.3613806

Published:27 October 2023Publication History

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 8433–8442

ABSTRACT

This paper reports on the development of a novel style guided diffusion model (SGDiff) which overcomes certain weaknesses inherent in existing models for image synthesis. The proposed SGDiff combines image modality with a pretrained text-to-image diffusion model to facilitate creative fashion image synthesis. It addresses the limitations of text-to-image diffusion models by incorporating supplementary style guidance, substantially reducing training costs, and overcoming the difficulties of controlling synthesized styles with text-only inputs. This paper also introduces a new dataset -- SG-Fashion, specifically designed for fashion image synthesis applications, offering high-resolution images and an extensive range of garment categories. By means of comprehensive ablation study, we examine the application of classifier-free guidance to a variety of conditions and validate the effectiveness of the proposed model for generating fashion images of the desired categories, product attributes, and styles. The contributions of this paper include a novel classifier-free guidance method for multi-modal feature fusion, a comprehensive dataset for fashion image synthesis application, a thorough investigation on conditioned text-to-image synthesis, and valuable insights for future research in the text-to-image synthesis domain. The code and dataset are available at: https://github.com/taited/SGDiff.

References

Kenan E Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A Kassim. 2019. Attribute manipulation generative adversarial networks for fashion images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10541--10550.Google ScholarCross Ref
Andrew Brock, Jeff Donahue, and Karen Simonyan. 2019. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations. https://openreview.net/forum?id=B1xsqj09FmGoogle Scholar
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR.Google Scholar
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.Google Scholar
Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. 2022. VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. In Computer Vision - ECCV 2022, Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer Nature Switzerland, Cham, 88--105.Google ScholarDigital Library
Aiyu Cui, Daniel McKee, and Svetlana Lazebnik. 2021. Dressing in order: Recurrent person image generation for pose transfer, virtual try-on and outfit editing. In Proceedings of the IEEE/CVF international conference on computer vision. 14638--14647.Google Scholar
Prafulla Dhariwal and Alexander Quinn Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (Eds.). https://openreview.net/forum?id=AAWuCvzaVtGoogle Scholar
Yujuan Ding, PY Mok, Yunshan Ma, and Yi Bin. 2023. Personalized fashion outfit generation with user coordination preference learning. Information Processing & Management 60, 5 (2023), 103434.Google ScholarDigital Library
Sepideh Esmaeilpour, Bing Liu, Eric Robertson, and Lei Shu. 2022. Zero-shot out-of-distribution detection based on the pre-trained model clip. In Proceedings of the AAAI conference on artificial intelligence, Vol. 36. 6568--6576.Google ScholarCross Ref
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. 2023. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=NAQvF08TcyGGoogle Scholar
Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo. 2021. Parser-free virtual try-on via distilling appearance flows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8485--8493.Google ScholarCross Ref
Xintong Han, Zuxuan Wu, Yu-Gang Jiang, and Larry S Davis. 2017. Learning fashion compatibility with bidirectional lstms. In Proceedings of the 25th ACM international conference on Multimedia. 1078--1086.Google ScholarDigital Library
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. 2023. Prompt-to-Prompt Image Editing with Cross-Attention Control. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=_CDixzkzeybGoogle Scholar
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H.Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/ paper_files/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdfGoogle ScholarDigital Library
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840--6851.Google Scholar
Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. https://openreview.net/forum?id=qw8AKxfYbIGoogle Scholar
Bingwen Hu, Ping Liu, Zhedong Zheng, and Mingwu Ren. 2022. SPG-VTON: Semantic Prediction Guidance for Multi-Pose Virtual Try-on. IEEE Transactions on Multimedia 24 (2022), 1233--1246. https://doi.org/10.1109/TMM.2022.3143712Google ScholarCross Ref
Shuhui Jiang, Jun Li, and Yun Fu. 2022. Deep Learning for Fashion Style Generation. IEEE Transactions on Neural Networks and Learning Systems 33, 9 (2022), 4538--4550. https://doi.org/10.1109/TNNLS.2021.3057892Google ScholarCross Ref
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for realtime style transfer and super-resolution. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 694--711.Google ScholarCross Ref
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110--8119.Google ScholarCross Ref
Bo-Kyeong Kim, Geonmin Kim, and Soo-Young Lee. 2020. Style-Controlled Synthesis of Clothing Segments for Fashion Image Manipulation. IEEE Transactions on Multimedia 22, 2 (2020), 298--310. https://doi.org/10.1109/TMM.2019.2929000Google ScholarDigital Library
Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu Tan, Danilo Mandic, Lei He, Xiangyang Li, Tao Qin, et al. 2022. Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis. Advances in Neural Information Processing Systems 35 (2022), 23689--23700.Google Scholar
Kathleen M Lewis, Srivatsan Varadharajan, and Ira Kemelmacher-Shlizerman. 2021. Tryongan: Body-aware try-on via layered interpolation. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1--10.Google ScholarDigital Library
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. 2022. Open-vocabulary semantic segmentation with mask-adapted clip. arXiv preprint arXiv:2210.04150 (2022).Google Scholar
Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang Liu. 2021. Fusedream: Training-free text-to-image generation with improved clip gan space optimization. arXiv preprint arXiv:2112.01573 (2021).Google Scholar
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, JiajunWu, Jun-Yan Zhu, and Stefano Ermon. 2022. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations. https://openreview.net/forum?id=aBsCjcPu_tEGoogle Scholar
Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. PMLR, 8162--8171.Google Scholar
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning. PMLR, 16784--16804.Google Scholar
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2085--2094.Google ScholarCross Ref
Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, and Martin Jagersand. 2019. Basnet: Boundary-aware salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7479--7489.Google ScholarCross Ref
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.Google Scholar
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821--8831.Google Scholar
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684--10695.Google ScholarCross Ref
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500--22510.Google ScholarCross Ref
Peter Schaldenbrand, Zhixuan Liu, and Jean Oh. 2022. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence Organization, 4966--4972. https://doi.org/10.24963/ijcai.2022/688 AI and Arts.Google ScholarCross Ref
Christoph Schuhmann, Romain Beaumont, Richard Vencu, CadeWGordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=M3Y74vmsMcYGoogle Scholar
Hengcan Shi, Munawar Hayat, Yicheng Wu, and Jianfei Cai. 2022. Proposal-CLIP: unsupervised open-category object proposal generation via exploiting clip cues. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9611--9620.Google ScholarCross Ref
Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. In International Conference on Learning Representations. https://openreview.net/forum?id=St1giarCHLPGoogle Scholar
Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. 2021. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems 34 (2021), 1415--1428.Google Scholar
Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon. 2023. Dual Diffusion Implicit Bridges for Image-to-Image Translation. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=5HLoTvVGDeGoogle Scholar
Zhu Teng, Yani Duan, Yan Liu, Baopeng Zhang, and Jianping Fan. 2021. Global to local: Clip-LSTM-based object detection from remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1--13.Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H.Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdfGoogle ScholarDigital Library
Jun Xu, Yuanyuan Pu, Rencan Nie, Dan Xu, Zhengpeng Zhao, and Wenhua Qian. 2021. Virtual Try-on Network With Attribute Transformation and Local Rendering. IEEE Transactions on Multimedia 23 (2021), 2222--2234. https://doi.org/10.1109/TMM.2021.3070972Google ScholarDigital Library
Cong Yu, Yang Hu, Yan Chen, and Bing Zeng. 2019. Personalized fashion design. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9046--9055.Google ScholarCross Ref
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. 2022. Tip-adapter: Training-free adaption of clip for few-shot classification. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXV. Springer, 493--510.Google Scholar
Xujie Zhang, Yu Sha, Michael C. Kampffmeyer, Zhenyu Xie, Zequn Jie, Chengwen Huang, Jianqing Peng, and Xiaodan Liang. 2022. ARMANI: Part-Level Garment- Text Alignment for Unified Cross-Modal Fashion Design. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 4525--4535. https://doi.org/10.1145/3503161.3548230Google ScholarDigital Library
Chong Zhou, Chen Change Loy, and Bo Dai. 2022. Extract free dense labels from clip. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVIII. Springer, 696--712.Google Scholar
Dongliang Zhou, Haijun Zhang, Qun Li, Jianghong Ma, and Xiaofei Xu. 2022. COutfitGAN: Learning to Synthesize Compatible Outfits Supervised by Silhouette Masks and Fashion Styles. IEEE Transactions on Multimedia (2022), 1--15. https://doi.org/10.1109/TMM.2022.3185894Google ScholarDigital Library
Ziqin Zhou, Bowen Zhang, Yinjie Lei, Lingqiao Liu, and Yifan Liu. 2022. ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation. arXiv preprint arXiv:2212.03588 (2022).Google Scholar
Shizhan Zhu, Raquel Urtasun, Sanja Fidler, Dahua Lin, and Chen Change Loy. 2017. Be your own prada: Fashion synthesis with structural coherence. In Proceedings of the IEEE international conference on computer vision. 1680--1688.Google ScholarCross Ref

Index Terms

SGDiff: A Style Guided Diffusion Model for Fashion Synthesis
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

FashionDiff: A Controllable Diffusion Model Using Pairwise Fashion Elements for Intelligent Design
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

The process of fashion design involves creative expression through various methods, including sketch drawing, brush painting, and choices of textures and colors, all of which are employed to characterize the originality and uniqueness of the designed ...
Read More
Fashion Meets Computer Vision: A Survey

Fashion is the way we present ourselves to the world and has become one of the world’s largest industries. Fashion, mainly conveyed by vision, has thus attracted much attention from computer vision researchers in recent years. Given the rapid ...
Read More
FCBoost-Net: A Generative Network for Synthesizing Multiple Collocated Outfits via Fashion Compatibility Boosting
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Outfit generation is a challenging task in the field of fashion technology, in which the aim is to create a collocated set of fashion items that complement a given set of items. Previous studies in this area have been limited to generating a unique set ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
denoising diffusion probabilistic models
fashion synthesis
style guidance
text-to-image
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 375
  Total Downloads
- Downloads (Last 12 months)375
- Downloads (Last 6 weeks)49
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SGDiff: A Style Guided Diffusion Model for Fashion Synthesis

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

FashionDiff: A Controllable Diffusion Model Using Pairwise Fashion Elements for Intelligent Design

Fashion Meets Computer Vision: A Survey

FCBoost-Net: A Generative Network for Synthesizing Multiple Collocated Outfits via Fashion Compatibility Boosting