research-article

Text2Light: Zero-Shot Text-Driven HDR Panorama Generation

Authors:
Zhaoxi Chen

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

,
Guangcong Wang

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

,
Ziwei Liu

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

Authors Info & Claims

ACM Transactions on Graphics Volume 41 Issue 6Article No.: 195pp 1–16https://doi.org/10.1145/3550454.3555447

Published:30 November 2022Publication History

ACM Transactions on Graphics

Abstract

High-quality HDRIs (High Dynamic Range Images), typically HDR panoramas, are one of the most popular ways to create photorealistic lighting and 360-degree reflections of 3D scenes in graphics. Given the difficulty of capturing HDRIs, a versatile and controllable generative model is highly desired, where layman users can intuitively control the generation process. However, existing state-of-the-art methods still struggle to synthesize high-quality panoramas for complex scenes. In this work, we propose a zero-shot text-driven framework, Text2Light, to generate 4K+ resolution HDRIs without paired training data. Given a free-form text as the description of the scene, we synthesize the corresponding HDRI with two dedicated steps: 1) text-driven panorama generation in low dynamic range (LDR) and low resolution (LR), and 2) super-resolution inverse tone mapping to scale up the LDR panorama both in resolution and dynamic range. Specifically, to achieve zero-shot text-driven panorama generation, we first build dual codebooks as the discrete representation for diverse environmental textures. Then, driven by the pre-trained Contrastive Language-Image Pre-training (CLIP) model, a text-conditioned global sampler learns to sample holistic semantics from the global codebook according to the input text. Furthermore, a structure-aware local sampler learns to synthesize LDR panoramas patch-by-patch, guided by holistic semantics. To achieve super-resolution inverse tone mapping, we derive a continuous representation of 360-degree imaging from the LDR panorama as a set of structured latent codes anchored to the sphere. This continuous representation enables a versatile module to upscale the resolution and dynamic range simultaneously. Extensive experiments demonstrate the superior capability of Text2Light in generating high-quality HDR panoramas. In addition, we show the feasibility of our work in realistic rendering and immersive VR.

Supplemental Material

3550454.3555447.mp4

mp4

877.7 MB

Download

References

Francesco Banterle, Patrick Ledda, Kurt Debattista, and Alan Chalmers. 2006. Inverse tone mapping. In Proceedings of the 4th international conference on Computer graphics and interactive techniques in Australasia and Southeast Asia - GRAPHITE '06. ACM Press, Kuala Lumpur, Malaysia, 349.Google ScholarDigital Library
Andreas Blattmann, Robin Rombach, Kaan Oktay, and Björn Ommer. 2022. Retrieval-Augmented Diffusion Models. Google ScholarCross Ref
Sam Bond-Taylor, Peter Hessey, Hiroshi Sasaki, Toby P. Breckon, and Chris G. Willcocks. 2021. Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes. arXiv:2111.12701 [cs] (Nov. 2021). arXiv: 2111.12701.Google Scholar
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. 2022. MaskGIT: Masked Generative Image Transformer. arXiv:2202.04200 [cs] (Feb. 2022). arXiv:2202.04200.Google Scholar
Guanying Chen, Chaofeng Chen, Shi Guo, Zhetong Liang, Kwan-Yee K. Wong, and Lei Zhang. 2021. HDR Video Reconstruction: A Coarse-to-fine Network and A Real-world Benchmark Dataset. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Montreal, QC, Canada, 2482--2491.Google Scholar
Zhaoxi Chen and Ziwei Liu. 2022. Relighting4D: Neural Relightable Human from Videos. In Proceedings of the European Conference on Computer Vision (ECCV).Google ScholarDigital Library
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. 2021. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems 34 (2021), 19822--19835.Google Scholar
Gabriel Eilertsen, Joel Kronander, Gyorgy Denes, Rafał K. Mantiuk, and Jonas Unger. 2017. HDR image reconstruction from a single exposure using deep CNNs. ACM Transactions on Graphics 36, 6 (Nov. 2017), 1--15.Google ScholarDigital Library
Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12873--12883.Google ScholarCross Ref
Marc-André Gardner, Kalyan Sunkavalli, Ersin Yumer, Xiaohui Shen, Emiliano Gambaretto, Christian Gagné, and Jean-François Lalonde. 2017. Learning to predict indoor illumination from a single image. arXiv preprint arXiv:1704.00090abs/1704.00090 (2017).Google Scholar
Shir Gur, Sagie Benaim, and Lior Wolf. 2020. Hierarchical patch vae-gan: Generating diverse videos from a single sample. Advances in Neural Information Processing Systems 33 (2020), 16761--16772.Google Scholar
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.Google Scholar
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs, stat] (Dec. 2020). arXiv: 2006.11239.Google Scholar
Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. 2022. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1--19.Google ScholarDigital Library
Xun Huang, Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. 2021. Multimodal Conditional Image Synthesis with Product-of-Experts GANs. arXiv:2112.05130 [cs] (Dec. 2021). arXiv: 2112.05130.Google Scholar
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1125--1134.Google ScholarCross Ref
Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, and Ziwei Liu. 2022. Text2Human: Text-Driven Controllable Human Image Generation. ACM Transactions on Graphics (TOG) 41, 4, Article 162 (2022), 11 pages. Google ScholarDigital Library
Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. 2020a. Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems 33, 12104--12114.Google Scholar
Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021. Alias-Free Generative Adversarial Networks. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 852--863.Google Scholar
Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4401--4410.Google ScholarCross Ref
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020b. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110--8119.Google ScholarCross Ref
Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. 2022. DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation. arXiv:2110.02711 [cs] (April 2022). arXiv: 2110.02711.Google Scholar
Soo Ye Kim, Jihyong Oh, and Munchurl Kim. 2020. Jsi-gan: Gan-based joint super-resolution and inverse tone-mapping with pixel-wise task-specific filters for uhd hdr video. In Proceedings of the AAAI Conference on Artificial Intelligence. 11287--11295.Google ScholarCross Ref
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Siyeong Lee, Gwon Hwan An, and Suk-Ju Kang. 2018. Deep Chain HDRI: Reconstructing a High Dynamic Range Image from a Single Low Dynamic Range Image. IEEE Access 6 (2018), 49913--49924. arXiv: 1801.06277.Google ScholarCross Ref
Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. 2017. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 136--144.Google ScholarCross Ref
Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, Da-Cheng Juan, Wei Wei, and Hwann-Tzong Chen. 2019. COCO-GAN: Generation by Parts via Conditional Coordinating. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).Google ScholarCross Ref
Chieh Hubert Lin, Hsin-Ying Lee, Yen-Chi Cheng, Sergey Tulyakov, and Ming-Hsuan Yang. 2021. InfinityGAN: Towards Infinite-Pixel Image Synthesis. arXiv:2104.03963 [cs] (Oct. 2021). arXiv: 2104.03963.Google Scholar
Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang Liu. 2021. FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization. arXiv:2112.01573 [cs] (Dec. 2021). arXiv: 2112.01573.Google Scholar
Yu-Lun Liu, Wei-Sheng Lai, Yu-Sheng Chen, Yi-Lung Kao, Ming-Hsuan Yang, Yung-Yu Chuang, and Jia-Bin Huang. 2020. Single-Image HDR Reconstruction by Learning to Reverse the Camera Pipeline. arXiv:2004.01179 [cs, eess] (April 2020). arXiv:2004.01179.Google Scholar
Demetris Marnerides, Thomas Bashford-Rogers, Jonathan Hatchett, and Kurt Debattista. 2019. ExpandNet: A Deep Convolutional Neural Network for High Dynamic Range Expansion from Low Dynamic Range Content. arXiv:1803.02266 [cs] (Sept. 2019). arXiv: 1803.02266.Google Scholar
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741 [cs] (March 2022). arXiv: 2112.10741.Google Scholar
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2018. Neural Discrete Representation Learning. arXiv:1711.00937 [cs] (May 2018). arXiv: 1711.00937.Google Scholar
Rohit Pandey, Sergio Orts Escolano, Chloe Legendre, Christian Häne, Sofien Bouaziz, Christoph Rhemann, Paul Debevec, and Sean Fanello. 2021. Total Relighting: Learning to Relight Portraits for Background Replacement. ACM Trans. Graph. 40, 4, Article 43 (jul 2021), 21 pages. Google ScholarDigital Library
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. arXiv:2103.17249 [cs] (March 2021). arXiv: 2103.17249.Google Scholar
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs] (Feb. 2021). arXiv: 2103.00020.Google Scholar
Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. 2019. On the spectral bias of neural networks. In International Conference on Machine Learning. PMLR, 5301--5310.Google Scholar
Prarabdh Raipurkar, Rohil Pal, and Shanmuganathan Raman. 2021. HDR-cGAN: Single LDR to HDR Image Translation using Conditional GAN. arXiv:2110.01660 [cs, eess] (Oct. 2021). arXiv: 2110.01660.Google Scholar
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. (2022), 24.Google Scholar
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation.Google Scholar
Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems 32 (2019).Google Scholar
E. Reinhard and K. Devlin. 2005. Dynamic range reduction inspired by photoreceptor physiology. IEEE Transactions on Visualization and Computer Graphics 11, 1 (2005), 13--24. Google ScholarDigital Library
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021a. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs] (Dec. 2021). arXiv: 2112.10752.Google Scholar
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021b. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]Google Scholar
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. Advances in neural information processing systems 29 (2016).Google Scholar
Peter Schaldenbrand, Zhixuan Liu, and Jean Oh. 2021. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis. arXiv:2111.03133 [cs] (Nov. 2021). arXiv: 2111.03133.Google Scholar
Ivan Skorokhodov, Grigorii Sotnikov, and Mohamed Elhoseiny. 2021. Aligning Latent and Image Spaces to Connect the Unconnectable. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Montreal, QC, Canada, 14124--14133.Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
Guangcong Wang, Yinuo Yang, Chen Change Loy, and Ziwei Liu. 2022b. StyleLight: HDR Panorama Generation for Lighting Estimation and Editing. In European Conference on Computer Vision (ECCV).Google Scholar
Lin Wang and Kuk-Jin Yoon. 2021. Deep Learning for HDR Imaging: State-of-the-Art and Future Trends. arXiv:2110.10394 [cs, eess] (Nov. 2021). arXiv: 2110.10394.Google Scholar
Zihao Wang, Wei Liu, Qian He, Xinglong Wu, and Zili Yi. 2022a. CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP. arXiv:2203.00386 [cs] (March 2022). arXiv: 2203.00386.Google Scholar
Wei Wei, Li Guan, Yue Liu, Hao Kang, Haoxiang Li, Ying Wu, and Gang Hua. 2021. Beyond Visual Attractiveness: Physically Plausible Single Image HDR Reconstruction for Spherical Panoramas. arXiv:2103.12926 [cs, eess] (March 2021). arXiv: 2103.12926.Google Scholar
Yichong Xu, Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, and Zheng Zhang. 2014. Scale-invariant convolutional neural networks. arXiv preprint arXiv:1411.6369 (2014).Google Scholar
Hanning Yu, Wentao Liu, Chengjiang Long, Bo Dong, Qin Zou, and Chunxia Xiao. 2021. Luminance Attentive Networks for HDR Image and Panorama Reconstruction. arXiv:2109.06688 [cs, eess] (Sept. 2021). arXiv: 2109.06688.Google Scholar
Jinsong Zhang and Jean-Francois Lalonde. 2017. Learning High Dynamic Range from Outdoor Panoramas. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Venice, 4529--4538.Google ScholarCross Ref
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586--595.Google ScholarCross Ref
Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. 2022. LAFITE: Towards Language-Free Training for Text-to-Image Generation. arXiv:2111.13792 [cs] (March 2022). arXiv:2111.13792.Google Scholar

Index Terms

Text2Light: Zero-Shot Text-Driven HDR Panorama Generation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

StyleLight: HDR Panorama Generation for Lighting Estimation and Editing
Computer Vision – ECCV 2022
Abstract
We present a new lighting estimation and editing framework to generate high-dynamic-range (HDR) indoor panorama lighting from a single limited field-of-view (LFOV) image captured by low-dynamic-range (LDR) cameras. Existing lighting estimation ...
Read More
Generating stereoscopic HDR images using HDR-LDR image pairs

A number of novel imaging technologies have been gaining popularity over the past few years. Foremost among these are stereoscopy and high dynamic range (HDR) Imaging. While a large body of research has looked into each of these imaging technologies ...
Read More
Bottom-up segmentation for ghost-free reconstruction of a dynamic scene from multi-exposure images
ICVGIP '10: Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing

High Dynamic Range (HDR) imaging requires one to composite multiple differently exposed images of a scene in the irradiance domain and perform tone mapping of the generated HDR image for displaying on Low Dynamic Range (LDR) devices. In the case of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Graphics Volume 41, Issue 6
December 2022
1428 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3550454
Issue’s Table of Contents

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 November 2022
Published in tog Volume 41, Issue 6

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
high dynamic range imaging
image generation
panorama generation
text-driven generation
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 442
  Total Downloads
- Downloads (Last 12 months)227
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Text2Light: Zero-Shot Text-Driven HDR Panorama Generation

ACM Transactions on Graphics

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

StyleLight: HDR Panorama Generation for Lighting Estimation and Editing

Generating stereoscopic HDR images using HDR-LDR image pairs

Bottom-up segmentation for ghost-free reconstruction of a dynamic scene from multi-exposure images

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Text2Light: Zero-Shot Text-Driven HDR Panorama Generation

ACM Transactions on Graphics

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

StyleLight: HDR Panorama Generation for Lighting Estimation and Editing

Generating stereoscopic HDR images using HDR-LDR image pairs

Bottom-up segmentation for ghost-free reconstruction of a dynamic scene from multi-exposure images

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media