Recognition: unknown
Linear Image Generation by Synthesizing Exposure Brackets
Pith reviewed 2026-05-10 00:10 UTC · model grok-4.3
The pith
Linear images are synthesized from text by generating separate exposure brackets for each part of the dynamic range using a DiT-based flow-matching architecture.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We address the task of text-to-linear-image generation by representing a linear image as a sequence of exposure brackets, each capturing a specific portion of the dynamic range, and propose a DiT-based flow-matching architecture for text-conditioned exposure bracket generation. The brackets are combined to form the final linear image. This enables downstream uses such as text-guided editing of the linear output and structure-conditioned synthesis through ControlNet.
What carries the argument
A sequence of exposure brackets that together record the full irradiance range of a scene, generated by a DiT-based flow-matching network conditioned on text.
If this is right
- Generated images remain scene-referred and invariant to sensor-specific factors, supporting professional editing workflows.
- Text prompts can directly control linear outputs for applications like guided editing.
- Structure-conditioned variants become feasible by attaching ControlNet to the bracket generator.
- The full dynamic range is available for downstream tone mapping and adjustments.
Where Pith is reading between the lines
- The bracket decomposition could generalize to other high-dynamic-range synthesis tasks such as video or 3D radiance fields.
- It might reduce reliance on custom loss terms by handling range compression in separate generation passes.
- Adaptive bracket counts based on scene contrast could be explored as a follow-on refinement.
Load-bearing premise
That pre-trained VAEs inherently fail to preserve extreme highlights and shadows in linear images and that generating brackets independently then recombining them will avoid new artifacts without added constraints.
What would settle it
A side-by-side comparison on high-contrast test scenes measuring whether recombined brackets retain more detail in clipped highlights and deep shadows than a single-pass latent diffusion model.
Figures
read the original abstract
The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing (ISP) pipeline to produce a display-referred image. However, such images are no longer faithful to the incident light, being compressed in dynamic range and stylized by subjective preferences. In contrast, RAW images record direct sensor signals before non-linear tone mapping. After camera response curve correction and demosaicing, they can be converted into linear images, which are scene-referred representations that directly reflect true irradiance and are invariant to sensor-specific factors. Since image sensors have better dynamic range and bit depth, linear images contain richer information than display-referred ones, leaving users more room for editing during post-processing. Despite this advantage, current generative models mainly synthesize display-referred images, which inherently limits downstream editing. In this paper, we address the task of text-to-linear-image generation: synthesizing a high-quality, scene-referred linear image that preserves full dynamic range, conditioned on a text prompt, for professional post-processing. Generating linear images is challenging, as pre-trained VAEs in latent diffusion models struggle to simultaneously preserve extreme highlights and shadows due to the higher dynamic range and bit depth. To this end, we represent a linear image as a sequence of exposure brackets, each capturing a specific portion of the dynamic range, and propose a DiT-based flow-matching architecture for text-conditioned exposure bracket generation. We further demonstrate downstream applications including text-guided linear image editing and structure-conditioned generation via ControlNet.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that text-to-linear-image generation can be achieved by decomposing linear images into sequences of exposure brackets (each capturing a portion of the dynamic range) and training a DiT-based flow-matching model to generate these brackets from text prompts; the brackets are then merged to recover a high-dynamic-range scene-referred linear image. This is motivated by the limitations of pre-trained VAEs in standard latent diffusion models when handling extreme highlights and shadows, and the work further shows applications in text-guided editing and ControlNet-based structure-conditioned generation.
Significance. If the central claim holds with supporting evidence, the approach could meaningfully advance generative modeling for professional imaging workflows by producing editable, scene-referred linear images rather than stylized display-referred outputs. The bracket-synthesis strategy offers a potential route around VAE dynamic-range bottlenecks and may generalize to other high-bit-depth generation tasks.
major comments (2)
- [Abstract] Abstract: the central claim that bracket synthesis avoids VAE-induced artifacts in highlights and shadows is presented without any quantitative results, ablation studies, or error analysis, leaving the empirical validity of the method unverified.
- [Method / Architecture] Proposed architecture description: no explicit inter-bracket consistency mechanism (shared latent conditioning, consistency loss, or alignment step) is described, which is load-bearing because independent flow-matching generations can introduce luminance or geometric inconsistencies that produce seams or ghosting upon standard HDR merging.
minor comments (1)
- [Method] The manuscript would benefit from a clear diagram or pseudocode showing the exact bracket sequence representation, merging procedure, and conditioning flow.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below, providing clarifications based on the manuscript content and indicating where revisions will strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that bracket synthesis avoids VAE-induced artifacts in highlights and shadows is presented without any quantitative results, ablation studies, or error analysis, leaving the empirical validity of the method unverified.
Authors: The abstract is intended as a high-level summary and therefore omits specific numerical results. The manuscript body (Section 4) contains quantitative comparisons against VAE-based latent diffusion baselines using metrics such as linear-space PSNR, HDR-VDP-2, and highlight/shadow error histograms, plus ablations on bracket count. We will revise the abstract to include a brief reference to these supporting results for improved clarity. revision: partial
-
Referee: [Method / Architecture] Proposed architecture description: no explicit inter-bracket consistency mechanism (shared latent conditioning, consistency loss, or alignment step) is described, which is load-bearing because independent flow-matching generations can introduce luminance or geometric inconsistencies that produce seams or ghosting upon standard HDR merging.
Authors: The DiT processes the full bracket sequence jointly in a single flow-matching trajectory, with shared text conditioning and per-bracket exposure embeddings that couple the generations. This joint modeling is described in Section 3.2 and empirically yields consistent merged outputs without visible seams. We will expand the architecture description to explicitly highlight the joint sequence generation and add a supplementary consistency visualization. revision: yes
Circularity Check
No circularity: architecture proposal is self-contained
full rationale
The paper introduces a DiT-based flow-matching model to generate text-conditioned exposure brackets that are later merged into linear images. No equations, fitted parameters, or derivations are shown that reduce the claimed output to the inputs by construction. The method is presented as a new generative architecture rather than a re-expression of prior fitted quantities or self-cited uniqueness results. The central claim rests on the design choice and its empirical application, which remains independent of any self-referential loop.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. Layer normalization.arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Bracket Diffusion: Hdr image generation by consistent ldr denoising
Mojtaba Bemana, Thomas Leimk ¨uhler, Karol Myszkowski, Hans-Peter Seidel, and Tobias Ritschel. Bracket Diffusion: Hdr image generation by consistent ldr denoising. InCom- puter Graphics Forum, 2025. 3
2025
-
[3]
Unprocessing im- ages for learned raw denoising
Tim Brooks, Ben Mildenhall, Tianfan Xue, Jiawen Chen, Dillon Sharlet, and Jonathan T Barron. Unprocessing im- ages for learned raw denoising. InCVPR, 2019. 3
2019
-
[4]
Learning photographic global tonal adjustment with a database of input / output image pairs
Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Fr ´edo Durand. Learning photographic global tonal adjustment with a database of input / output image pairs. InCVPR, 2011. 3, 7
2011
-
[5]
Parametric shadow control for portrait generation in text-to- image diffusion models
Haoming Cai, Tsung-Wei Huang, Shiv Gehlot, Brandon Y Feng, Sachin Shah, Guan-Ming Su, and Christopher Metzler. Parametric shadow control for portrait generation in text-to- image diffusion models. InICCV, 2025. 2
2025
-
[6]
Pixart-σ: Weak-to-strong training of diffu- sion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffu- sion transformer for 4k text-to-image generation. InECCV. Springer, 2024. 2
2024
-
[7]
Text2Light: Zero-shot text-driven hdr panorama generation.ACM TOG, 41(6):1–16, 2022
Zhaoxi Chen, Guangcong Wang, and Ziwei Liu. Text2Light: Zero-shot text-driven hdr panorama generation.ACM TOG, 41(6):1–16, 2022. 3
2022
-
[8]
Reversed image sig- nal processing and raw reconstruction
Marcos V Conde, Radu Timofte, Yibin Huang, Jingyang Peng, Chang Chen, Cheng Li, Eduardo P´erez-Pellitero, Fen- glong Song, Furui Bai, Shuai Liu, et al. Reversed image sig- nal processing and raw reconstruction. aim 2022 challenge report. InECCVW, 2022. 3
2022
-
[9]
RAISE: A raw images dataset for dig- ital image forensics
Duc-Tien Dang-Nguyen, Cecilia Pasquini, Valentina Conot- ter, and Giulia Boato. RAISE: A raw images dataset for dig- ital image forensics. InACM Multimedia Systems, 2015. 3
2015
-
[10]
Hdr image reconstruction from a single exposure using deep cnns.ACM TOG, 36(6):1–15,
Gabriel Eilertsen, Joel Kronander, Gyorgy Denes, Rafał K Mantiuk, and Jonas Unger. Hdr image reconstruction from a single exposure using deep cnns.ACM TOG, 36(6):1–15,
-
[11]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 2, 1
2024
-
[12]
HDR image generation via gain map decomposed diffusion
Yuanshen Guan, Ruikang Xu, Yinuo Liao, Mingde Yao, Lizhi Wang, and Zhiwei Xiong. HDR image generation via gain map decomposed diffusion. InICCV, 2025. 3
2025
-
[13]
Animatediff: Animate your personalized text-to- image diffusion models without specific tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. InICLR,
-
[14]
CameraCtrl: En- abling camera control for video diffusion models
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. CameraCtrl: En- abling camera control for video diffusion models. InICLR,
-
[15]
Denoising diffu- sion probabilistic models.NeurIPS, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.NeurIPS, 2020. 3
2020
-
[16]
Lora: Low-rank adaptation of large language models.ICLR,
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR,
-
[17]
Towards low light enhancement with raw images.IEEE TIP, 31:1391–1405, 2022
Haofeng Huang, Wenhan Yang, Yueyu Hu, Jiaying Liu, and Ling-Yu Duan. Towards low light enhancement with raw images.IEEE TIP, 31:1391–1405, 2022. 2
2022
-
[18]
Removing reflections from raw photos
Eric Kee, Adam Pikielny, Kevin Blackburn-Matzen, and Marc Levoy. Removing reflections from raw photos. In CVPR, 2025. 2
2025
-
[19]
Flowedit: Inversion-free text-based editing using pre-trained flow models
Vladimir Kulikov, Matan Kleiner, Inbar Huberman- Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models. InICCV,
-
[20]
Flux.https://github.com/ black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2, 7, 8, 1
2024
-
[21]
Single-image HDR reconstruction by learning to reverse the camera pipeline
Yu-Lun Liu, Wei-Sheng Lai, Yu-Sheng Chen, Yi-Lung Kao, Ming-Hsuan Yang, Yung-Yu Chuang, and Jia-Bin Huang. Single-image HDR reconstruction by learning to reverse the camera pipeline. InCVPR, 2020. 3, 4
2020
-
[22]
Repaint: Inpainting using denoising diffusion probabilistic models
Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. InCVPR,
-
[23]
Sdedit: Guided image synthesis and editing with stochastic differential equa- tions
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions. InICLR, 2022. 4
2022
-
[24]
Learning srgb-to-raw-rgb de-rendering with content-aware metadata
Seonghyeon Nam, Abhijith Punnappurath, Marcus A Brubaker, and Michael S Brown. Learning srgb-to-raw-rgb de-rendering with content-aware metadata. InCVPR, 2022. 3
2022
-
[25]
RAW image re- construction using a self-contained sRGB-JPEG image with only 64 kb overhead
Rang MH Nguyen and Michael S Brown. RAW image re- construction using a self-contained sRGB-JPEG image with only 64 kb overhead. InCVPR, 2016. 3
2016
-
[26]
Self-supervised reversed image signal processing via reference-guided dynamic parameter selection.CoRR, 2023
Junji Otsuka, Masakazu Yoshimura, and Takeshi Ohashi. Self-supervised reversed image signal processing via reference-guided dynamic parameter selection.CoRR, 2023. 3
2023
-
[27]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 2
2023
-
[28]
Film: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InAAAI, 2018. 2
2018
-
[29]
Spatially aware metadata for raw reconstruction
Abhijith Punnappurath and Michael S Brown. Spatially aware metadata for raw reconstruction. InWACV, 2021. 3
2021
-
[30]
RAW-Diffusion: RGB-Guided Dif- fusion Models for High-Fidelity RAW Image Generation
Christoph Reinders, Radu Berdan, Beril Besbinar, Junji Ot- suka, and Daisuke Iso. RAW-Diffusion: RGB-Guided Dif- fusion Models for High-Fidelity RAW Image Generation. In WACV, 2025. 3
2025
-
[31]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 2
2022
-
[32]
LAION-5B: An open large-scale dataset for train- ing next generation image-text models.NeurIPS, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. LAION-5B: An open large-scale dataset for train- ing next generation image-text models.NeurIPS, 2022. 2
2022
-
[33]
Qwen2.5-vl, 2025
Qwen Team. Qwen2.5-vl, 2025. 4
2025
-
[34]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
GlowGAN: Unsupervised learning of hdr images from ldr images in the wild
Chao Wang, Ana Serrano, Xingang Pan, Bin Chen, Karol Myszkowski, Hans-Peter Seidel, Christian Theobalt, and Thomas Leimk ¨uhler. GlowGAN: Unsupervised learning of hdr images from ldr images in the wild. InICCV, 2023. 3, 7
2023
-
[36]
LEDiff: Latent exposure diffusion for HDR generation
Chao Wang, Zhihao Xia, Thomas Leimkuhler, Karol Myszkowski, and Xuaner Zhang. LEDiff: Latent exposure diffusion for HDR generation. InCVPR, 2025. 3, 4
2025
-
[37]
StyleLight: Hdr panorama generation for lighting estimation and editing
Guangcong Wang, Yinuo Yang, Chen Change Loy, and Zi- wei Liu. StyleLight: Hdr panorama generation for lighting estimation and editing. InECCV, 2022. 3
2022
-
[38]
Flash-split: 2d reflection re- moval with flash cues and latent diffusion separation
Tianfu Wang, Mingyang Xie, Haoming Cai, Sachin Shah, and Christopher A Metzler. Flash-split: 2d reflection re- moval with flash cues and latent diffusion separation. In CVPR, 2025. 2
2025
-
[39]
Raw image reconstruc- tion with learned compact metadata
Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex C Kot, and Bihan Wen. Raw image reconstruc- tion with learned compact metadata. InCVPR, 2023. 3
2023
-
[40]
Invertible image signal processing
Yazhou Xing, Zian Qian, and Qifeng Chen. Invertible image signal processing. InCVPR, 2021. 3
2021
-
[41]
High quality image reconstruction from raw and jpeg image pair
Lu Yuan and Jian Sun. High quality image reconstruction from raw and jpeg image pair. InICCV, 2011. 3
2011
-
[42]
Generative photog- raphy: Scene-consistent camera control for realistic text-to- image synthesis
Yu Yuan, Xijun Wang, Yichen Sheng, Prateek Chennuri, Xingguang Zhang, and Stanley Chan. Generative photog- raphy: Scene-consistent camera control for realistic text-to- image synthesis. InCVPR, 2025. 7, 8
2025
-
[43]
CycleISP: Real image restoration via improved data synthesis
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. CycleISP: Real image restoration via improved data synthesis. InCVPR, 2020. 3
2020
-
[44]
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 4 Linear Image Generation by Synthesizing Exposure Brackets Supplementary Material A. Additional Ablation Studies This section presents additional ablation studies of our method, investigating the influence of positional encoding, the ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.