Refracting Reality: Generating Images with Realistic Transparent Objects
Pith reviewed 2026-05-17 20:24 UTC · model grok-4.3
The pith
Generative models can synthesize transparent objects with accurate refraction by warping pixels using Snell's Law at each generation step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that synchronizing pixels within the object's boundary with those outside by warping and merging using Snell's Law of Refraction at each step of the generation trajectory produces much more optically-plausible images. For surfaces not directly observed but visible via refraction or reflection, their appearance is recovered by synchronizing with a second generated panorama image centered at the object using the same procedure. This respects the physical constraints of refraction without requiring explicit 3D geometry.
What carries the argument
The mechanism of pixel synchronization by warping and merging according to Snell's Law of Refraction, applied at every step of the generation trajectory and extended via a panorama image for unobserved surfaces.
If this is right
- Generated images of transparent objects will show correct distortion and alignment of background elements as seen through the object.
- Complex refractive effects become feasible in text-to-image synthesis without additional 3D reconstruction.
- The method integrates into existing generative pipelines by modifying the generation trajectory.
- Images respect optical constraints, leading to fewer physically implausible artifacts in transparent regions.
Where Pith is reading between the lines
- Similar pixel-based enforcement of physical laws could be applied to other phenomena like reflections or shadows in generative models.
- Adapting the approach to video generation could ensure consistent refraction across frames.
- This suggests that embedding explicit physical rules into the sampling process may improve accuracy for specific optical effects beyond what data alone provides.
Load-bearing premise
That enforcing refraction through pixel warping with Snell's Law at each generation step and synchronizing to a panorama image suffices to produce accurate results without full 3D geometry or ray-tracing.
What would settle it
Create an image of a refractive sphere over a grid pattern and verify if the observed distortion of the grid matches the predictions from the law of refraction for the given index and geometry.
Figures
read the original abstract
Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object's boundary with those outside by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image -- a panorama centered at the object -- using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a method to generate images with realistic transparent objects from text prompts. It modifies the generative trajectory by warping and merging pixels within the object's boundary using Snell's Law of Refraction at each step. For surfaces visible only via refraction or reflection, it generates and synchronizes with a second panorama image centered at the object using the same warping procedure, claiming this yields more optically plausible results that respect physical constraints.
Significance. If the central mechanism can be shown to produce accurate refraction without explicit 3D geometry or full ray tracing, the work would offer a meaningful step toward embedding optical physics into diffusion-based image synthesis. This could improve fidelity in domains such as product visualization and scene rendering where transparent materials are common. The auxiliary panorama idea addresses an important coverage issue for unobserved surfaces.
major comments (2)
- [Abstract] Abstract: The core claim that pixels are synchronized 'by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory' is load-bearing. Snell's Law in vector form requires the surface normal n, incident direction, and refractive index to compute the refracted ray and source pixel location. The manuscript provides no procedure for recovering these quantities (depth, normals, or equivalent) from the 2D prompt or intermediate generation state, making the described synchronization step impossible to execute as stated.
- [Method] Method (or equivalent section describing the warping): Without an explicit mechanism for obtaining surface normals or depth (e.g., monocular depth estimation, prompt-derived shape prior, or per-step segmentation), the pixel-warping operation cannot be performed. This omission directly undermines the claim of producing refraction that respects physical constraints while avoiding explicit 3D geometry.
minor comments (1)
- [Abstract] Abstract: The phrase 'synchronize the pixels' is repeated without a precise definition of the merging operation (e.g., alpha blending weights, handling of multiple refractions, or occlusion resolution).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments correctly identify that the abstract and method description are concise and would benefit from greater explicitness regarding the recovery of geometric quantities needed for Snell's Law. We address each major comment below and will incorporate clarifications in the revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: The core claim that pixels are synchronized 'by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory' is load-bearing. Snell's Law in vector form requires the surface normal n, incident direction, and refractive index to compute the refracted ray and source pixel location. The manuscript provides no procedure for recovering these quantities (depth, normals, or equivalent) from the 2D prompt or intermediate generation state, making the described synchronization step impossible to execute as stated.
Authors: We acknowledge that the abstract does not detail the recovery procedure. In the full Method section we apply a pre-trained monocular depth and normal estimator to the intermediate denoised image at each step, conditioned on the text prompt to focus on the object region. The incident direction is obtained from the pixel coordinate under a standard pinhole camera assumption, and the refractive index is taken from material keywords in the prompt (e.g., glass = 1.5). We will revise the abstract to briefly reference this estimation step and add pseudocode plus a pipeline diagram in the Method section. revision: yes
-
Referee: [Method] Method (or equivalent section describing the warping): Without an explicit mechanism for obtaining surface normals or depth (e.g., monocular depth estimation, prompt-derived shape prior, or per-step segmentation), the pixel-warping operation cannot be performed. This omission directly undermines the claim of producing refraction that respects physical constraints while avoiding explicit 3D geometry.
Authors: We agree that the current wording leaves the mechanism underspecified. Our approach integrates a monocular depth/normal network run on the current generation state at every denoising step; the resulting normals and depths are used directly to evaluate the vector form of Snell's Law for the warping and merging operation. No explicit 3D mesh or full ray-tracing is constructed. We will expand the Method section with equations, a step-by-step algorithm box, and an additional figure showing the per-step estimation and warping to make the procedure fully reproducible. revision: yes
Circularity Check
No circularity: method applies external physical law without self-referential reduction
full rationale
The paper's core procedure applies Snell's Law of Refraction (an established result from optics) to warp and merge pixels during the diffusion trajectory and to synchronize with a generated panorama. No equations or steps reduce the claimed output to fitted parameters, self-defined quantities, or a chain of self-citations whose validity depends on the present work. The derivation remains self-contained because it imports an independent physical constraint rather than deriving the refraction behavior from the generative model's own statistics or renaming an empirical pattern.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Snell's Law accurately describes refraction at object boundaries
- domain assumption Diffusion models can be steered by pixel-level warping operations without breaking the generative process
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We synchronize the pixels within the object's boundary with those outside by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
di+1 = αi di + (αi βi − √γi) n(xi+1)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Max Born and Emil Wolf.Principles of Optics: Electromag- netic Theory of Propagation, Interference and Diffraction of Light. Elsevier, 2013. 2, 3, 4
work page 2013
-
[2]
LookingGlass: Generative anamor- phoses via laplacian pyramid warping
Pascal Chang, Sergio Sancho, Jingwei Tang, Markus Gross, and Vinicius Azevedo. LookingGlass: Generative anamor- phoses via laplacian pyramid warping. InCVPR, pages 24– 33, 2025. 3, 4, 5, 6, 1
work page 2025
-
[3]
Scribblelight: Single image indoor relighting with scribbles
Jun Myeong Choi, Annie Wang, Pieter Peers, Anand Bhat- tad, and Roni Sengupta. Scribblelight: Single image indoor relighting with scribbles. InCVPR, pages 5720–5731, 2025. 2
work page 2025
-
[4]
Blender Foundation, Stichting Blender Foundation, Amsterdam, 2024
Blender Online Community.Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2024. 5
work page 2024
-
[5]
Latent swap joint diffusion for 2d long-form latent gen- eration
Yusheng Dai, Chenxi Wang, Chang Li, Chen Wang, Kewei Li, Jun Du, Lei Sun, Jianqing Gao, Ruoyu Wang, and Jiefeng Ma. Latent swap joint diffusion for 2d long-form latent gen- eration. InICCV, pages 11006–11015, 2025. 2
work page 2025
-
[6]
Flashtex: Fast relightable mesh texturing with lightcontrolnet
Kangle Deng, Timothy Omernick, Alexander Weiss, Deva Ramanan, Jun-Yan Zhu, Tinghui Zhou, and Maneesh Agrawala. Flashtex: Fast relightable mesh texturing with lightcontrolnet. InECCV, pages 90–107. Springer, 2024. 2
work page 2024
-
[7]
Reflecting reality: Enabling diffusion models to produce faithful mirror reflections
Ankit Dhiman, Manan Shah, Rishubh Parihar, Yash Bhalgat, Lokesh R Boregowda, and R Venkatesh Babu. Reflecting reality: Enabling diffusion models to produce faithful mirror reflections. In2025 International Conference on 3D Vision (3DV), pages 824–834. IEEE, 2025. 2
work page 2025
-
[8]
Visual ana- grams: Generating multi-view optical illusions with diffu- sion models
Daniel Geng, Inbum Park, and Andrew Owens. Visual ana- grams: Generating multi-view optical illusions with diffu- sion models. InCVPR, pages 24154–24163, 2024. 2
work page 2024
- [9]
-
[10]
Clipscore: A reference-free evaluation met- ric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 5
work page 2021
-
[11]
Shadow generation for composite image in real-world scenes
Yan Hong, Li Niu, and Jianfu Zhang. Shadow generation for composite image in real-world scenes. InAAAI, pages 914–922, 2022. 2
work page 2022
-
[12]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, page 3, 2022. 2
work page 2022
-
[13]
Mask-shadowgan: Learning to remove shadows from unpaired data
Xiaowei Hu, Yitong Jiang, Chi-Wing Fu, and Pheng-Ann Heng. Mask-shadowgan: Learning to remove shadows from unpaired data. InICCV, pages 2472–2481, 2019. 2
work page 2019
-
[14]
Neural gaffer: Relighting any object via diffusion.NeurIPS, 37: 141129–141152, 2024
Haian Jin, Yuan Li, Fujun Luan, Yuanbo Xiangli, Sai Bi, Kai Zhang, Zexiang Xu, Jin Sun, and Noah Snavely. Neural gaffer: Relighting any object via diffusion.NeurIPS, 37: 141129–141152, 2024. 2
work page 2024
-
[15]
Automatic scene inference for 3d object compositing.ACM TOG, 33(3):1–15, 2014
Kevin Karsch, Kalyan Sunkavalli, Sunil Hadap, Nathan Carr, Hailin Jin, Rafael Fonte, Michael Sittig, and David Forsyth. Automatic scene inference for 3d object compositing.ACM TOG, 33(3):1–15, 2014. 2
work page 2014
-
[16]
Exposing photo manipulation from shading and shadows.ACM TOG, 33(5): 1–21, 2014
Eric Kee, James F O’brien, and Hany Farid. Exposing photo manipulation from shading and shadows.ACM TOG, 33(5): 1–21, 2014. 2
work page 2014
-
[17]
Hoon Kim, Minje Jang, Wonjun Yoon, Jisoo Lee, Donghyun Na, and Sanghyun Woo. Switchlight: Co-design of physics- driven architecture and pre-training framework for human portrait relighting. InCVPR, pages 25096–25106, 2024. 2
work page 2024
-
[18]
SyncTweedies: A general generative framework based on synchronized diffusions
Jaihoon Kim, Juil Koo, Kyeongmin Yeo, and Minhyuk Sung. SyncTweedies: A general generative framework based on synchronized diffusions. InNeurIPS, pages 95198–95237,
-
[19]
Lightit: Illumination modeling and control for diffusion models
Peter Kocsis, Julien Philip, Kalyan Sunkavalli, Matthias Nießner, and Yannick Hold-Geoffroy. Lightit: Illumination modeling and control for diffusion models. InCVPR, pages 9359–9369, 2024. 2
work page 2024
-
[20]
Flux.1 kontext: Flow matching for in-context image generation and editing in latent space,
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context i...
-
[21]
Syncsde: A probabilistic framework for diffusion synchronization
Hyunjun Lee, Hyunsoo Lee, and Sookwan Han. Syncsde: A probabilistic framework for diffusion synchronization. In CVPR, pages 17508–17517, 2025. 2
work page 2025
-
[22]
Syncdiffusion: Coherent montage via synchronized joint diffusions.NeurIPS, 36:50648–50660, 2023
Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions.NeurIPS, 36:50648–50660, 2023. 2
work page 2023
-
[23]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 3
work page 2023
-
[24]
Arshadowgan: Shadow generative adversarial network for augmented reality in sin- gle light scenes
Daquan Liu, Chengjiang Long, Hongpan Zhang, Hanning Yu, Xinzhi Dong, and Chunxia Xiao. Arshadowgan: Shadow generative adversarial network for augmented reality in sin- gle light scenes. InCVPR, pages 8139–8148, 2020. 2
work page 2020
-
[25]
Shadow generation for composite image using diffusion model
Qingyang Liu, Junqi You, Jianting Wang, Xinhao Tao, Bo Zhang, and Li Niu. Shadow generation for composite image using diffusion model. InCVPR, pages 8121–8130, 2024. 2
work page 2024
-
[26]
Learning physics-guided face re- lighting under directional light
Thomas Nestmeyer, Jean-Franc ¸ois Lalonde, Iain Matthews, and Andreas Lehrmann. Learning physics-guided face re- lighting under directional light. InCVPR, pages 5124–5133,
-
[27]
Chatgpt.https://openai.com, 2025
OpenAI. Chatgpt.https://openai.com, 2025. Ver- sion 5.1. 1
work page 2025
-
[28]
Diffusionlight: Light probes for free by painting a chrome ball
Pakkapon Phongthawee, Worameth Chinchuthakun, Non- taphat Sinsunthithet, Varun Jampani, Amit Raj, Pramook Khungurn, and Supasorn Suwajanakorn. Diffusionlight: Light probes for free by painting a chrome ball. InCVPR, pages 98–108, 2024. 2
work page 2024
-
[29]
SDXL: Improving Latent Diffusion Mod- els for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving Latent Diffusion Mod- els for High-Resolution Image Synthesis. InThe Twelfth In- ternational Conference on Learning Representations, 2023. 2
work page 2023
-
[30]
Relightful harmonization: Lighting-aware portrait background replacement
Mengwei Ren, Wei Xiong, Jae Shin Yoon, Zhixin Shu, Jianming Zhang, HyunJoon Jung, Guido Gerig, and He 7 Zhang. Relightful harmonization: Lighting-aware portrait background replacement. InCVPR, pages 6452–6462, 2024. 2
work page 2024
-
[31]
An empirical bayes approach to statis- tics
Herbert E Robbins. An empirical bayes approach to statis- tics. InBreakthroughs in Statistics: Foundations and basic theory, pages 388–394. Springer, 1992. 2, 3
work page 1992
-
[32]
High-resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution Image Synthesis with Latent Diffusion Models. InCVPR, pages 10684–10695, 2022. 2
work page 2022
-
[33]
Soumyadip Sengupta, Brian Curless, Ira Kemelmacher- Shlizerman, and Steven M Seitz. A light stage on every desk. InICCV, pages 2420–2429, 2021. 2
work page 2021
-
[34]
Ssn: Soft shadow network for image compositing
Yichen Sheng, Jianming Zhang, and Bedrich Benes. Ssn: Soft shadow network for image compositing. InCVPR, pages 4380–4390, 2021. 2
work page 2021
-
[35]
Controllable shadow generation using pixel height maps
Yichen Sheng, Yifan Liu, Jianming Zhang, Wei Yin, A Cen- giz Oztireli, He Zhang, Zhe Lin, Eli Shechtman, and Bedrich Benes. Controllable shadow generation using pixel height maps. InECCV, pages 240–256. Springer, 2022
work page 2022
-
[36]
Pixht-lab: Pixel height based light effect generation for im- age compositing
Yichen Sheng, Jianming Zhang, Julien Philip, Yannick Hold- Geoffroy, Xin Sun, He Zhang, Lu Ling, and Bedrich Benes. Pixht-lab: Pixel height based light effect generation for im- age compositing. InCVPR, pages 16643–16653, 2023. 2
work page 2023
-
[37]
Tiancheng Sun, Jonathan T. Barron, Yun-Ta Tsai, Zexiang Xu, Xueming Yu, Graham Fyffe, Christoph Rhemann, Jay Busch, Paul Debevec, and Ravi Ramamoorthi. Single image portrait relighting.ACM TOG, 38(4), 2019. 2
work page 2019
-
[38]
Shadow generation with decomposed mask prediction and attentive shadow filling
Xinhao Tao, Junyan Cao, Yan Hong, and Li Niu. Shadow generation with decomposed mask prediction and attentive shadow filling. InAAAI, pages 5198–5206, 2024. 2
work page 2024
-
[39]
Ref-NeRF: Structured View-Dependent Appearance for Neural Radi- ance Fields
Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T Barron, and Pratul P Srinivasan. Ref-NeRF: Structured View-Dependent Appearance for Neural Radi- ance Fields. InCVPR, pages 5481–5490. IEEE, 2022. 5
work page 2022
-
[40]
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546,
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Sunstage: Portrait reconstruction and relighting using the sun as a light stage
Yifan Wang, Aleksander Holynski, Xiuming Zhang, and Xu- aner Zhang. Sunstage: Portrait reconstruction and relighting using the sun as a light stage. InCVPR, pages 20792–20802,
-
[42]
Zero-shot image restoration using denoising diffusion null-space model
Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space model. In ICLR, 2023. 5, 1
work page 2023
-
[43]
Structured 3d latents for scalable and versatile 3d gen- eration
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InCVPR, pages 21469–21480, 2025. 1
work page 2025
-
[44]
Imagereward: learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InNeurIPS, pages 15903–15935, 2023. 5
work page 2023
-
[45]
Yue Yin, Enze Tao, Weijian Deng, and Dylan Campbell. Refref: A synthetic dataset and benchmark for recon- structing refractive and reflective objects.arXiv preprint arXiv:2505.05848, 2025. 1
-
[46]
Freedom: Training-free energy-guided condi- tional diffusion model
Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and Jian Zhang. Freedom: Training-free energy-guided condi- tional diffusion model. InICCV, pages 23174–23184, 2023. 5
work page 2023
-
[47]
Dilightnet: Fine-grained light- ing control for diffusion-based image generation
Chong Zeng, Yue Dong, Pieter Peers, Youkang Kong, Hongzhi Wu, and Xin Tong. Dilightnet: Fine-grained light- ing control for diffusion-based image generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024. 2
work page 2024
-
[48]
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Scaling in-the-wild training for diffusion-based illumination harmo- nization and editing by imposing consistent light transport. InICLR, 2025. 2
work page 2025
-
[49]
Shuyang Zhang, Runze Liang, and Miao Wang. Shadowgan: Shadow synthesis for virtual objects with conditional adver- sarial networks.Computational Visual Media, 5(1):105–115,
-
[50]
Shadow generation using diffusion model with geometry prior
Haonan Zhao, Qingyang Liu, Xinhao Tao, Li Niu, and Guangtao Zhai. Shadow generation using diffusion model with geometry prior. InCVPR, pages 7603–7612, 2025. 2
work page 2025
-
[51]
Re- lightable neural human assets from multi-view gradient illu- minations
Taotao Zhou, Kai He, Di Wu, Teng Xu, Qixuan Zhang, Kuix- iang Shao, Wenzheng Chen, Lan Xu, and Jingyi Yu. Re- lightable neural human assets from multi-view gradient illu- minations. InCVPR, pages 4315–4327, 2023. 2 8
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.