arxiv: 2604.07966 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: unknown

Lighting-grounded Video Generation with Renderer-based Agent Reasoning

Ziqi Cai , Taoyu Yang , Zheng Chang , Si Li , Han Jiang , Shuchen Weng , Boxin Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationdiffusion models3D scene controlcontrollable synthesislighting controlcamera trajectoryscene agentphotorealism

0 comments

The pith

LiVER conditions video diffusion models on renderer outputs from unified 3D scenes to deliver disentangled control over layout, lighting, and camera.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix the entanglement of key scene elements in current video diffusion models by grounding generation in explicit 3D renders. It builds a new dataset with dense layout, lighting, and camera annotations, then renders control signals from a single 3D representation. A lightweight conditioning module and progressive training strategy integrate those signals into a base video model. A scene agent converts high-level instructions into the required 3D parameters. If the approach holds, users gain precise editing of individual factors in image-to-video and video-to-video tasks without sacrificing visual quality or motion coherence.

Core claim

LiVER is a diffusion-based framework that renders explicit 3D scene properties from a unified representation and feeds them as conditioning signals into a foundational video diffusion model through a lightweight module and progressive training. This produces videos with state-of-the-art photorealism and temporal consistency while allowing independent editing of object layout, lighting, and camera trajectory. The method is supported by a new large-scale dataset of annotated 3D scenes and includes a scene agent that translates natural language instructions into the 3D control signals needed for synthesis.

What carries the argument

Renderer outputs from a unified 3D scene representation that supply disentangled control signals for layout, lighting, and camera to the video diffusion model via a lightweight conditioning module.

If this is right

Image-to-video and video-to-video synthesis become fully editable at the level of individual scene factors.
High-level user instructions can be automatically converted into precise 3D control signals by the scene agent.
Generated videos maintain higher photorealism and frame-to-frame consistency than prior controllable diffusion approaches.
Filmmaking and virtual production workflows gain direct access to layout, lighting, and camera adjustments inside the generation process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same renderer-grounding pattern could be tested on longer video sequences to check whether 3D coherence reduces drift over time.
Extending the unified 3D representation to include material properties might allow joint control of appearance and geometry.
The scene agent could be evaluated on tasks outside video, such as generating editable 3D scenes from text for simulation environments.

Load-bearing premise

Rendered 3D control signals can be added to a video diffusion model through a lightweight module and progressive training without creating new entanglements or reducing image quality.

What would settle it

Generate videos where altering only the lighting parameter visibly changes object positions or shapes, or where LiVER videos score lower on photorealism or temporal consistency metrics than the unconditioned base diffusion model.

Figures

Figures reproduced from arXiv: 2604.07966 by Boxin Shi, Han Jiang, Shuchen Weng, Si Li, Taoyu Yang, Zheng Chang, Ziqi Cai.

**Figure 1.** Figure 1: Overall framework. (1) A renderer-based agent produces a coarse geometric layout, camera trajectory, and a High Dynamic [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Our data annotation pipeline for LiVER-Real. We process each video to reconstruct its 3D geometry and estimate its HDR environment map. These are then used to render three pixel-aligned lighting representations (Diffuse, Glossy GGX, Rough GGX), which are concatenated to form the final conditioning input. While these methods introduce 3D-aware conditions to provide a strong geometric foundation for video ge… view at source ↗

**Figure 3.** Figure 3: Pipeline of LiVER. Given a text prompt T, our Scene Agent parses object categories, spatial relations, and coarse geometry to construct an initial 3D scene. The Camera Agent infers a camera trajectory consistent with the described viewpoint and scene semantics, producing the camera condition C. The 3D scene is then rendered through a physically-based renderer to obtain the lighting-grounded scene proxy, in… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with state-of-the-art controllable video generation models. In each block, each row corresponds to one [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: By manipulating the HDR environment map, our model produces continuous and physically consistent lighting variations. We [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results of our ablation study. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LiVER adds a renderer-based conditioning layer plus a scene agent to video diffusion for explicit 3D control over layout, lighting, and camera, but the disentanglement claims rest on thin experimental support.

read the letter

The main takeaway is that this paper tries to make video diffusion models more usable for professional editing by rendering explicit 3D signals from a unified scene representation and feeding them into the model via a lightweight module and progressive training. They also release a new densely annotated dataset and build a scene agent that turns high-level instructions into those 3D controls for image-to-video and video-to-video tasks.

Referee Report

2 major / 1 minor

Summary. The paper introduces LiVER, a diffusion-based framework for scene-controllable video generation. It conditions synthesis on explicit 3D scene properties (layout, lighting, camera trajectory) rendered from a unified 3D representation, supported by a new large-scale annotated dataset. The method uses a lightweight conditioning module and progressive training strategy to integrate signals into a foundational video diffusion model, plus a scene agent that translates high-level instructions into 3D controls. It claims SOTA photorealism and temporal consistency with precise, disentangled control, enabling editable image-to-video and video-to-video synthesis.

Significance. If validated, the renderer-grounded conditioning approach combined with the scene agent would represent a meaningful advance in controllable video generation, addressing entanglement issues in diffusion models for practical domains like virtual production. The new dataset with dense 3D annotations is a concrete contribution that could support future work.

major comments (2)

[Method] Method section: the claim that the lightweight conditioning module and progressive training strategy achieve 'precise, disentangled control' without entanglement or fidelity loss lacks any architectural specification of signal injection (cross-attention, concatenation, or adapter), auxiliary losses for factor independence, or quantitative disentanglement metrics such as control accuracy when one factor is varied while others are fixed.
[Experiments] Experiments section: the assertion of state-of-the-art photorealism and temporal consistency is stated without any reported quantitative metrics (FID, FVD, etc.), baselines, ablation studies on the conditioning module or training strategy, or error analysis, making it impossible to evaluate whether the data support the central claims.

minor comments (1)

The abstract and method description would benefit from a diagram illustrating the signal flow from 3D renderer through the conditioning module to the diffusion model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The comments highlight important areas where additional detail and quantitative support will strengthen the manuscript. We address each major comment below and commit to incorporating the requested elements in the revised version.

read point-by-point responses

Referee: [Method] Method section: the claim that the lightweight conditioning module and progressive training strategy achieve 'precise, disentangled control' without entanglement or fidelity loss lacks any architectural specification of signal injection (cross-attention, concatenation, or adapter), auxiliary losses for factor independence, or quantitative disentanglement metrics such as control accuracy when one factor is varied while others are fixed.

Authors: We agree that the current description of the lightweight conditioning module and progressive training strategy is at a high level and does not include the requested low-level specifications or quantitative disentanglement metrics. In the revised manuscript, we will expand the Method section with: (1) a detailed architectural specification of the signal injection mechanism (including whether cross-attention, concatenation, or an adapter is employed), (2) any auxiliary losses used to promote independence across factors, and (3) quantitative disentanglement metrics, such as control accuracy measured while varying one factor (e.g., lighting) while holding others fixed. These additions will directly substantiate the claims of precise, disentangled control. revision: yes
Referee: [Experiments] Experiments section: the assertion of state-of-the-art photorealism and temporal consistency is stated without any reported quantitative metrics (FID, FVD, etc.), baselines, ablation studies on the conditioning module or training strategy, or error analysis, making it impossible to evaluate whether the data support the central claims.

Authors: We acknowledge that quantitative metrics are essential for rigorously supporting the claims of state-of-the-art photorealism and temporal consistency. The current manuscript relies primarily on qualitative visual results and comparisons, but does not report FID, FVD, baselines, ablations, or error analysis. In the revised version, we will add a quantitative evaluation subsection to the Experiments section that includes FID and FVD scores, comparisons against relevant baselines, ablation studies on the conditioning module and progressive training strategy, and error analysis to provide a complete empirical validation of the central claims. revision: yes

Circularity Check

0 steps flagged

No circularity; framework and claims are self-contained

full rationale

The paper introduces a new framework (LiVER), dataset with dense 3D annotations, lightweight conditioning module, progressive training, and scene agent. These are presented as novel constructions grounded in external 3D rendering rather than derived from or equivalent to the model's own outputs or fitted parameters. No equations, self-definitional loops, or load-bearing self-citations appear in the provided text; the central claims of disentangled control and SOTA performance rest on experimental validation and the explicit rendering step, which is independent of the diffusion model itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that 3D-rendered signals can be integrated into diffusion models for disentangled control; no free parameters or invented physical entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Diffusion models can be conditioned on explicit rendered 3D signals to achieve disentangled control over layout, lighting, and camera without loss of photorealism or temporal consistency.
Invoked as the basis for the lightweight conditioning module and progressive training strategy.

invented entities (1)

scene agent no independent evidence
purpose: Automatically translates high-level user instructions into required 3D control signals.
New component added for usability; no independent evidence provided beyond the framework description.

pith-pipeline@v0.9.0 · 5534 in / 1272 out tokens · 80843 ms · 2026-05-10T17:28:38.846814+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 18 canonical work pages · 8 internal anchors

[1]

Poly Haven: https://polyhaven.com. 3, 4
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-VL technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

GenLit: Reformulating single-image relighting as video generation

Shrisha Bharadwaj, Haiwen Feng, Giorgio Becherini, Vic- toria Fernandez Abrevaya, and Michael J Black. GenLit: Reformulating single-image relighting as video generation. arXiv preprint arXiv:2412.11224, 2024. 3

work page arXiv 2024
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1

work page internal anchor Pith review arXiv 2023
[5]

Real-time 3D-aware portrait video relighting

Ziqi Cai, Kaiwen Jiang, Shu-Yu Chen, Yu-Kun Lai, Hongbo Fu, Boxin Shi, and Lin Gao. Real-time 3D-aware portrait video relighting. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2024. 3

2024
[6]

PhyS- EdiT: Physics-aware semantic image editing with text de- scription

Ziqi Cai, Shuchen Weng, Yifei Xia, and Boxin Shi. PhyS- EdiT: Physics-aware semantic image editing with text de- scription. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3

2025
[7]

Diffusionlight-turbo: Accelerated light probes for free via single-pass chrome ball inpainting.arXiv preprint arXiv:2507.01305, 2025

Worameth Chinchuthakun, Pakkapon Phongthawee, Non- taphat Sinsunthithet, Amit Raj, Varun Jampani, Pramook Khungurn, and Supasorn Suwajanakorn. DiffusionLight- Turbo: Accelerated light probes for free via single-pass chrome ball inpainting.arXiv preprint arXiv:2507.01305,

work page arXiv
[8]

Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018

Blender Online Community.Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018. 2, 4

2018
[9]

Objaverse-XL: A universe of 10M+ 3D objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Anirud- dha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-XL: A universe of 10M+ 3D objects. InAdvances in Neural In- formation Proc...

2023
[10]

Sigmoid- weighted linear units for neural network function approxi- mation in reinforcement learning.Neural networks, 2018

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid- weighted linear units for neural network function approxi- mation in reinforcement learning.Neural networks, 2018. 5

2018
[11]

Motion prompting: Controlling video generation with motion trajec- tories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompting: Controlling video generation with motion trajec- tories. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2

2025
[12]

InThe Thirteenth International Conference on Learning Representations

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, et al. ”PhyWorldBench”: A comprehensive evaluation of physical realism in text-to-video models.arXiv preprint arXiv:2507.13428, 2025. 2

work page arXiv 2025
[13]

CameraCtrl: En- abling camera control for video diffusion models

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. CameraCtrl: En- abling camera control for video diffusion models. InInter- national Conference on Learning Representations, 2025. 2, 6, 7, 8

2025
[14]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InAdvances in Neural Information Processing Sys- tems, 2017. 6

2017
[15]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InAdvances in Neural Informa- tion Processing Systems, 2020. 2

2020
[16]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 5, 6

2022
[17]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self Forcing: Bridging the train- test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 2

work page internal anchor Pith review arXiv 2025
[18]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 2

2024
[19]

arXiv preprint arXiv:2503.07598 (2025) 2, 4, 8, 9, 10, 6

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. 2

work page arXiv 2025
[20]

arXiv preprint arXiv:2503.19907 (2025) 2

Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Qiang Xu. FullDit: Multi-task video generative foundation model with full attention.arXiv preprint arXiv:2503.19907, 2025. 2

work page arXiv 2025
[21]

Vide- oFrom3D: 3D scene video generation via complementary image and video diffusion models

Geonung Kim, Janghyeok Han, and Sunghyun Cho. Vide- oFrom3D: 3D scene video generation via complementary image and video diffusion models. InACM SIGGRAPH Asia Conference Papers, 2025. 2, 6, 7, 8

2025
[22]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Col- laborative video diffusion: Consistent multi-video genera- tion with camera control

Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hong- sheng Li, Leonidas J Guibas, and Gordon Wetzstein. Col- laborative video diffusion: Consistent multi-video genera- tion with camera control. InAdvances in Neural Information Processing Systems, 2024. 2

2024
[24]

TrackDif- fusion: Tracklet-conditioned video generation via diffusion models

Pengxiang Li, Kai Chen, Zhili Liu, Ruiyuan Gao, Lanqing Hong, Dit-Yan Yeung, Huchuan Lu, and Xu Jia. TrackDif- fusion: Tracklet-conditioned video generation via diffusion models. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025. 2

2025
[25]

DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 6

2024
[26]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Sketch3DVE: Sketch-based 3D-aware scene video editing

Feng-Lin Liu, Shi-Yang Li, Yan-Pei Cao, Hongbo Fu, and Lin Gao. Sketch3DVE: Sketch-based 3D-aware scene video editing. InACM SIGGRAPH Conference Papers, 2025. 2

2025
[28]

Grounding DINO: Marry- ing DINO with grounded pre-training for open-set object de- tection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marry- ing DINO with grounded pre-training for open-set object de- tection. InEuropean Conference on Computer Vision, 2023. 3

2023
[29]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 6

2019
[30]

Ctrl-V: Higher fi- delity autonomous vehicle video generation with bounding- box controlled object motion.Transactions on Machine Learning Research, 2025

Ge Ya Luo, ZhiHao Luo, Anthony Gosselin, Alexia Jolicoeur-Martineau, and Christopher Pal. Ctrl-V: Higher fi- delity autonomous vehicle video generation with bounding- box controlled object motion.Transactions on Machine Learning Research, 2025. 2

2025
[31]

Trailblazer: Trajectory control for diffusion-based video generation

Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion-based video generation. InACM SIGGRAPH Conference Papers, 2024. 2

2024
[32]

Total Relighting: learning to relight portraits for background replacement.ACM SIG- GRAPH Conference Papers, 2021

Rohit Pandey, Sergio Orts-Escolano, Chloe Legendre, Chris- tian Haene, Sofien Bouaziz, Christoph Rhemann, Paul E De- bevec, and Sean Ryan Fanello. Total Relighting: learning to relight portraits for background replacement.ACM SIG- GRAPH Conference Papers, 2021. 3

2021
[33]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision, 2023. 2

2023
[34]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning,
[35]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

GEN3C: 3D-informed world-consistent video generation with precise camera con- trol

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexander Keller, Sanja Fidler, and Jun Gao. GEN3C: 3D-informed world-consistent video generation with precise camera con- trol. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025. 2

2025
[37]

T2V-CompBench: A compre- hensive benchmark for compositional text-to-video genera- tion

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2V-CompBench: A compre- hensive benchmark for compositional text-to-video genera- tion. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025. 2

2025
[38]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 6

work page internal anchor Pith review arXiv 2018
[39]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Physctrl: Generative physics for controllable and physics-grounded video generation.arXiv preprint arXiv:2509.20358, 2025

Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, and Lingjie Liu. PhysCtrl: Generative physics for controllable and physics-grounded video genera- tion.arXiv preprint arXiv:2509.20358, 2025. 2

work page arXiv 2025
[41]

StyleLight: HDR panorama generation for lighting estimation and editing

Guangcong Wang, Yinuo Yang, Chen Change Loy, and Zi- wei Liu. StyleLight: HDR panorama generation for lighting estimation and editing. InEuropean Conference on Com- puter Vision, 2022. 5

2022
[42]

Boximator: Generat- ing rich and controllable motions for video synthesis.arXiv preprint arXiv:2402.01566, 2024

Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guo- qiang Wei, Liping Yuan, and Hang Li. Boximator: Generat- ing rich and controllable motions for video synthesis.arXiv preprint arXiv:2402.01566, 2024. 2

work page arXiv 2024
[43]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2025. 3, 6

2025
[44]

Cinemaster: A 3D-aware and controllable framework for cinematic text-to-video generation

Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Cinemaster: A 3D-aware and controllable framework for cinematic text-to-video generation. InACM SIGGRAPH Conference Papers, 2025. 2

2025
[45]

MotionCtrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. MotionCtrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH Conference Papers,
[46]

Group normalization

Yuxin Wu and Kaiming He. Group normalization. InEuro- pean Conference on Computer Vision, 2018. 5

2018
[47]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. InInternational Conference on Learning Representations, 2025. 1, 2

2025
[48]

From slow bidirectional to fast autoregressive video diffusion mod- els

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Free- man, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion mod- els. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025. 2

2025
[49]

DiLightNet: Fine-grained light- ing control for diffusion-based image generation

Chong Zeng, Yue Dong, Pieter Peers, Youkang Kong, Hongzhi Wu, and Xin Tong. DiLightNet: Fine-grained light- ing control for diffusion-based image generation. InACM SIGGRAPH Conference Papers, 2024. 3

2024
[50]

Think before you diffuse: Infusing physical rules into video diffusion.arXiv preprint arXiv:2505.21653, 2025

Ke Zhang, Cihan Xiao, Yiqun Mei, Jiacong Xu, and Vishal M Patel. Think Before You Diffuse: LLMs- guided physics-aware video generation.arXiv preprint arXiv:2505.21653, 2025. 2

work page arXiv 2025
[51]

Scaling in-the-wild training for diffusion-based illumination harmo- nization and editing by imposing consistent light transport

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Scaling in-the-wild training for diffusion-based illumination harmo- nization and editing by imposing consistent light transport. InInternational Conference on Learning Representations,
[52]

Lumis- culpt: A consistency lighting control network for video gen- eration.arXiv preprint arXiv:2410.22979, 2024

Yuxin Zhang, Dandan Zheng, Biao Gong, Jingdong Chen, Ming Yang, Weiming Dong, and Changsheng Xu. Lumis- culpt: A consistency lighting control network for video gen- eration.arXiv preprint arXiv:2410.22979, 2024. 3

work page arXiv 2024
[53]

Waver: Wave your way to lifelike video generation,

Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Ze- huan Yuan. Waver: Wave your way to lifelike video genera- tion.arXiv preprint arXiv:2508.15761, 2025. 2

work page arXiv 2025
[54]

Light-a-video: Training-free video re- lighting via progressive light fusion

Yujie Zhou, Jiazi Bu, Pengyang Ling, Pan Zhang, Tong Wu, Qidong Huang, Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, et al. Light-a-video: Training-free video re- lighting via progressive light fusion. InInternational Con- ference on Computer Vision, 2025. 3

2025