Holo-World: Unified Camera, Object and Weather Control for Video World Model
Pith reviewed 2026-06-26 18:03 UTC · model grok-4.3
The pith
A video world model follows explicit camera, object and weather controls from a single starting image while preserving scene structure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The model jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition.
What carries the argument
Unified Scene Adapter that factorizes world preservation and weather transfer into distinct parameter subspaces using rendered background, geometry buffers, and object controls
Load-bearing premise
The adapter can factorize world preservation and weather transfer into distinct parameter subspaces without introducing structural artifacts or losing control precision.
What would settle it
Observe whether videos generated with changed weather still follow the exact input camera trajectories and object positions without geometry distortions.
Figures
read the original abstract
Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls the scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object controls with consistent scene structure while transferring scenes into diverse target weather states, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at https://xiangchenyin.github.io/Holo-World/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce Holo-World, a video world model for unified control of camera, object, and weather from a single first-frame image using explicit controls. It constructs HoloStateData dataset and proposes a Unified Scene Adapter that factorizes world preservation and weather transfer into distinct parameter subspaces using rendered background, geometry buffers, and object controls, along with Scene-Weather Decomposed CFG to guide residuals separately. Quantitative and qualitative experiments are said to show that it maintains precise camera and object control with consistent scene structure while transferring to target weather states, outperforming video-to-video weather editing baselines.
Significance. If the central claims hold after addressing the evaluation protocol, this would be a notable contribution to controllable video generation by unifying multiple control modalities in a first-frame-anchored setting. The creation of HoloStateData provides a new resource for training models with joint camera, object, and weather supervision, which is a strength. The factorization approach could inspire similar decompositions in other generative models.
major comments (1)
- [Abstract] The claim of outperforming video-to-video weather editing baselines on weather-state generation (Abstract) is load-bearing for the central contribution. However, Holo-World operates in a single first-frame image plus explicit controls regime, while the baselines receive the full source video. The manuscript provides no indication that baselines were adapted to this setting or that the additional video input to baselines does not provide an unfair advantage in structure preservation. This must be addressed in the Experiments section to validate the outperformance.
minor comments (1)
- [Abstract] The abstract references 'quantitative and qualitative experiments' demonstrating outperformance but does not include any specific metrics, error bars, or dataset statistics, which would help readers assess the strength of the results immediately.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential contribution of Holo-World and the HoloStateData dataset. We address the single major comment below and will revise the manuscript to strengthen the evaluation protocol.
read point-by-point responses
-
Referee: [Abstract] The claim of outperforming video-to-video weather editing baselines on weather-state generation (Abstract) is load-bearing for the central contribution. However, Holo-World operates in a single first-frame image plus explicit controls regime, while the baselines receive the full source video. The manuscript provides no indication that baselines were adapted to this setting or that the additional video input to baselines does not provide an unfair advantage in structure preservation. This must be addressed in the Experiments section to validate the outperformance.
Authors: We agree that this comparison requires clarification to ensure fairness. The video-to-video baselines receive the full source video, which supplies additional structural information not available to Holo-World. In the revised manuscript, we will adapt the baselines to the single first-frame setting by providing only the initial frame plus the same explicit camera, object, and weather controls. We will update the Experiments section with these revised quantitative results (including metrics on weather transfer and structure preservation) and will explicitly describe the adaptation procedure so that the outperformance claim is validated under equivalent input conditions. revision: yes
Circularity Check
No significant circularity in derivation or claims
full rationale
The paper introduces a new dataset (HoloStateData) and model architecture (Holo-World with Unified Scene Adapter and Scene-Weather Decomposed CFG) to address a first-frame-anchored controllable video generation task. No equations, fitted parameters, or self-citations are presented that reduce the claimed performance or factorization to a tautology or prior result by construction. The central claims rest on new components and experimental comparisons rather than re-deriving inputs, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (3)
-
HoloStateData
no independent evidence
-
Unified Scene Adapter
no independent evidence
-
Scene-Weather Decomposed CFG
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.