Holo-World: Unified Camera, Object and Weather Control for Video World Model

Chunfeng Wang; Dachun Kai; Jiahui Yuan; Wei Li; Wenzhang Sun; Xiangchen Yin; Xiaoyan Sun; Yinda Chen; Zijie Liu

arxiv: 2606.20083 · v3 · pith:CSPGQAWFnew · submitted 2026-06-18 · 💻 cs.CV

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Xiangchen Yin , Wenzhang Sun , Jiahui Yuan , Zijie Liu , Yinda Chen , Wei Li , Dachun Kai , Chunfeng Wang

show 1 more author

Xiaoyan Sun

This is my paper

Pith reviewed 2026-06-26 18:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords video world modelcamera controlobject controlweather transferscene consistencycontrollable generationunified adaptervideo generation

0 comments

The pith

A video world model follows explicit camera, object and weather controls from a single starting image while preserving scene structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to generate videos starting from one image that respects user-specified camera movements, object positions over time, and a chosen weather condition such as rain or fog. It trains a model that uses an adapter to keep the underlying scene fixed while changing only the weather-related appearance and effects. If successful, this would allow creating consistent virtual worlds that can be viewed from moving cameras with moving objects placed in different weather without rebuilding the scene each time. The approach separates the controls so that changing weather does not disrupt the geometry or motion paths.

Core claim

The model jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition.

What carries the argument

Unified Scene Adapter that factorizes world preservation and weather transfer into distinct parameter subspaces using rendered background, geometry buffers, and object controls

Load-bearing premise

The adapter can factorize world preservation and weather transfer into distinct parameter subspaces without introducing structural artifacts or losing control precision.

What would settle it

Observe whether videos generated with changed weather still follow the exact input camera trajectories and object positions without geometry distortions.

Figures

Figures reproduced from arXiv: 2606.20083 by Chunfeng Wang, Dachun Kai, Jiahui Yuan, Wei Li, Wenzhang Sun, Xiangchen Yin, Xiaoyan Sun, Yinda Chen, Zijie Liu.

**Figure 1.** Figure 1: Unified state control in Holo-World. Holo-World jointly controls camera motion, object dynamics, and weather state within the same observed world. ABSTRACT Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video… view at source ↗

**Figure 2.** Figure 2: HoloStateData construction pipeline. These annotations are converted into source-side rendered controls and object controls, while paired target-weather videos provide supervision for weather-state transfer. weather text, object masks, camera motion, and dense geometry. Scene construction renders sourceside background RGB, depth, and normal controls, converts object masks into object controls, and associa… view at source ↗

**Figure 3.** Figure 3: Overview of Holo-World. Given a first frame and factorized source-to-state controls, Holo-World decomposes controllable video generation into world-preservation and weather-transfer residual paths. Metrics. On the Real subset, we report VBench-I2V (Huang et al., 2025b) for video quality, rotation error (RotErr), translation error (TransErr) (He et al., 2024) and ObjMC (Wang et al., 2024) for camera and ob… view at source ↗

**Figure 4.** Figure 4: Main qualitative comparison. Real and Weather rows show the source-to-state requirement of preserving the source-controlled world while synthesizing the requested weather state. Results of Weather Transfer. We next evaluate the Weather subset, where Holo-World must change the weather from a single image under camera and object control . This setting is stricter than other video-to-video editing models beca… view at source ↗

**Figure 5.** Figure 5: Core decoupling ablation. The comparison visualizes whether UniSA reduces interference between Real-subset preservation and Weather-subset editing. SW-CFG (weather=2) w/o CFG with CFG SW-CFG (weather=4) SW-CFG (weather=2) w/o CFG with CFG SW-CFG (weather=4) Weather Sample Real Sample [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Guidance decoupling ablation. The comparison visualizes how no CFG, vanilla CFG, and SceneWeather Decomposed CFG balance weather strength and source-world preservation. and raise Weather Alignment and VLM Evaluation on the Weather subset, indicating that explicit geometry anchors help produce weather effects and preserves the controlled scene. UniSA further improves all three background metrics and VLM E… view at source ↗

**Figure 7.** Figure 7: Multi-weather state control. With the same source world and camera/object controls, changing only the target weather prompt produces different weather videos [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Additional visualization of Holo-World. Real examples visualize background-consistent generation under rendered controls, while Weather examples visualize target weather transfer under the same source-side control interface [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Weather-family distribution in HoloStateData. The pie chart shows the target weather-family distribution of the Weather training subset used for state-transfer supervision. VLM text annotation. HoloStateData uses Qwen3-VL to produce factorized text conditions with low-randomness decoding (Bai et al., 2025). The annotation pipeline uses two VLM prompts rather than one generic caption. The scene prompt is ge… view at source ↗

**Figure 10.** Figure 10: HoloStateData example. The visualization shows how one video record is converted into the source [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls the scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object controls with consistent scene structure while transferring scenes into diverse target weather states, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at https://xiangchenyin.github.io/Holo-World/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The claimed outperformance over baselines is likely not fair because those baselines get the full source video while Holo-World gets only the first frame plus controls.

read the letter

The main point to know is that this paper's evaluation setup undercuts its own headline result. Holo-World starts from a single image and explicit camera/object/weather signals, but the video-to-video baselines it compares against receive the entire source video. That extra input gives the baselines a structural advantage on preservation, so any reported win on weather transfer could simply reflect the harder regime for Holo-World rather than better factorization.

What is actually new is the HoloStateData construction pipeline that converts ordinary videos into unified control samples, the Unified Scene Adapter that routes background renders, geometry buffers, and object signals into separate subspaces for scene versus weather, and the Scene-Weather Decomposed CFG that applies guidance to residuals independently. These pieces address the joint-control problem from a first-frame anchor in a way that prior isolated-control work did not.

The adapter design itself looks like a reasonable engineering step for keeping camera and object trajectories stable while changing appearance and particles. The decomposed guidance is a clean way to avoid over-strengthening the whole condition.

The soft spots are the missing numbers, error bars, and dataset statistics, plus the baseline comparison that the abstract does not show was adjusted to the single-image setting. Without those, the central claim rests on an unreviewed dataset and model whose real performance is hard to judge.

This is for people already working on controllable video world models in computer vision. A reader chasing ideas for joint camera-object-weather control might pick up the adapter and CFG tricks, but the work does not yet supply the evidence needed to treat the results as reliable.

I would not send it to peer review in its current form. The authors would need to either rerun the baselines under the same single-frame-plus-controls protocol or provide a clear argument why the extra video input does not explain the gap.

Referee Report

1 major / 1 minor

Summary. The paper claims to introduce Holo-World, a video world model for unified control of camera, object, and weather from a single first-frame image using explicit controls. It constructs HoloStateData dataset and proposes a Unified Scene Adapter that factorizes world preservation and weather transfer into distinct parameter subspaces using rendered background, geometry buffers, and object controls, along with Scene-Weather Decomposed CFG to guide residuals separately. Quantitative and qualitative experiments are said to show that it maintains precise camera and object control with consistent scene structure while transferring to target weather states, outperforming video-to-video weather editing baselines.

Significance. If the central claims hold after addressing the evaluation protocol, this would be a notable contribution to controllable video generation by unifying multiple control modalities in a first-frame-anchored setting. The creation of HoloStateData provides a new resource for training models with joint camera, object, and weather supervision, which is a strength. The factorization approach could inspire similar decompositions in other generative models.

major comments (1)

[Abstract] The claim of outperforming video-to-video weather editing baselines on weather-state generation (Abstract) is load-bearing for the central contribution. However, Holo-World operates in a single first-frame image plus explicit controls regime, while the baselines receive the full source video. The manuscript provides no indication that baselines were adapted to this setting or that the additional video input to baselines does not provide an unfair advantage in structure preservation. This must be addressed in the Experiments section to validate the outperformance.

minor comments (1)

[Abstract] The abstract references 'quantitative and qualitative experiments' demonstrating outperformance but does not include any specific metrics, error bars, or dataset statistics, which would help readers assess the strength of the results immediately.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential contribution of Holo-World and the HoloStateData dataset. We address the single major comment below and will revise the manuscript to strengthen the evaluation protocol.

read point-by-point responses

Referee: [Abstract] The claim of outperforming video-to-video weather editing baselines on weather-state generation (Abstract) is load-bearing for the central contribution. However, Holo-World operates in a single first-frame image plus explicit controls regime, while the baselines receive the full source video. The manuscript provides no indication that baselines were adapted to this setting or that the additional video input to baselines does not provide an unfair advantage in structure preservation. This must be addressed in the Experiments section to validate the outperformance.

Authors: We agree that this comparison requires clarification to ensure fairness. The video-to-video baselines receive the full source video, which supplies additional structural information not available to Holo-World. In the revised manuscript, we will adapt the baselines to the single first-frame setting by providing only the initial frame plus the same explicit camera, object, and weather controls. We will update the Experiments section with these revised quantitative results (including metrics on weather transfer and structure preservation) and will explicitly describe the adaptation procedure so that the outperformance claim is validated under equivalent input conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper introduces a new dataset (HoloStateData) and model architecture (Holo-World with Unified Scene Adapter and Scene-Weather Decomposed CFG) to address a first-frame-anchored controllable video generation task. No equations, fitted parameters, or self-citations are presented that reduce the claimed performance or factorization to a tautology or prior result by construction. The central claims rest on new components and experimental comparisons rather than re-deriving inputs, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The work introduces new model components and a dataset whose validity is not independently verified in the provided abstract; no free parameters or external axioms are explicitly listed.

invented entities (3)

HoloStateData no independent evidence
purpose: Unified state video dataset providing camera, object, and weather supervision
New dataset constructed from diverse videos; no external validation mentioned
Unified Scene Adapter no independent evidence
purpose: Factorizes world preservation and weather transfer into distinct parameter subspaces
New architectural module introduced to handle the joint control task
Scene-Weather Decomposed CFG no independent evidence
purpose: Guides scene and weather residuals separately during generation
New guidance technique to strengthen weather effects without over-amplifying conditions

pith-pipeline@v0.9.1-grok · 5814 in / 1319 out tokens · 21721 ms · 2026-06-26T18:03:40.985029+00:00 · methodology

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)