pith. sign in

arxiv: 2606.29020 · v1 · pith:4XJOQURJnew · submitted 2026-06-27 · 💻 cs.CV · cs.AI· cs.ET· cs.MM

Semantic-Aware, Physics-Informed, Geometry-Grounded Weather Video Synthesis

Pith reviewed 2026-06-30 09:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.ETcs.MM
keywords weather video synthesisphysics-informed simulationsemantic-aware anchoringgeometry-grounded synthesisparticle dynamicsadverse weather effectsautonomous drivingsemantic segmentation
0
0 comments X

The pith

Factoring weather video synthesis into semantic, dynamic, and geometric signals enables diverse and physically realistic weather effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to add controllable weather effects such as rain or snow to existing videos while keeping the original scene identity, structure, and motion intact. Existing approaches rely on text prompts that lack detail and on general video editors that tend to suppress dense particle effects, limiting both variety and physical accuracy. The proposed method supplies three separate conditional signals to an off-the-shelf editor: semantics to set target appearance, a physics simulation of Gaussian particles under gravity, wind, and turbulence to control motion over time, and geometry to place the effects correctly in the scene. If the signals work as intended, the resulting videos can serve as training data that measurably raises the accuracy of semantic segmentation models for autonomous driving in bad weather.

Core claim

By factorizing synthesis into semantic-aware appearance anchoring from scene semantics and user input, physics-informed dynamic simulation of a Gaussian-represented particle field under gravity, wind, and turbulence, and geometry-grounded video synthesis that aligns particles with target scene geometry, an off-the-shelf video editor can be steered to produce diverse global appearances and detailed particle dynamics that are both physically and visually realistic while preserving scene identity, structure, and motion.

What carries the argument

Factorization into three conditional signals—semantics for appearance, physics-informed particle simulation for temporal evolution, and geometry for spatial placement—that together steer an off-the-shelf video editor.

If this is right

  • The method generates diverse, physically and visually realistic weather effects with explicit control over temporal evolution and particle motion.
  • Synthesized videos preserve original scene identity, structure, and motion.
  • Training on the synthesized adverse-weather videos significantly improves the robustness of autonomous driving semantic segmentation models under real adverse conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-signal factorization could be adapted to other particle phenomena such as smoke or dust by changing only the force terms in the simulation.
  • Synthetic weather data produced this way may reduce the volume of real adverse-weather recordings needed to train robust perception systems.
  • Explicit physics and geometry conditioning may help other generative video models overcome suppression of complex dynamic effects.

Load-bearing premise

An off-the-shelf general-purpose video editor will reliably generate dense particle effects when supplied with the three additional conditional signals instead of suppressing them.

What would settle it

Run the method on a set of clear-weather driving videos and measure whether the outputs contain dense, temporally coherent particles whose motion matches the simulated gravity-wind-turbulence field; separately, train a semantic segmentation model on the generated adverse-weather videos and check whether accuracy on real adverse-weather test sets fails to rise above a baseline trained only on clear data.

Figures

Figures reproduced from arXiv: 2606.29020 by Chenghao Qian, Gustav Markkula, Jiaqi Ma, Lingdong Kong, Luc Van Gool, Nedko Savov, Rui Song, Wenjing Li, Yeying Jin, Zhun Zhong.

Figure 1
Figure 1. Figure 1: Weather Synthesis Examples. Given an original input video, our method can synthesize diverse weather effects, including the type of appearance (snowy or rainy), the event condition (static or dynamic), the duration (how long the weather lasts), the direction and strength of wind influence on particles, and the overall severity. This enables precise and interpretable control over atmospheric appearance and … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between text prompt-only and particle-conditioned editing. Text prompts alone often yield mild precipitation with weak particle details. In contrast, particle-conditioned editing enables dense snowflakes and rain streaks synthesis, indicating it effectively activates latent weather priors in pretrained video editors. Our key insight is that a well-designed conditioning space is essential: struct… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our framework. Given an input video, we construct structured conditioning signals from three aspects: semantics (semantic-aware appearance anchoring), dynamics (physics￾informed dynamic simulation), and geometry (geometry-grounded video synthesis). These signals are combined to guide a video diffusion model to synthesize realistic weather effects. 3 Methodology Our framework consists of three s… view at source ↗
Figure 4
Figure 4. Figure 4: Comparisons of editing results between the simple prompt and our prompt. With the simple prompt, the editing model struggles to produce light snow with clearly visible effects and heavy rain with accurate, scene-consistent lighting. In contrast, our generated prompt incorporates richer scene semantics and finer-grained details, leading to outputs that more reliably match the intended conditions [PITH_FULL… view at source ↗
Figure 5
Figure 5. Figure 5: Semantic-aware appearance anchoring. Given the initial video frame, a VLM parses scene semantics and an LLM reasons about weather-specific effects conditioned on user intent. The resulting instruction is used to edit the initial frame to anchor target appearance, while also being combined with scene semantics to provide a refined text description for subsequent video generation. scene under the target weat… view at source ↗
Figure 6
Figure 6. Figure 6: Physics-informed dynamic simulation. (a) Weather particles are initialized as grid￾sampled anisotropic Gaussians. (b) The particle field is evolved under gravity, wind, and turbulence, producing physically plausible motion that serves as explicit motion cues for video synthesis. Physics-informed simulation. To simulate particle dynamics, we augment each particle’s state with a depth coordinate, i.e., xi = … view at source ↗
Figure 7
Figure 7. Figure 7: Geometry-grounded video synthesis. (a) The simulated particle field is aligned with the scene by estimating gravity direction from 3D geometry. (b) The aligned particles are then projected into edited initial frame and each depth frame using camera intrinsics and extrinsics to produce particle maps. (c) Finally, we condition video generation on the particle-augmented inputs together with scene geometry and… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison under different weather conditions. Top: snowy scenes. Bottom: rainy scenes. Particle details are highlighted in colored boxes [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation on appearance anchoring. (a) Without semantic parsing and effect reasoning to generate a detailed prompt, results either change background content (snowy) or produce limited effects on the subject (rainy). With our prompt, the desired effects are achieved while retaining scene structure. (b) Without detailed prompts derived from scene captioning, results often exhibit hallucinations. With captioni… view at source ↗
Figure 10
Figure 10. Figure 10: Ablation on geometry grounding. Representative particles are highlighted in white. (Left) The blue solid arrow denotes the true gravity direction in 3D, and the dotted arrows show falling trends in 2D. (Right) The green solid boxes indicate the reference object scale, while the dashed boxes represent the particle-group scale. 4.5 Ablation Study We validate our key designs in (i) appearance anchoring (sema… view at source ↗
Figure 11
Figure 11. Figure 11: Examples of controllable weather effects. (Left) User inputs specify the weather duration category (short/medium/long) to control overall semantic appearance. (Right) Simulation parameters adjust particle shape and motion to reflect wind strength (weak to strong), enabling fine-grained particle control. Method ACDC MUSES Snow Rain Snow Rain DAFormer 50.9 50.8 41.4 33.8 HRDA 50.0 51.6 40.4 35.4 DAFormer+ou… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative segmentation results com￾parison under adverse weather: w/ vs. w/o our augmentation. Improved regions are highlighted in white circles. 4.6 Applications By leveraging the structured conditioning, our design enables flexible and interpretable control over both semantic appearance and particle effects, as shown in [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
read the original abstract

Weather synthesis aims to add weather effects to input videos while preserving scene identity, structure, and motion. The key limitation of existing methods is the lack of diversity in weather appearance and effective control over weather dynamics (e.g., temporal evolution and particle motion). Most approaches rely on text prompts, which are inherently underspecified and often fail to produce detailed weather characteristics. Additionally, general-purpose video editors optimized for clean and aesthetic outputs tend to suppress heavy weather phenomena, making dense particle effects difficult to generate. To address these, we propose a Semantic-Aware, Physics-Informed, and Geometry-Grounded framework that steers an off-the-shelf video editor to synthesize diverse global appearances and detailed particle dynamics. We factorize the synthesis into three conditional signals, so that each provides a distinct and stable source of guidance: semantics specifies what the weather should look like, dynamics governs how it evolves over time, and geometry determines where it should appear in the scene. Specifically, we introduce (1) semantic-aware appearance anchoring to establish the target appearance from scene semantics and user input; (2) physics-informed dynamic simulation to generate particle effects by simulating a Gaussian-represented particle field under gravity, wind, and turbulence; and (3) geometry-grounded video synthesis to align the simulated particles with target scene geometry and synthesize the final video. Experiments demonstrate that our method produces diverse, physically and visually realistic weather effects. Furthermore, we show that our synthesized data significantly improves the robustness of autonomous driving semantic segmentation under adverse weather conditions. Project page: https://jumponthemoon.github.io/w-crafter/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a Semantic-Aware, Physics-Informed, Geometry-Grounded framework that steers an off-the-shelf video editor via three conditional signals—semantic-aware appearance anchoring, physics-informed Gaussian particle simulation under gravity/wind/turbulence, and geometry-grounded alignment—to synthesize weather effects in videos while preserving scene identity. It claims this overcomes limitations of text-prompt methods and editor suppression of heavy weather, producing diverse, physically and visually realistic effects, and that the synthesized data significantly improves robustness of autonomous driving semantic segmentation under adverse weather.

Significance. If the central claims hold, the work would offer a controllable approach to weather data synthesis that could aid training of robust perception models, particularly for autonomous driving in adverse conditions. The explicit factorization into distinct semantic, dynamic, and geometric signals is a reasonable response to underspecification in prior methods.

major comments (2)
  1. [Abstract] Abstract: the claim that experiments 'demonstrate that our method produces diverse, physically and visually realistic weather effects' and that synthesized data 'significantly improves the robustness' of semantic segmentation rests on asserted experimental support, yet the provided text contains no quantitative metrics, ablation details, dataset descriptions, or error analysis. This is load-bearing for both the realism and downstream claims.
  2. [Abstract] Abstract: the method asserts that the three conditional signals will steer the off-the-shelf editor to produce dense particle effects despite the documented tendency of such editors to 'suppress heavy weather phenomena,' but no explicit mechanism, loss term, or enforcement strategy is described for overriding the editor's training objective. This premise is central to the particle-dynamics claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below, clarifying the role of the abstract and the conditioning strategy.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that experiments 'demonstrate that our method produces diverse, physically and visually realistic weather effects' and that synthesized data 'significantly improves the robustness' of semantic segmentation rests on asserted experimental support, yet the provided text contains no quantitative metrics, ablation details, dataset descriptions, or error analysis. This is load-bearing for both the realism and downstream claims.

    Authors: The abstract follows standard conventions by summarizing outcomes at a high level without embedding full metrics or ablations. The complete manuscript presents these details in Section 4 (Experiments), including quantitative metrics (FID, LPIPS for realism; mIoU gains on Cityscapes/ACDC for downstream robustness), ablation studies in 4.3, dataset specifications, and error analysis. We will revise the abstract to include one or two key quantitative results for better self-containment. revision: partial

  2. Referee: [Abstract] Abstract: the method asserts that the three conditional signals will steer the off-the-shelf editor to produce dense particle effects despite the documented tendency of such editors to 'suppress heavy weather phenomena,' but no explicit mechanism, loss term, or enforcement strategy is described for overriding the editor's training objective. This premise is central to the particle-dynamics claim.

    Authors: The steering occurs via dense, structured conditioning inputs derived from the three signals: semantic appearance maps, time-varying particle position fields from the Gaussian physics simulation, and geometry-aligned depth/normal maps. These are injected through the editor's native conditioning pathways (detailed in Section 3), providing explicit per-frame guidance that overrides suppression without modifying the editor or adding custom losses. We will expand the method description with a dedicated paragraph on the conditioning injection process. revision: partial

Circularity Check

0 steps flagged

No circularity: method is a compositional framework with no self-referential derivations or fitted predictions.

full rationale

The provided abstract and description present the approach as a factorization into three external conditional signals (semantics, dynamics via Gaussian particle simulation, geometry) that steer an off-the-shelf video editor. No equations, parameter fitting, or derivation chain are described that reduce outputs to inputs by construction. Claims rest on empirical experiments and downstream task improvement rather than mathematical self-reference. This matches the default expectation of a non-circular engineering composition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that an off-the-shelf video editor can be successfully steered by the three new signals; no free parameters or invented physical entities are mentioned in the abstract.

axioms (1)
  • domain assumption General-purpose video editors can be steered with additional conditional signals to produce dense particle weather effects instead of suppressing them.
    The paper explicitly identifies suppression of heavy weather by existing editors as the key limitation it overcomes via the three-signal approach.

pith-pipeline@v0.9.1-grok · 5857 in / 1342 out tokens · 36397 ms · 2026-06-30T09:16:32.695885+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 25 canonical work pages · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    arXiv preprint (2025), arXiv:2503.14492 2

    Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-Transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

  3. [3]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025

  4. [4]

    Vd3d: Taming large video diffusion transformers for 3d camera control.arXiv preprint arXiv:2407.12781, 2024

    Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. VD3D: Taming large video diffusion transformers for 3D camera control.arXiv preprint arXiv:2407.12781, 2024

  5. [5]

    ReCamMaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. ReCamMaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

  6. [6]

    MUSES: The multi-sensor semantic perception dataset for driving under uncertainty

    Tim Brödermann, David Bruggemann, Christos Sakaridis, Kevin Ta, Odysseas Liagouris, Jason Corkill, and Luc Van Gool. MUSES: The multi-sensor semantic perception dataset for driving under uncertainty. InEuropean Conference on Computer Vision, pages 21–38. Springer, 2024

  7. [7]

    nuScenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621–11631, 2020

  8. [8]

    Control-A-Video: controllable text-to-video generation with diffusion models.arXiv preprint arXiv:2305.13840, 2023

    Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video diffusion models with motion prior and reward feedback learning.arXiv preprint arXiv:2305.13840, 2023

  9. [9]

    ContextFlow: Training-free video object editing via adaptive context enrichment.arXiv preprint arXiv:2509.17818, 2025

    Yiyang Chen, Xuanhua He, Xiujun Ma, and Yue Ma. ContextFlow: Training-free video object editing via adaptive context enrichment.arXiv preprint arXiv:2509.17818, 2025

  10. [10]

    EditMGT: Unleashing potentials of masked generative transformers in image editing

    Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, and Songhua Liu. EditMGT: Unleashing potentials of masked generative transformers in image editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 38038–38048, 2026

  11. [11]

    Flatten: optical flow- guided attention for consistent text-to-video editing.arXiv preprint arXiv:2310.05922, 2023

    Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan- Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. FLATTEN: optical flow-guided attention for consistent text-to-video editing.arXiv preprint arXiv:2310.05922, 2023. 12

  12. [12]

    RainyGS: Efficient rain synthesis with physically-based gaussian splatting

    Qiyu Dai, Xingyu Ni, Qianfan Shen, Wenzheng Chen, Baoquan Chen, and Mengyu Chu. RainyGS: Efficient rain synthesis with physically-based gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16153–16162, 2025

  13. [13]

    Unreal Engine.Retrieved from Unreal Engine: https://www

    Unreal Engine. Unreal Engine.Retrieved from Unreal Engine: https://www. unrealengine. com/en-US/what-is-unreal-engine-4, 2018

  14. [14]

    CCEdit: Creative and controllable video editing via diffusion models

    Ruoyu Feng, Wenming Weng, Yanhui Wang, Yuhui Yuan, Jianmin Bao, Chong Luo, Zhibo Chen, and Baining Guo. CCEdit: Creative and controllable video editing via diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6712–6722, 2024

  15. [15]

    Let it snow! animating 3D gaussian scenes with dynamic weather effects via physics-guided score distillation.arXiv preprint arXiv:2504.05296, 2025

    Gal Fiebelman, Hadar Averbuch-Elor, and Sagie Benaim. Let it snow! animating 3D gaussian scenes with dynamic weather effects via physics-guided score distillation.arXiv preprint arXiv:2504.05296, 2025

  16. [16]

    StyleGAN-NADA: CLIP-guided domain adaptation of image generators.ACM Transactions on Graphics, 41(4):1–13, 2022

    Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. StyleGAN-NADA: CLIP-guided domain adaptation of image generators.ACM Transactions on Graphics, 41(4):1–13, 2022

  17. [17]

    Humandit: Pose-guided diffusion transformer for long-form human motion video generation.arXiv preprint arXiv:2502.04847,

    Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, and Jianke Zhu. HumanDiT: Pose-guided diffusion transformer for long-form human motion video generation.arXiv preprint arXiv:2502.04847, 2025

  18. [18]

    PISCO: Precise video instance insertion with sparse control.arXiv preprint arXiv:2602.08277, 2026

    Xiangbo Gao, Renjie Li, Xinghao Chen, Yuheng Wu, Suofei Feng, Qing Yin, and Zhengzhong Tu. PISCO: Precise video instance insertion with sparse control.arXiv preprint arXiv:2602.08277, 2026

  19. [19]

    An implicit compressible SPH solver for snow simulation.ACM Transactions on Graphics, 39(4): 36–1, 2020

    Christoph Gissler, Andreas Henne, Stefan Band, Andreas Peer, and Matthias Teschner. An implicit compressible SPH solver for snow simulation.ACM Transactions on Graphics, 39(4): 36–1, 2020

  20. [20]

    SparseCtrl: Adding sparse controls to text-to-video diffusion models

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. SparseCtrl: Adding sparse controls to text-to-video diffusion models. InEuropean Conference on Computer Vision, pages 330–348. Springer, 2024

  21. [21]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  22. [22]

    Semantic understanding of foggy scenes with purely synthetic data

    Martin Hahner, Dengxin Dai, Christos Sakaridis, Jan-Nico Zaech, and Luc Van Gool. Semantic understanding of foggy scenes with purely synthetic data. InIEEE Intelligent Transportation Systems Conference, pages 3675–3681, 2019

  23. [23]

    Is your HD map constructor reliable under sensor corruptions? InAdvances in Neural Information Processing Systems, volume 37, pages 22441–22482, 2024

    Xiaoshuai Hao et al. Is your HD map constructor reliable under sensor corruptions? InAdvances in Neural Information Processing Systems, volume 37, pages 22441–22482, 2024

  24. [24]

    SafeMap: Robust HD map construction from incomplete observations

    Xiaoshuai Hao et al. SafeMap: Robust HD map construction from incomplete observations. In International Conference on Machine Learning, pages 22091–22102. PMLR, 2025

  25. [25]

    DAFormer: Improving network architectures and training strategies for domain-adaptive semantic segmentation

    Lukas Hoyer, Dengxin Dai, and Luc Van Gool. DAFormer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9924–9935, 2022

  26. [26]

    HRDA: Context-aware high-resolution domain- adaptive semantic segmentation

    Lukas Hoyer, Dengxin Dai, and Luc Van Gool. HRDA: Context-aware high-resolution domain- adaptive semantic segmentation. InEuropean conference on computer vision, pages 372–391. Springer, 2022

  27. [27]

    V ACE: All-in- one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. V ACE: All-in- one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025. 13

  28. [28]

    Tuning-free visual effect transfer across videos.arXiv preprint arXiv:2601.07833, 2026

    Maxwell Jones, Rameen Abdal, Or Patashnik, Ruslan Salakhutdinov, Sergey Tulyakov, Jun-Yan Zhu, and Kuan-Chieh Jackson Wang. Tuning-free visual effect transfer across videos.arXiv preprint arXiv:2601.07833, 2026

  29. [29]

    3D Gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):139–1, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3D Gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):139–1, 2023

  30. [30]

    Robo3D: Towards robust and reliable 3D perception against corruptions

    Lingdong Kong, Youquan Liu, Xin Li, Runnan Chen, Wenwei Zhang, Jiawei Ren, Liang Pan, Kai Chen, and Ziwei Liu. Robo3D: Towards robust and reliable 3D perception against corruptions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19994–20006, 2023

  31. [31]

    Cottereau, and Wei Tsang Ooi

    Lingdong Kong, Shaoyuan Xie, Hanjiang Hu, Lai Xing Ng, Benoit R. Cottereau, and Wei Tsang Ooi. RoboDepth: Robust out-of-distribution depth estimation under corruptions. InAdvances in Neural Information Processing Systems, volume 36, pages 21298–21342, 2023

  32. [32]

    Lingdong Kong, Shaoyuan Xie, Hanjiang Hu, Yaru Niu, Wei Tsang Ooi, Benoit R. Cottereau, Lai Xing Ng, Yuexin Ma, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, Weichao Qiu, Wei Zhang, Xu Cao, Hao Lu, Ying-Cong Chen, Caixin Kang, Xinning Zhou, Chengyang Ying, Wentao Shang, Xingxing Wei, Yinpeng Dong, Bo Yang, Shengyin Jiang, Zeliang Ma, Dengyi Ji, Haiwen Li,...

  33. [33]

    Multi-modal data-efficient 3D scene understanding for autonomous driving

    Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, and Ziwei Liu. Multi-modal data-efficient 3D scene understanding for autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3748–3765, 2025

  34. [34]

    Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, and Ziwei Liu. 3D and 4D world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025

  35. [35]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  36. [36]

    Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

    Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. AnyV2V: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

  37. [37]

    Collaborative video diffusion: Consistent multi-video generation with camera control.Advances in Neural Information Processing Systems, 37:16240–16271, 2024

    Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control.Advances in Neural Information Processing Systems, 37:16240–16271, 2024

  38. [38]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. FLUX. 1 Kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

  39. [39]

    Weather GAN: Multi-domain weather translation using generative adversarial networks.arXiv preprint arXiv:2103.05422, 2021

    Xuelong Li, Kai Kou, and Bin Zhao. Weather GAN: Multi-domain weather translation using generative adversarial networks.arXiv preprint arXiv:2103.05422, 2021

  40. [40]

    ClimateNeRF: Ex- treme weather synthesis in neural radiance field

    Yuan Li, Zhi-Hao Lin, David Forsyth, Jia-Bin Huang, and Shenlong Wang. ClimateNeRF: Ex- treme weather synthesis in neural radiance field. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3227–3238, 2023. 14

  41. [41]

    Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, and Ziwei Liu

    Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, and Ziwei Liu. WorldLens: Full-spectrum evaluations of driving world models in real wor...

  42. [42]

    Controllable weather synthesis and removal with video diffusion models

    Chih-Hao Lin, Zian Wang, Ruofan Liang, Yuxuan Zhang, Sanja Fidler, Shenlong Wang, and Zan Gojcic. Controllable weather synthesis and removal with video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13580–13591, 2025

  43. [43]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  44. [44]

    DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

  45. [45]

    StableV2V: Stabilizing shape consistency in video-to-video editing.IEEE Transactions on Circuits and Systems for Video Technology, 2025

    Chang Liu, Rui Li, Kaidong Zhang, Yunwei Lan, and Dong Liu. StableV2V: Stabilizing shape consistency in video-to-video editing.IEEE Transactions on Circuits and Systems for Video Technology, 2025

  46. [46]

    PhysGen: Rigid-body physics-grounded image-to-video generation

    Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. PhysGen: Rigid-body physics-grounded image-to-video generation. InEuropean Conference on Computer Vision, pages 360–378. Springer, 2024

  47. [47]

    InfiniCube: Unbounded and controllable dynamic 3D driving scene generation with world-guided video models

    Yifan Lu, Xuanchi Ren, Jiawei Yang, Tianchang Shen, Zhangjie Wu, Jun Gao, Yue Wang, Siheng Chen, Mike Chen, Sanja Fidler, et al. InfiniCube: Unbounded and controllable dynamic 3D driving scene generation with world-guided video models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27272–27283, 2025

  48. [48]

    Follow your pose: Pose-guided text-to-video generation using pose-free videos

    Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4117–4125, 2024

  49. [49]

    NeRF: Representing scenes as neural radiance fields for view synthesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

  50. [50]

    Vision in bad weather

    Shree K Nayar and Srinivasa G Narasimhan. Vision in bad weather. InProceedings of the IEEE/CVF International Conference on Computer Vision, volume 2, pages 820–827, 1999

  51. [51]

    DreamDance: Animating human images by enriching 3D geometry cues from 2D poses

    Yatian Pang, Bin Zhu, Bin Lin, Mingzhe Zheng, Francis EH Tay, Ser-Nam Lim, Harry Yang, and Li Yuan. DreamDance: Animating human images by enriching 3D geometry cues from 2D poses. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14039–14050, 2025

  52. [52]

    One- step image translation with text-to-image models,

    Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models.arXiv preprint arXiv:2403.12036, 2024

  53. [53]

    A benchmark dataset and evaluation methodology for video object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 724–732, 2016

  54. [54]

    WeatherDG: LLM-assisted proce- dural weather generation for domain-generalized semantic segmentation.IEEE Robotics and Automation Letters, 10:5919–5926, 2025

    Chenghao Qian, Yuhu Guo, Yuhong Mo, and Wenjing Li. WeatherDG: LLM-assisted proce- dural weather generation for domain-generalized semantic segmentation.IEEE Robotics and Automation Letters, 10:5919–5926, 2025

  55. [55]

    WeatherEdit: Controllable weather editing with 4D gaussian field.arXiv preprint arXiv:2505.20471, 2025

    Chenghao Qian, Wenjing Li, Yuhu Guo, and Gustav Markkula. WeatherEdit: Controllable weather editing with 4D gaussian field.arXiv preprint arXiv:2505.20471, 2025. 15

  56. [56]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  57. [57]

    ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding

    Christos Sakaridis, Dengxin Dai, and Luc Van Gool. ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10765–10775, 2021

  58. [58]

    Weather-Magician: Reconstruction and rendering framework for 4D weather synthesis in real time.arXiv preprint arXiv:2505.19919, 2025

    Chen Sang, Yeqiang Qian, Jiale Zhang, Chunxiang Wang, and Ming Yang. Weather-Magician: Reconstruction and rendering framework for 4D weather synthesis in real time.arXiv preprint arXiv:2505.19919, 2025

  59. [59]

    ClimateGAN: Raising climate change awareness by generating images of floods.arXiv preprint arXiv:2110.02871, 2021

    Victor Schmidt, Alexandra Sasha Luccioni, Mélisande Teng, Tianyu Zhang, Alexia Reynaud, Sunand Raghupathi, Gautier Cosne, Adrien Juraver, Vahe Vardanyan, Alex Hernandez-Garcia, et al. ClimateGAN: Raising climate change awareness by generating images of floods.arXiv preprint arXiv:2110.02871, 2021

  60. [60]

    A material point method for snow simulation.ACM Transactions on Graphics, 32(4):1–10, 2013

    Alexey Stomakhin, Craig Schroeder, Lawrence Chai, Joseph Teran, and Andrew Selle. A material point method for snow simulation.ACM Transactions on Graphics, 32(4):1–10, 2013

  61. [61]

    Rain rendering for evaluating and improving robustness to bad weather.International Journal of Computer Vision, 129(2):341–360, 2021

    Maxime Tremblay, Shirsendu Sukanta Halder, Raoul De Charette, and Jean-François Lalonde. Rain rendering for evaluating and improving robustness to bad weather.International Journal of Computer Vision, 129(2):341–360, 2021

  62. [62]

    VideoDirector: Precise video editing via text-to-video models

    Yukun Wang, Longguang Wang, Zhiyuan Ma, Qibin Hu, Kai Xu, and Yulan Guo. VideoDirector: Precise video editing via text-to-video models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2589–2598, 2025

  63. [63]

    Weathercity: Urban scene reconstruction with controllable multi-weather transformation

    Wenhua Wu, Huai Guan, Zhe Liu, and Hesheng Wang. Weathercity: Urban scene reconstruction with controllable multi-weather transformation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 40949–40958, 2026

  64. [64]

    PandaSet: Advanced sensor suite dataset for autonomous driving

    Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, et al. PandaSet: Advanced sensor suite dataset for autonomous driving. InIEEE International Intelligent Transportation Systems Conference, pages 3095–3101, 2021

  65. [65]

    Are VLMs ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives

    Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are VLMs ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6585–6597, 2025

  66. [66]

    Benchmarking and improving bird’s eye view perception robustness in autonomous driving

    Shaoyuan Xie, Lingdong Kong, Wenwei Zhang, Jiawei Ren, Liang Pan, Kai Chen, and Ziwei Liu. Benchmarking and improving bird’s eye view perception robustness in autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3878–3894, 2025

  67. [67]

    ClimateGS: Real-time climate simulation with 3D gaussian style transfer.arXiv preprint arXiv:2503.14845, 2025

    Yuezhen Xie, Meiying Zhang, and Qi Hao. ClimateGS: Real-time climate simulation with 3D gaussian style transfer.arXiv preprint arXiv:2503.14845, 2025

  68. [68]

    Make-your-video: Customized video generation using textual and structural guidance.IEEE Transactions on Visualization and Computer Graphics, 31(2):1526–1541, 2024

    Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, et al. Make-your-video: Customized video generation using textual and structural guidance.IEEE Transactions on Visualization and Computer Graphics, 31(2):1526–1541, 2024

  69. [69]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 16