pith. sign in

arxiv: 2605.11871 · v2 · pith:V66BHI7Vnew · submitted 2026-05-12 · 💻 cs.CV

h-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

Pith reviewed 2026-05-20 22:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords camera controlflow matchingvideo generationGibbs samplingtraining-freelatent refinementconditional samplingpartial observation
0
0 comments X

The pith

h-control augments guidance steps with block-conditional Gibbs refinement to reconcile partial camera trajectories with pretrained video priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames training-free camera control for flow-matching video generators as a partial-observation inverse problem in which a depth-warped guidance video supplies noisy evidence only on a subset of latent sites. It proposes h-control as a structural sampler change that augments each hard-replacement guidance step with an inner loop of block-conditional pseudo-Gibbs refinement performed on the unobserved complement at fixed noise level. The refinement is proven to converge to the target partial-observation conditional data law. Conditional locality of video latents is exploited by partitioning the complement into 3D patches whose individual convergence is tracked by a custom mixing indicator that freezes patches once they stabilize. On RealEstate10K and DAVIS the resulting method records the best Fréchet Video Distance against seven training-free and training-based baselines while improving every other reported metric over all training-free competitors.

Core claim

h-control resolves the adherence-quality trade-off by augmenting each outer hard-replacement guidance step with an inner-loop block-conditional pseudo-Gibbs refinement on the unobserved latent complement at the same noise level, yielding provable convergence to the partial-observation conditional data law while exploiting conditional locality through 3D patch partitioning and adaptive mixing indicators.

What carries the argument

block-conditional pseudo-Gibbs refinement performed on adaptively frozen 3D patches of the unobserved latent complement

If this is right

  • h-control records the best FVD on both RealEstate10K and DAVIS against all seven training-free and training-based competitors.
  • It improves every reported metric over every training-free baseline.
  • The inner-loop refinement converges to the exact partial-observation conditional distribution.
  • Adaptive freezing of converged patches reduces compute on high-dimensional video latents without quality loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same locality-driven patch refinement could be ported to other partial-observation tasks such as inpainting or novel-view synthesis in the same generators.
  • Because the method requires no fine-tuning, it may allow rapid testing of new control signals on successive model releases.
  • If the mixing indicator generalizes, analogous adaptive stopping rules could accelerate sampling in other high-dimensional conditional generation settings.

Load-bearing premise

The unobserved latent sites exhibit enough conditional locality that they can be safely split into independent 3D patches whose separate refinement does not destroy global video consistency.

What would settle it

A video sequence in which patch-wise refinement produces visible motion seams or trajectory drift across patch boundaries despite the mixing indicator having declared convergence.

Figures

Figures reproduced from arXiv: 2605.11871 by Duo Su, Jun Zhu, Xi Ye, Yangyang Xu, Yuzhu Wang.

Figure 1
Figure 1. Figure 1: 2D checkerboard toy example. (a) Sample clouds at yobs ≈0.5 for ground truth, DPS [11], TFG-UGD [12], and h-control (left to right). (b) Posterior-hit rate vs. total NFE for h-control (varying Jmax) and TFG-UGD (varying Nrecur). (c) |∆ (j) W | vs. inner iteration j binned by noise band. By Polyak and Juditsky [25] on iterate averaging, this trades the per-sample posterior variance for an O(τint/Jmax) Monte… view at source ↗
Figure 2
Figure 2. Figure 2: Top canonical partial correlation ρ1(Rbβγ) along the H, W, L axes of the Wan 2.2 latent (N = 200 encoded videos). Off-diagonal mass concentrates within |β − γ|≤2. 3.2 From Toy to Video: Locality and Block-Conditional Gibbs At video scale — the Wan 2.2 latent has shape (C, L, H, W) with C = 48 and ∼ 105 sites — a generalized DAE chain on this full sub-state mixes too slowly (per-probe sampling variance scal… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results on RealEstate10K. Compared with the baselines, our method generates [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on DAVIS. On dynamic scenes, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Stability mask Sg evolution at the initial outer guidance step (noise level σts ). Stable region grows with j to cover the unobserved support. weighted-h-transform of Wang et al. [16] extends DPS with a global-scalar confidence weight, and Zhu et al. [46] pursues the same formal object via fine-tuning rather than inference-time guid￾ance. Inference-time conditional samplers such as the Twisted Diffusion Sa… view at source ↗
Figure 6
Figure 6. Figure 6: Top canonical partial correlation ρ1(Rbβγ) on the model’s clean prediction zˆ0(zt, t, c) along the H, W, L axes (left to right) at five noise levels. The diagonal band stays sharp at every σt with off-diagonal mass concentrated within |β − γ|≤2, confirming that the locality structure of [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Failure modes. Severe depth estimation error produces wrong warping, leads to bad camera control. I.2 RealEstate10K [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional RealEstate10K qualitative comparisons. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: More comparison with state-of-the-art methods on Davis. [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
read the original abstract

Training-free camera control for pretrained flow-matching video generators is a partial-observation inverse problem: a depth-warped guidance video supplies noisy evidence on a subset of latent sites, which the sampler must reconcile with the pretrained prior. Existing methods struggle to balance the trade-off between trajectory adherence and visual quality and the heuristic guidance-strength tuning lacks robustness. We propose \textbf{$h$-control}, which resolves this dilemma through a structural change to the sampler: each outer hard-replacement guidance step is augmented with an inner-loop \emph{block-conditional pseudo-Gibbs refinement} on the unobserved complement at the same noise level, with provable convergence to the partial-observation conditional data law. To accelerate convergence on high-dimensional video latents, we exploit their conditional locality, partitioning the unobserved complement into 3D patches, each tracked by a custom mixing indicator that adaptively freezes converged patches. On RealEstate10K and DAVIS, \textbf{$h$-control} attains the best FVD against all seven training-free and training-based competitors, outperforming every training-free baseline on every reported metric.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces h-control, a training-free camera control method for pretrained flow-matching video generators. It frames the task as a partial-observation inverse problem and augments hard-replacement guidance with an inner-loop block-conditional pseudo-Gibbs refinement on the unobserved latent complement. Conditional locality is exploited by partitioning into 3D patches tracked by adaptive mixing indicators that freeze converged patches. The method claims provable convergence to the partial-observation conditional data law and reports the best FVD on RealEstate10K and DAVIS against seven training-free and training-based competitors, outperforming all training-free baselines on every metric.

Significance. If the convergence guarantee holds and patch-wise refinement preserves global consistency under camera-induced dependencies, the approach would provide a principled, training-free alternative to heuristic guidance tuning, improving the trade-off between trajectory adherence and visual quality. Credit is due for building directly on existing pretrained priors and standard sampling theory without introducing new free parameters or fitted quantities. The empirical claim of best-in-class FVD is potentially impactful for video generation benchmarks if the experimental protocol is fully specified.

major comments (2)
  1. Abstract: The central claim of 'provable convergence to the partial-observation conditional data law' is asserted without any theorem statement, derivation, mixing-time bound, or error analysis. This is load-bearing for the contribution, as the block-conditional pseudo-Gibbs sampler is the structural change proposed to resolve the guidance-quality trade-off.
  2. Abstract: The partitioning of unobserved latents into independent 3D patches with per-patch mixing indicators assumes sufficient conditional locality. No bound or argument is supplied showing that patch-wise updates preserve the joint conditional law when camera motion induces long-range correlations (e.g., consistent parallax and occlusion boundaries). This directly engages the stress-test concern and risks global coherence violations even if local indicators report convergence.
minor comments (1)
  1. Abstract: The seven competitors are referenced but neither named nor described (e.g., which are training-free vs. training-based), and no quantitative FVD values or baseline details are supplied, limiting verification of the 'best-in-class' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments identify two important aspects of the presentation that warrant clarification and strengthening. We address each below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: Abstract: The central claim of 'provable convergence to the partial-observation conditional data law' is asserted without any theorem statement, derivation, mixing-time bound, or error analysis. This is load-bearing for the contribution, as the block-conditional pseudo-Gibbs sampler is the structural change proposed to resolve the guidance-quality trade-off.

    Authors: We agree that the abstract would benefit from a more explicit pointer to the supporting argument. Section 3.2 of the manuscript derives the convergence result by showing that the block-conditional pseudo-Gibbs updates are exact conditional samples from the pretrained flow-matching model at fixed noise level and that the overall procedure is a valid MCMC kernel for the partial-observation conditional. We will revise the abstract to reference this derivation and add a concise theorem statement (with a one-paragraph proof sketch) in the main text to make the claim self-contained. revision: yes

  2. Referee: Abstract: The partitioning of unobserved latents into independent 3D patches with per-patch mixing indicators assumes sufficient conditional locality. No bound or argument is supplied showing that patch-wise updates preserve the joint conditional law when camera motion induces long-range correlations (e.g., consistent parallax and occlusion boundaries). This directly engages the stress-test concern and risks global coherence violations even if local indicators report convergence.

    Authors: This is a substantive concern. The method exploits the strong conditional locality observed in the flow-matching latent space for video data, with adaptive freezing intended to preserve consistency. While we do not supply a theoretical error bound quantifying the effect of camera-induced long-range dependencies, the empirical results on RealEstate10K and DAVIS show that global metrics (FVD) and visual coherence remain superior to baselines. We will add a dedicated paragraph in Section 4 discussing the locality assumption, its empirical validation, and the potential for coherence violations under extreme motion, together with additional qualitative examples. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation builds on standard Gibbs sampling and pretrained priors without reducing claims to fitted inputs or self-referential definitions.

full rationale

The paper presents h-control as a structural modification to the sampler (outer hard-replacement guidance augmented by inner block-conditional pseudo-Gibbs on the unobserved complement) that claims convergence to the partial-observation conditional. This rests on the standard theory of Gibbs sampling plus an explicit modeling assumption of conditional locality for patch partitioning, rather than any equation or parameter that is defined in terms of the target performance metric. No fitted quantity is renamed as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and the reported FVD improvements are empirical outcomes on RealEstate10K and DAVIS rather than algebraic identities. The derivation chain therefore remains self-contained against external benchmarks and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the existence of a well-defined partial-observation conditional data law for flow-matching latents and on the practical validity of conditional locality for patch-wise refinement.

axioms (1)
  • domain assumption Video latents possess conditional locality that permits safe partitioning into 3D patches whose individual convergence can be monitored independently.
    Invoked to justify the acceleration technique and adaptive freezing of converged patches.

pith-pipeline@v0.9.0 · 5731 in / 1243 out tokens · 69955 ms · 2026-05-20T22:49:11.778132+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 9 internal anchors

  1. [1]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

  2. [2]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

  3. [3]

    Cameractrl ii: Dynamic scene exploration via camera- controlled video diffusion models

    Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera- controlled video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13416–13426, 2025

  4. [4]

    Vd3d: Taming large video diffusion transformers for 3d camera control.arXiv preprint arXiv:2407.12781, 2024

    Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control.arXiv preprint arXiv:2407.12781, 2024

  5. [5]

    Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025

  6. [6]

    Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

    David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2050–2062, 2025

  7. [7]

    Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024

    Chen Hou and Zhibo Chen. Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024

  8. [8]

    Latent-reframe: Enabling camera control for video diffusion models without training

    Zhenghong Zhou, Jie An, and Jiebo Luo. Latent-reframe: Enabling camera control for video diffusion models without training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12779–12789, 2025

  9. [9]

    Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning

    Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. InInternational Conference on Machine Learning, pages 22825–22855. PMLR, 2023

  10. [10]

    Z., Salakhut- dinov, R., et al

    Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J Zico Kolter, Ruslan Salakhutdinov, et al. Manifold preserv- ing guided diffusion.arXiv preprint arXiv:2311.16424, 2023

  11. [11]

    Diffusion Posterior Sampling for General Noisy Inverse Problems

    Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffu- sion posterior sampling for general noisy inverse problems.arXiv preprint arXiv:2209.14687, 2022

  12. [12]

    Tfg: Unified training-free guidance for diffusion models.Advances in Neural Information Processing Systems, 37:22370–22417, 2024

    Haotian Ye, Haowei Lin, Jiaqi Han, Minkai Xu, Sheng Liu, Yitao Liang, Jianzhu Ma, James Zou, and Stefano Ermon. Tfg: Unified training-free guidance for diffusion models.Advances in Neural Information Processing Systems, 37:22370–22417, 2024

  13. [13]

    Repaint: Inpainting using denoising diffusion probabilistic models

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022

  14. [14]

    Time-to-move: Training-free motion controlled video generation via dual-clock denoising.arXiv preprint arXiv:2511.08633, 2025

    Assaf Singer, Noam Rotstein, Amir Mann, Ron Kimmel, and Or Litany. Time-to-move: Training-free motion controlled video generation via dual-clock denoising.arXiv preprint arXiv:2511.08633, 2025. 10

  15. [15]

    Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025

    Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, and Chi Zhang. Taming video models for 3d and 4d generation via zero-shot camera control.arXiv preprint arXiv:2509.15130, 2025

  16. [16]

    Coarse-guided visual generation via weighted h-transform sampling.arXiv preprint arXiv:2603.12057, 2026

    Yanghao Wang, Ziqi Jiang, Zhen Wang, and Long Chen. Coarse-guided visual generation via weighted h-transform sampling.arXiv preprint arXiv:2603.12057, 2026

  17. [17]

    Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Processing Systems, 37:122458–122483, 2024

    Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Processing Systems, 37:122458–122483, 2024

  18. [18]

    Generalized denoising auto- encoders as generative models.Advances in neural information processing systems, 26, 2013

    Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto- encoders as generative models.Advances in neural information processing systems, 26, 2013

  19. [19]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InThe Ninth International Conference on Learning Representations, 2021

  20. [20]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  21. [21]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  22. [22]

    Cambridge university press, 2000

    L Chris G Rogers and David Williams.Diffusions, Markov processes, and martingales, volume 2. Cambridge university press, 2000

  23. [23]

    Deft: Efficient fine-tuning of diffusion models by learning the generalisedh-transform.Advances in Neural Information Processing Systems, 37:19636–19682, 2024

    Alexander Denker, Francisco Vargas, Shreyas Padhy, Kieran Didi, Simon Mathis, Vincent Dutordoir, Riccardo Barbano, Emile Mathieu, Urszula J Komorowska, and Pietro Lio. Deft: Efficient fine-tuning of diffusion models by learning the generalisedh-transform.Advances in Neural Information Processing Systems, 37:19636–19682, 2024

  24. [24]

    Practi- cal and asymptotically exact conditional sampling in diffusion models.Advances in Neural Information Processing Systems, 36:31372–31403, 2023

    Luhuan Wu, Brian Trippe, Christian Naesseth, David Blei, and John P Cunningham. Practi- cal and asymptotically exact conditional sampling in diffusion models.Advances in Neural Information Processing Systems, 36:31372–31403, 2023

  25. [25]

    Acceleration of stochastic approximation by averaging

    Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992

  26. [26]

    Self-Refining Video Sampling

    Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon, and Sung Ju Hwang. Self-refining video sampling.arXiv preprint arXiv:2601.18577, 2026

  27. [27]

    Note on a method for calculating corrected sums of squares and products

    Barry Payne Welford. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419–420, 1962

  28. [28]

    Covariance structure of the gibbs sampler with applications to the comparisons of estimators and augmentation schemes.Biometrika, pages 27–40, 1994

    Jun S Liu, Wing Hung Wong, and Augustine Kong. Covariance structure of the gibbs sampler with applications to the comparisons of estimators and augmentation schemes.Biometrika, pages 27–40, 1994

  29. [29]

    Gareth O Roberts and Sujit K Sahu. Updating schemes, correlation structure, blocking and parameterization for the gibbs sampler.Journal of the Royal Statistical Society Series B: Statistical Methodology, 59(2):291–317, 1997

  30. [30]

    Understanding the effective receptive field in deep convolutional neural networks.Advances in neural information processing systems, 29, 2016

    Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks.Advances in neural information processing systems, 29, 2016

  31. [31]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  32. [32]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 11

  33. [33]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  34. [34]

    Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

    Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

  35. [35]

    Trajectorycrafter: Redirecting camera trajec- tory for monocular videos via diffusion models

    Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajec- tory for monocular videos via diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 100–111, 2025

  36. [36]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  37. [37]

    Collaborative video diffusion: Consistent multi-video generation with camera control.Advances in Neural Information Processing Systems, 37:16240–16271, 2024

    Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control.Advances in Neural Information Processing Systems, 37:16240–16271, 2024

  38. [38]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  39. [39]

    Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10486–10496, 2025

  40. [40]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

  41. [41]

    Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints.arXiv preprint arXiv:2412.07760, 2024

    Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, and Di Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints.arXiv preprint arXiv:2412.07760, 2024

  42. [42]

    Generative camera dolly: Extreme monocular dynamic novel view synthesis

    Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEuropean Conference on Computer Vision, pages 313–331. Springer, 2024

  43. [43]

    Zero4d: Training-free 4d video generation from single video using off-the-shelf video diffusion.arXiv preprint arXiv:2503.22622, 2025

    Jangho Park, Taesung Kwon, and Jong Chul Ye. Zero4d: Training-free 4d video generation from single video using off-the-shelf video diffusion.arXiv preprint arXiv:2503.22622, 2025

  44. [44]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

  45. [45]

    Vid-camedit: Video cam- era trajectory editing with generative rendering from estimated geometry.arXiv preprint arXiv:2506.13697, 2025

    Junyoung Seo, Jisang Han, Jaewoo Jung, Siyoon Jin, Joungbin Lee, Takuya Narihira, Kazumi Fukuda, Takashi Shibuya, Donghoon Ahn, Shoukang Hu, et al. Vid-camedit: Video cam- era trajectory editing with generative rendering from estimated geometry.arXiv preprint arXiv:2506.13697, 2025

  46. [46]

    Training-free adaptation of diffusion models via doob’sh-transform.arXiv preprint arXiv:2602.16198, 2026

    Qijie Zhu, Zeqi Ye, Han Liu, Zhaoran Wang, and Minshuo Chen. Training-free adaptation of diffusion models via doob’sh-transform.arXiv preprint arXiv:2602.16198, 2026

  47. [47]

    A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025

    Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025. 12

  48. [48]

    Plug-and-play priors for bright field electron tomography and sparse interpolation.IEEE Transactions on Computational Imaging, 2 (4):408–423, 2016

    Suhas Sreehari, S Venkat Venkatakrishnan, Brendt Wohlberg, Gregery T Buzzard, Lawrence F Drummy, Jeffrey P Simmons, and Charles A Bouman. Plug-and-play priors for bright field electron tomography and sparse interpolation.IEEE Transactions on Computational Imaging, 2 (4):408–423, 2016

  49. [49]

    The little engine that could: Regularization by denoising (red).SIAM journal on imaging sciences, 10(4):1804–1844, 2017

    Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by denoising (red).SIAM journal on imaging sciences, 10(4):1804–1844, 2017

  50. [50]

    A restoration network as an implicit prior

    Yuyang Hu, Mauricio Delbracio, Peyman Milanfar, and Ulugbek Kamilov. A restoration network as an implicit prior. InThe Twelfth International Conference on Learning Representations, 2023

  51. [51]

    Fire: Fixed-points of restoration priors for solving inverse problems

    Matthieu Terris, Ulugbek S Kamilov, and Thomas Moreau. Fire: Fixed-points of restoration priors for solving inverse problems. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23185–23194, 2025

  52. [52]

    Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

    Brian DO Anderson. Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

  53. [53]

    A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661–1674, 2011

    Pascal Vincent. A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661–1674, 2011

  54. [54]

    Stereo Magnification: Learning View Synthesis using Multiplane Images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

  55. [55]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  56. [56]

    Bovik, Hamid R

    Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004. 13 A Extended Related Work and Positioning This appendix expands Section 5 with a technical positioning of h-control against the four research lines it sits...

  57. [57]

    provides a generative-modeling perspective on the same problem. The common requirement across this family is that at least one component — backbone, adapter, or refinement head — is fine-tuned to internalize the trajectory-to-video correspondence. Training-free controllers.TTM [ 14] and WorldForge [15] construct a warped guidance video by lifting the sour...

  58. [58]

    denoiser as implicit prior

    fine-tunes the base model to internalize the same drift.Position of h-control:we extend the global scalar λσt to aspatially non-uniformmask M and pair it with a novel inner refinement on the unobserved support, while keeping the conditioning entirely at inference time — no extra network and no fine-tuning. Sequential Monte Carlo and Feynman–Kac.Twisted Di...