$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

Duo Su; Jun Zhu; Xi Ye; Yangyang Xu; Yuzhu Wang

arxiv: 2605.11871 · v2 · pith:V66BHI7Vnew · submitted 2026-05-12 · 💻 cs.CV

h-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

Yuzhu Wang , Xi Ye , Duo Su , Yangyang Xu , Jun Zhu This is my paper

Pith reviewed 2026-05-20 22:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords camera controlflow matchingvideo generationGibbs samplingtraining-freelatent refinementconditional samplingpartial observation

0 comments

The pith

h-control augments guidance steps with block-conditional Gibbs refinement to reconcile partial camera trajectories with pretrained video priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames training-free camera control for flow-matching video generators as a partial-observation inverse problem in which a depth-warped guidance video supplies noisy evidence only on a subset of latent sites. It proposes h-control as a structural sampler change that augments each hard-replacement guidance step with an inner loop of block-conditional pseudo-Gibbs refinement performed on the unobserved complement at fixed noise level. The refinement is proven to converge to the target partial-observation conditional data law. Conditional locality of video latents is exploited by partitioning the complement into 3D patches whose individual convergence is tracked by a custom mixing indicator that freezes patches once they stabilize. On RealEstate10K and DAVIS the resulting method records the best Fréchet Video Distance against seven training-free and training-based baselines while improving every other reported metric over all training-free competitors.

Core claim

h-control resolves the adherence-quality trade-off by augmenting each outer hard-replacement guidance step with an inner-loop block-conditional pseudo-Gibbs refinement on the unobserved latent complement at the same noise level, yielding provable convergence to the partial-observation conditional data law while exploiting conditional locality through 3D patch partitioning and adaptive mixing indicators.

What carries the argument

block-conditional pseudo-Gibbs refinement performed on adaptively frozen 3D patches of the unobserved latent complement

If this is right

h-control records the best FVD on both RealEstate10K and DAVIS against all seven training-free and training-based competitors.
It improves every reported metric over every training-free baseline.
The inner-loop refinement converges to the exact partial-observation conditional distribution.
Adaptive freezing of converged patches reduces compute on high-dimensional video latents without quality loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same locality-driven patch refinement could be ported to other partial-observation tasks such as inpainting or novel-view synthesis in the same generators.
Because the method requires no fine-tuning, it may allow rapid testing of new control signals on successive model releases.
If the mixing indicator generalizes, analogous adaptive stopping rules could accelerate sampling in other high-dimensional conditional generation settings.

Load-bearing premise

The unobserved latent sites exhibit enough conditional locality that they can be safely split into independent 3D patches whose separate refinement does not destroy global video consistency.

What would settle it

A video sequence in which patch-wise refinement produces visible motion seams or trajectory drift across patch boundaries despite the mixing indicator having declared convergence.

Figures

Figures reproduced from arXiv: 2605.11871 by Duo Su, Jun Zhu, Xi Ye, Yangyang Xu, Yuzhu Wang.

**Figure 1.** Figure 1: 2D checkerboard toy example. (a) Sample clouds at yobs ≈0.5 for ground truth, DPS [11], TFG-UGD [12], and h-control (left to right). (b) Posterior-hit rate vs. total NFE for h-control (varying Jmax) and TFG-UGD (varying Nrecur). (c) |∆ (j) W | vs. inner iteration j binned by noise band. By Polyak and Juditsky [25] on iterate averaging, this trades the per-sample posterior variance for an O(τint/Jmax) Monte… view at source ↗

**Figure 2.** Figure 2: Top canonical partial correlation ρ1(Rbβγ) along the H, W, L axes of the Wan 2.2 latent (N = 200 encoded videos). Off-diagonal mass concentrates within |β − γ|≤2. 3.2 From Toy to Video: Locality and Block-Conditional Gibbs At video scale — the Wan 2.2 latent has shape (C, L, H, W) with C = 48 and ∼ 105 sites — a generalized DAE chain on this full sub-state mixes too slowly (per-probe sampling variance scal… view at source ↗

**Figure 3.** Figure 3: Qualitative results on RealEstate10K. Compared with the baselines, our method generates [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on DAVIS. On dynamic scenes, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Stability mask Sg evolution at the initial outer guidance step (noise level σts ). Stable region grows with j to cover the unobserved support. weighted-h-transform of Wang et al. [16] extends DPS with a global-scalar confidence weight, and Zhu et al. [46] pursues the same formal object via fine-tuning rather than inference-time guidance. Inference-time conditional samplers such as the Twisted Diffusion Sa… view at source ↗

**Figure 6.** Figure 6: Top canonical partial correlation ρ1(Rbβγ) on the model’s clean prediction zˆ0(zt, t, c) along the H, W, L axes (left to right) at five noise levels. The diagonal band stays sharp at every σt with off-diagonal mass concentrated within |β − γ|≤2, confirming that the locality structure of [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Failure modes. Severe depth estimation error produces wrong warping, leads to bad camera control. I.2 RealEstate10K [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Additional RealEstate10K qualitative comparisons. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: More comparison with state-of-the-art methods on Davis. [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

read the original abstract

Training-free camera control for pretrained flow-matching video generators is a partial-observation inverse problem: a depth-warped guidance video supplies noisy evidence on a subset of latent sites, which the sampler must reconcile with the pretrained prior. Existing methods struggle to balance the trade-off between trajectory adherence and visual quality and the heuristic guidance-strength tuning lacks robustness. We propose \textbf{$h$-control}, which resolves this dilemma through a structural change to the sampler: each outer hard-replacement guidance step is augmented with an inner-loop \emph{block-conditional pseudo-Gibbs refinement} on the unobserved complement at the same noise level, with provable convergence to the partial-observation conditional data law. To accelerate convergence on high-dimensional video latents, we exploit their conditional locality, partitioning the unobserved complement into 3D patches, each tracked by a custom mixing indicator that adaptively freezes converged patches. On RealEstate10K and DAVIS, \textbf{$h$-control} attains the best FVD against all seven training-free and training-based competitors, outperforming every training-free baseline on every reported metric.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

h-control adds an inner block-conditional Gibbs loop on 3D patches to hard-replacement guidance for training-free camera control and reports top FVD scores, but the patch independence assumption looks shaky for camera-induced long-range effects.

read the letter

The main thing to know is that this paper changes the sampler itself rather than just tuning guidance. It wraps each outer hard-replacement step with an inner pseudo-Gibbs refinement on the unobserved latent sites at fixed noise level, then splits those sites into 3D patches that get tracked by mixing indicators and frozen when they converge. That combination plus the adaptive freezing is the concrete novelty over prior training-free baselines for flow-matching video models conditioned on depth-warped trajectories.

Referee Report

2 major / 1 minor

Summary. The paper introduces h-control, a training-free camera control method for pretrained flow-matching video generators. It frames the task as a partial-observation inverse problem and augments hard-replacement guidance with an inner-loop block-conditional pseudo-Gibbs refinement on the unobserved latent complement. Conditional locality is exploited by partitioning into 3D patches tracked by adaptive mixing indicators that freeze converged patches. The method claims provable convergence to the partial-observation conditional data law and reports the best FVD on RealEstate10K and DAVIS against seven training-free and training-based competitors, outperforming all training-free baselines on every metric.

Significance. If the convergence guarantee holds and patch-wise refinement preserves global consistency under camera-induced dependencies, the approach would provide a principled, training-free alternative to heuristic guidance tuning, improving the trade-off between trajectory adherence and visual quality. Credit is due for building directly on existing pretrained priors and standard sampling theory without introducing new free parameters or fitted quantities. The empirical claim of best-in-class FVD is potentially impactful for video generation benchmarks if the experimental protocol is fully specified.

major comments (2)

Abstract: The central claim of 'provable convergence to the partial-observation conditional data law' is asserted without any theorem statement, derivation, mixing-time bound, or error analysis. This is load-bearing for the contribution, as the block-conditional pseudo-Gibbs sampler is the structural change proposed to resolve the guidance-quality trade-off.
Abstract: The partitioning of unobserved latents into independent 3D patches with per-patch mixing indicators assumes sufficient conditional locality. No bound or argument is supplied showing that patch-wise updates preserve the joint conditional law when camera motion induces long-range correlations (e.g., consistent parallax and occlusion boundaries). This directly engages the stress-test concern and risks global coherence violations even if local indicators report convergence.

minor comments (1)

Abstract: The seven competitors are referenced but neither named nor described (e.g., which are training-free vs. training-based), and no quantitative FVD values or baseline details are supplied, limiting verification of the 'best-in-class' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments identify two important aspects of the presentation that warrant clarification and strengthening. We address each below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: Abstract: The central claim of 'provable convergence to the partial-observation conditional data law' is asserted without any theorem statement, derivation, mixing-time bound, or error analysis. This is load-bearing for the contribution, as the block-conditional pseudo-Gibbs sampler is the structural change proposed to resolve the guidance-quality trade-off.

Authors: We agree that the abstract would benefit from a more explicit pointer to the supporting argument. Section 3.2 of the manuscript derives the convergence result by showing that the block-conditional pseudo-Gibbs updates are exact conditional samples from the pretrained flow-matching model at fixed noise level and that the overall procedure is a valid MCMC kernel for the partial-observation conditional. We will revise the abstract to reference this derivation and add a concise theorem statement (with a one-paragraph proof sketch) in the main text to make the claim self-contained. revision: yes
Referee: Abstract: The partitioning of unobserved latents into independent 3D patches with per-patch mixing indicators assumes sufficient conditional locality. No bound or argument is supplied showing that patch-wise updates preserve the joint conditional law when camera motion induces long-range correlations (e.g., consistent parallax and occlusion boundaries). This directly engages the stress-test concern and risks global coherence violations even if local indicators report convergence.

Authors: This is a substantive concern. The method exploits the strong conditional locality observed in the flow-matching latent space for video data, with adaptive freezing intended to preserve consistency. While we do not supply a theoretical error bound quantifying the effect of camera-induced long-range dependencies, the empirical results on RealEstate10K and DAVIS show that global metrics (FVD) and visual coherence remain superior to baselines. We will add a dedicated paragraph in Section 4 discussing the locality assumption, its empirical validation, and the potential for coherence violations under extreme motion, together with additional qualitative examples. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation builds on standard Gibbs sampling and pretrained priors without reducing claims to fitted inputs or self-referential definitions.

full rationale

The paper presents h-control as a structural modification to the sampler (outer hard-replacement guidance augmented by inner block-conditional pseudo-Gibbs on the unobserved complement) that claims convergence to the partial-observation conditional. This rests on the standard theory of Gibbs sampling plus an explicit modeling assumption of conditional locality for patch partitioning, rather than any equation or parameter that is defined in terms of the target performance metric. No fitted quantity is renamed as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and the reported FVD improvements are empirical outcomes on RealEstate10K and DAVIS rather than algebraic identities. The derivation chain therefore remains self-contained against external benchmarks and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the existence of a well-defined partial-observation conditional data law for flow-matching latents and on the practical validity of conditional locality for patch-wise refinement.

axioms (1)

domain assumption Video latents possess conditional locality that permits safe partitioning into 3D patches whose individual convergence can be monitored independently.
Invoked to justify the acceleration technique and adaptive freezing of converged patches.

pith-pipeline@v0.9.0 · 5731 in / 1243 out tokens · 69955 ms · 2026-05-20T22:49:11.778132+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

partitioning the unobserved complement into 3D patches... order-2 Markov... along each of L, H, W

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 9 internal anchors

[1]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

work page 2025
[2]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Cameractrl ii: Dynamic scene exploration via camera- controlled video diffusion models

Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera- controlled video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13416–13426, 2025

work page 2025
[4]

Vd3d: Taming large video diffusion transformers for 3d camera control.arXiv preprint arXiv:2407.12781, 2024

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control.arXiv preprint arXiv:2407.12781, 2024

work page arXiv 2024
[5]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025

work page 2025
[6]

Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2050–2062, 2025

work page 2050
[7]

Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024

Chen Hou and Zhibo Chen. Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024

work page arXiv 2024
[8]

Latent-reframe: Enabling camera control for video diffusion models without training

Zhenghong Zhou, Jie An, and Jiebo Luo. Latent-reframe: Enabling camera control for video diffusion models without training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12779–12789, 2025

work page 2025
[9]

Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning

Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. InInternational Conference on Machine Learning, pages 22825–22855. PMLR, 2023

work page 2023
[10]

Z., Salakhut- dinov, R., et al

Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J Zico Kolter, Ruslan Salakhutdinov, et al. Manifold preserv- ing guided diffusion.arXiv preprint arXiv:2311.16424, 2023

work page arXiv 2023
[11]

Diffusion Posterior Sampling for General Noisy Inverse Problems

Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffu- sion posterior sampling for general noisy inverse problems.arXiv preprint arXiv:2209.14687, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Tfg: Unified training-free guidance for diffusion models.Advances in Neural Information Processing Systems, 37:22370–22417, 2024

Haotian Ye, Haowei Lin, Jiaqi Han, Minkai Xu, Sheng Liu, Yitao Liang, Jianzhu Ma, James Zou, and Stefano Ermon. Tfg: Unified training-free guidance for diffusion models.Advances in Neural Information Processing Systems, 37:22370–22417, 2024

work page 2024
[13]

Repaint: Inpainting using denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022

work page 2022
[14]

Time-to-move: Training-free motion controlled video generation via dual-clock denoising.arXiv preprint arXiv:2511.08633, 2025

Assaf Singer, Noam Rotstein, Amir Mann, Ron Kimmel, and Or Litany. Time-to-move: Training-free motion controlled video generation via dual-clock denoising.arXiv preprint arXiv:2511.08633, 2025. 10

work page arXiv 2025
[15]

Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025

Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, and Chi Zhang. Taming video models for 3d and 4d generation via zero-shot camera control.arXiv preprint arXiv:2509.15130, 2025

work page arXiv 2025
[16]

Coarse-guided visual generation via weighted h-transform sampling.arXiv preprint arXiv:2603.12057, 2026

Yanghao Wang, Ziqi Jiang, Zhen Wang, and Long Chen. Coarse-guided visual generation via weighted h-transform sampling.arXiv preprint arXiv:2603.12057, 2026

work page arXiv 2026
[17]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Processing Systems, 37:122458–122483, 2024

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Processing Systems, 37:122458–122483, 2024

work page 2024
[18]

Generalized denoising auto- encoders as generative models.Advances in neural information processing systems, 26, 2013

Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto- encoders as generative models.Advances in neural information processing systems, 26, 2013

work page 2013
[19]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InThe Ninth International Conference on Learning Representations, 2021

work page 2021
[20]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Cambridge university press, 2000

L Chris G Rogers and David Williams.Diffusions, Markov processes, and martingales, volume 2. Cambridge university press, 2000

work page 2000
[23]

Deft: Efficient fine-tuning of diffusion models by learning the generalisedh-transform.Advances in Neural Information Processing Systems, 37:19636–19682, 2024

Alexander Denker, Francisco Vargas, Shreyas Padhy, Kieran Didi, Simon Mathis, Vincent Dutordoir, Riccardo Barbano, Emile Mathieu, Urszula J Komorowska, and Pietro Lio. Deft: Efficient fine-tuning of diffusion models by learning the generalisedh-transform.Advances in Neural Information Processing Systems, 37:19636–19682, 2024

work page 2024
[24]

Practi- cal and asymptotically exact conditional sampling in diffusion models.Advances in Neural Information Processing Systems, 36:31372–31403, 2023

Luhuan Wu, Brian Trippe, Christian Naesseth, David Blei, and John P Cunningham. Practi- cal and asymptotically exact conditional sampling in diffusion models.Advances in Neural Information Processing Systems, 36:31372–31403, 2023

work page 2023
[25]

Acceleration of stochastic approximation by averaging

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992

work page 1992
[26]

Self-Refining Video Sampling

Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon, and Sung Ju Hwang. Self-refining video sampling.arXiv preprint arXiv:2601.18577, 2026

work page internal anchor Pith review arXiv 2026
[27]

Note on a method for calculating corrected sums of squares and products

Barry Payne Welford. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419–420, 1962

work page 1962
[28]

Covariance structure of the gibbs sampler with applications to the comparisons of estimators and augmentation schemes.Biometrika, pages 27–40, 1994

Jun S Liu, Wing Hung Wong, and Augustine Kong. Covariance structure of the gibbs sampler with applications to the comparisons of estimators and augmentation schemes.Biometrika, pages 27–40, 1994

work page 1994
[29]

Gareth O Roberts and Sujit K Sahu. Updating schemes, correlation structure, blocking and parameterization for the gibbs sampler.Journal of the Royal Statistical Society Series B: Statistical Methodology, 59(2):291–317, 1997

work page 1997
[30]

Understanding the effective receptive field in deep convolutional neural networks.Advances in neural information processing systems, 29, 2016

Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks.Advances in neural information processing systems, 29, 2016

work page 2016
[31]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[32]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 11

work page 2021
[33]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

work page arXiv 2024
[35]

Trajectorycrafter: Redirecting camera trajec- tory for monocular videos via diffusion models

Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajec- tory for monocular videos via diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 100–111, 2025

work page 2025
[36]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Collaborative video diffusion: Consistent multi-video generation with camera control.Advances in Neural Information Processing Systems, 37:16240–16271, 2024

Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control.Advances in Neural Information Processing Systems, 37:16240–16271, 2024

work page 2024
[38]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[39]

Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10486–10496, 2025

work page 2025
[40]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

work page 2024
[41]

Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints.arXiv preprint arXiv:2412.07760, 2024

Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, and Di Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints.arXiv preprint arXiv:2412.07760, 2024

work page arXiv 2024
[42]

Generative camera dolly: Extreme monocular dynamic novel view synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEuropean Conference on Computer Vision, pages 313–331. Springer, 2024

work page 2024
[43]

Zero4d: Training-free 4d video generation from single video using off-the-shelf video diffusion.arXiv preprint arXiv:2503.22622, 2025

Jangho Park, Taesung Kwon, and Jong Chul Ye. Zero4d: Training-free 4d video generation from single video using off-the-shelf video diffusion.arXiv preprint arXiv:2503.22622, 2025

work page arXiv 2025
[44]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Vid-camedit: Video cam- era trajectory editing with generative rendering from estimated geometry.arXiv preprint arXiv:2506.13697, 2025

Junyoung Seo, Jisang Han, Jaewoo Jung, Siyoon Jin, Joungbin Lee, Takuya Narihira, Kazumi Fukuda, Takashi Shibuya, Donghoon Ahn, Shoukang Hu, et al. Vid-camedit: Video cam- era trajectory editing with generative rendering from estimated geometry.arXiv preprint arXiv:2506.13697, 2025

work page arXiv 2025
[46]

Training-free adaptation of diffusion models via doob’sh-transform.arXiv preprint arXiv:2602.16198, 2026

Qijie Zhu, Zeqi Ye, Han Liu, Zhaoran Wang, and Minshuo Chen. Training-free adaptation of diffusion models via doob’sh-transform.arXiv preprint arXiv:2602.16198, 2026

work page arXiv 2026
[47]

A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025. 12

work page arXiv 2025
[48]

Plug-and-play priors for bright field electron tomography and sparse interpolation.IEEE Transactions on Computational Imaging, 2 (4):408–423, 2016

Suhas Sreehari, S Venkat Venkatakrishnan, Brendt Wohlberg, Gregery T Buzzard, Lawrence F Drummy, Jeffrey P Simmons, and Charles A Bouman. Plug-and-play priors for bright field electron tomography and sparse interpolation.IEEE Transactions on Computational Imaging, 2 (4):408–423, 2016

work page 2016
[49]

The little engine that could: Regularization by denoising (red).SIAM journal on imaging sciences, 10(4):1804–1844, 2017

Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by denoising (red).SIAM journal on imaging sciences, 10(4):1804–1844, 2017

work page 2017
[50]

A restoration network as an implicit prior

Yuyang Hu, Mauricio Delbracio, Peyman Milanfar, and Ulugbek Kamilov. A restoration network as an implicit prior. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[51]

Fire: Fixed-points of restoration priors for solving inverse problems

Matthieu Terris, Ulugbek S Kamilov, and Thomas Moreau. Fire: Fixed-points of restoration priors for solving inverse problems. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23185–23194, 2025

work page 2025
[52]

Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

Brian DO Anderson. Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

work page 1982
[53]

A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661–1674, 2011

Pascal Vincent. A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661–1674, 2011

work page 2011
[54]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[55]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018
[56]

Bovik, Hamid R

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004. 13 A Extended Related Work and Positioning This appendix expands Section 5 with a technical positioning of h-control against the four research lines it sits...

work page 2004
[57]

provides a generative-modeling perspective on the same problem. The common requirement across this family is that at least one component — backbone, adapter, or refinement head — is fine-tuned to internalize the trajectory-to-video correspondence. Training-free controllers.TTM [ 14] and WorldForge [15] construct a warped guidance video by lifting the sour...

work page
[58]

denoiser as implicit prior

fine-tunes the base model to internalize the same drift.Position of h-control:we extend the global scalar λσt to aspatially non-uniformmask M and pair it with a novel inner refinement on the unobserved support, while keeping the conditioning entirely at inference time — no extra network and no fine-tuning. Sequential Monte Carlo and Feynman–Kac.Twisted Di...

work page 2017

[1] [1]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

work page 2025

[2] [2]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Cameractrl ii: Dynamic scene exploration via camera- controlled video diffusion models

Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera- controlled video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13416–13426, 2025

work page 2025

[4] [4]

Vd3d: Taming large video diffusion transformers for 3d camera control.arXiv preprint arXiv:2407.12781, 2024

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control.arXiv preprint arXiv:2407.12781, 2024

work page arXiv 2024

[5] [5]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025

work page 2025

[6] [6]

Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2050–2062, 2025

work page 2050

[7] [7]

Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024

Chen Hou and Zhibo Chen. Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024

work page arXiv 2024

[8] [8]

Latent-reframe: Enabling camera control for video diffusion models without training

Zhenghong Zhou, Jie An, and Jiebo Luo. Latent-reframe: Enabling camera control for video diffusion models without training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12779–12789, 2025

work page 2025

[9] [9]

Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning

Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. InInternational Conference on Machine Learning, pages 22825–22855. PMLR, 2023

work page 2023

[10] [10]

Z., Salakhut- dinov, R., et al

Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J Zico Kolter, Ruslan Salakhutdinov, et al. Manifold preserv- ing guided diffusion.arXiv preprint arXiv:2311.16424, 2023

work page arXiv 2023

[11] [11]

Diffusion Posterior Sampling for General Noisy Inverse Problems

Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffu- sion posterior sampling for general noisy inverse problems.arXiv preprint arXiv:2209.14687, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Tfg: Unified training-free guidance for diffusion models.Advances in Neural Information Processing Systems, 37:22370–22417, 2024

Haotian Ye, Haowei Lin, Jiaqi Han, Minkai Xu, Sheng Liu, Yitao Liang, Jianzhu Ma, James Zou, and Stefano Ermon. Tfg: Unified training-free guidance for diffusion models.Advances in Neural Information Processing Systems, 37:22370–22417, 2024

work page 2024

[13] [13]

Repaint: Inpainting using denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022

work page 2022

[14] [14]

Time-to-move: Training-free motion controlled video generation via dual-clock denoising.arXiv preprint arXiv:2511.08633, 2025

Assaf Singer, Noam Rotstein, Amir Mann, Ron Kimmel, and Or Litany. Time-to-move: Training-free motion controlled video generation via dual-clock denoising.arXiv preprint arXiv:2511.08633, 2025. 10

work page arXiv 2025

[15] [15]

Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025

Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, and Chi Zhang. Taming video models for 3d and 4d generation via zero-shot camera control.arXiv preprint arXiv:2509.15130, 2025

work page arXiv 2025

[16] [16]

Coarse-guided visual generation via weighted h-transform sampling.arXiv preprint arXiv:2603.12057, 2026

Yanghao Wang, Ziqi Jiang, Zhen Wang, and Long Chen. Coarse-guided visual generation via weighted h-transform sampling.arXiv preprint arXiv:2603.12057, 2026

work page arXiv 2026

[17] [17]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Processing Systems, 37:122458–122483, 2024

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Processing Systems, 37:122458–122483, 2024

work page 2024

[18] [18]

Generalized denoising auto- encoders as generative models.Advances in neural information processing systems, 26, 2013

Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto- encoders as generative models.Advances in neural information processing systems, 26, 2013

work page 2013

[19] [19]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InThe Ninth International Conference on Learning Representations, 2021

work page 2021

[20] [20]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Cambridge university press, 2000

L Chris G Rogers and David Williams.Diffusions, Markov processes, and martingales, volume 2. Cambridge university press, 2000

work page 2000

[23] [23]

Deft: Efficient fine-tuning of diffusion models by learning the generalisedh-transform.Advances in Neural Information Processing Systems, 37:19636–19682, 2024

Alexander Denker, Francisco Vargas, Shreyas Padhy, Kieran Didi, Simon Mathis, Vincent Dutordoir, Riccardo Barbano, Emile Mathieu, Urszula J Komorowska, and Pietro Lio. Deft: Efficient fine-tuning of diffusion models by learning the generalisedh-transform.Advances in Neural Information Processing Systems, 37:19636–19682, 2024

work page 2024

[24] [24]

Practi- cal and asymptotically exact conditional sampling in diffusion models.Advances in Neural Information Processing Systems, 36:31372–31403, 2023

Luhuan Wu, Brian Trippe, Christian Naesseth, David Blei, and John P Cunningham. Practi- cal and asymptotically exact conditional sampling in diffusion models.Advances in Neural Information Processing Systems, 36:31372–31403, 2023

work page 2023

[25] [25]

Acceleration of stochastic approximation by averaging

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992

work page 1992

[26] [26]

Self-Refining Video Sampling

Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon, and Sung Ju Hwang. Self-refining video sampling.arXiv preprint arXiv:2601.18577, 2026

work page internal anchor Pith review arXiv 2026

[27] [27]

Note on a method for calculating corrected sums of squares and products

Barry Payne Welford. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419–420, 1962

work page 1962

[28] [28]

Covariance structure of the gibbs sampler with applications to the comparisons of estimators and augmentation schemes.Biometrika, pages 27–40, 1994

Jun S Liu, Wing Hung Wong, and Augustine Kong. Covariance structure of the gibbs sampler with applications to the comparisons of estimators and augmentation schemes.Biometrika, pages 27–40, 1994

work page 1994

[29] [29]

Gareth O Roberts and Sujit K Sahu. Updating schemes, correlation structure, blocking and parameterization for the gibbs sampler.Journal of the Royal Statistical Society Series B: Statistical Methodology, 59(2):291–317, 1997

work page 1997

[30] [30]

Understanding the effective receptive field in deep convolutional neural networks.Advances in neural information processing systems, 29, 2016

Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks.Advances in neural information processing systems, 29, 2016

work page 2016

[31] [31]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[32] [32]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 11

work page 2021

[33] [33]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

work page arXiv 2024

[35] [35]

Trajectorycrafter: Redirecting camera trajec- tory for monocular videos via diffusion models

Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajec- tory for monocular videos via diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 100–111, 2025

work page 2025

[36] [36]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[37] [37]

Collaborative video diffusion: Consistent multi-video generation with camera control.Advances in Neural Information Processing Systems, 37:16240–16271, 2024

Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control.Advances in Neural Information Processing Systems, 37:16240–16271, 2024

work page 2024

[38] [38]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021

[39] [39]

Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10486–10496, 2025

work page 2025

[40] [40]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

work page 2024

[41] [41]

Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints.arXiv preprint arXiv:2412.07760, 2024

Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, and Di Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints.arXiv preprint arXiv:2412.07760, 2024

work page arXiv 2024

[42] [42]

Generative camera dolly: Extreme monocular dynamic novel view synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEuropean Conference on Computer Vision, pages 313–331. Springer, 2024

work page 2024

[43] [43]

Zero4d: Training-free 4d video generation from single video using off-the-shelf video diffusion.arXiv preprint arXiv:2503.22622, 2025

Jangho Park, Taesung Kwon, and Jong Chul Ye. Zero4d: Training-free 4d video generation from single video using off-the-shelf video diffusion.arXiv preprint arXiv:2503.22622, 2025

work page arXiv 2025

[44] [44]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Vid-camedit: Video cam- era trajectory editing with generative rendering from estimated geometry.arXiv preprint arXiv:2506.13697, 2025

Junyoung Seo, Jisang Han, Jaewoo Jung, Siyoon Jin, Joungbin Lee, Takuya Narihira, Kazumi Fukuda, Takashi Shibuya, Donghoon Ahn, Shoukang Hu, et al. Vid-camedit: Video cam- era trajectory editing with generative rendering from estimated geometry.arXiv preprint arXiv:2506.13697, 2025

work page arXiv 2025

[46] [46]

Training-free adaptation of diffusion models via doob’sh-transform.arXiv preprint arXiv:2602.16198, 2026

Qijie Zhu, Zeqi Ye, Han Liu, Zhaoran Wang, and Minshuo Chen. Training-free adaptation of diffusion models via doob’sh-transform.arXiv preprint arXiv:2602.16198, 2026

work page arXiv 2026

[47] [47]

A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025. 12

work page arXiv 2025

[48] [48]

Plug-and-play priors for bright field electron tomography and sparse interpolation.IEEE Transactions on Computational Imaging, 2 (4):408–423, 2016

Suhas Sreehari, S Venkat Venkatakrishnan, Brendt Wohlberg, Gregery T Buzzard, Lawrence F Drummy, Jeffrey P Simmons, and Charles A Bouman. Plug-and-play priors for bright field electron tomography and sparse interpolation.IEEE Transactions on Computational Imaging, 2 (4):408–423, 2016

work page 2016

[49] [49]

The little engine that could: Regularization by denoising (red).SIAM journal on imaging sciences, 10(4):1804–1844, 2017

Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by denoising (red).SIAM journal on imaging sciences, 10(4):1804–1844, 2017

work page 2017

[50] [50]

A restoration network as an implicit prior

Yuyang Hu, Mauricio Delbracio, Peyman Milanfar, and Ulugbek Kamilov. A restoration network as an implicit prior. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[51] [51]

Fire: Fixed-points of restoration priors for solving inverse problems

Matthieu Terris, Ulugbek S Kamilov, and Thomas Moreau. Fire: Fixed-points of restoration priors for solving inverse problems. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23185–23194, 2025

work page 2025

[52] [52]

Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

Brian DO Anderson. Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

work page 1982

[53] [53]

A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661–1674, 2011

Pascal Vincent. A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661–1674, 2011

work page 2011

[54] [54]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[55] [55]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018

[56] [56]

Bovik, Hamid R

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004. 13 A Extended Related Work and Positioning This appendix expands Section 5 with a technical positioning of h-control against the four research lines it sits...

work page 2004

[57] [57]

provides a generative-modeling perspective on the same problem. The common requirement across this family is that at least one component — backbone, adapter, or refinement head — is fine-tuned to internalize the trajectory-to-video correspondence. Training-free controllers.TTM [ 14] and WorldForge [15] construct a warped guidance video by lifting the sour...

work page

[58] [58]

denoiser as implicit prior

fine-tunes the base model to internalize the same drift.Position of h-control:we extend the global scalar λσt to aspatially non-uniformmask M and pair it with a novel inner refinement on the unobserved support, while keeping the conditioning entirely at inference time — no extra network and no fine-tuning. Sequential Monte Carlo and Feynman–Kac.Twisted Di...

work page 2017