pith. sign in

arxiv: 2601.18993 · v2 · pith:NMXOFFFRnew · submitted 2026-01-26 · 💻 cs.CV · cs.AI· cs.GR

FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction

Pith reviewed 2026-05-21 13:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR
keywords camera redirection4D reconstructionforeground completionvideo diffusionmonocular videotraining-freegeometric scaffold
0
0 comments X

The pith

A complete foreground 4D proxy built from monocular video supplies geometric scaffolds that steer consistent camera redirection in generated videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem of replaying a dynamic scene from new camera paths when only one narrow monocular video exists. It separates the fixed background from the moving foreground, then uses an object-focused diffusion model to imagine the unseen sides of the foreground and build a full point-cloud model in a canonical space. This model is aligned back into the global scene coordinates through dense 3D correspondences, creating a 4D proxy. Projecting the proxy into desired viewpoints gives structural guidance to a conditional video diffusion model. If the approach holds, large-angle redirections become geometrically grounded rather than relying solely on learned priors that often fail when visual information is missing.

Core claim

FreeOrbit4D recovers a foreground-complete 4D proxy by first unprojecting the input video into a static background point cloud and a partial foreground point cloud inside a shared global space. An object-centric multi-view diffusion model then synthesizes additional images of the foreground to permit complete point-cloud reconstruction in canonical object space. Dense pixel-synchronized 3D-3D correspondences align the canonical foreground back into global coordinates. The resulting aligned 4D proxy is projected onto target camera rays to supply explicit geometric scaffolds that condition a video diffusion model for the final redirected output.

What carries the argument

foreground-complete 4D proxy: a completed foreground point cloud reconstructed in canonical space, aligned to the global scene via 3D correspondences, and projected as geometric guidance for the video generator.

If this is right

  • Redirected videos remain geometrically faithful and temporally coherent even when the new camera path deviates substantially from the input trajectory.
  • The framework operates without any training on camera-redirection data.
  • The same 4D proxy supports downstream tasks such as propagating edits through the video and synthesizing additional 4D training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The alignment step could be tested on videos with fast-moving or deforming objects to see whether the 3D correspondences remain reliable.
  • If the proxy is sufficiently accurate, the same scaffolding idea might transfer to other dynamic-scene tasks such as inserting new objects or changing lighting while preserving motion consistency.

Load-bearing premise

An object-centric multi-view diffusion model can synthesize accurate additional views of the foreground object from the narrow observations present in the original monocular video.

What would settle it

Take a monocular video containing a foreground object with clearly asymmetric parts hidden from the original camera; apply a large redirection trajectory and inspect whether the generated frames correctly reveal those hidden parts or instead produce geometric distortions or hallucinations.

Figures

Figures reproduced from arXiv: 2601.18993 by Fengrui Tian, Hao Zhang, Ning Yu, Shenlong Wang, Wei Cao, Yaoyao Liu, Yingying Li, Yulun Wu.

Figure 1
Figure 1. Figure 1: FreeOrbit4D enables training-free camera redirection from a single monocular video to arbitrary target camera trajectories. Given a source video (left) and a target trajectory (middle), our method produces a redirected video (right) with faithful appearance and strong temporal coherence under large-angle redirected camera motions, even including bullet-time orbits, demonstrated on diverse scenes and subjec… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of video camera redirection paradigms. We compare 3 representative approaches for camera redirection from monocular video. (A) Implicit Control: Camera motion is specified via learned embeddings. Such implicit representations provide only soft controllability: text cannot precisely describe complex trajectories, and learned conditions often fail to follow the intended path (e.g., the “turn to ba… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of FreeOrbit4D. Our framework redirects a monocular video to a target camera trajectory via a geometry-complete 4D proxy. This proxy is constructed through two branches: Global Scene Reconstruction recovers background and partial foreground in global space, while Canonical Object Completion reconstructs complete foreground geometry via multi-view synthesis. After alignment, we render view-dependen… view at source ↗
Figure 4
Figure 4. Figure 4: Decoupled 4D reconstruction and alignment pipeline. Left (Sec. 3.1): A dynamic-aware feed-forward network lifts the source video V src into global scene space, producing the static background P bg and geometry-incomplete foreground Pefg t (orange). Middle (Sec. 3.1): The foreground sequence {I fg t } T t=1 is fed into an object-centric video diffusion model to synthesize multi-view images, from which VGGT … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on challenging dynamic sequences: “Swing” (top) and “Bmx” (bottom). For each sequence, the top-left inset visualizes our reconstructed 4D geometry-complete proxy and the target camera trajectory. These scenarios feature rapid foreground motion, thin structures (e.g., swing ropes), and significant perspective changes. Existing methods exhibit distinct failure modes: ReCamMaster [3] an… view at source ↗
Figure 6
Figure 6. Figure 6: Applications enabled by FreeOrbit4D. Our explicit 4D representation enables various downstream applications. Top: Appearance editing—given a single edited reference frame (e.g., zebra pattern or anime style), our geometry-complete proxy prop￾agates the edit consistently across all novel viewpoints. Bottom: Geometry editing—by directly manipulating the point cloud (scal￾ing or compositing objects from diffe… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation: Simple Inference vs. Ours. Directly com￾bining multi-view images with source video in a feed-forward net￾work leads to temporal correspondence collapse and ghosting arti￾facts (left). Our decoupled strategy produces a coherent 4D recon￾struction (right). Simple Inference. A straightforward baseline is to directly feed the multi-view images together with the source video into a dynamic-aware feed-… view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative comparisons on various sequences. Compared to baseline methods, which suffer from geometric distor￾tions, motion blur, and semantic drift under large viewpoint changes, our method produces consistently sharp textures and stable geometry. These results demonstrate the effectiveness of our geometry-complete 4D proxy in handling fast motion and complex backgrounds. 12 [PITH_FULL_IMAGE:… view at source ↗
Figure 9
Figure 9. Figure 9: Multi-trajectory video synthesis results. We show that our method generates temporally and geometrically consistent videos along diverse target camera trajectories. By leveraging the geometry-complete 4D proxy, our approach effectively handles articulated motion, complex lighting, and thin structures across all sequences. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: User study interface. We present the participants with two anonymized videos (ours vs. baseline) and ask them to select the one with superior temporal stability and geometric fidelity. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
read the original abstract

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective training-free framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FreeOrbit4D, a training-free framework for arbitrary camera redirection in monocular videos of dynamic scenes. It decouples foreground and background by unprojecting the input video into a static background and partial foreground point clouds in global space, employs an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct a complete foreground point cloud in canonical object space, aligns the canonical foreground to global space via dense pixel-synchronized 3D-3D correspondences, and projects the resulting foreground-complete 4D proxy onto target camera viewpoints to supply geometric scaffolds that condition a video diffusion model for the redirected output. The work claims improved faithfulness and temporal coherence under large-angle trajectories compared to prior diffusion-based methods, plus utility for edit propagation and 4D data generation.

Significance. If the central claim holds, the result would be significant for computer vision and graphics: it provides an explicit, training-free geometric proxy to mitigate ambiguity in large-viewpoint video synthesis, which could improve reliability in applications such as novel-view video replay, virtual cinematography, and dynamic scene editing without dataset-specific fine-tuning.

major comments (2)
  1. The pipeline's load-bearing step is the object-centric multi-view diffusion model's ability to synthesize geometrically consistent multi-view images of the foreground from the narrow viewpoint range and motion present in a monocular video (described in the abstract and method overview). The manuscript provides no quantitative evaluation, error analysis, or ablation of reconstruction accuracy or viewpoint consistency for this completion step; if the synthesized views contain inconsistencies or hallucinations, the canonical point cloud, its alignment via 3D-3D correspondences, and the projected scaffolds will propagate errors directly into the final redirected video.
  2. Abstract: the claim of 'more faithful and temporally coherent redirected videos under challenging large-angle trajectories' is presented without reference to specific metrics, baseline comparisons, or failure-case analysis in the provided text. This makes it impossible to assess whether the 4D proxy actually resolves the geometric ambiguity that prior diffusion methods encounter.
minor comments (2)
  1. Clarify the exact form of the dense pixel-synchronized 3D-3D correspondences used for alignment (e.g., whether they are computed from optical flow, depth, or learned features) and how they handle dynamic foreground motion.
  2. The abstract mentions enabling 'edit propagation and 4D data generation' as applications of the proxy; a brief qualitative example or description of these uses would strengthen the contribution without requiring new experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and describe the revisions we will incorporate to strengthen the presentation and evaluation.

read point-by-point responses
  1. Referee: The pipeline's load-bearing step is the object-centric multi-view diffusion model's ability to synthesize geometrically consistent multi-view images of the foreground from the narrow viewpoint range and motion present in a monocular video (described in the abstract and method overview). The manuscript provides no quantitative evaluation, error analysis, or ablation of reconstruction accuracy or viewpoint consistency for this completion step; if the synthesized views contain inconsistencies or hallucinations, the canonical point cloud, its alignment via 3D-3D correspondences, and the projected scaffolds will propagate errors directly into the final redirected video.

    Authors: We agree that isolating the reconstruction accuracy of the object-centric multi-view diffusion step with dedicated quantitative metrics and ablations would provide clearer validation of this load-bearing component. Our current evaluation emphasizes end-to-end video redirection quality and comparisons against prior methods, which indirectly reflect the proxy's effectiveness. In the revised version we will add a targeted analysis section, including error metrics on synthetic data with available ground truth (e.g., Chamfer distance for point clouds and multi-view consistency scores) and an ablation on the impact of synthesis quality. revision: yes

  2. Referee: Abstract: the claim of 'more faithful and temporally coherent redirected videos under challenging large-angle trajectories' is presented without reference to specific metrics, baseline comparisons, or failure-case analysis in the provided text. This makes it impossible to assess whether the 4D proxy actually resolves the geometric ambiguity that prior diffusion methods encounter.

    Authors: The abstract serves as a concise overview; the full manuscript details the quantitative metrics, baseline comparisons, and analysis in the Experiments section. To improve clarity we will revise the abstract to explicitly cite the evaluation metrics (such as those measuring faithfulness and temporal coherence) and reference the relevant figures and tables that demonstrate improvements over prior diffusion-based approaches under large-angle trajectories. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline is self-contained

full rationale

The paper describes a training-free pipeline that unprojects the monocular video into unified global space for background and partial foreground, invokes an external object-centric multi-view diffusion model to synthesize views and complete the foreground point cloud in canonical space, aligns via dense pixel-synchronized 3D-3D correspondences, and projects the resulting 4D proxy as scaffolds for a separate conditional video diffusion model. No equation or step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing premise depends on a self-citation chain. The central claim remains an independent structural composition of existing components, consistent with the reader's assessment of score 2 and qualifying as a non-finding under the guidelines.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach relies on standard assumptions about diffusion model capabilities for multi-view synthesis and point cloud reconstruction accuracy; no explicit free parameters or new invented entities are stated in the abstract.

axioms (2)
  • domain assumption Object-centric multi-view diffusion models can produce sufficiently accurate novel views of foreground objects from monocular observations.
    Invoked when using the diffusion model to complete the foreground point cloud.
  • domain assumption Dense pixel-synchronized 3D-3D correspondences can be reliably established between canonical and global spaces.
    Required for the alignment step that merges foreground and background.
invented entities (1)
  • foreground-complete 4D proxy no independent evidence
    purpose: Structural grounding to reduce geometric ambiguity in large-angle redirection
    Constructed via the described pipeline rather than postulated as a new physical entity.

pith-pipeline@v0.9.0 · 5860 in / 1393 out tokens · 45160 ms · 2026-05-21T13:59:16.692883+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D–3D correspondences

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  2. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 6.0

    UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    A survey of augmented reality.Presence: teleoperators & virtual environments, 6(4):355–385, 1997

    Ronald T Azuma. A survey of augmented reality.Presence: teleoperators & virtual environments, 6(4):355–385, 1997. 2

  2. [2]

    VD3D: Taming large video diffusion transformers for 3D camera control

    Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siaro- hin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. VD3D: Taming large video diffusion transformers for 3D camera control. InICLR, 2025. 3

  3. [3]

    Recammaster: Camera-controlled generative ren- dering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative ren- dering from a single video. InICCV, 2025. 2, 3, 6, 7

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

  5. [5]

    Hdr-gs: Efficient high dynamic range novel view synthesis at 1000x speed via gaussian splatting

    Yuanhao Cai, Zihao Xiao, Yixun Liang, Minghan Qin, Yu- lun Zhang, Xiaokang Yang, Yaoyao Liu, and Alan L Yuille. Hdr-gs: Efficient high dynamic range novel view synthesis at 1000x speed via gaussian splatting. InNeurIPS, 2024. 2

  6. [6]

    Motion2vecsets: 4D latent vector set diffusion for non-rigid shape reconstruction and tracking

    Wei Cao, Chang Luo, Biao Zhang, Matthias Nießner, and Jiapeng Tang. Motion2vecsets: 4D latent vector set diffusion for non-rigid shape reconstruction and tracking. InCVPR,

  7. [7]

    Pseudo- simulation for autonomous driving

    Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo- simulation for autonomous driving. InCoRL, 2025. 2

  8. [8]

    Omniview: An all- seeing diffusion model for 3d and 4d view synthesis.arXiv preprint arXiv:2512.10940, 2025

    Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei, Petr Sushko, Aliaksandr Siarohin, Sergey Tulyakov, and Ranjay Krishna. Omniview: An all- seeing diffusion model for 3d and 4d view synthesis.arXiv preprint arXiv:2512.10940, 2025. 3

  9. [9]

    Dynamic view synthesis from dynamic monocular video

    Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021. 2

  10. [10]

    Veo.https : / / deepmind

    Google DeepMind. Veo.https : / / deepmind . google/models/veo/, 2024. Accessed: 2024-11-05. 7

  11. [11]

    Cameractrl: Enabling camera control for text-to-video generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. InICLR, 2025. 2, 3

  12. [12]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InNeurIPS, 2017. 7

  13. [13]

    Training-free camera control for video generation

    Chen Hou and Zhibo Chen. Training-free camera control for video generation. InICLR, 2025. 2, 3

  14. [14]

    Ex-4d: Extreme viewpoint 4d video synthesis via depth watertight mesh.arXiv preprint arXiv:2506.05554, 2025

    Tao Hu, Haoyang Peng, Xiao Liu, and Yuewen Ma. Ex-4d: Extreme viewpoint 4d video synthesis via depth watertight mesh.arXiv preprint arXiv:2506.05554, 2025. 2, 3, 6, 7

  15. [15]

    Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

    Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, and Chunchao Guo. V oyager: Long-range and world-consistent video diffu- sion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025. 3

  16. [16]

    Vbench: Comprehensive bench- mark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InCVPR, 2024. 7 9

  17. [17]

    Vace: All-in-one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InICCV, 2025. 5, 7

  18. [18]

    3D gaussian splatting for real-time radiance field rendering.TOG, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3D gaussian splatting for real-time radiance field rendering.TOG, 2023. 2, 3

  19. [19]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3

  20. [20]

    Cameras as relative positional encoding

    Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding. InNeurIPS, 2025. 3

  21. [21]

    Realcam-i2v: Real- world image-to-video generation with interactive complex camera control

    Teng Li, Guangcong Zheng, Rui Jiang, Shuigen Zhan, Tao Wu, Yehao Lu, Yining Lin, and Xi Li. Realcam-i2v: Real- world image-to-video generation with interactive complex camera control. InICCV, 2025. 2, 3

  22. [22]

    Light-x: Generative 4d video render- ing with camera and illumination control.arXiv preprint arXiv:2512.05115, 2025

    Tianqi Liu, Zhaoxi Chen, Zihao Huang, Shaocong Xu, Sain- ing Zhang, Chongjie Ye, Bohan Li, Zhiguo Cao, Wei Li, Hao Zhao, et al. Light-x: Generative 4d video render- ing with camera and illumination control.arXiv preprint arXiv:2512.05115, 2025. 3

  23. [23]

    Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140, 2025

    Yawen Luo, Jianhong Bai, Xiaoyu Shi, Menghan Xia, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Tianfan Xue. Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140,

  24. [24]

    Generating images with 3d annotations us- ing diffusion models

    Wufei Ma, Qihao Liu, Jiahao Wang, Angtian Wang, Xiaod- ing Yuan, Yi Zhang, Zihao Xiao, Guofeng Zhang, Beijia Lu, Ruxiao Duan, Yongrui Qi, Adam Kortylewski, Yaoyao Liu, and Alan Yuille. Generating images with 3d annotations us- ing diffusion models. InICLR, 2024. 2

  25. [25]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis, 2020. 2, 3

  26. [26]

    Sora: Creating video from text.https:// openai.com/index/sora/, 2025

    OpenAI. Sora: Creating video from text.https:// openai.com/index/sora/, 2025. Accessed: 2024- 11-05. 7

  27. [27]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 7

  28. [28]

    Perazzi, J

    F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016. 7

  29. [29]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 7

  30. [30]

    Sam 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. In ICLR, 2025. 4, 7

  31. [31]

    Gen3c: 3d-informed world-consistent video generation with precise camera con- trol

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera con- trol. InCVPR, 2025. 2, 3, 6, 7

  32. [32]

    Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025

    Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, and Chi Zhang. Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025. 3

  33. [33]

    TCVM: Temporal contrasting video montage framework for self-supervised video representation learning

    Fengrui Tian, Jiawei Fan, Xie Yu, Shaoyi Du, Meina Song, and Yu Zhao. TCVM: Temporal contrasting video montage framework for self-supervised video representation learning. InACCV, 2022. 3

  34. [34]

    MonoN- eRF: Learning a generalizable dynamic radiance field from monocular videos

    Fengrui Tian, Shaoyi Du, and Yueqi Duan. MonoN- eRF: Learning a generalizable dynamic radiance field from monocular videos. InICCV, 2023. 2, 3

  35. [35]

    Semantic Flow: Learning semantic fields of dy- namic scenes from monocular videos

    Fengrui Tian, Yueqi Duan, Angtian Wang, Jianfei Guo, and Shaoyi Du. Semantic Flow: Learning semantic fields of dy- namic scenes from monocular videos. InICLR, 2024. 3

  36. [36]

    V oyaging into perpetual dynamic scenes from a single view

    Fengrui Tian, Tianjiao Ding, Jinqi Luo, Hancheng Min, and Ren´e Vidal. V oyaging into perpetual dynamic scenes from a single view. InICCV, 2025. 3

  37. [37]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InICLR, 2019. 7

  38. [38]

    Generative camera dolly: Ex- treme monocular dynamic novel view synthesis

    Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sar- gent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Ex- treme monocular dynamic novel view synthesis. InECCV,

  39. [39]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

  40. [40]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, 2025. 3, 4, 7

  41. [41]

    10 Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. 10 Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024. 3

  42. [42]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 8

  43. [43]

    Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency

    Yiming Xie, Chun-Han Yao, Vikram V oleti, Huaizu Jiang, and Varun Jampani. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency. InICLR, 2025. 3

  44. [44]

    CogVideoX: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. InICLR, 2024. 3

  45. [45]

    SV4D2.0: Enhancing spatio-temporal consistency in multi-view video diffusion for high-quality 4d generation

    Chun-Han Yao, Yiming Xie, Vikram V oleti, Huaizu Jiang, and Varun Jampani. SV4D2.0: Enhancing spatio-temporal consistency in multi-view video diffusion for high-quality 4d generation. InICCV, 2025. 4, 7

  46. [46]

    Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

    Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, 2025. 2, 3, 6, 7

  47. [47]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2, 3

  48. [48]

    Stable part diffusion 4d: Multi-view rgb and kinematic parts video generation

    Hao Zhang, Chun-Han Yao, Simon Donn ´e, Narendra Ahuja, and Varun Jampani. Stable part diffusion 4d: Multi-view rgb and kinematic parts video generation. InNeurIPS, 2025. 2

  49. [49]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 3

  50. [50]

    Spatialcrafter: Unleashing the imagination of video diffusion models for scene reconstruction from limited observations

    Songchun Zhang, Huiyao Xu, Sitong Guo, Zhongwei Xie, Pengwei Liu, Hujun Bao, Weiwei Xu, and Changqing Zou. Spatialcrafter: Unleashing the imagination of video diffusion models for scene reconstruction from limited observations. InICCV, 2025. 3

  51. [51]

    Joint 3d geometry reconstruction and motion generation for 4d synthesis from a single image

    Yanran Zhang, Ziyi Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Joint 3d geometry reconstruction and motion generation for 4d synthesis from a single image. arXiv preprint arXiv:2512.05044, 2025. 3

  52. [52]

    What matters to enhance traffic rule compliance of imitation learning for end-to-end autonomous driving.arXiv preprint arXiv:2309.07808, 2023

    Hongkuan Zhou, Wei Cao, Aifen Sui, and Zhenshan Bing. What matters to enhance traffic rule compliance of imitation learning for end-to-end autonomous driving.arXiv preprint arXiv:2309.07808, 2023. 2

  53. [53]

    Predicting the road ahead: A knowledge graph based foundation model for scene under- standing in autonomous driving

    Hongkuan Zhou, Stefan Schimid, Yicong Li, Lavdim Halilaj, Xiangtong Yao, and Wei Cao. Predicting the road ahead: A knowledge graph based foundation model for scene under- standing in autonomous driving. InEuropean Semantic Web Conference, 2025. 2

  54. [54]

    PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

    Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, and Mengyu Wang. Page-4d: Disentangled pose and ge- ometry estimation for 4d perception.arXiv preprint arXiv:2510.17568, 2025. 3, 4, 7 11 EX4D ReCamMaster TrajectoryCrafter Ours GEN3C (a) Qualitative comparison on the “Camel” sequence. EX4D ReCamMaster Tra...