FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction

Fengrui Tian; Hao Zhang; Ning Yu; Shenlong Wang; Wei Cao; Yaoyao Liu; Yingying Li; Yulun Wu

arxiv: 2601.18993 · v2 · pith:NMXOFFFRnew · submitted 2026-01-26 · 💻 cs.CV · cs.AI· cs.GR

FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction

Wei Cao , Hao Zhang , Fengrui Tian , Yulun Wu , Yingying Li , Shenlong Wang , Ning Yu , Yaoyao Liu This is my paper

Pith reviewed 2026-05-21 13:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR

keywords camera redirection4D reconstructionforeground completionvideo diffusionmonocular videotraining-freegeometric scaffold

0 comments

The pith

A complete foreground 4D proxy built from monocular video supplies geometric scaffolds that steer consistent camera redirection in generated videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem of replaying a dynamic scene from new camera paths when only one narrow monocular video exists. It separates the fixed background from the moving foreground, then uses an object-focused diffusion model to imagine the unseen sides of the foreground and build a full point-cloud model in a canonical space. This model is aligned back into the global scene coordinates through dense 3D correspondences, creating a 4D proxy. Projecting the proxy into desired viewpoints gives structural guidance to a conditional video diffusion model. If the approach holds, large-angle redirections become geometrically grounded rather than relying solely on learned priors that often fail when visual information is missing.

Core claim

FreeOrbit4D recovers a foreground-complete 4D proxy by first unprojecting the input video into a static background point cloud and a partial foreground point cloud inside a shared global space. An object-centric multi-view diffusion model then synthesizes additional images of the foreground to permit complete point-cloud reconstruction in canonical object space. Dense pixel-synchronized 3D-3D correspondences align the canonical foreground back into global coordinates. The resulting aligned 4D proxy is projected onto target camera rays to supply explicit geometric scaffolds that condition a video diffusion model for the final redirected output.

What carries the argument

foreground-complete 4D proxy: a completed foreground point cloud reconstructed in canonical space, aligned to the global scene via 3D correspondences, and projected as geometric guidance for the video generator.

If this is right

Redirected videos remain geometrically faithful and temporally coherent even when the new camera path deviates substantially from the input trajectory.
The framework operates without any training on camera-redirection data.
The same 4D proxy supports downstream tasks such as propagating edits through the video and synthesizing additional 4D training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The alignment step could be tested on videos with fast-moving or deforming objects to see whether the 3D correspondences remain reliable.
If the proxy is sufficiently accurate, the same scaffolding idea might transfer to other dynamic-scene tasks such as inserting new objects or changing lighting while preserving motion consistency.

Load-bearing premise

An object-centric multi-view diffusion model can synthesize accurate additional views of the foreground object from the narrow observations present in the original monocular video.

What would settle it

Take a monocular video containing a foreground object with clearly asymmetric parts hidden from the original camera; apply a large redirection trajectory and inspect whether the generated frames correctly reveal those hidden parts or instead produce geometric distortions or hallucinations.

Figures

Figures reproduced from arXiv: 2601.18993 by Fengrui Tian, Hao Zhang, Ning Yu, Shenlong Wang, Wei Cao, Yaoyao Liu, Yingying Li, Yulun Wu.

**Figure 1.** Figure 1: FreeOrbit4D enables training-free camera redirection from a single monocular video to arbitrary target camera trajectories. Given a source video (left) and a target trajectory (middle), our method produces a redirected video (right) with faithful appearance and strong temporal coherence under large-angle redirected camera motions, even including bullet-time orbits, demonstrated on diverse scenes and subjec… view at source ↗

**Figure 2.** Figure 2: Comparison of video camera redirection paradigms. We compare 3 representative approaches for camera redirection from monocular video. (A) Implicit Control: Camera motion is specified via learned embeddings. Such implicit representations provide only soft controllability: text cannot precisely describe complex trajectories, and learned conditions often fail to follow the intended path (e.g., the “turn to ba… view at source ↗

**Figure 3.** Figure 3: Overview of FreeOrbit4D. Our framework redirects a monocular video to a target camera trajectory via a geometry-complete 4D proxy. This proxy is constructed through two branches: Global Scene Reconstruction recovers background and partial foreground in global space, while Canonical Object Completion reconstructs complete foreground geometry via multi-view synthesis. After alignment, we render view-dependen… view at source ↗

**Figure 4.** Figure 4: Decoupled 4D reconstruction and alignment pipeline. Left (Sec. 3.1): A dynamic-aware feed-forward network lifts the source video V src into global scene space, producing the static background P bg and geometry-incomplete foreground Pefg t (orange). Middle (Sec. 3.1): The foreground sequence {I fg t } T t=1 is fed into an object-centric video diffusion model to synthesize multi-view images, from which VGGT … view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on challenging dynamic sequences: “Swing” (top) and “Bmx” (bottom). For each sequence, the top-left inset visualizes our reconstructed 4D geometry-complete proxy and the target camera trajectory. These scenarios feature rapid foreground motion, thin structures (e.g., swing ropes), and significant perspective changes. Existing methods exhibit distinct failure modes: ReCamMaster [3] an… view at source ↗

**Figure 6.** Figure 6: Applications enabled by FreeOrbit4D. Our explicit 4D representation enables various downstream applications. Top: Appearance editing—given a single edited reference frame (e.g., zebra pattern or anime style), our geometry-complete proxy propagates the edit consistently across all novel viewpoints. Bottom: Geometry editing—by directly manipulating the point cloud (scaling or compositing objects from diffe… view at source ↗

**Figure 7.** Figure 7: Ablation: Simple Inference vs. Ours. Directly combining multi-view images with source video in a feed-forward network leads to temporal correspondence collapse and ghosting artifacts (left). Our decoupled strategy produces a coherent 4D reconstruction (right). Simple Inference. A straightforward baseline is to directly feed the multi-view images together with the source video into a dynamic-aware feed-… view at source ↗

**Figure 8.** Figure 8: Additional qualitative comparisons on various sequences. Compared to baseline methods, which suffer from geometric distortions, motion blur, and semantic drift under large viewpoint changes, our method produces consistently sharp textures and stable geometry. These results demonstrate the effectiveness of our geometry-complete 4D proxy in handling fast motion and complex backgrounds. 12 [PITH_FULL_IMAGE:… view at source ↗

**Figure 9.** Figure 9: Multi-trajectory video synthesis results. We show that our method generates temporally and geometrically consistent videos along diverse target camera trajectories. By leveraging the geometry-complete 4D proxy, our approach effectively handles articulated motion, complex lighting, and thin structures across all sequences. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: User study interface. We present the participants with two anonymized videos (ours vs. baseline) and ask them to select the one with superior temporal stability and geometric fidelity. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective training-free framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a foreground-complete 4D proxy via object-centric multi-view diffusion and alignment to condition redirection, which is a direct structural response to large-angle coherence issues but rests on unshown validation.

read the letter

The paper's main move is to recover a foreground-complete 4D proxy from a monocular video by first unprojecting to partial clouds, then using an object-centric multi-view diffusion model to fill in the missing views and reconstruct a complete canonical point cloud for the foreground. They align that back to the global space with pixel-synchronized correspondences and project it to guide the final conditional video diffusion. This approach stands out for explicitly handling the foreground separately from the background, which makes sense for dynamic scenes where objects move independently. The alignment step using dense 3D-3D matches is a concrete way to keep everything consistent. It does well at identifying the core issue with large-angle changes and proposing this proxy as geometric scaffolding instead of pure generation. The soft spot is that everything hinges on the multi-view diffusion producing accurate and consistent images from the limited monocular observations. If that step hallucinates geometry or fails on self-occlusions, the proxy will be off and the redirection will suffer. The abstract claims better results but doesn't show the numbers or failure cases, so the evidence is thin so far. This is the kind of paper for folks working on 4D reconstruction and video diffusion models. A reader interested in practical camera control tools would get something out of the method description. It deserves a serious referee because the pipeline is laid out clearly and the problem is real, even if the current writeup needs more validation to land. I would recommend sending it for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FreeOrbit4D, a training-free framework for arbitrary camera redirection in monocular videos of dynamic scenes. It decouples foreground and background by unprojecting the input video into a static background and partial foreground point clouds in global space, employs an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct a complete foreground point cloud in canonical object space, aligns the canonical foreground to global space via dense pixel-synchronized 3D-3D correspondences, and projects the resulting foreground-complete 4D proxy onto target camera viewpoints to supply geometric scaffolds that condition a video diffusion model for the redirected output. The work claims improved faithfulness and temporal coherence under large-angle trajectories compared to prior diffusion-based methods, plus utility for edit propagation and 4D data generation.

Significance. If the central claim holds, the result would be significant for computer vision and graphics: it provides an explicit, training-free geometric proxy to mitigate ambiguity in large-viewpoint video synthesis, which could improve reliability in applications such as novel-view video replay, virtual cinematography, and dynamic scene editing without dataset-specific fine-tuning.

major comments (2)

The pipeline's load-bearing step is the object-centric multi-view diffusion model's ability to synthesize geometrically consistent multi-view images of the foreground from the narrow viewpoint range and motion present in a monocular video (described in the abstract and method overview). The manuscript provides no quantitative evaluation, error analysis, or ablation of reconstruction accuracy or viewpoint consistency for this completion step; if the synthesized views contain inconsistencies or hallucinations, the canonical point cloud, its alignment via 3D-3D correspondences, and the projected scaffolds will propagate errors directly into the final redirected video.
Abstract: the claim of 'more faithful and temporally coherent redirected videos under challenging large-angle trajectories' is presented without reference to specific metrics, baseline comparisons, or failure-case analysis in the provided text. This makes it impossible to assess whether the 4D proxy actually resolves the geometric ambiguity that prior diffusion methods encounter.

minor comments (2)

Clarify the exact form of the dense pixel-synchronized 3D-3D correspondences used for alignment (e.g., whether they are computed from optical flow, depth, or learned features) and how they handle dynamic foreground motion.
The abstract mentions enabling 'edit propagation and 4D data generation' as applications of the proxy; a brief qualitative example or description of these uses would strengthen the contribution without requiring new experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and describe the revisions we will incorporate to strengthen the presentation and evaluation.

read point-by-point responses

Referee: The pipeline's load-bearing step is the object-centric multi-view diffusion model's ability to synthesize geometrically consistent multi-view images of the foreground from the narrow viewpoint range and motion present in a monocular video (described in the abstract and method overview). The manuscript provides no quantitative evaluation, error analysis, or ablation of reconstruction accuracy or viewpoint consistency for this completion step; if the synthesized views contain inconsistencies or hallucinations, the canonical point cloud, its alignment via 3D-3D correspondences, and the projected scaffolds will propagate errors directly into the final redirected video.

Authors: We agree that isolating the reconstruction accuracy of the object-centric multi-view diffusion step with dedicated quantitative metrics and ablations would provide clearer validation of this load-bearing component. Our current evaluation emphasizes end-to-end video redirection quality and comparisons against prior methods, which indirectly reflect the proxy's effectiveness. In the revised version we will add a targeted analysis section, including error metrics on synthetic data with available ground truth (e.g., Chamfer distance for point clouds and multi-view consistency scores) and an ablation on the impact of synthesis quality. revision: yes
Referee: Abstract: the claim of 'more faithful and temporally coherent redirected videos under challenging large-angle trajectories' is presented without reference to specific metrics, baseline comparisons, or failure-case analysis in the provided text. This makes it impossible to assess whether the 4D proxy actually resolves the geometric ambiguity that prior diffusion methods encounter.

Authors: The abstract serves as a concise overview; the full manuscript details the quantitative metrics, baseline comparisons, and analysis in the Experiments section. To improve clarity we will revise the abstract to explicitly cite the evaluation metrics (such as those measuring faithfulness and temporal coherence) and reference the relevant figures and tables that demonstrate improvements over prior diffusion-based approaches under large-angle trajectories. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline is self-contained

full rationale

The paper describes a training-free pipeline that unprojects the monocular video into unified global space for background and partial foreground, invokes an external object-centric multi-view diffusion model to synthesize views and complete the foreground point cloud in canonical space, aligns via dense pixel-synchronized 3D-3D correspondences, and projects the resulting 4D proxy as scaffolds for a separate conditional video diffusion model. No equation or step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing premise depends on a self-citation chain. The central claim remains an independent structural composition of existing components, consistent with the reader's assessment of score 2 and qualifying as a non-finding under the guidelines.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach relies on standard assumptions about diffusion model capabilities for multi-view synthesis and point cloud reconstruction accuracy; no explicit free parameters or new invented entities are stated in the abstract.

axioms (2)

domain assumption Object-centric multi-view diffusion models can produce sufficiently accurate novel views of foreground objects from monocular observations.
Invoked when using the diffusion model to complete the foreground point cloud.
domain assumption Dense pixel-synchronized 3D-3D correspondences can be reliably established between canonical and global spaces.
Required for the alignment step that merges foreground and background.

invented entities (1)

foreground-complete 4D proxy no independent evidence
purpose: Structural grounding to reduce geometric ambiguity in large-angle redirection
Constructed via the described pipeline rather than postulated as a new physical entity.

pith-pipeline@v0.9.0 · 5860 in / 1393 out tokens · 45160 ms · 2026-05-21T13:59:16.692883+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D–3D correspondences

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

A survey of augmented reality.Presence: teleoperators & virtual environments, 6(4):355–385, 1997

Ronald T Azuma. A survey of augmented reality.Presence: teleoperators & virtual environments, 6(4):355–385, 1997. 2

work page 1997
[2]

VD3D: Taming large video diffusion transformers for 3D camera control

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siaro- hin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. VD3D: Taming large video diffusion transformers for 3D camera control. InICLR, 2025. 3

work page 2025
[3]

Recammaster: Camera-controlled generative ren- dering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative ren- dering from a single video. InICCV, 2025. 2, 3, 6, 7

work page 2025
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Hdr-gs: Efficient high dynamic range novel view synthesis at 1000x speed via gaussian splatting

Yuanhao Cai, Zihao Xiao, Yixun Liang, Minghan Qin, Yu- lun Zhang, Xiaokang Yang, Yaoyao Liu, and Alan L Yuille. Hdr-gs: Efficient high dynamic range novel view synthesis at 1000x speed via gaussian splatting. InNeurIPS, 2024. 2

work page 2024
[6]

Motion2vecsets: 4D latent vector set diffusion for non-rigid shape reconstruction and tracking

Wei Cao, Chang Luo, Biao Zhang, Matthias Nießner, and Jiapeng Tang. Motion2vecsets: 4D latent vector set diffusion for non-rigid shape reconstruction and tracking. InCVPR,

work page
[7]

Pseudo- simulation for autonomous driving

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo- simulation for autonomous driving. InCoRL, 2025. 2

work page 2025
[8]

Omniview: An all- seeing diffusion model for 3d and 4d view synthesis.arXiv preprint arXiv:2512.10940, 2025

Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei, Petr Sushko, Aliaksandr Siarohin, Sergey Tulyakov, and Ranjay Krishna. Omniview: An all- seeing diffusion model for 3d and 4d view synthesis.arXiv preprint arXiv:2512.10940, 2025. 3

work page arXiv 2025
[9]

Dynamic view synthesis from dynamic monocular video

Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021. 2

work page 2021
[10]

Veo.https : / / deepmind

Google DeepMind. Veo.https : / / deepmind . google/models/veo/, 2024. Accessed: 2024-11-05. 7

work page 2024
[11]

Cameractrl: Enabling camera control for text-to-video generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. InICLR, 2025. 2, 3

work page 2025
[12]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InNeurIPS, 2017. 7

work page 2017
[13]

Training-free camera control for video generation

Chen Hou and Zhibo Chen. Training-free camera control for video generation. InICLR, 2025. 2, 3

work page 2025
[14]

Ex-4d: Extreme viewpoint 4d video synthesis via depth watertight mesh.arXiv preprint arXiv:2506.05554, 2025

Tao Hu, Haoyang Peng, Xiao Liu, and Yuewen Ma. Ex-4d: Extreme viewpoint 4d video synthesis via depth watertight mesh.arXiv preprint arXiv:2506.05554, 2025. 2, 3, 6, 7

work page arXiv 2025
[15]

Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, and Chunchao Guo. V oyager: Long-range and world-consistent video diffu- sion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025. 3

work page arXiv 2025
[16]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InCVPR, 2024. 7 9

work page 2024
[17]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InICCV, 2025. 5, 7

work page 2025
[18]

3D gaussian splatting for real-time radiance field rendering.TOG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3D gaussian splatting for real-time radiance field rendering.TOG, 2023. 2, 3

work page 2023
[19]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Cameras as relative positional encoding

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding. InNeurIPS, 2025. 3

work page 2025
[21]

Realcam-i2v: Real- world image-to-video generation with interactive complex camera control

Teng Li, Guangcong Zheng, Rui Jiang, Shuigen Zhan, Tao Wu, Yehao Lu, Yining Lin, and Xi Li. Realcam-i2v: Real- world image-to-video generation with interactive complex camera control. InICCV, 2025. 2, 3

work page 2025
[22]

Light-x: Generative 4d video render- ing with camera and illumination control.arXiv preprint arXiv:2512.05115, 2025

Tianqi Liu, Zhaoxi Chen, Zihao Huang, Shaocong Xu, Sain- ing Zhang, Chongjie Ye, Bohan Li, Zhiguo Cao, Wei Li, Hao Zhao, et al. Light-x: Generative 4d video render- ing with camera and illumination control.arXiv preprint arXiv:2512.05115, 2025. 3

work page arXiv 2025
[23]

Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140, 2025

Yawen Luo, Jianhong Bai, Xiaoyu Shi, Menghan Xia, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Tianfan Xue. Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140,

work page arXiv
[24]

Generating images with 3d annotations us- ing diffusion models

Wufei Ma, Qihao Liu, Jiahao Wang, Angtian Wang, Xiaod- ing Yuan, Yi Zhang, Zihao Xiao, Guofeng Zhang, Beijia Lu, Ruxiao Duan, Yongrui Qi, Adam Kortylewski, Yaoyao Liu, and Alan Yuille. Generating images with 3d annotations us- ing diffusion models. InICLR, 2024. 2

work page 2024
[25]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis, 2020. 2, 3

work page 2020
[26]

Sora: Creating video from text.https:// openai.com/index/sora/, 2025

OpenAI. Sora: Creating video from text.https:// openai.com/index/sora/, 2025. Accessed: 2024- 11-05. 7

work page 2025
[27]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Perazzi, J

F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016. 7

work page 2016
[29]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 7

work page 2021
[30]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. In ICLR, 2025. 4, 7

work page 2025
[31]

Gen3c: 3d-informed world-consistent video generation with precise camera con- trol

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera con- trol. InCVPR, 2025. 2, 3, 6, 7

work page 2025
[32]

Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025

Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, and Chi Zhang. Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025. 3

work page arXiv 2025
[33]

TCVM: Temporal contrasting video montage framework for self-supervised video representation learning

Fengrui Tian, Jiawei Fan, Xie Yu, Shaoyi Du, Meina Song, and Yu Zhao. TCVM: Temporal contrasting video montage framework for self-supervised video representation learning. InACCV, 2022. 3

work page 2022
[34]

MonoN- eRF: Learning a generalizable dynamic radiance field from monocular videos

Fengrui Tian, Shaoyi Du, and Yueqi Duan. MonoN- eRF: Learning a generalizable dynamic radiance field from monocular videos. InICCV, 2023. 2, 3

work page 2023
[35]

Semantic Flow: Learning semantic fields of dy- namic scenes from monocular videos

Fengrui Tian, Yueqi Duan, Angtian Wang, Jianfei Guo, and Shaoyi Du. Semantic Flow: Learning semantic fields of dy- namic scenes from monocular videos. InICLR, 2024. 3

work page 2024
[36]

V oyaging into perpetual dynamic scenes from a single view

Fengrui Tian, Tianjiao Ding, Jinqi Luo, Hancheng Min, and Ren´e Vidal. V oyaging into perpetual dynamic scenes from a single view. InICCV, 2025. 3

work page 2025
[37]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InICLR, 2019. 7

work page 2019
[38]

Generative camera dolly: Ex- treme monocular dynamic novel view synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sar- gent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Ex- treme monocular dynamic novel view synthesis. InECCV,

work page
[39]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, 2025. 3, 4, 7

work page 2025
[41]

10 Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. 10 Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024. 3

work page 2024
[42]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency

Yiming Xie, Chun-Han Yao, Vikram V oleti, Huaizu Jiang, and Varun Jampani. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency. InICLR, 2025. 3

work page 2025
[44]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. InICLR, 2024. 3

work page 2024
[45]

SV4D2.0: Enhancing spatio-temporal consistency in multi-view video diffusion for high-quality 4d generation

Chun-Han Yao, Yiming Xie, Vikram V oleti, Huaizu Jiang, and Varun Jampani. SV4D2.0: Enhancing spatio-temporal consistency in multi-view video diffusion for high-quality 4d generation. InICCV, 2025. 4, 7

work page 2025
[46]

Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, 2025. 2, 3, 6, 7

work page 2025
[47]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Stable part diffusion 4d: Multi-view rgb and kinematic parts video generation

Hao Zhang, Chun-Han Yao, Simon Donn ´e, Narendra Ahuja, and Varun Jampani. Stable part diffusion 4d: Multi-view rgb and kinematic parts video generation. InNeurIPS, 2025. 2

work page 2025
[49]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 3

work page 2023
[50]

Spatialcrafter: Unleashing the imagination of video diffusion models for scene reconstruction from limited observations

Songchun Zhang, Huiyao Xu, Sitong Guo, Zhongwei Xie, Pengwei Liu, Hujun Bao, Weiwei Xu, and Changqing Zou. Spatialcrafter: Unleashing the imagination of video diffusion models for scene reconstruction from limited observations. InICCV, 2025. 3

work page 2025
[51]

Joint 3d geometry reconstruction and motion generation for 4d synthesis from a single image

Yanran Zhang, Ziyi Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Joint 3d geometry reconstruction and motion generation for 4d synthesis from a single image. arXiv preprint arXiv:2512.05044, 2025. 3

work page arXiv 2025
[52]

What matters to enhance traffic rule compliance of imitation learning for end-to-end autonomous driving.arXiv preprint arXiv:2309.07808, 2023

Hongkuan Zhou, Wei Cao, Aifen Sui, and Zhenshan Bing. What matters to enhance traffic rule compliance of imitation learning for end-to-end autonomous driving.arXiv preprint arXiv:2309.07808, 2023. 2

work page arXiv 2023
[53]

Predicting the road ahead: A knowledge graph based foundation model for scene under- standing in autonomous driving

Hongkuan Zhou, Stefan Schimid, Yicong Li, Lavdim Halilaj, Xiangtong Yao, and Wei Cao. Predicting the road ahead: A knowledge graph based foundation model for scene under- standing in autonomous driving. InEuropean Semantic Web Conference, 2025. 2

work page 2025
[54]

PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, and Mengyu Wang. Page-4d: Disentangled pose and ge- ometry estimation for 4d perception.arXiv preprint arXiv:2510.17568, 2025. 3, 4, 7 11 EX4D ReCamMaster TrajectoryCrafter Ours GEN3C (a) Qualitative comparison on the “Camel” sequence. EX4D ReCamMaster Tra...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

A survey of augmented reality.Presence: teleoperators & virtual environments, 6(4):355–385, 1997

Ronald T Azuma. A survey of augmented reality.Presence: teleoperators & virtual environments, 6(4):355–385, 1997. 2

work page 1997

[2] [2]

VD3D: Taming large video diffusion transformers for 3D camera control

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siaro- hin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. VD3D: Taming large video diffusion transformers for 3D camera control. InICLR, 2025. 3

work page 2025

[3] [3]

Recammaster: Camera-controlled generative ren- dering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative ren- dering from a single video. InICCV, 2025. 2, 3, 6, 7

work page 2025

[4] [4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Hdr-gs: Efficient high dynamic range novel view synthesis at 1000x speed via gaussian splatting

Yuanhao Cai, Zihao Xiao, Yixun Liang, Minghan Qin, Yu- lun Zhang, Xiaokang Yang, Yaoyao Liu, and Alan L Yuille. Hdr-gs: Efficient high dynamic range novel view synthesis at 1000x speed via gaussian splatting. InNeurIPS, 2024. 2

work page 2024

[6] [6]

Motion2vecsets: 4D latent vector set diffusion for non-rigid shape reconstruction and tracking

Wei Cao, Chang Luo, Biao Zhang, Matthias Nießner, and Jiapeng Tang. Motion2vecsets: 4D latent vector set diffusion for non-rigid shape reconstruction and tracking. InCVPR,

work page

[7] [7]

Pseudo- simulation for autonomous driving

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo- simulation for autonomous driving. InCoRL, 2025. 2

work page 2025

[8] [8]

Omniview: An all- seeing diffusion model for 3d and 4d view synthesis.arXiv preprint arXiv:2512.10940, 2025

Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei, Petr Sushko, Aliaksandr Siarohin, Sergey Tulyakov, and Ranjay Krishna. Omniview: An all- seeing diffusion model for 3d and 4d view synthesis.arXiv preprint arXiv:2512.10940, 2025. 3

work page arXiv 2025

[9] [9]

Dynamic view synthesis from dynamic monocular video

Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021. 2

work page 2021

[10] [10]

Veo.https : / / deepmind

Google DeepMind. Veo.https : / / deepmind . google/models/veo/, 2024. Accessed: 2024-11-05. 7

work page 2024

[11] [11]

Cameractrl: Enabling camera control for text-to-video generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. InICLR, 2025. 2, 3

work page 2025

[12] [12]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InNeurIPS, 2017. 7

work page 2017

[13] [13]

Training-free camera control for video generation

Chen Hou and Zhibo Chen. Training-free camera control for video generation. InICLR, 2025. 2, 3

work page 2025

[14] [14]

Ex-4d: Extreme viewpoint 4d video synthesis via depth watertight mesh.arXiv preprint arXiv:2506.05554, 2025

Tao Hu, Haoyang Peng, Xiao Liu, and Yuewen Ma. Ex-4d: Extreme viewpoint 4d video synthesis via depth watertight mesh.arXiv preprint arXiv:2506.05554, 2025. 2, 3, 6, 7

work page arXiv 2025

[15] [15]

Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, and Chunchao Guo. V oyager: Long-range and world-consistent video diffu- sion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025. 3

work page arXiv 2025

[16] [16]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InCVPR, 2024. 7 9

work page 2024

[17] [17]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InICCV, 2025. 5, 7

work page 2025

[18] [18]

3D gaussian splatting for real-time radiance field rendering.TOG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3D gaussian splatting for real-time radiance field rendering.TOG, 2023. 2, 3

work page 2023

[19] [19]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Cameras as relative positional encoding

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding. InNeurIPS, 2025. 3

work page 2025

[21] [21]

Realcam-i2v: Real- world image-to-video generation with interactive complex camera control

Teng Li, Guangcong Zheng, Rui Jiang, Shuigen Zhan, Tao Wu, Yehao Lu, Yining Lin, and Xi Li. Realcam-i2v: Real- world image-to-video generation with interactive complex camera control. InICCV, 2025. 2, 3

work page 2025

[22] [22]

Light-x: Generative 4d video render- ing with camera and illumination control.arXiv preprint arXiv:2512.05115, 2025

Tianqi Liu, Zhaoxi Chen, Zihao Huang, Shaocong Xu, Sain- ing Zhang, Chongjie Ye, Bohan Li, Zhiguo Cao, Wei Li, Hao Zhao, et al. Light-x: Generative 4d video render- ing with camera and illumination control.arXiv preprint arXiv:2512.05115, 2025. 3

work page arXiv 2025

[23] [23]

Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140, 2025

Yawen Luo, Jianhong Bai, Xiaoyu Shi, Menghan Xia, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Tianfan Xue. Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140,

work page arXiv

[24] [24]

Generating images with 3d annotations us- ing diffusion models

Wufei Ma, Qihao Liu, Jiahao Wang, Angtian Wang, Xiaod- ing Yuan, Yi Zhang, Zihao Xiao, Guofeng Zhang, Beijia Lu, Ruxiao Duan, Yongrui Qi, Adam Kortylewski, Yaoyao Liu, and Alan Yuille. Generating images with 3d annotations us- ing diffusion models. InICLR, 2024. 2

work page 2024

[25] [25]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis, 2020. 2, 3

work page 2020

[26] [26]

Sora: Creating video from text.https:// openai.com/index/sora/, 2025

OpenAI. Sora: Creating video from text.https:// openai.com/index/sora/, 2025. Accessed: 2024- 11-05. 7

work page 2025

[27] [27]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Perazzi, J

F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016. 7

work page 2016

[29] [29]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 7

work page 2021

[30] [30]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. In ICLR, 2025. 4, 7

work page 2025

[31] [31]

Gen3c: 3d-informed world-consistent video generation with precise camera con- trol

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera con- trol. InCVPR, 2025. 2, 3, 6, 7

work page 2025

[32] [32]

Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025

Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, and Chi Zhang. Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025. 3

work page arXiv 2025

[33] [33]

TCVM: Temporal contrasting video montage framework for self-supervised video representation learning

Fengrui Tian, Jiawei Fan, Xie Yu, Shaoyi Du, Meina Song, and Yu Zhao. TCVM: Temporal contrasting video montage framework for self-supervised video representation learning. InACCV, 2022. 3

work page 2022

[34] [34]

MonoN- eRF: Learning a generalizable dynamic radiance field from monocular videos

Fengrui Tian, Shaoyi Du, and Yueqi Duan. MonoN- eRF: Learning a generalizable dynamic radiance field from monocular videos. InICCV, 2023. 2, 3

work page 2023

[35] [35]

Semantic Flow: Learning semantic fields of dy- namic scenes from monocular videos

Fengrui Tian, Yueqi Duan, Angtian Wang, Jianfei Guo, and Shaoyi Du. Semantic Flow: Learning semantic fields of dy- namic scenes from monocular videos. InICLR, 2024. 3

work page 2024

[36] [36]

V oyaging into perpetual dynamic scenes from a single view

Fengrui Tian, Tianjiao Ding, Jinqi Luo, Hancheng Min, and Ren´e Vidal. V oyaging into perpetual dynamic scenes from a single view. InICCV, 2025. 3

work page 2025

[37] [37]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InICLR, 2019. 7

work page 2019

[38] [38]

Generative camera dolly: Ex- treme monocular dynamic novel view synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sar- gent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Ex- treme monocular dynamic novel view synthesis. InECCV,

work page

[39] [39]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, 2025. 3, 4, 7

work page 2025

[41] [41]

10 Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. 10 Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024. 3

work page 2024

[42] [42]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency

Yiming Xie, Chun-Han Yao, Vikram V oleti, Huaizu Jiang, and Varun Jampani. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency. InICLR, 2025. 3

work page 2025

[44] [44]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. InICLR, 2024. 3

work page 2024

[45] [45]

SV4D2.0: Enhancing spatio-temporal consistency in multi-view video diffusion for high-quality 4d generation

Chun-Han Yao, Yiming Xie, Vikram V oleti, Huaizu Jiang, and Varun Jampani. SV4D2.0: Enhancing spatio-temporal consistency in multi-view video diffusion for high-quality 4d generation. InICCV, 2025. 4, 7

work page 2025

[46] [46]

Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, 2025. 2, 3, 6, 7

work page 2025

[47] [47]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Stable part diffusion 4d: Multi-view rgb and kinematic parts video generation

Hao Zhang, Chun-Han Yao, Simon Donn ´e, Narendra Ahuja, and Varun Jampani. Stable part diffusion 4d: Multi-view rgb and kinematic parts video generation. InNeurIPS, 2025. 2

work page 2025

[49] [49]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 3

work page 2023

[50] [50]

Spatialcrafter: Unleashing the imagination of video diffusion models for scene reconstruction from limited observations

Songchun Zhang, Huiyao Xu, Sitong Guo, Zhongwei Xie, Pengwei Liu, Hujun Bao, Weiwei Xu, and Changqing Zou. Spatialcrafter: Unleashing the imagination of video diffusion models for scene reconstruction from limited observations. InICCV, 2025. 3

work page 2025

[51] [51]

Joint 3d geometry reconstruction and motion generation for 4d synthesis from a single image

Yanran Zhang, Ziyi Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Joint 3d geometry reconstruction and motion generation for 4d synthesis from a single image. arXiv preprint arXiv:2512.05044, 2025. 3

work page arXiv 2025

[52] [52]

What matters to enhance traffic rule compliance of imitation learning for end-to-end autonomous driving.arXiv preprint arXiv:2309.07808, 2023

Hongkuan Zhou, Wei Cao, Aifen Sui, and Zhenshan Bing. What matters to enhance traffic rule compliance of imitation learning for end-to-end autonomous driving.arXiv preprint arXiv:2309.07808, 2023. 2

work page arXiv 2023

[53] [53]

Predicting the road ahead: A knowledge graph based foundation model for scene under- standing in autonomous driving

Hongkuan Zhou, Stefan Schimid, Yicong Li, Lavdim Halilaj, Xiangtong Yao, and Wei Cao. Predicting the road ahead: A knowledge graph based foundation model for scene under- standing in autonomous driving. InEuropean Semantic Web Conference, 2025. 2

work page 2025

[54] [54]

PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, and Mengyu Wang. Page-4d: Disentangled pose and ge- ometry estimation for 4d perception.arXiv preprint arXiv:2510.17568, 2025. 3, 4, 7 11 EX4D ReCamMaster TrajectoryCrafter Ours GEN3C (a) Qualitative comparison on the “Camel” sequence. EX4D ReCamMaster Tra...

work page internal anchor Pith review Pith/arXiv arXiv 2025