FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction
Pith reviewed 2026-05-21 13:59 UTC · model grok-4.3
The pith
A complete foreground 4D proxy built from monocular video supplies geometric scaffolds that steer consistent camera redirection in generated videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FreeOrbit4D recovers a foreground-complete 4D proxy by first unprojecting the input video into a static background point cloud and a partial foreground point cloud inside a shared global space. An object-centric multi-view diffusion model then synthesizes additional images of the foreground to permit complete point-cloud reconstruction in canonical object space. Dense pixel-synchronized 3D-3D correspondences align the canonical foreground back into global coordinates. The resulting aligned 4D proxy is projected onto target camera rays to supply explicit geometric scaffolds that condition a video diffusion model for the final redirected output.
What carries the argument
foreground-complete 4D proxy: a completed foreground point cloud reconstructed in canonical space, aligned to the global scene via 3D correspondences, and projected as geometric guidance for the video generator.
If this is right
- Redirected videos remain geometrically faithful and temporally coherent even when the new camera path deviates substantially from the input trajectory.
- The framework operates without any training on camera-redirection data.
- The same 4D proxy supports downstream tasks such as propagating edits through the video and synthesizing additional 4D training data.
Where Pith is reading between the lines
- The alignment step could be tested on videos with fast-moving or deforming objects to see whether the 3D correspondences remain reliable.
- If the proxy is sufficiently accurate, the same scaffolding idea might transfer to other dynamic-scene tasks such as inserting new objects or changing lighting while preserving motion consistency.
Load-bearing premise
An object-centric multi-view diffusion model can synthesize accurate additional views of the foreground object from the narrow observations present in the original monocular video.
What would settle it
Take a monocular video containing a foreground object with clearly asymmetric parts hidden from the original camera; apply a large redirection trajectory and inspect whether the generated frames correctly reveal those hidden parts or instead produce geometric distortions or hallucinations.
Figures
read the original abstract
Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective training-free framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FreeOrbit4D, a training-free framework for arbitrary camera redirection in monocular videos of dynamic scenes. It decouples foreground and background by unprojecting the input video into a static background and partial foreground point clouds in global space, employs an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct a complete foreground point cloud in canonical object space, aligns the canonical foreground to global space via dense pixel-synchronized 3D-3D correspondences, and projects the resulting foreground-complete 4D proxy onto target camera viewpoints to supply geometric scaffolds that condition a video diffusion model for the redirected output. The work claims improved faithfulness and temporal coherence under large-angle trajectories compared to prior diffusion-based methods, plus utility for edit propagation and 4D data generation.
Significance. If the central claim holds, the result would be significant for computer vision and graphics: it provides an explicit, training-free geometric proxy to mitigate ambiguity in large-viewpoint video synthesis, which could improve reliability in applications such as novel-view video replay, virtual cinematography, and dynamic scene editing without dataset-specific fine-tuning.
major comments (2)
- The pipeline's load-bearing step is the object-centric multi-view diffusion model's ability to synthesize geometrically consistent multi-view images of the foreground from the narrow viewpoint range and motion present in a monocular video (described in the abstract and method overview). The manuscript provides no quantitative evaluation, error analysis, or ablation of reconstruction accuracy or viewpoint consistency for this completion step; if the synthesized views contain inconsistencies or hallucinations, the canonical point cloud, its alignment via 3D-3D correspondences, and the projected scaffolds will propagate errors directly into the final redirected video.
- Abstract: the claim of 'more faithful and temporally coherent redirected videos under challenging large-angle trajectories' is presented without reference to specific metrics, baseline comparisons, or failure-case analysis in the provided text. This makes it impossible to assess whether the 4D proxy actually resolves the geometric ambiguity that prior diffusion methods encounter.
minor comments (2)
- Clarify the exact form of the dense pixel-synchronized 3D-3D correspondences used for alignment (e.g., whether they are computed from optical flow, depth, or learned features) and how they handle dynamic foreground motion.
- The abstract mentions enabling 'edit propagation and 4D data generation' as applications of the proxy; a brief qualitative example or description of these uses would strengthen the contribution without requiring new experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and describe the revisions we will incorporate to strengthen the presentation and evaluation.
read point-by-point responses
-
Referee: The pipeline's load-bearing step is the object-centric multi-view diffusion model's ability to synthesize geometrically consistent multi-view images of the foreground from the narrow viewpoint range and motion present in a monocular video (described in the abstract and method overview). The manuscript provides no quantitative evaluation, error analysis, or ablation of reconstruction accuracy or viewpoint consistency for this completion step; if the synthesized views contain inconsistencies or hallucinations, the canonical point cloud, its alignment via 3D-3D correspondences, and the projected scaffolds will propagate errors directly into the final redirected video.
Authors: We agree that isolating the reconstruction accuracy of the object-centric multi-view diffusion step with dedicated quantitative metrics and ablations would provide clearer validation of this load-bearing component. Our current evaluation emphasizes end-to-end video redirection quality and comparisons against prior methods, which indirectly reflect the proxy's effectiveness. In the revised version we will add a targeted analysis section, including error metrics on synthetic data with available ground truth (e.g., Chamfer distance for point clouds and multi-view consistency scores) and an ablation on the impact of synthesis quality. revision: yes
-
Referee: Abstract: the claim of 'more faithful and temporally coherent redirected videos under challenging large-angle trajectories' is presented without reference to specific metrics, baseline comparisons, or failure-case analysis in the provided text. This makes it impossible to assess whether the 4D proxy actually resolves the geometric ambiguity that prior diffusion methods encounter.
Authors: The abstract serves as a concise overview; the full manuscript details the quantitative metrics, baseline comparisons, and analysis in the Experiments section. To improve clarity we will revise the abstract to explicitly cite the evaluation metrics (such as those measuring faithfulness and temporal coherence) and reference the relevant figures and tables that demonstrate improvements over prior diffusion-based approaches under large-angle trajectories. revision: yes
Circularity Check
No significant circularity; pipeline is self-contained
full rationale
The paper describes a training-free pipeline that unprojects the monocular video into unified global space for background and partial foreground, invokes an external object-centric multi-view diffusion model to synthesize views and complete the foreground point cloud in canonical space, aligns via dense pixel-synchronized 3D-3D correspondences, and projects the resulting 4D proxy as scaffolds for a separate conditional video diffusion model. No equation or step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing premise depends on a self-citation chain. The central claim remains an independent structural composition of existing components, consistent with the reader's assessment of score 2 and qualifying as a non-finding under the guidelines.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Object-centric multi-view diffusion models can produce sufficiently accurate novel views of foreground objects from monocular observations.
- domain assumption Dense pixel-synchronized 3D-3D correspondences can be reliably established between canonical and global spaces.
invented entities (1)
-
foreground-complete 4D proxy
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D–3D correspondences
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
Reference graph
Works this paper leans on
-
[1]
A survey of augmented reality.Presence: teleoperators & virtual environments, 6(4):355–385, 1997
Ronald T Azuma. A survey of augmented reality.Presence: teleoperators & virtual environments, 6(4):355–385, 1997. 2
work page 1997
-
[2]
VD3D: Taming large video diffusion transformers for 3D camera control
Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siaro- hin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. VD3D: Taming large video diffusion transformers for 3D camera control. InICLR, 2025. 3
work page 2025
-
[3]
Recammaster: Camera-controlled generative ren- dering from a single video
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative ren- dering from a single video. InICCV, 2025. 2, 3, 6, 7
work page 2025
-
[4]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Hdr-gs: Efficient high dynamic range novel view synthesis at 1000x speed via gaussian splatting
Yuanhao Cai, Zihao Xiao, Yixun Liang, Minghan Qin, Yu- lun Zhang, Xiaokang Yang, Yaoyao Liu, and Alan L Yuille. Hdr-gs: Efficient high dynamic range novel view synthesis at 1000x speed via gaussian splatting. InNeurIPS, 2024. 2
work page 2024
-
[6]
Motion2vecsets: 4D latent vector set diffusion for non-rigid shape reconstruction and tracking
Wei Cao, Chang Luo, Biao Zhang, Matthias Nießner, and Jiapeng Tang. Motion2vecsets: 4D latent vector set diffusion for non-rigid shape reconstruction and tracking. InCVPR,
-
[7]
Pseudo- simulation for autonomous driving
Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo- simulation for autonomous driving. InCoRL, 2025. 2
work page 2025
-
[8]
Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei, Petr Sushko, Aliaksandr Siarohin, Sergey Tulyakov, and Ranjay Krishna. Omniview: An all- seeing diffusion model for 3d and 4d view synthesis.arXiv preprint arXiv:2512.10940, 2025. 3
-
[9]
Dynamic view synthesis from dynamic monocular video
Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021. 2
work page 2021
-
[10]
Google DeepMind. Veo.https : / / deepmind . google/models/veo/, 2024. Accessed: 2024-11-05. 7
work page 2024
-
[11]
Cameractrl: Enabling camera control for text-to-video generation
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. InICLR, 2025. 2, 3
work page 2025
-
[12]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InNeurIPS, 2017. 7
work page 2017
-
[13]
Training-free camera control for video generation
Chen Hou and Zhibo Chen. Training-free camera control for video generation. InICLR, 2025. 2, 3
work page 2025
-
[14]
Tao Hu, Haoyang Peng, Xiao Liu, and Yuewen Ma. Ex-4d: Extreme viewpoint 4d video synthesis via depth watertight mesh.arXiv preprint arXiv:2506.05554, 2025. 2, 3, 6, 7
-
[15]
Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, and Chunchao Guo. V oyager: Long-range and world-consistent video diffu- sion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025. 3
-
[16]
Vbench: Comprehensive bench- mark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InCVPR, 2024. 7 9
work page 2024
-
[17]
Vace: All-in-one video creation and editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InICCV, 2025. 5, 7
work page 2025
-
[18]
3D gaussian splatting for real-time radiance field rendering.TOG, 2023
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3D gaussian splatting for real-time radiance field rendering.TOG, 2023. 2, 3
work page 2023
-
[19]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Cameras as relative positional encoding
Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding. InNeurIPS, 2025. 3
work page 2025
-
[21]
Realcam-i2v: Real- world image-to-video generation with interactive complex camera control
Teng Li, Guangcong Zheng, Rui Jiang, Shuigen Zhan, Tao Wu, Yehao Lu, Yining Lin, and Xi Li. Realcam-i2v: Real- world image-to-video generation with interactive complex camera control. InICCV, 2025. 2, 3
work page 2025
-
[22]
Tianqi Liu, Zhaoxi Chen, Zihao Huang, Shaocong Xu, Sain- ing Zhang, Chongjie Ye, Bohan Li, Zhiguo Cao, Wei Li, Hao Zhao, et al. Light-x: Generative 4d video render- ing with camera and illumination control.arXiv preprint arXiv:2512.05115, 2025. 3
-
[23]
Yawen Luo, Jianhong Bai, Xiaoyu Shi, Menghan Xia, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Tianfan Xue. Camclonemaster: Enabling reference-based camera control for video generation.arXiv preprint arXiv:2506.03140,
-
[24]
Generating images with 3d annotations us- ing diffusion models
Wufei Ma, Qihao Liu, Jiahao Wang, Angtian Wang, Xiaod- ing Yuan, Yi Zhang, Zihao Xiao, Guofeng Zhang, Beijia Lu, Ruxiao Duan, Yongrui Qi, Adam Kortylewski, Yaoyao Liu, and Alan Yuille. Generating images with 3d annotations us- ing diffusion models. InICLR, 2024. 2
work page 2024
-
[25]
Srinivasan, Matthew Tancik, Jonathan T
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis, 2020. 2, 3
work page 2020
-
[26]
Sora: Creating video from text.https:// openai.com/index/sora/, 2025
OpenAI. Sora: Creating video from text.https:// openai.com/index/sora/, 2025. Accessed: 2024- 11-05. 7
work page 2025
-
[27]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016. 7
work page 2016
-
[29]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 7
work page 2021
-
[30]
Sam 2: Segment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. In ICLR, 2025. 4, 7
work page 2025
-
[31]
Gen3c: 3d-informed world-consistent video generation with precise camera con- trol
Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera con- trol. InCVPR, 2025. 2, 3, 6, 7
work page 2025
-
[32]
Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, and Chi Zhang. Worldforge: Unlocking emergent 3d/4d generation in video diffusion model via training-free guidance.arXiv preprint arXiv:2509.15130, 2025. 3
-
[33]
TCVM: Temporal contrasting video montage framework for self-supervised video representation learning
Fengrui Tian, Jiawei Fan, Xie Yu, Shaoyi Du, Meina Song, and Yu Zhao. TCVM: Temporal contrasting video montage framework for self-supervised video representation learning. InACCV, 2022. 3
work page 2022
-
[34]
MonoN- eRF: Learning a generalizable dynamic radiance field from monocular videos
Fengrui Tian, Shaoyi Du, and Yueqi Duan. MonoN- eRF: Learning a generalizable dynamic radiance field from monocular videos. InICCV, 2023. 2, 3
work page 2023
-
[35]
Semantic Flow: Learning semantic fields of dy- namic scenes from monocular videos
Fengrui Tian, Yueqi Duan, Angtian Wang, Jianfei Guo, and Shaoyi Du. Semantic Flow: Learning semantic fields of dy- namic scenes from monocular videos. InICLR, 2024. 3
work page 2024
-
[36]
V oyaging into perpetual dynamic scenes from a single view
Fengrui Tian, Tianjiao Ding, Jinqi Luo, Hancheng Min, and Ren´e Vidal. V oyaging into perpetual dynamic scenes from a single view. InICCV, 2025. 3
work page 2025
-
[37]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InICLR, 2019. 7
work page 2019
-
[38]
Generative camera dolly: Ex- treme monocular dynamic novel view synthesis
Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sar- gent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Ex- treme monocular dynamic novel view synthesis. InECCV,
-
[39]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, 2025. 3, 4, 7
work page 2025
-
[41]
10 Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. 10 Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024. 3
work page 2024
-
[42]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency
Yiming Xie, Chun-Han Yao, Vikram V oleti, Huaizu Jiang, and Varun Jampani. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency. InICLR, 2025. 3
work page 2025
-
[44]
CogVideoX: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. InICLR, 2024. 3
work page 2024
-
[45]
Chun-Han Yao, Yiming Xie, Vikram V oleti, Huaizu Jiang, and Varun Jampani. SV4D2.0: Enhancing spatio-temporal consistency in multi-view video diffusion for high-quality 4d generation. InICCV, 2025. 4, 7
work page 2025
-
[46]
Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models
Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, 2025. 2, 3, 6, 7
work page 2025
-
[47]
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Stable part diffusion 4d: Multi-view rgb and kinematic parts video generation
Hao Zhang, Chun-Han Yao, Simon Donn ´e, Narendra Ahuja, and Varun Jampani. Stable part diffusion 4d: Multi-view rgb and kinematic parts video generation. InNeurIPS, 2025. 2
work page 2025
-
[49]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 3
work page 2023
-
[50]
Songchun Zhang, Huiyao Xu, Sitong Guo, Zhongwei Xie, Pengwei Liu, Hujun Bao, Weiwei Xu, and Changqing Zou. Spatialcrafter: Unleashing the imagination of video diffusion models for scene reconstruction from limited observations. InICCV, 2025. 3
work page 2025
-
[51]
Joint 3d geometry reconstruction and motion generation for 4d synthesis from a single image
Yanran Zhang, Ziyi Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Joint 3d geometry reconstruction and motion generation for 4d synthesis from a single image. arXiv preprint arXiv:2512.05044, 2025. 3
-
[52]
Hongkuan Zhou, Wei Cao, Aifen Sui, and Zhenshan Bing. What matters to enhance traffic rule compliance of imitation learning for end-to-end autonomous driving.arXiv preprint arXiv:2309.07808, 2023. 2
-
[53]
Hongkuan Zhou, Stefan Schimid, Yicong Li, Lavdim Halilaj, Xiangtong Yao, and Wei Cao. Predicting the road ahead: A knowledge graph based foundation model for scene under- standing in autonomous driving. InEuropean Semantic Web Conference, 2025. 2
work page 2025
-
[54]
PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation
Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, and Mengyu Wang. Page-4d: Disentangled pose and ge- ometry estimation for 4d perception.arXiv preprint arXiv:2510.17568, 2025. 3, 4, 7 11 EX4D ReCamMaster TrajectoryCrafter Ours GEN3C (a) Qualitative comparison on the “Camel” sequence. EX4D ReCamMaster Tra...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.