arxiv: 2512.09112 · v2 · submitted 2025-12-09 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

GimbalDiffusion: Gravity-Aware Camera Control for Video Generation

Fr\'ed\'eric Fortier-Chouinard , Yannick Hold-Geoffroy , Valentin Deschaintre , Matheus Gadelha , Jean-Fran\c{c}ois Lalonde

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords camera controltext-to-videodiffusion models360 videogravity referencecamera trajectoryvideo generation

0 comments

The pith

GimbalDiffusion grounds video camera control in gravity-based absolute coordinates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GimbalDiffusion to achieve fine-grained control over camera motion in text-to-video generation by using gravity as a fixed global reference. This replaces relative descriptions between frames with trajectories defined in an absolute physical coordinate system. Training on 360-degree panoramic videos ensures coverage of all possible viewpoints including extreme orientations. A null-pitch conditioning strategy is added to keep the model from ignoring the camera instructions when they conflict with the text prompt. New benchmarks are proposed to assess performance on extreme angles and prompt entanglement.

Core claim

We introduce GimbalDiffusion, a framework that enables camera control grounded in physical-world coordinates, using gravity as a global reference. Instead of describing motion relative to previous frames, our method defines camera trajectories in an absolute coordinate system, allowing accurate, interpretable control over camera parameters. Using panoramic 360-degree videos for training, we cover the full sphere of possible viewpoints, including combinations of extreme pitch and roll that are out-of-distribution of conventional video data. To improve camera guidance, we introduce null-pitch conditioning, a strategy that prevents the model from overriding camera specifications in the presence

What carries the argument

Gravity-referenced absolute coordinate system for camera trajectories, trained via panoramic 360 videos and enforced with null-pitch conditioning.

Load-bearing premise

That training exclusively on panoramic 360-degree videos plus the null-pitch strategy will generalize to the distribution of conventional video prompts without introducing new artifacts or requiring additional fine-tuning on real-world footage.

What would settle it

Observe the output when the camera is conditioned to point straight up but the prompt describes ground-level objects; the model should generate sky rather than ground content if the conditioning works.

Figures

Figures reproduced from arXiv: 2512.09112 by Fr\'ed\'eric Fortier-Chouinard, Jean-Fran\c{c}ois Lalonde, Matheus Gadelha, Valentin Deschaintre, Yannick Hold-Geoffroy.

**Figure 1.** Figure 1: We propose GIMBALDIFFUSION, a framework for absolute camera control in text-to-video generation. Our approach adapts foundational video generation models to accept absolute camera controls, conditioning the entire video on camera parameters expressed in a gravity-aligned global coordinate system. This enables the generation of videos with challenging viewpoints, such as low pitch (top) or high roll (bottom… view at source ↗

**Figure 2.** Figure 2: Reproducing real scenes using our representation. From [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Training data pipeline. (a) We extract the camera poses from a dataset of 360 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Training data samples from our data augmentation pipeline, capturing a highly diverse set of rotation trajectories from 360 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of parameter distribution between our sampling and a typical video dataset for camera control (RealEstate10K [ [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Entanglement between prompt and camera pitch. Without [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results on the SpatialVID-extreme benchmark. The input absolute camera angle is shown as a dark overlay on the [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Recent progress in text-to-video generation has achieved remarkable realism, yet fine-grained control over camera motion and orientation remains elusive, especially with extreme trajectories (e.g., a 180-degree turnaround, or looking directly up or down). Existing approaches typically encode camera trajectories using relative or ambiguous representations, limiting precise geometric control and offering limited support for large rotations. We introduce GimbalDiffusion, a framework that enables camera control grounded in physical-world coordinates, using gravity as a global reference. Instead of describing motion relative to previous frames, our method defines camera trajectories in an absolute coordinate system, allowing accurate, interpretable control over camera parameters. Using panoramic 360-degree videos for training, we cover the full sphere of possible viewpoints, including combinations of extreme pitch and roll that are out-of-distribution of conventional video data. To improve camera guidance, we introduce null-pitch conditioning, a strategy that prevents the model from overriding camera specifications in the presence of conflicting prompt content (e.g., generating grass while the camera points toward the sky). Finally, we propose new benchmarks to evaluate gravity-aware camera-controlled video generation, assessing models' ability to generate extreme camera angles and quantify their input prompt entanglement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GimbalDiffusion trains on 360 panoramas for absolute gravity-referenced camera control plus a null-pitch trick, but transfer to ordinary video prompts is untested.

read the letter

The main point is that this paper trains video diffusion on full 360-degree panoramic footage so camera trajectories can be defined in absolute world coordinates anchored by gravity. That choice covers extreme pitch and roll combinations that normal video data rarely includes, and they add a null-pitch conditioning step to keep the model from ignoring the camera spec when the text prompt pulls in the opposite direction, such as trying to show ground while the camera points up.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces GimbalDiffusion, a framework for text-to-video generation that achieves gravity-aware camera control by representing trajectories in an absolute world coordinate system with gravity as the global reference, rather than relative frame-to-frame motions. It trains exclusively on panoramic 360-degree videos to cover extreme pitch/roll combinations, introduces a null-pitch conditioning strategy to mitigate prompt-camera conflicts, and proposes new benchmarks to measure fidelity on extreme angles and prompt disentanglement.

Significance. If the claimed improvements in extreme-trajectory accuracy and reduced prompt entanglement prove robust, the work would advance controllable video synthesis by supplying an interpretable, physically grounded alternative to relative camera encodings, potentially benefiting applications that require precise geometric control.

major comments (3)

[§4] §4 (Experiments and Benchmarks): The evaluation is confined to held-out panoramic 360° test clips; no quantitative results, ablations, or error analysis are reported on standard (non-360) text-to-video prompts, leaving the central generalization claim unsupported.
[§3] §3 (Method, null-pitch conditioning): The mechanism by which null-pitch conditioning is injected into the diffusion process is described only at a high level; without the precise conditioning formulation or loss term, it is impossible to verify how it prevents the model from overriding camera specifications.
[§4.2] §4.2 (Proposed benchmarks): The metrics used to quantify prompt entanglement and extreme-angle fidelity are not defined, nor are baseline comparisons provided, so the reported gains cannot be assessed for statistical significance or robustness.

minor comments (2)

[Abstract] The abstract would be clearer if it briefly stated the scale of the training dataset and the specific quantitative metrics used in the new benchmarks.
[§3] Notation for the absolute coordinate frame (e.g., how gravity vector and camera intrinsics are encoded) should be introduced earlier and used consistently.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [§4] The evaluation is confined to held-out panoramic 360° test clips; no quantitative results, ablations, or error analysis are reported on standard (non-360) text-to-video prompts, leaving the central generalization claim unsupported.

Authors: We agree that additional evidence on standard (non-360) prompts would strengthen the generalization discussion. While the core contribution targets extreme trajectories enabled by full-sphere panoramic training data, we will add a new subsection with qualitative results and limited quantitative metrics on conventional text-to-video prompts (e.g., from standard datasets) to demonstrate that the gravity-aware control does not degrade performance on typical cases. Full ablations on non-360 data will be included where feasible. revision: partial
Referee: [§3] The mechanism by which null-pitch conditioning is injected into the diffusion process is described only at a high level; without the precise conditioning formulation or loss term, it is impossible to verify how it prevents the model from overriding camera specifications.

Authors: We acknowledge that the description in §3 is high-level. In the revised manuscript we will add the exact conditioning formulation, including the mathematical definition of the null-pitch embedding, its injection point in the diffusion U-Net, and the modified loss term that encourages adherence to camera parameters even under conflicting text prompts. revision: yes
Referee: [§4.2] The metrics used to quantify prompt entanglement and extreme-angle fidelity are not defined, nor are baseline comparisons provided, so the reported gains cannot be assessed for statistical significance or robustness.

Authors: We thank the referee for this observation. Section 4.2 will be expanded with precise mathematical definitions of the prompt-entanglement and extreme-angle fidelity metrics. We will also add baseline comparisons against prior camera-control methods and report statistical significance (e.g., standard deviations over multiple seeds) to allow proper assessment of the gains. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training recipe with no self-referential derivations

full rationale

The paper introduces GimbalDiffusion as a training-based framework using 360° panoramic videos and null-pitch conditioning to achieve gravity-aware camera control. No equations, closed-form derivations, or parameter-fitting steps are described that reduce the claimed control accuracy or generalization to quantities defined or fitted inside the same work. The method is presented as an empirical recipe (data choice + conditioning strategy + new benchmarks) rather than a mathematical chain. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs in the provided derivation. Generalization from panoramic to conventional video distributions is an empirical question, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that panoramic video data plus a simple conditioning trick suffices to learn a general camera controller; no new physical constants or particles are introduced.

axioms (1)

domain assumption Diffusion-based video generators can be conditioned on explicit camera parameters when trained on sufficiently diverse viewpoint data.
Invoked implicitly when stating that absolute coordinates plus panoramic training will yield accurate control.

pith-pipeline@v0.9.0 · 5532 in / 1259 out tokens · 36613 ms · 2026-05-16T23:36:53.300196+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce GimbalDiffusion, a framework that enables camera control grounded in physical-world coordinates, using gravity as a global reference. ... absolute coordinate system

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CalibAnyView: Beyond Single-View Camera Calibration in the Wild
cs.CV 2026-05 conditional novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Towards high resolution video generation with progressive growing of sliced wasserstein gans

Dinesh Acharya, Zhiwu Huang, Danda Pani Paudel, and Luc Van Gool. Towards high resolution video generation with progressive growing of sliced wasserstein gans. InCoRR, 2018-01-01. 2

work page 2018
[2]

Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Ali- aksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers. InIEEE/CVF Conf. Comput. Vis. Pattern Recog.,

work page
[3]

Lindell, and Sergey Tulyakov

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin- Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David B. Lindell, and Sergey Tulyakov. VD3D: Taming large video diffusion transformers for 3d camera control. InInt. Conf. Learn. Represent., 2025. 4

work page 2025
[4]

Recammaster: Camera-controlled generative ren- dering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative ren- dering from a single video. InIEEE/CVF Int. Conf. Comput. Vis., 2025. 2

work page 2025
[5]

PreciseCam: Precise camera control for text-to-image generation

Edurne Bernal-Berdun, Ana Serrano, Belen Masia, Matheus Gadelha, Yannick Hold-Geoffroy, Xin Sun, and Diego Gutier- rez. PreciseCam: Precise camera control for text-to-image generation. InIEEE/CVF Conf. Comput. Vis. Pattern Recog.,

work page
[6]

MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion

Di Chang, Yichun Shi, Quankai Gao, Jessica Fu, Hongyi Xu, Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang, and Mo- hammad Soleymani. MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion. InInt. Conf. Mach. Learn., 2024. 2

work page 2024
[7]

Control-a-video: Con- trollable text-to-video generation with diffusion models

Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Con- trollable text-to-video generation with diffusion models. In CoRR, 2023. 2

work page 2023
[8]

Egocentric scene understanding via multimodal spatial rectifier

Tien Do, Khiem Vuong, and Hyun Soo Park. Egocentric scene understanding via multimodal spatial rectifier. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2022. 3

work page 2022
[9]

RealEstate10K: A large-scale dataset of camera poses

Google Research. RealEstate10K: A large-scale dataset of camera poses. https://google.github.io/ realestate10k/, 2018. Camera trajectories from approx- imately 80,000 video clips (from 10,000 YouTube videos), totaling about 10 million frames; poses generated via SLAM and bundle-adjustment. 6

work page 2018
[10]

Diffusion as shader: 3d- aware video diffusion for versatile video generation control

Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, and Yuan Liu. Diffusion as shader: 3d- aware video diffusion for versatile video generation control. InACM SIGGRAPH Conf., 2025. 3

work page 2025
[11]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. CameraCtrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 2, 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

CameraCtrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. CameraCtrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025. 2, 4, 6

work page arXiv 2025
[13]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InConf. Emp. Metho. Nat. Lang. Proc.,

work page
[14]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation.arXiv preprint arXiv:2311.17117, 2023

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation.arXiv preprint arXiv:2311.17117, 2023. 2

work page arXiv 2023
[15]

Vipe: Video pose engine for 3d geometric perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. Vipe: Video pose engine for 3d geometric perception. InNVIDIA Research Whitepapers, 2025. 3, 6

work page 2025
[16]

Megasam: Scaling up camera pose estima- tion with a foundation model for structure-from-motion

Fang Jiang et al. Megasam: Scaling up camera pose estima- tion with a foundation model for structure-from-motion. In IEEE/CVF Conf. Comput. Vis. Pattern Recog., 2025. 3

work page 2025
[17]

Perspective fields for single image cam- era calibration

Linyi Jin, Jianming Zhang, Yannick Hold-Geoffroy, Oliver Wang, Kevin Blackburn-Matzen, Matthew Sticha, and David F Fouhey. Perspective fields for single image cam- era calibration. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2023. 6, 8

work page 2023
[18]

Linyi Jin, Jianming Zhang, Yannick Hold-Geoffroy, Oliver Wang, Kevin Matzen, Matthew Sticha, and David F. Fouhey. Perspective fields for single image camera calibration. IEEE/CVF Conf. Comput. Vis. Pattern Recog., 2023. 3

work page 2023
[19]

Spad: Spatially aware multi-view diffusers

Yash Kant, Aliaksandr Siarohin, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, and Igor Gilitschenski. Spad: Spatially aware multi-view diffusers. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2024. 4

work page 2024
[20]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2024. 6

work page 2024
[21]

Temporally consistent horizon lines

Florian Kluger, Hanno Ackermann, Michael Ying Yang, and Bodo Rosenhahn. Temporally consistent horizon lines. InInt. Conf. Robot. Autom., 2020. 3

work page 2020
[22]

LightIt: Illumination modeling and control for diffusion models

Peter Kocsis, Julien Philip, Kalyan Sunkavalli, Matthias Nießner, and Yannick Hold-Geoffroy. LightIt: Illumination modeling and control for diffusion models. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2024. 2

work page 2024
[23]

Ground- ing image matching in 3d with Mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with Mast3r. InEur. Conf. Comput. Vis., 2024. 3

work page 2024
[24]

Cameras as relative positional encoding

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding. InAdv. Neural Inform. Process. Syst., 2025. 3

work page 2025
[25]

Video generation from text

Yitong Li, Martin Min, Dinghan Shen, David Carlson, and Lawrence Carin. Video generation from text. InAssoc. Adv. of Art. Int., 2018. 2

work page 2018
[26]

LightLab: Con- trolling light sources in images with diffusion models

Nadav Magar, Amir Hertz, Eric Tabellion, Yael Pritch, Alex Rav-Acha, Ariel Shamir, and Yedid Hoshen. LightLab: Con- trolling light sources in images with diffusion models. In ACM SIGGRAPH Conf., 2025. 2

work page 2025
[27]

Openmvg: Open multiple view geometry

Pierre Moulon, Pascal Monasse, Romuald Perrot, and Renaud Marlet. Openmvg: Open multiple view geometry. InInt. Work. Reproduc. Res. Patt. Recog., 2016. 3

work page 2016
[28]

Sch¨onberger and Jan-Michael Frahm

Johannes L. Sch¨onberger and Jan-Michael Frahm. Structure- from-motion revisited. InIEEE/CVF Conf. Comput. Vis. Pat- tern Recog., 2016. 3

work page 2016
[29]

Mocogan: Decomposing motion and content for video generation

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. InIEEE/CVF Conf. Comput. Vis. Pattern Recog.,

work page
[30]

GeoCalib: Single-image calibration with geometric optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Single-image calibration with geometric optimization. InEur. Conf. Comput. Vis., 2024. 3

work page 2024
[31]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2025. 3

work page 2025
[33]

SpatialVID: A large-scale video dataset with spatial annotations, 2025

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiaoxiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, and Yao Yao. SpatialVID: A large-scale video dataset with spatial annotations, 2025. 2, 6

work page 2025
[34]

360dvd: Controllable panorama video gener- ation with 360-degree video diffusion model.arXiv preprint arXiv:2401.06578, 2024

Qian Wang, Weiqi Li, Chong Mou, Xinhua Cheng, and Jian Zhang. 360dvd: Controllable panorama video gener- ation with 360-degree video diffusion model.arXiv preprint arXiv:2401.06578, 2024. 6

work page arXiv 2024
[35]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2024. 3

work page 2024
[36]

Diffusion models for video generation.Lil’Log (blog), 2024

Lilian Weng. Diffusion models for video generation.Lil’Log (blog), 2024. https://lilianweng.github.io/posts/2024-04-12- diffusion-video/. 2

work page 2024
[37]

Horizon lines in the wild

Scott Workman, Menghua Zhai, and Nathan Jacobs. Horizon lines in the wild. InBrit. Mach. Vis. Conf., 2016. 3

work page 2016
[38]

Visualsfm: A visual structure from motion system, 2011

Changchang Wu et al. Visualsfm: A visual structure from motion system, 2011. 3

work page 2011
[39]

Uprightnet: geometry-aware camera orientation estimation from single images

Wenqi Xian, Zhengqi Li, Matthew Fisher, Jonathan Eisen- mann, Eli Shechtman, and Noah Snavely. Uprightnet: geometry-aware camera orientation estimation from single images. InIEEE/CVF Int. Conf. Comput. Vis., 2019. 3

work page 2019
[40]

Motioncanvas: Cinematic shot design with controllable image-to-video generation

Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Anirud- dha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu. Motioncanvas: Cinematic shot design with controllable image-to-video generation. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–11, 2025. 3

work page 2025
[41]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Image sculpting: Precise ob- ject editing with 3D geometry control

Jiraphon Yenphraphai, Xichen Pan, Sainan Liu, Daniele Panozzo, and Saining Xie. Image sculpting: Precise ob- ject editing with 3D geometry control. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2024. 2

work page 2024
[43]

Tra- jectoryCrafter: Redirecting camera trajectory for monocular videos via diffusion models

Mark YU, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectoryCrafter: Redirecting camera trajectory for monocular videos via diffusion models. InIEEE/CVF Int. Conf. Comput. Vis., 2025. 3

work page 2025
[44]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025