pith. machine review for the scientific record. sign in

arxiv: 2512.09112 · v2 · submitted 2025-12-09 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

GimbalDiffusion: Gravity-Aware Camera Control for Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords camera controltext-to-videodiffusion models360 videogravity referencecamera trajectoryvideo generation
0
0 comments X

The pith

GimbalDiffusion grounds video camera control in gravity-based absolute coordinates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GimbalDiffusion to achieve fine-grained control over camera motion in text-to-video generation by using gravity as a fixed global reference. This replaces relative descriptions between frames with trajectories defined in an absolute physical coordinate system. Training on 360-degree panoramic videos ensures coverage of all possible viewpoints including extreme orientations. A null-pitch conditioning strategy is added to keep the model from ignoring the camera instructions when they conflict with the text prompt. New benchmarks are proposed to assess performance on extreme angles and prompt entanglement.

Core claim

We introduce GimbalDiffusion, a framework that enables camera control grounded in physical-world coordinates, using gravity as a global reference. Instead of describing motion relative to previous frames, our method defines camera trajectories in an absolute coordinate system, allowing accurate, interpretable control over camera parameters. Using panoramic 360-degree videos for training, we cover the full sphere of possible viewpoints, including combinations of extreme pitch and roll that are out-of-distribution of conventional video data. To improve camera guidance, we introduce null-pitch conditioning, a strategy that prevents the model from overriding camera specifications in the presence

What carries the argument

Gravity-referenced absolute coordinate system for camera trajectories, trained via panoramic 360 videos and enforced with null-pitch conditioning.

Load-bearing premise

That training exclusively on panoramic 360-degree videos plus the null-pitch strategy will generalize to the distribution of conventional video prompts without introducing new artifacts or requiring additional fine-tuning on real-world footage.

What would settle it

Observe the output when the camera is conditioned to point straight up but the prompt describes ground-level objects; the model should generate sky rather than ground content if the conditioning works.

Figures

Figures reproduced from arXiv: 2512.09112 by Fr\'ed\'eric Fortier-Chouinard, Jean-Fran\c{c}ois Lalonde, Matheus Gadelha, Valentin Deschaintre, Yannick Hold-Geoffroy.

Figure 1
Figure 1. Figure 1: We propose GIMBALDIFFUSION, a framework for absolute camera control in text-to-video generation. Our approach adapts foundational video generation models to accept absolute camera controls, conditioning the entire video on camera parameters expressed in a gravity-aligned global coordinate system. This enables the generation of videos with challenging viewpoints, such as low pitch (top) or high roll (bottom… view at source ↗
Figure 2
Figure 2. Figure 2: Reproducing real scenes using our representation. From [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training data pipeline. (a) We extract the camera poses from a dataset of 360 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training data samples from our data augmentation pipeline, capturing a highly diverse set of rotation trajectories from 360 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of parameter distribution between our sampling and a typical video dataset for camera control (RealEstate10K [ [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Entanglement between prompt and camera pitch. Without [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on the SpatialVID-extreme benchmark. The input absolute camera angle is shown as a dark overlay on the [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

Recent progress in text-to-video generation has achieved remarkable realism, yet fine-grained control over camera motion and orientation remains elusive, especially with extreme trajectories (e.g., a 180-degree turnaround, or looking directly up or down). Existing approaches typically encode camera trajectories using relative or ambiguous representations, limiting precise geometric control and offering limited support for large rotations. We introduce GimbalDiffusion, a framework that enables camera control grounded in physical-world coordinates, using gravity as a global reference. Instead of describing motion relative to previous frames, our method defines camera trajectories in an absolute coordinate system, allowing accurate, interpretable control over camera parameters. Using panoramic 360-degree videos for training, we cover the full sphere of possible viewpoints, including combinations of extreme pitch and roll that are out-of-distribution of conventional video data. To improve camera guidance, we introduce null-pitch conditioning, a strategy that prevents the model from overriding camera specifications in the presence of conflicting prompt content (e.g., generating grass while the camera points toward the sky). Finally, we propose new benchmarks to evaluate gravity-aware camera-controlled video generation, assessing models' ability to generate extreme camera angles and quantify their input prompt entanglement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces GimbalDiffusion, a framework for text-to-video generation that achieves gravity-aware camera control by representing trajectories in an absolute world coordinate system with gravity as the global reference, rather than relative frame-to-frame motions. It trains exclusively on panoramic 360-degree videos to cover extreme pitch/roll combinations, introduces a null-pitch conditioning strategy to mitigate prompt-camera conflicts, and proposes new benchmarks to measure fidelity on extreme angles and prompt disentanglement.

Significance. If the claimed improvements in extreme-trajectory accuracy and reduced prompt entanglement prove robust, the work would advance controllable video synthesis by supplying an interpretable, physically grounded alternative to relative camera encodings, potentially benefiting applications that require precise geometric control.

major comments (3)
  1. [§4] §4 (Experiments and Benchmarks): The evaluation is confined to held-out panoramic 360° test clips; no quantitative results, ablations, or error analysis are reported on standard (non-360) text-to-video prompts, leaving the central generalization claim unsupported.
  2. [§3] §3 (Method, null-pitch conditioning): The mechanism by which null-pitch conditioning is injected into the diffusion process is described only at a high level; without the precise conditioning formulation or loss term, it is impossible to verify how it prevents the model from overriding camera specifications.
  3. [§4.2] §4.2 (Proposed benchmarks): The metrics used to quantify prompt entanglement and extreme-angle fidelity are not defined, nor are baseline comparisons provided, so the reported gains cannot be assessed for statistical significance or robustness.
minor comments (2)
  1. [Abstract] The abstract would be clearer if it briefly stated the scale of the training dataset and the specific quantitative metrics used in the new benchmarks.
  2. [§3] Notation for the absolute coordinate frame (e.g., how gravity vector and camera intrinsics are encoded) should be introduced earlier and used consistently.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] The evaluation is confined to held-out panoramic 360° test clips; no quantitative results, ablations, or error analysis are reported on standard (non-360) text-to-video prompts, leaving the central generalization claim unsupported.

    Authors: We agree that additional evidence on standard (non-360) prompts would strengthen the generalization discussion. While the core contribution targets extreme trajectories enabled by full-sphere panoramic training data, we will add a new subsection with qualitative results and limited quantitative metrics on conventional text-to-video prompts (e.g., from standard datasets) to demonstrate that the gravity-aware control does not degrade performance on typical cases. Full ablations on non-360 data will be included where feasible. revision: partial

  2. Referee: [§3] The mechanism by which null-pitch conditioning is injected into the diffusion process is described only at a high level; without the precise conditioning formulation or loss term, it is impossible to verify how it prevents the model from overriding camera specifications.

    Authors: We acknowledge that the description in §3 is high-level. In the revised manuscript we will add the exact conditioning formulation, including the mathematical definition of the null-pitch embedding, its injection point in the diffusion U-Net, and the modified loss term that encourages adherence to camera parameters even under conflicting text prompts. revision: yes

  3. Referee: [§4.2] The metrics used to quantify prompt entanglement and extreme-angle fidelity are not defined, nor are baseline comparisons provided, so the reported gains cannot be assessed for statistical significance or robustness.

    Authors: We thank the referee for this observation. Section 4.2 will be expanded with precise mathematical definitions of the prompt-entanglement and extreme-angle fidelity metrics. We will also add baseline comparisons against prior camera-control methods and report statistical significance (e.g., standard deviations over multiple seeds) to allow proper assessment of the gains. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training recipe with no self-referential derivations

full rationale

The paper introduces GimbalDiffusion as a training-based framework using 360° panoramic videos and null-pitch conditioning to achieve gravity-aware camera control. No equations, closed-form derivations, or parameter-fitting steps are described that reduce the claimed control accuracy or generalization to quantities defined or fitted inside the same work. The method is presented as an empirical recipe (data choice + conditioning strategy + new benchmarks) rather than a mathematical chain. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs in the provided derivation. Generalization from panoramic to conventional video distributions is an empirical question, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that panoramic video data plus a simple conditioning trick suffices to learn a general camera controller; no new physical constants or particles are introduced.

axioms (1)
  • domain assumption Diffusion-based video generators can be conditioned on explicit camera parameters when trained on sufficiently diverse viewpoint data.
    Invoked implicitly when stating that absolute coordinates plus panoramic training will yield accurate control.

pith-pipeline@v0.9.0 · 5532 in / 1259 out tokens · 36613 ms · 2026-05-16T23:36:53.300196+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CalibAnyView: Beyond Single-View Camera Calibration in the Wild

    cs.CV 2026-05 conditional novelty 8.0

    A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Towards high resolution video generation with progressive growing of sliced wasserstein gans

    Dinesh Acharya, Zhiwu Huang, Danda Pani Paudel, and Luc Van Gool. Towards high resolution video generation with progressive growing of sliced wasserstein gans. InCoRR, 2018-01-01. 2

  2. [2]

    Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Ali- aksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers. InIEEE/CVF Conf. Comput. Vis. Pattern Recog.,

  3. [3]

    Lindell, and Sergey Tulyakov

    Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin- Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David B. Lindell, and Sergey Tulyakov. VD3D: Taming large video diffusion transformers for 3d camera control. InInt. Conf. Learn. Represent., 2025. 4

  4. [4]

    Recammaster: Camera-controlled generative ren- dering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative ren- dering from a single video. InIEEE/CVF Int. Conf. Comput. Vis., 2025. 2

  5. [5]

    PreciseCam: Precise camera control for text-to-image generation

    Edurne Bernal-Berdun, Ana Serrano, Belen Masia, Matheus Gadelha, Yannick Hold-Geoffroy, Xin Sun, and Diego Gutier- rez. PreciseCam: Precise camera control for text-to-image generation. InIEEE/CVF Conf. Comput. Vis. Pattern Recog.,

  6. [6]

    MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion

    Di Chang, Yichun Shi, Quankai Gao, Jessica Fu, Hongyi Xu, Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang, and Mo- hammad Soleymani. MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion. InInt. Conf. Mach. Learn., 2024. 2

  7. [7]

    Control-a-video: Con- trollable text-to-video generation with diffusion models

    Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Con- trollable text-to-video generation with diffusion models. In CoRR, 2023. 2

  8. [8]

    Egocentric scene understanding via multimodal spatial rectifier

    Tien Do, Khiem Vuong, and Hyun Soo Park. Egocentric scene understanding via multimodal spatial rectifier. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2022. 3

  9. [9]

    RealEstate10K: A large-scale dataset of camera poses

    Google Research. RealEstate10K: A large-scale dataset of camera poses. https://google.github.io/ realestate10k/, 2018. Camera trajectories from approx- imately 80,000 video clips (from 10,000 YouTube videos), totaling about 10 million frames; poses generated via SLAM and bundle-adjustment. 6

  10. [10]

    Diffusion as shader: 3d- aware video diffusion for versatile video generation control

    Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, and Yuan Liu. Diffusion as shader: 3d- aware video diffusion for versatile video generation control. InACM SIGGRAPH Conf., 2025. 3

  11. [11]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. CameraCtrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 2, 4, 8

  12. [12]

    CameraCtrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

    Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. CameraCtrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025. 2, 4, 6

  13. [13]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InConf. Emp. Metho. Nat. Lang. Proc.,

  14. [14]

    Animate anyone: Consistent and controllable image-to-video synthesis for character animation.arXiv preprint arXiv:2311.17117, 2023

    Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation.arXiv preprint arXiv:2311.17117, 2023. 2

  15. [15]

    Vipe: Video pose engine for 3d geometric perception

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. Vipe: Video pose engine for 3d geometric perception. InNVIDIA Research Whitepapers, 2025. 3, 6

  16. [16]

    Megasam: Scaling up camera pose estima- tion with a foundation model for structure-from-motion

    Fang Jiang et al. Megasam: Scaling up camera pose estima- tion with a foundation model for structure-from-motion. In IEEE/CVF Conf. Comput. Vis. Pattern Recog., 2025. 3

  17. [17]

    Perspective fields for single image cam- era calibration

    Linyi Jin, Jianming Zhang, Yannick Hold-Geoffroy, Oliver Wang, Kevin Blackburn-Matzen, Matthew Sticha, and David F Fouhey. Perspective fields for single image cam- era calibration. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2023. 6, 8

  18. [18]

    Linyi Jin, Jianming Zhang, Yannick Hold-Geoffroy, Oliver Wang, Kevin Matzen, Matthew Sticha, and David F. Fouhey. Perspective fields for single image camera calibration. IEEE/CVF Conf. Comput. Vis. Pattern Recog., 2023. 3

  19. [19]

    Spad: Spatially aware multi-view diffusers

    Yash Kant, Aliaksandr Siarohin, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, and Igor Gilitschenski. Spad: Spatially aware multi-view diffusers. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2024. 4

  20. [20]

    Analyzing and improving the training dynamics of diffusion models

    Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2024. 6

  21. [21]

    Temporally consistent horizon lines

    Florian Kluger, Hanno Ackermann, Michael Ying Yang, and Bodo Rosenhahn. Temporally consistent horizon lines. InInt. Conf. Robot. Autom., 2020. 3

  22. [22]

    LightIt: Illumination modeling and control for diffusion models

    Peter Kocsis, Julien Philip, Kalyan Sunkavalli, Matthias Nießner, and Yannick Hold-Geoffroy. LightIt: Illumination modeling and control for diffusion models. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2024. 2

  23. [23]

    Ground- ing image matching in 3d with Mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with Mast3r. InEur. Conf. Comput. Vis., 2024. 3

  24. [24]

    Cameras as relative positional encoding

    Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding. InAdv. Neural Inform. Process. Syst., 2025. 3

  25. [25]

    Video generation from text

    Yitong Li, Martin Min, Dinghan Shen, David Carlson, and Lawrence Carin. Video generation from text. InAssoc. Adv. of Art. Int., 2018. 2

  26. [26]

    LightLab: Con- trolling light sources in images with diffusion models

    Nadav Magar, Amir Hertz, Eric Tabellion, Yael Pritch, Alex Rav-Acha, Ariel Shamir, and Yedid Hoshen. LightLab: Con- trolling light sources in images with diffusion models. In ACM SIGGRAPH Conf., 2025. 2

  27. [27]

    Openmvg: Open multiple view geometry

    Pierre Moulon, Pascal Monasse, Romuald Perrot, and Renaud Marlet. Openmvg: Open multiple view geometry. InInt. Work. Reproduc. Res. Patt. Recog., 2016. 3

  28. [28]

    Sch¨onberger and Jan-Michael Frahm

    Johannes L. Sch¨onberger and Jan-Michael Frahm. Structure- from-motion revisited. InIEEE/CVF Conf. Comput. Vis. Pat- tern Recog., 2016. 3

  29. [29]

    Mocogan: Decomposing motion and content for video generation

    Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. InIEEE/CVF Conf. Comput. Vis. Pattern Recog.,

  30. [30]

    GeoCalib: Single-image calibration with geometric optimization

    Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Single-image calibration with geometric optimization. InEur. Conf. Comput. Vis., 2024. 3

  31. [31]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

  32. [32]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2025. 3

  33. [33]

    SpatialVID: A large-scale video dataset with spatial annotations, 2025

    Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiaoxiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, and Yao Yao. SpatialVID: A large-scale video dataset with spatial annotations, 2025. 2, 6

  34. [34]

    360dvd: Controllable panorama video gener- ation with 360-degree video diffusion model.arXiv preprint arXiv:2401.06578, 2024

    Qian Wang, Weiqi Li, Chong Mou, Xinhua Cheng, and Jian Zhang. 360dvd: Controllable panorama video gener- ation with 360-degree video diffusion model.arXiv preprint arXiv:2401.06578, 2024. 6

  35. [35]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2024. 3

  36. [36]

    Diffusion models for video generation.Lil’Log (blog), 2024

    Lilian Weng. Diffusion models for video generation.Lil’Log (blog), 2024. https://lilianweng.github.io/posts/2024-04-12- diffusion-video/. 2

  37. [37]

    Horizon lines in the wild

    Scott Workman, Menghua Zhai, and Nathan Jacobs. Horizon lines in the wild. InBrit. Mach. Vis. Conf., 2016. 3

  38. [38]

    Visualsfm: A visual structure from motion system, 2011

    Changchang Wu et al. Visualsfm: A visual structure from motion system, 2011. 3

  39. [39]

    Uprightnet: geometry-aware camera orientation estimation from single images

    Wenqi Xian, Zhengqi Li, Matthew Fisher, Jonathan Eisen- mann, Eli Shechtman, and Noah Snavely. Uprightnet: geometry-aware camera orientation estimation from single images. InIEEE/CVF Int. Conf. Comput. Vis., 2019. 3

  40. [40]

    Motioncanvas: Cinematic shot design with controllable image-to-video generation

    Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Anirud- dha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu. Motioncanvas: Cinematic shot design with controllable image-to-video generation. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–11, 2025. 3

  41. [41]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2, 6

  42. [42]

    Image sculpting: Precise ob- ject editing with 3D geometry control

    Jiraphon Yenphraphai, Xichen Pan, Sainan Liu, Daniele Panozzo, and Saining Xie. Image sculpting: Precise ob- ject editing with 3D geometry control. InIEEE/CVF Conf. Comput. Vis. Pattern Recog., 2024. 2

  43. [43]

    Tra- jectoryCrafter: Redirecting camera trajectory for monocular videos via diffusion models

    Mark YU, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectoryCrafter: Redirecting camera trajectory for monocular videos via diffusion models. InIEEE/CVF Int. Conf. Comput. Vis., 2025. 3

  44. [44]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6