pith. sign in

arxiv: 2606.27345 · v2 · pith:2P52PR4Gnew · submitted 2026-06-25 · 💻 cs.CV

RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation

Pith reviewed 2026-06-29 04:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords positional encodingvideo diffusionPlucker coordinates3D consistencyray spacecamera controlattention mechanismdiffusion transformers
0
0 comments X

The pith

RayPE injects 6D Plucker ray coordinates into video diffusion transformer attention to capture scene geometry that standard RoPE misses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard video diffusion transformers encode token positions with RoPE on image and time axes, which describes the camera sampling grid but carries no information about the underlying 3D scene. The paper observes that the geometric relation between any two camera rays is given by their Plucker reciprocal product, a bilinear operation that matches the algebraic form of the dot product inside attention. RayPE therefore adds the 6D Plucker coordinates of each ray additively into the query and key vectors, using a flip arrangement so the geometry term appears exactly when the two rays coincide. The resulting attention score splits into a content term, a geometry term, and cross terms; experiments show each is required for the measured gains. The module is zero-initialized, adds under 0.1 percent parameters, and is stabilized across heterogeneous camera scales by magnitude gating and RMSNorm alignment.

Core claim

RayPE injects per-token 6D Plucker coordinates additively into the queries and keys of self-attention, with a query/key flip arrangement under which the symmetric identity configuration coincides exactly with the reciprocal product. The attention score therefore decomposes into a content term, a geometry term, and two content-geometry cross-terms, all of which prove individually necessary. Direction is decoupled from moment magnitude, the geometry contribution is gated by a learned function of log-magnitude, and RMSNorm aligns it with the content branch.

What carries the argument

RayPE: additive injection of 6D Plucker coordinates into attention queries and keys so the reciprocal-product bilinear form appears directly in the attention score.

If this is right

  • The full module adds less than 0.1 percent parameters to a pretrained video DiT.
  • Zero initialization allows training to begin from the original pretrained weights.
  • Camera controllability improves on generated videos.
  • Cross-frame 3D consistency improves on the training mixture.
  • Overall video quality improves on the same four-dataset mixture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The clean separation of geometry and content terms suggests the same bilinear injection could be tried on other attention-based spatial models without redesigning the backbone.
  • Decoupling ray direction from moment magnitude may let the same encoding handle both normalized and metric camera data in future video or image generators.
  • Because the change is additive and zero-initialized, it offers a low-risk route for adding geometric awareness to any existing DiT checkpoint.

Load-bearing premise

The bilinear reciprocal-product term, when added to content-based attention and gated by a learned function of log-magnitude, produces measurable gains in 3D consistency rather than being absorbed into the existing content pathway.

What would settle it

Ablating only the geometry term while keeping the rest of RayPE and measuring whether cross-frame 3D consistency metrics on the four-dataset mixture drop back to the RoPE baseline level.

Figures

Figures reproduced from arXiv: 2606.27345 by Jiahao Lu, Kai Han, Minghao Yin, Shan Ying, Wang Zhao, Wenbo Hu.

Figure 1
Figure 1. Figure 1: RayPE enables precise relative camera control for pretrained video diffusion models. Given a target camera trajectory, our method [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Why attention needs ray geometry. Two cameras (blue, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Applying RayPE to a self-attention layer. The content path (pink) preserves the pretrained Wq/Wk → QKNorm → RoPE pipeline unchanged. The geometry path (orange/blue) computes per-token Plucker coordinates from camera parame- ¨ ters, decomposes them into scale-invariant direction and log￾magnitude, projects via zero-initialized Eq/Ek, normalizes (RMSNorm), and gates by a learned function of the log￾magnitude… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on RealEstate10K. Each row shows frames from the same clip generated under the same target camera [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gallery of RayPE on diverse scenes. Each row is a different first frame driven by a distinct target camera trajectory; frames go [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Out-of-distribution qualitative comparison. Inputs are artistic paintings (rather than natural photographs), conditioned on simple [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Generalization to stylized inputs. Each row is a different hand-painted concept-art first frame driven by a distinct target camera [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Multi-trajectory gallery on cinematic out-of-distribution inputs. We fix two movie-still first frames and drive each through [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Modern video diffusion transformers position their tokens through RoPE on the (u,v,t) axes -- a description of the camera's sampling grid that says nothing about the 3D structure of the scene. We observe that the geometric relation between two camera rays is captured by the Plucker reciprocal product, which is bilinear in the two rays -- the same algebraic form as the dot product in Transformer attention. Building on this analogy, we propose RayPE, a positional-encoding extension that injects per-token 6D Plucker coordinates additively into the queries and keys of self-attention, with a query/key flip arrangement under which the symmetric identity configuration coincides exactly with the reciprocal product. The injection is additive, the resulting attention score decomposes into a content term, a geometry term, and two content and geometry cross-terms -- all of which our experiments find individually necessary. To make the encoding stable across video data with heterogeneous camera-translation scales (SfM, deep SLAM, metric), we further decouple ray direction from moment magnitude, gate the encoding by a learned function of the log-magnitude, and apply RMSNorm to align it with the QKNorm-normalized content branch. The full module adds less than 0.1% parameters to a pretrained video DiT, is zero-initialized to start from the pretrained weights, and improves camera controllability, cross-frame 3D consistency, and overall video quality on a four-dataset training mixture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RayPE, an extension to positional encodings in video diffusion transformers (DiTs). It injects per-token 6D Plucker coordinates of camera rays additively into the queries and keys of self-attention via a query/key flip arrangement so that the attention score includes the Plucker reciprocal product as its geometry term. The resulting score decomposes into content, geometry, and two cross terms; a learned gate on log-magnitude plus RMSNorm stabilizes the encoding across heterogeneous camera scales. The module adds <0.1% parameters, is zero-initialized from a pretrained video DiT, and is claimed to improve camera controllability, cross-frame 3D consistency, and video quality on a four-dataset mixture, with experiments asserting that all four terms are individually necessary.

Significance. If the experimental claims are substantiated, the work supplies a lightweight, algebraically motivated way to embed ray-space geometry directly into the attention mechanism of video generators. This could improve controllability and consistency without retraining from scratch or adding substantial capacity, and the bilinear decomposition plus gating strategy may generalize to other geometric priors in diffusion models.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'our experiments find [the four terms] individually necessary' and that the module improves camera controllability, 3D consistency, and video quality rests on an assertion with no accompanying quantitative tables, ablation numbers, error bars, or controls (e.g., random-vector replacement or magnitude-agnostic baseline). Without these data the necessity of the specific Plucker reciprocal-product term versus generic additive capacity cannot be assessed.
  2. [Method (RayPE formulation)] The geometric term is derived from the external Plucker identity rather than optimized against the video-quality metric; the learned gate is a small additional parameter. The manuscript must demonstrate that this algebraic form (rather than any low-parameter additive branch) is responsible for the reported gains, e.g., via an ablation that replaces the reciprocal product with an unstructured modulation of identical dimensionality.
minor comments (2)
  1. [Implementation details] Clarify the exact initialization and training schedule for the learned gate function of log-magnitude so that readers can reproduce the zero-initialized starting point.
  2. [Experiments] The four-dataset mixture is mentioned but its composition, camera-scale statistics, and per-dataset metric breakdowns are not summarized; a table would help evaluate robustness across SfM, deep SLAM, and metric regimes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. The comments highlight the need for stronger quantitative substantiation of our experimental claims, which we will address through targeted revisions and additional ablations. Below we respond point by point.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'our experiments find [the four terms] individually necessary' and that the module improves camera controllability, 3D consistency, and video quality rests on an assertion with no accompanying quantitative tables, ablation numbers, error bars, or controls (e.g., random-vector replacement or magnitude-agnostic baseline). Without these data the necessity of the specific Plucker reciprocal-product term versus generic additive capacity cannot be assessed.

    Authors: We agree that the abstract's summary of the experimental findings would be strengthened by explicit reference to quantitative results. The current manuscript presents ablation studies on the four terms, but we acknowledge the absence of consolidated tables with error bars and controls such as random-vector replacements in the provided sections. In the revision we will add these tables to the main text (and update the abstract to reference them), enabling direct assessment of whether the Plucker term outperforms generic additive capacity. revision: yes

  2. Referee: [Method (RayPE formulation)] The geometric term is derived from the external Plucker identity rather than optimized against the video-quality metric; the learned gate is a small additional parameter. The manuscript must demonstrate that this algebraic form (rather than any low-parameter additive branch) is responsible for the reported gains, e.g., via an ablation that replaces the reciprocal product with an unstructured modulation of identical dimensionality.

    Authors: We accept that an ablation isolating the algebraic form of the reciprocal product is necessary to rule out generic low-parameter effects. While the Plucker identity supplies the bilinear geometry term that aligns with attention, we will implement the suggested control—replacing the reciprocal product with an unstructured modulation of matching dimensionality—and report the comparative results (including effects on controllability and consistency metrics) in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: geometric term derived from external Plucker identity; attention decomposition follows directly from additive injection

full rationale

The paper's core construction starts from the external algebraic fact that the Plucker reciprocal product is bilinear in ray coordinates (identical in form to the dot product), then defines an additive Q/K injection with a flip symmetry so the geometry term appears explicitly in the decomposed attention score. This decomposition is a direct algebraic consequence of the chosen injection and does not presuppose the target video-quality metric or any fitted result. The log-magnitude gate and RMSNorm are additional learned components whose necessity is asserted via experiment rather than by definition. No self-citation is load-bearing for the central claim, no parameter is fitted to the evaluation metric and then relabeled as a prediction, and no uniqueness theorem or ansatz is smuggled from prior author work. The reported gains therefore rest on empirical verification rather than reducing to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the algebraic identity between the reciprocal product and the dot-product form of attention, plus the assumption that a learned scalar gate on log-magnitude will stabilize the encoding across heterogeneous camera scales without introducing new free parameters that dominate the result.

free parameters (1)
  • learned gate function of log-magnitude
    A small neural function that modulates the geometry term based on ray moment magnitude; its parameters are fitted during fine-tuning.
axioms (2)
  • standard math The Plucker reciprocal product is bilinear in the two rays and therefore compatible with the algebraic form of scaled dot-product attention.
    Invoked in the second sentence of the abstract to justify the additive injection.
  • domain assumption Heterogeneous camera-translation scales in SfM, deep SLAM, and metric data require explicit decoupling of direction from moment magnitude.
    Stated as motivation for the gating and RMSNorm steps.

pith-pipeline@v0.9.1-grok · 5808 in / 1550 out tokens · 35317 ms · 2026-06-29T04:31:19.797821+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Ali- aksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025. 3

  2. [2]

    Vd3d: Taming large video diffusion transformers for 3d camera control

    Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siaro- hin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control. InInternational Con- ference on Learning Representations, pages 66712–66737,

  3. [3]

    ReCamMaster: Camera-controlled gen- erative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, and Di Zhang. ReCamMaster: Camera-controlled gen- erative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. arXiv:2503.11647. 2, 3, 6, 7, 8, 9, 14

  4. [4]

    GS-DiT: Advancing video generation with pseudo 4D gaussian fields through efficient dense 3D point tracking

    Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu- Yun Wang, and Hongsheng Li. GS-DiT: Advancing video generation with pseudo 4D gaussian fields through efficient dense 3D point tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2501.02690. 3

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1, 2, 3

  6. [6]

    Simultaneous local- ization and mapping: part i.IEEE robotics & automation magazine, 13(2):99–110, 2006

    Hugh Durrant-Whyte and Tim Bailey. Simultaneous local- ization and mapping: part i.IEEE robotics & automation magazine, 13(2):99–110, 2006. 2, 5

  7. [7]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis. InProceed- ings of the Internati...

  8. [8]

    Diffusion as shader: 3d-aware video diffusion for ver- satile video generation control

    Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffusion for ver- satile video generation control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–12, 2025. 3

  9. [9]

    Cambridge University Press, 2 edition, 2004

    Richard Hartley and Andrew Zisserman.Multiple View Ge- ometry in Computer Vision. Cambridge University Press, 2 edition, 2004. 4, 13

  10. [10]

    Cameractrl: En- abling camera control for video diffusion models

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: En- abling camera control for video diffusion models. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 2, 3, 6, 7

  11. [11]

    Query-key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. InFindings of the Association for Computa- tional Linguistics: EMNLP 2020, 2020. 2, 3

  12. [12]

    Rotary position embedding for vision transformer

    Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. In European Conference on Computer Vision (ECCV), 2024. 1, 3

  13. [13]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2020. 3

  14. [14]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffu- sion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 1

  15. [15]

    ViPE: Video Pose Engine for 3D Geometric Perception

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 7

  16. [16]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, et al. Hunyuan- Video: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 2

  17. [17]

    ReRoPE: Repurposing RoPE for relative camera control.arXiv preprint arXiv:2602.08068, 2026

    Chunyang Li, Yuanbo Yang, Jiahao Shao, Hongyu Zhou, Katja Schwarz, and Yiyi Liao. ReRoPE: Repurpos- ing RoPE for relative camera control.arXiv preprint arXiv:2602.08068, 2026. 2, 3, 6, 7, 14

  18. [18]

    Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025

    Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encod- ing. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. Introduces PROPE (Projective Positional Encoding) for multi-view transformers; arXiv:2507.10496. 2, 3, 6

  19. [19]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

  20. [20]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for generative modeling. InInternational Conference on Learning Repre- sentations (ICLR), 2023. 3

  21. [21]

    Trackingworld: World-centric monocular 3d tracking of almost all pixels.arXiv preprint arXiv:2512.08358, 2025

    Jiahao Lu, Weitao Xiong, Jiacheng Deng, Peng Li, Tianyu Huang, Zhiyang Dou, Cheng Lin, Sai-Kit Yeung, and Yuan Liu. Trackingworld: World-centric monocular 3d tracking of almost all pixels.arXiv preprint arXiv:2512.08358, 2025. 3

  22. [22]

    Track4world: Feedforward world-centric dense 3d tracking of all pixels.arXiv preprint arXiv:2603.02573, 2026

    Jiahao Lu, Jiayi Xu, Wenbo Hu, Ruijie Zhu, Chengfeng Zhao, Sai-Kit Yeung, Ying Shan, and Yuan Liu. Track4world: Feedforward world-centric dense 3d tracking of all pixels.arXiv preprint arXiv:2603.02573, 2026. 3

  23. [23]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2023. 2 11

  24. [24]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, et al. Movie Gen: A cast of media founda- tion models.arXiv preprint arXiv:2410.13720, 2024. 1, 2

  25. [25]

    Springer, 2001

    Helmut Pottmann and Johannes Wallner.Computational Line Geometry. Springer, 2001. 4, 13

  26. [26]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  27. [27]

    Structure-from-motion revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2, 5, 14

  28. [28]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced trans- former with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021. 1, 3

  29. [29]

    DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras

    Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InAd- vances in Neural Information Processing Systems (NeurIPS),

  30. [30]

    FVD: A new metric for video generation

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. FVD: A new metric for video generation. InICLR Workshop on Deep Generative Models for Highly Structured Data,

  31. [31]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 3

  32. [32]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 2, 3, 6, 15

  33. [33]

    MotionCtrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. MotionCtrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Pa- pers, 2024. 2, 3

  34. [34]

    Spatialtracker: Tracking any 2d pixels in 3d space

    Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20406–20417, 2024. 3

  35. [35]

    Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025

    Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025. 3

  36. [36]

    CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

    Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. CamCo: Camera- controllable 3D-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024. 2, 3

  37. [37]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InInternational Conference on Learning Representations, 2025. 1, 2, 3

  38. [38]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhiber Huang, Xiang Gao, Xiaogang Luo, Ying Shan, and Yonghong Tian. ViewCrafter: Taming video diffusion mod- els for high-fidelity novel view synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2409.02048. 2, 3

  39. [39]

    Tra- jectoryCrafter: Redirecting camera trajectory for monoc- ular videos via diffusion models

    Wangbo Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectoryCrafter: Redirecting camera trajectory for monoc- ular videos via diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2, 3

  40. [40]

    Root mean square layer nor- malization.Advances in neural information processing sys- tems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer nor- malization.Advances in neural information processing sys- tems, 32, 2019. 2

  41. [41]

    Tapip3d: Tracking any point in persistent 3d geome- try.Advances in Neural Information Processing Systems, 38: 135284–135303, 2026

    Bowei Zhang, Lei Ke, Adam Harley, and Katerina Fragki- adaki. Tapip3d: Tracking any point in persistent 3d geome- try.Advances in Neural Information Processing Systems, 38: 135284–135303, 2026. 3

  42. [42]

    Uni- fied camera positional encoding for controlled video gen- eration

    Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, and Jianfei Cai. Uni- fied camera positional encoding for controlled video gen- eration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. arXiv:2512.07237. 2, 3, 6, 7, 8, 9, 14

  43. [43]

    Stereo magnification: Learning view syn- thesis using multiplane images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view syn- thesis using multiplane images. InACM SIGGRAPH, 2018. 2, 7, 14

  44. [44]

    Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

    Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhou- jie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, and Tong He. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025. 2,...