pith. sign in

arxiv: 2605.15391 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI

PanoWorld: Geometry-Consistent Panoramic Video World Modeling

Pith reviewed 2026-05-19 15:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords panoramic video generationgeometric consistencydepth consistency losstrajectory consistency360 degree videoworld modelingspherical geometryembodied AI
0
0 comments X p. Extension

The pith

PanoWorld improves geometric consistency in panoramic videos by enforcing depth and trajectory constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing panoramic video methods optimize for visual realism but produce outputs with inconsistent depth, broken correspondences, and implausible motion across the spherical surface. PanoWorld addresses this by framing the task as geometry- and dynamics-consistent latent state modeling on top of a pre-trained perspective video model. It adds a depth consistency loss against pseudo ground-truth panoramic depth and a trajectory consistency loss that supervises 3D world-frame positions of tracked points, plus spherical-geometry-aware adaptation to conditioning and positional encoding. A new dataset called PanoGeo supplies consistent depth, trajectory, and prompt annotations across real and synthetic sources for training and evaluation. Experiments indicate improved geometric consistency while keeping visual realism competitive, supporting the need for geometric modeling in applications like embodied AI.

Core claim

By framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem and introducing depth consistency loss against pseudo ground-truth panoramic depth plus trajectory consistency loss on 3D world-frame positions, along with spherical-geometry-aware adaptation, PanoWorld generates 360° videos from a single image and caption that exhibit better geometric consistency than prior methods while maintaining competitive visual realism.

What carries the argument

Depth consistency loss against pseudo ground-truth panoramic depth combined with trajectory consistency loss supervising 3D world-frame positions of tracked points, plus spherical-geometry-aware adaptation to conditioning and positional encoding.

If this is right

  • Panoramic video outputs show consistent depth and unbroken correspondences across the spherical surface.
  • Geometric consistency improves over prior methods that treat generation as pure visual synthesis.
  • Visual realism remains competitive with existing panoramic generation approaches.
  • Panoramic video generation must be treated as a geometric modeling problem to support holistic spatial understanding in embodied AI.
  • The PanoGeo dataset enables training and stratified evaluation with unified geometry annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same depth and trajectory regularizers could be adapted to other immersive or wide-field video generation settings.
  • Consistent 3D modeling may improve reliability in downstream tasks such as navigation or object interaction in virtual environments.
  • Joint optimization with learned depth estimation could reduce reliance on pseudo ground-truth depth in future extensions.

Load-bearing premise

The pseudo ground-truth panoramic depth maps used for the depth consistency loss are accurate enough to enforce genuine 3D consistency without introducing systematic errors or artifacts.

What would settle it

If independent 3D reconstruction or multi-view evaluation of the generated videos reveals persistent depth errors, broken point trajectories, or implausible motion across frames, the geometric consistency improvement would not hold.

Figures

Figures reproduced from arXiv: 2605.15391 by Bishoy Galoaa, Caleb James Lee, Edmund Yeh, Jennifer Dy, Le Jiang, Sarah Ostadabbas, Shayda Moezzi, Tooba Imtiaz, Xiangyu Bai, Yanzhi Wang.

Figure 1
Figure 1. Figure 1: PanoWorld generates geometry-consistent 360◦ panoramic video from a single perspective image and text prompt. Unlike prior panoramic video methods that optimize for visual realism alone, PanoWorld enforces depth and trajectory consistency in the latent world state, enabling downstream 3D applications that appearance-only methods cannot support: (a) input perspective image, (b) generated 360◦ panoramic vide… view at source ↗
Figure 2
Figure 2. Figure 2: Training pipeline overview. A ground-truth panoramic video from PanoGeo (blue, left) is VAE-encoded into a clean latent z0 and mixed with noise ϵ to form znoise (green, top). Three conditioning signals (orange, center) are derived from a randomly sampled perspective crop: (1) spatial: the crop is projected onto the equirectangular canvas (P2E) and VAE-encoded into reference latent zref and mask mspatial, c… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on PanoGeo under Stage 1 input (single perspective image + caption). One representative ERP frame per method on six clips spanning the three evaluation regimes: PanoGeo (rows 1–2), Argus Testset (rows 3–4), and Habitat-Sim (rows 5–6). The red dashed box on each GT frame marks the perspective crop fed to all methods; the rest of the sphere is hallucinated. Compared methods: OmniRoam [… view at source ↗
read the original abstract

We present PanoWorld, a panoramic video world model that generates geometry-consistent 360$\degree$ video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at https://github.com/ostadabbas/PanoWorld.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper presents PanoWorld, a method for generating geometry-consistent 360° panoramic videos from a single image and caption. It builds on a pre-trained perspective video world model by adding a depth consistency loss against pseudo ground-truth panoramic depth maps (Eq. 4 in §3.2), a trajectory consistency loss on 3D world-frame positions of tracked points, spherical-geometry-aware adaptations to conditioning and positional encodings, and the new PanoGeo dataset with consistent depth, trajectory, and prompt annotations. Experiments claim improved geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, with code released publicly.

Significance. If the reported geometric consistency gains hold under scrutiny, the work advances panoramic video generation by reframing it as latent 3D state modeling rather than pure 2D synthesis, which aligns with needs in embodied AI for holistic spatial understanding. Strengths include the introduction of the PanoGeo dataset with unified annotations across real and synthetic sources and the public code release, both of which support reproducibility and future extensions.

major comments (3)
  1. [§3.2, Eq. (4)] §3.2, Eq. (4): The depth consistency loss is defined as L_depth = ||D_gen - D_pseudo||, where D_pseudo is produced by an off-the-shelf panoramic depth estimator. Systematic biases in equirectangular depth estimation (e.g., near poles or stitching seams) would cause the generator to optimize toward incorrect 3D states; no ablation or validation of D_pseudo accuracy on the target distribution is provided, so it is unclear whether measured consistency improvements reflect genuine geometry or artifacts of the pseudo-label distribution.
  2. [§3.3] §3.3: The trajectory consistency loss lifts 2D tracks to 3D world-frame positions using the same depth estimates that feed the depth loss. This creates a dependency in which depth errors affect both regularizers; without an experiment that replaces D_pseudo with ground-truth depth or measures sensitivity to depth noise, the claim that the combined losses enforce independent 3D consistency remains unverified.
  3. [Evaluation section] Evaluation section (and associated tables): The paper reports improvements in geometric consistency metrics, yet provides insufficient detail on baseline implementations, exact metric definitions, and per-component ablations (depth loss vs. trajectory loss vs. spherical adaptation). This makes it difficult to isolate the contribution of the proposed regularizers to the central claim.
minor comments (3)
  1. The abstract states experimental improvements but does not include any numerical values or baseline names; moving a concise quantitative summary to the abstract would improve readability.
  2. Notation for the spherical positional encoding adaptation is introduced without an explicit equation or diagram showing how it differs from standard sinusoidal encodings; adding this would clarify the geometric-awareness claim.
  3. Figure captions for qualitative results should explicitly state the input image, caption, and which method produced each row to facilitate direct visual comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our paper. We will revise the manuscript to address the concerns regarding the validation of pseudo-depth labels, the dependency between losses, and the details in the evaluation section.

read point-by-point responses
  1. Referee: [§3.2, Eq. (4)] §3.2, Eq. (4): The depth consistency loss is defined as L_depth = ||D_gen - D_pseudo||, where D_pseudo is produced by an off-the-shelf panoramic depth estimator. Systematic biases in equirectangular depth estimation (e.g., near poles or stitching seams) would cause the generator to optimize toward incorrect 3D states; no ablation or validation of D_pseudo accuracy on the target distribution is provided, so it is unclear whether measured consistency improvements reflect genuine geometry or artifacts of the pseudo-label distribution.

    Authors: We agree that the accuracy of the pseudo ground-truth depth maps is crucial for the validity of the depth consistency loss. Since PanoGeo includes synthetic data with ground-truth depth annotations, we can perform the necessary validation. In the revised manuscript, we will add an analysis of the pseudo-depth estimator's accuracy on the synthetic subset and include an ablation study using ground-truth depth for the loss computation on those sequences. This will help confirm that the improvements are due to genuine geometric consistency rather than estimator-specific artifacts. revision: yes

  2. Referee: [§3.3] §3.3: The trajectory consistency loss lifts 2D tracks to 3D world-frame positions using the same depth estimates that feed the depth loss. This creates a dependency in which depth errors affect both regularizers; without an experiment that replaces D_pseudo with ground-truth depth or measures sensitivity to depth noise, the claim that the combined losses enforce independent 3D consistency remains unverified.

    Authors: We acknowledge the shared dependency on depth estimates. To verify the independent contributions, we will leverage the ground-truth depth available in the synthetic part of PanoGeo. The revised paper will include experiments replacing D_pseudo with ground-truth depth for both the depth and trajectory losses, as well as a sensitivity analysis where we introduce controlled noise to the depth maps and evaluate the resulting geometric consistency. These additions will strengthen the claim that the losses enforce 3D consistency. revision: yes

  3. Referee: Evaluation section (and associated tables): The paper reports improvements in geometric consistency metrics, yet provides insufficient detail on baseline implementations, exact metric definitions, and per-component ablations (depth loss vs. trajectory loss vs. spherical adaptation). This makes it difficult to isolate the contribution of the proposed regularizers to the central claim.

    Authors: We appreciate the suggestion to improve clarity in the evaluation. We will revise the evaluation section to provide exact definitions of the geometric consistency metrics, detailed descriptions of baseline adaptations and implementations, and comprehensive ablations that isolate the effects of the depth consistency loss, trajectory consistency loss, and spherical-geometry adaptations. This will allow readers to better assess the contribution of each proposed component. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external pseudo-labels and independent losses

full rationale

The paper builds on a pre-trained perspective model (external), adds depth consistency loss against off-the-shelf panoramic depth estimator and trajectory loss on 3D positions, plus spherical adaptation and a new PanoGeo dataset for training/evaluation. The depth loss L_depth = ||D_gen - D_pseudo|| enforces matching to an independent pseudo-GT source rather than defining the target by the model's own outputs. Reported geometric improvements are evaluated on the new dataset with metrics that do not reduce to the fitted parameters or self-citations by construction. No self-definitional, fitted-input-renamed-as-prediction, or load-bearing self-citation steps are present.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on the reliability of pseudo ground-truth depth and on the transferability of a pre-trained perspective model to spherical geometry; no free parameters or invented physical entities are visible in the abstract.

axioms (2)
  • domain assumption Pseudo ground-truth panoramic depth maps are sufficiently accurate to serve as supervision for geometric consistency.
    Invoked when the depth consistency loss is introduced.
  • domain assumption A pre-trained perspective video world model can be adapted to panoramic output via lightweight regularizers and spherical positional encoding changes.
    Central to the proposed adaptation strategy.

pith-pipeline@v0.9.0 · 5800 in / 1420 out tokens · 47400 ms · 2026-05-19T15:37:19.171881+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 12 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat- topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

  2. [2]

    Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

  3. [3]

    Lumiere: A space-time diffusion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11,

  4. [4]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv:2404.08471,

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

  6. [6]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669,

  7. [7]

    Reconviagen: Towards accurate multi-view 3d object recon- struction via generation.arXiv preprint arXiv:2510.23306,

    10 Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi-view 3d object recon- struction via generation.arXiv preprint arXiv:2510.23306,

  8. [8]

    Infinite-canvas: Higher-resolution video outpainting with extensive content generation

    Qihua Chen, Yue Ma, Hongfa Wang, Junkun Yuan, Wenzhe Zhao, Qi Tian, Hongmei Wang, Shaobo Min, Qifeng Chen, and Wei Liu. Infinite-canvas: Higher-resolution video outpainting with extensive content generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 2150–2158, 2025a. Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Ch...

  9. [9]

    Introduction to latent variable energy-based models: a path toward autonomous machine intelligence.Journal of Statistical Mechanics: Theory and Experiment, 2024 (10):104011,

    Anna Dawid and Yann LeCun. Introduction to latent variable energy-based models: a path toward autonomous machine intelligence.Journal of Statistical Mechanics: Theory and Experiment, 2024 (10):104011,

  10. [10]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019a. Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on mac...

  11. [11]

    Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025a

    Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025a. Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Terra: Explorable native 3d world model with point latents, 2025b. URL https://a...

  12. [12]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125,

  13. [13]

    Omninwm: Omniscient driving navigation world models

    Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving navigation world models. arXiv preprint arXiv:2510.18313, 2025a. Longfei Li, Zhiwen Fan, Wenyan Cong, Xinhang Liu, Yuyang Yin, Matt Foutter, Panwang Pan, Chenyu You, Yue Wang, Zhangyang Wang, et al. Mart...

  14. [14]

    Depth any panoramas: A foundation model for panoramic depth estimation.arXiv preprint arXiv:2512.16913, 2025

    Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, and Lu Qi. Depth any panoramas: A foundation model for panoramic depth estimation. arXiv preprint arXiv:2512.16913,

  15. [15]

    Omniroam: World wandering via long- horizon panoramic video generation.arXiv preprint arXiv:2603.30045, 2026

    Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang, Kalyan Sunkavalli, Yannick Hold- Geoffroy, Hao Tan, Kai Zhang, Xiaohui Xie, et al. Omniroam: World wandering via long-horizon panoramic video generation.arXiv preprint arXiv:2603.30045,

  16. [16]

    Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

    Panwang Pan, Chenguo Lin, Jingjing Zhao, Chenxin Li, Yuchen Lin, Haopeng Li, Honglei Yan, Kairun Wen, Yunlong Lin, Yixuan Yuan, et al. Diff4splat: Controllable 4d scene generation with latent dynamic reconstruction models.arXiv preprint arXiv:2511.00503,

  17. [17]

    Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023

    Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724,

  18. [18]

    Vipra: Video prediction for robot actions.arXiv preprint arXiv:2511.07732,

    12 Sandeep Routray, Hengkai Pan, Unnat Jain, Shikhar Bahl, and Deepak Pathak. Vipra: Video prediction for robot actions.arXiv preprint arXiv:2511.07732,

  19. [19]

    GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

    Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

  20. [20]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

  21. [21]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InICLR 2019 Workshop on Debugging Machine Learning Models,

  22. [22]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314,

  23. [23]

    Novel view synthesis with diffusion models.arXiv preprint arXiv:2210.04628,

    Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mo- hammad Norouzi. Novel view synthesis with diffusion models.arXiv preprint arXiv:2210.04628,

  24. [24]

    Centerface: joint face detection and alignment using face as point.Scientific Programming, 2020(1):7845384,

    Yuanyuan Xu, Wan Yan, Genke Yang, Jiliang Luo, Tao Li, and Jianan He. Centerface: joint face detection and alignment using face as point.Scientific Programming, 2020(1):7845384,

  25. [25]

    FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685,

  26. [26]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404,

  27. [27]

    Aether: Geometric-aware unified world modeling

    Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8535–8546, 2025a. Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, ...

  28. [28]

    Table 3 reports the same Stage 1 metrics as Table 1 broken down by source (50 clips each)

    hide source-level behaviour that is informative for a panoramic generator’s failure modes: the PanoGeo held-out slice measures within-distribution polish, the Argus testset stresses appearance priors against unseen real- world capture conditions, and Habitat-Sim stresses geometric priors against synthetic but precisely controlled scenes. Table 3 reports t...

  29. [29]

    does the generator produce a self-consistent 3D scene

    and per source over PanoGeo held-out / Argus testset / Habitat-Sim (Table 3). B.2 Metric Definitions We split metrics into two complementary axes. Thecorrespondence-freeaxis is well-defined under any camera path (both stages); thecorrespondence-requiredaxis assumes frame-aligned predictions against GT and is directly meaningful only when frame0is anchored...

  30. [30]

    The geometry loss weights are λd=0.3 and λτ=0.06 in L=L visual +λ dLdepth +λ τ Ltrack

    on T=93 -frame clips at 16 fps. The geometry loss weights are λd=0.3 and λτ=0.06 in L=L visual +λ dLdepth +λ τ Ltrack. The noise-adaptive confidence factor c(σ) uses σmax=3.0, which covers ∼86% of the EDM log-normal noise levels we sample from. The augmented track-state coefficients are α=0.5, β=0.25 . A linear warm-up of 1,000 iterations ramps both auxil...