PanoWorld: Geometry-Consistent Panoramic Video World Modeling

arxiv: 2605.15391 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI

PanoWorld: Geometry-Consistent Panoramic Video World Modeling

Le Jiang , Xiangyu Bai , Bishoy Galoaa , Shayda Moezzi , Caleb James Lee , Tooba Imtiaz , Edmund Yeh , Jennifer Dy

show 2 more authors

Yanzhi Wang Sarah Ostadabbas

This is my paper

Pith reviewed 2026-05-19 15:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords panoramic video generationgeometric consistencydepth consistency losstrajectory consistency360 degree videoworld modelingspherical geometryembodied AI

0 comments p. Extension

The pith

PanoWorld improves geometric consistency in panoramic videos by enforcing depth and trajectory constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing panoramic video methods optimize for visual realism but produce outputs with inconsistent depth, broken correspondences, and implausible motion across the spherical surface. PanoWorld addresses this by framing the task as geometry- and dynamics-consistent latent state modeling on top of a pre-trained perspective video model. It adds a depth consistency loss against pseudo ground-truth panoramic depth and a trajectory consistency loss that supervises 3D world-frame positions of tracked points, plus spherical-geometry-aware adaptation to conditioning and positional encoding. A new dataset called PanoGeo supplies consistent depth, trajectory, and prompt annotations across real and synthetic sources for training and evaluation. Experiments indicate improved geometric consistency while keeping visual realism competitive, supporting the need for geometric modeling in applications like embodied AI.

Core claim

By framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem and introducing depth consistency loss against pseudo ground-truth panoramic depth plus trajectory consistency loss on 3D world-frame positions, along with spherical-geometry-aware adaptation, PanoWorld generates 360° videos from a single image and caption that exhibit better geometric consistency than prior methods while maintaining competitive visual realism.

What carries the argument

Depth consistency loss against pseudo ground-truth panoramic depth combined with trajectory consistency loss supervising 3D world-frame positions of tracked points, plus spherical-geometry-aware adaptation to conditioning and positional encoding.

If this is right

Panoramic video outputs show consistent depth and unbroken correspondences across the spherical surface.
Geometric consistency improves over prior methods that treat generation as pure visual synthesis.
Visual realism remains competitive with existing panoramic generation approaches.
Panoramic video generation must be treated as a geometric modeling problem to support holistic spatial understanding in embodied AI.
The PanoGeo dataset enables training and stratified evaluation with unified geometry annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same depth and trajectory regularizers could be adapted to other immersive or wide-field video generation settings.
Consistent 3D modeling may improve reliability in downstream tasks such as navigation or object interaction in virtual environments.
Joint optimization with learned depth estimation could reduce reliance on pseudo ground-truth depth in future extensions.

Load-bearing premise

The pseudo ground-truth panoramic depth maps used for the depth consistency loss are accurate enough to enforce genuine 3D consistency without introducing systematic errors or artifacts.

What would settle it

If independent 3D reconstruction or multi-view evaluation of the generated videos reveals persistent depth errors, broken point trajectories, or implausible motion across frames, the geometric consistency improvement would not hold.

Figures

Figures reproduced from arXiv: 2605.15391 by Bishoy Galoaa, Caleb James Lee, Edmund Yeh, Jennifer Dy, Le Jiang, Sarah Ostadabbas, Shayda Moezzi, Tooba Imtiaz, Xiangyu Bai, Yanzhi Wang.

**Figure 1.** Figure 1: PanoWorld generates geometry-consistent 360◦ panoramic video from a single perspective image and text prompt. Unlike prior panoramic video methods that optimize for visual realism alone, PanoWorld enforces depth and trajectory consistency in the latent world state, enabling downstream 3D applications that appearance-only methods cannot support: (a) input perspective image, (b) generated 360◦ panoramic vide… view at source ↗

**Figure 2.** Figure 2: Training pipeline overview. A ground-truth panoramic video from PanoGeo (blue, left) is VAE-encoded into a clean latent z0 and mixed with noise ϵ to form znoise (green, top). Three conditioning signals (orange, center) are derived from a randomly sampled perspective crop: (1) spatial: the crop is projected onto the equirectangular canvas (P2E) and VAE-encoded into reference latent zref and mask mspatial, c… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on PanoGeo under Stage 1 input (single perspective image + caption). One representative ERP frame per method on six clips spanning the three evaluation regimes: PanoGeo (rows 1–2), Argus Testset (rows 3–4), and Habitat-Sim (rows 5–6). The red dashed box on each GT frame marks the perspective crop fed to all methods; the rest of the sphere is hallucinated. Compared methods: OmniRoam [… view at source ↗

read the original abstract

We present PanoWorld, a panoramic video world model that generates geometry-consistent 360$\degree$ video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at https://github.com/ostadabbas/PanoWorld.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PanoWorld shows that layering depth and trajectory regularizers on a pre-trained perspective model can improve geometric consistency in 360 video, but the gains rest on pseudo-depth quality that the paper does not fully stress-test.

read the letter

The main thing to know is that this paper reframes panoramic video generation as explicit 3D state modeling instead of just making plausible frames. They start from an existing perspective world model, add a depth consistency loss against pseudo panoramic depth maps, a trajectory loss on lifted 3D world positions, and a spherical adaptation to the positional encoding. They also release PanoGeo, a dataset that bundles consistent depth, trajectories, and prompts across real and synthetic 360 sources. That combination and the dataset are the concrete new pieces relative to prior panoramic generation work. The approach is practical because the regularizers are lightweight and do not require retraining the base model from scratch. It directly targets the problem that current 360 video outputs often look fine visually but produce broken depth and motion when you try to use them for spatial reasoning in VR or robotics. The experiments apparently demonstrate better geometric metrics than earlier methods while holding visual quality steady, which is the right direction for embodied AI use cases. The soft spot is the depth consistency term. It pulls the generator toward pseudo ground-truth depth produced by an off-the-shelf panoramic estimator. If that estimator carries systematic bias on equirectangular projections, especially near the poles or at seams, the loss can reinforce incorrect 3D structure rather than correct it. The trajectory loss only supervises sparse tracked points and still depends on the same depth for 3D lifting, so it does not fully compensate. The paper does not appear to include an ablation that swaps depth estimators or quantifies how much pseudo-label error propagates into the final consistency scores. That leaves the central claim somewhat dependent on the quality of the pseudo labels. This work is aimed at researchers building world models or video generators for 360-degree settings who need spatial coherence for downstream tasks. Anyone working on video synthesis for robotics or VR will find the method and the new dataset useful to examine. The paper has a clear gap it is trying to close, ships code, and supplies enough concrete additions to justify sending it to peer review rather than desk rejecting it. I would recommend putting it through review; the core idea is worth testing and the dataset is a net positive even if the depth-label sensitivity needs more scrutiny in revision.

Referee Report

3 major / 3 minor

Summary. The paper presents PanoWorld, a method for generating geometry-consistent 360° panoramic videos from a single image and caption. It builds on a pre-trained perspective video world model by adding a depth consistency loss against pseudo ground-truth panoramic depth maps (Eq. 4 in §3.2), a trajectory consistency loss on 3D world-frame positions of tracked points, spherical-geometry-aware adaptations to conditioning and positional encodings, and the new PanoGeo dataset with consistent depth, trajectory, and prompt annotations. Experiments claim improved geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, with code released publicly.

Significance. If the reported geometric consistency gains hold under scrutiny, the work advances panoramic video generation by reframing it as latent 3D state modeling rather than pure 2D synthesis, which aligns with needs in embodied AI for holistic spatial understanding. Strengths include the introduction of the PanoGeo dataset with unified annotations across real and synthetic sources and the public code release, both of which support reproducibility and future extensions.

major comments (3)

[§3.2, Eq. (4)] §3.2, Eq. (4): The depth consistency loss is defined as L_depth = ||D_gen - D_pseudo||, where D_pseudo is produced by an off-the-shelf panoramic depth estimator. Systematic biases in equirectangular depth estimation (e.g., near poles or stitching seams) would cause the generator to optimize toward incorrect 3D states; no ablation or validation of D_pseudo accuracy on the target distribution is provided, so it is unclear whether measured consistency improvements reflect genuine geometry or artifacts of the pseudo-label distribution.
[§3.3] §3.3: The trajectory consistency loss lifts 2D tracks to 3D world-frame positions using the same depth estimates that feed the depth loss. This creates a dependency in which depth errors affect both regularizers; without an experiment that replaces D_pseudo with ground-truth depth or measures sensitivity to depth noise, the claim that the combined losses enforce independent 3D consistency remains unverified.
[Evaluation section] Evaluation section (and associated tables): The paper reports improvements in geometric consistency metrics, yet provides insufficient detail on baseline implementations, exact metric definitions, and per-component ablations (depth loss vs. trajectory loss vs. spherical adaptation). This makes it difficult to isolate the contribution of the proposed regularizers to the central claim.

minor comments (3)

The abstract states experimental improvements but does not include any numerical values or baseline names; moving a concise quantitative summary to the abstract would improve readability.
Notation for the spherical positional encoding adaptation is introduced without an explicit equation or diagram showing how it differs from standard sinusoidal encodings; adding this would clarify the geometric-awareness claim.
Figure captions for qualitative results should explicitly state the input image, caption, and which method produced each row to facilitate direct visual comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our paper. We will revise the manuscript to address the concerns regarding the validation of pseudo-depth labels, the dependency between losses, and the details in the evaluation section.

read point-by-point responses

Referee: [§3.2, Eq. (4)] §3.2, Eq. (4): The depth consistency loss is defined as L_depth = ||D_gen - D_pseudo||, where D_pseudo is produced by an off-the-shelf panoramic depth estimator. Systematic biases in equirectangular depth estimation (e.g., near poles or stitching seams) would cause the generator to optimize toward incorrect 3D states; no ablation or validation of D_pseudo accuracy on the target distribution is provided, so it is unclear whether measured consistency improvements reflect genuine geometry or artifacts of the pseudo-label distribution.

Authors: We agree that the accuracy of the pseudo ground-truth depth maps is crucial for the validity of the depth consistency loss. Since PanoGeo includes synthetic data with ground-truth depth annotations, we can perform the necessary validation. In the revised manuscript, we will add an analysis of the pseudo-depth estimator's accuracy on the synthetic subset and include an ablation study using ground-truth depth for the loss computation on those sequences. This will help confirm that the improvements are due to genuine geometric consistency rather than estimator-specific artifacts. revision: yes
Referee: [§3.3] §3.3: The trajectory consistency loss lifts 2D tracks to 3D world-frame positions using the same depth estimates that feed the depth loss. This creates a dependency in which depth errors affect both regularizers; without an experiment that replaces D_pseudo with ground-truth depth or measures sensitivity to depth noise, the claim that the combined losses enforce independent 3D consistency remains unverified.

Authors: We acknowledge the shared dependency on depth estimates. To verify the independent contributions, we will leverage the ground-truth depth available in the synthetic part of PanoGeo. The revised paper will include experiments replacing D_pseudo with ground-truth depth for both the depth and trajectory losses, as well as a sensitivity analysis where we introduce controlled noise to the depth maps and evaluate the resulting geometric consistency. These additions will strengthen the claim that the losses enforce 3D consistency. revision: yes
Referee: Evaluation section (and associated tables): The paper reports improvements in geometric consistency metrics, yet provides insufficient detail on baseline implementations, exact metric definitions, and per-component ablations (depth loss vs. trajectory loss vs. spherical adaptation). This makes it difficult to isolate the contribution of the proposed regularizers to the central claim.

Authors: We appreciate the suggestion to improve clarity in the evaluation. We will revise the evaluation section to provide exact definitions of the geometric consistency metrics, detailed descriptions of baseline adaptations and implementations, and comprehensive ablations that isolate the effects of the depth consistency loss, trajectory consistency loss, and spherical-geometry adaptations. This will allow readers to better assess the contribution of each proposed component. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external pseudo-labels and independent losses

full rationale

The paper builds on a pre-trained perspective model (external), adds depth consistency loss against off-the-shelf panoramic depth estimator and trajectory loss on 3D positions, plus spherical adaptation and a new PanoGeo dataset for training/evaluation. The depth loss L_depth = ||D_gen - D_pseudo|| enforces matching to an independent pseudo-GT source rather than defining the target by the model's own outputs. Reported geometric improvements are evaluated on the new dataset with metrics that do not reduce to the fitted parameters or self-citations by construction. No self-definitional, fitted-input-renamed-as-prediction, or load-bearing self-citation steps are present.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on the reliability of pseudo ground-truth depth and on the transferability of a pre-trained perspective model to spherical geometry; no free parameters or invented physical entities are visible in the abstract.

axioms (2)

domain assumption Pseudo ground-truth panoramic depth maps are sufficiently accurate to serve as supervision for geometric consistency.
Invoked when the depth consistency loss is introduced.
domain assumption A pre-trained perspective video world model can be adapted to panoramic output via lightweight regularizers and spherical positional encoding changes.
Central to the proposed adaptation strategy.

pith-pipeline@v0.9.0 · 5800 in / 1420 out tokens · 47400 ms · 2026-05-19T15:37:19.171881+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ldepth = c(σ) [1/|Mq| Σ wh |D̂t,h,w − Dgt t,h,w| + gradient term] (Eq. 4); Ltrack on ξp,t = [X; αΔX; βΔ²X] lifted via πS2 (Eq. 6)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Latitude-aware RoPE posh ∝ sin(ϕ(h)) + 1; spherical area weighting wh = cos ϕ(h)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 12 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat- topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

work page arXiv
[3]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11,

work page 2024
[4]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv:2404.08471,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Reconviagen: Towards accurate multi-view 3d object recon- struction via generation.arXiv preprint arXiv:2510.23306,

10 Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi-view 3d object recon- struction via generation.arXiv preprint arXiv:2510.23306,

work page arXiv
[8]

Infinite-canvas: Higher-resolution video outpainting with extensive content generation

Qihua Chen, Yue Ma, Hongfa Wang, Junkun Yuan, Wenzhe Zhao, Qi Tian, Hongmei Wang, Shaobo Min, Qifeng Chen, and Wei Liu. Infinite-canvas: Higher-resolution video outpainting with extensive content generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 2150–2158, 2025a. Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Ch...

work page arXiv
[9]

Introduction to latent variable energy-based models: a path toward autonomous machine intelligence.Journal of Statistical Mechanics: Theory and Experiment, 2024 (10):104011,

Anna Dawid and Yann LeCun. Introduction to latent variable energy-based models: a path toward autonomous machine intelligence.Journal of Statistical Mechanics: Theory and Experiment, 2024 (10):104011,

work page 2024
[10]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019a. Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on mac...

work page internal anchor Pith review Pith/arXiv arXiv 1912
[11]

Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025a

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025a. Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Terra: Explorable native 3d world model with point latents, 2025b. URL https://a...

work page arXiv
[12]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Omninwm: Omniscient driving navigation world models

Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving navigation world models. arXiv preprint arXiv:2510.18313, 2025a. Longfei Li, Zhiwen Fan, Wenyan Cong, Xinhang Liu, Yuyang Yin, Matt Foutter, Panwang Pan, Chenyu You, Yue Wang, Zhangyang Wang, et al. Mart...

work page arXiv
[14]

Depth any panoramas: A foundation model for panoramic depth estimation.arXiv preprint arXiv:2512.16913, 2025

Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, and Lu Qi. Depth any panoramas: A foundation model for panoramic depth estimation. arXiv preprint arXiv:2512.16913,

work page arXiv
[15]

Omniroam: World wandering via long- horizon panoramic video generation.arXiv preprint arXiv:2603.30045, 2026

Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang, Kalyan Sunkavalli, Yannick Hold- Geoffroy, Hao Tan, Kai Zhang, Xiaohui Xie, et al. Omniroam: World wandering via long-horizon panoramic video generation.arXiv preprint arXiv:2603.30045,

work page arXiv
[16]

Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

Panwang Pan, Chenguo Lin, Jingjing Zhao, Chenxin Li, Yuchen Lin, Haopeng Li, Honglei Yan, Kairun Wen, Yunlong Lin, Yixuan Yuan, et al. Diff4splat: Controllable 4d scene generation with latent dynamic reconstruction models.arXiv preprint arXiv:2511.00503,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023

Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724,

work page arXiv
[18]

Vipra: Video prediction for robot actions.arXiv preprint arXiv:2511.07732,

12 Sandeep Routray, Hengkai Pan, Unnat Jain, Shikhar Bahl, and Deepak Pathak. Vipra: Video prediction for robot actions.arXiv preprint arXiv:2511.07732,

work page arXiv
[19]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InICLR 2019 Workshop on Debugging Machine Learning Models,

work page 2019
[22]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Novel view synthesis with diffusion models.arXiv preprint arXiv:2210.04628,

Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mo- hammad Norouzi. Novel view synthesis with diffusion models.arXiv preprint arXiv:2210.04628,

work page arXiv
[24]

Centerface: joint face detection and alignment using face as point.Scientific Programming, 2020(1):7845384,

Yuanyuan Xu, Wan Yan, Genke Yang, Jiliang Luo, Tao Li, and Jianan He. Centerface: joint face detection and alignment using face as point.Scientific Programming, 2020(1):7845384,

work page 2020
[25]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Aether: Geometric-aware unified world modeling

Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8535–8546, 2025a. Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, ...

work page arXiv 1920
[28]

Table 3 reports the same Stage 1 metrics as Table 1 broken down by source (50 clips each)

hide source-level behaviour that is informative for a panoramic generator’s failure modes: the PanoGeo held-out slice measures within-distribution polish, the Argus testset stresses appearance priors against unseen real- world capture conditions, and Habitat-Sim stresses geometric priors against synthetic but precisely controlled scenes. Table 3 reports t...

work page 2025
[29]

does the generator produce a self-consistent 3D scene

and per source over PanoGeo held-out / Argus testset / Habitat-Sim (Table 3). B.2 Metric Definitions We split metrics into two complementary axes. Thecorrespondence-freeaxis is well-defined under any camera path (both stages); thecorrespondence-requiredaxis assumes frame-aligned predictions against GT and is directly meaningful only when frame0is anchored...

work page 2019
[30]

The geometry loss weights are λd=0.3 and λτ=0.06 in L=L visual +λ dLdepth +λ τ Ltrack

on T=93 -frame clips at 16 fps. The geometry loss weights are λd=0.3 and λτ=0.06 in L=L visual +λ dLdepth +λ τ Ltrack. The noise-adaptive confidence factor c(σ) uses σmax=3.0, which covers ∼86% of the EDM log-normal noise levels we sample from. The augmented track-state coefficients are α=0.5, β=0.25 . A linear warm-up of 1,000 iterations ramps both auxil...

work page 2020

[1] [1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat- topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

work page arXiv

[3] [3]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11,

work page 2024

[4] [4]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv:2404.08471,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Reconviagen: Towards accurate multi-view 3d object recon- struction via generation.arXiv preprint arXiv:2510.23306,

10 Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi-view 3d object recon- struction via generation.arXiv preprint arXiv:2510.23306,

work page arXiv

[8] [8]

Infinite-canvas: Higher-resolution video outpainting with extensive content generation

Qihua Chen, Yue Ma, Hongfa Wang, Junkun Yuan, Wenzhe Zhao, Qi Tian, Hongmei Wang, Shaobo Min, Qifeng Chen, and Wei Liu. Infinite-canvas: Higher-resolution video outpainting with extensive content generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 2150–2158, 2025a. Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Ch...

work page arXiv

[9] [9]

Introduction to latent variable energy-based models: a path toward autonomous machine intelligence.Journal of Statistical Mechanics: Theory and Experiment, 2024 (10):104011,

Anna Dawid and Yann LeCun. Introduction to latent variable energy-based models: a path toward autonomous machine intelligence.Journal of Statistical Mechanics: Theory and Experiment, 2024 (10):104011,

work page 2024

[10] [10]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019a. Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on mac...

work page internal anchor Pith review Pith/arXiv arXiv 1912

[11] [11]

Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025a

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025a. Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Terra: Explorable native 3d world model with point latents, 2025b. URL https://a...

work page arXiv

[12] [12]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Omninwm: Omniscient driving navigation world models

Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving navigation world models. arXiv preprint arXiv:2510.18313, 2025a. Longfei Li, Zhiwen Fan, Wenyan Cong, Xinhang Liu, Yuyang Yin, Matt Foutter, Panwang Pan, Chenyu You, Yue Wang, Zhangyang Wang, et al. Mart...

work page arXiv

[14] [14]

Depth any panoramas: A foundation model for panoramic depth estimation.arXiv preprint arXiv:2512.16913, 2025

Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, and Lu Qi. Depth any panoramas: A foundation model for panoramic depth estimation. arXiv preprint arXiv:2512.16913,

work page arXiv

[15] [15]

Omniroam: World wandering via long- horizon panoramic video generation.arXiv preprint arXiv:2603.30045, 2026

Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang, Kalyan Sunkavalli, Yannick Hold- Geoffroy, Hao Tan, Kai Zhang, Xiaohui Xie, et al. Omniroam: World wandering via long-horizon panoramic video generation.arXiv preprint arXiv:2603.30045,

work page arXiv

[16] [16]

Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

Panwang Pan, Chenguo Lin, Jingjing Zhao, Chenxin Li, Yuchen Lin, Haopeng Li, Honglei Yan, Kairun Wen, Yunlong Lin, Yixuan Yuan, et al. Diff4splat: Controllable 4d scene generation with latent dynamic reconstruction models.arXiv preprint arXiv:2511.00503,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023

Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724,

work page arXiv

[18] [18]

Vipra: Video prediction for robot actions.arXiv preprint arXiv:2511.07732,

12 Sandeep Routray, Hengkai Pan, Unnat Jain, Shikhar Bahl, and Deepak Pathak. Vipra: Video prediction for robot actions.arXiv preprint arXiv:2511.07732,

work page arXiv

[19] [19]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InICLR 2019 Workshop on Debugging Machine Learning Models,

work page 2019

[22] [22]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Novel view synthesis with diffusion models.arXiv preprint arXiv:2210.04628,

Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mo- hammad Norouzi. Novel view synthesis with diffusion models.arXiv preprint arXiv:2210.04628,

work page arXiv

[24] [24]

Centerface: joint face detection and alignment using face as point.Scientific Programming, 2020(1):7845384,

Yuanyuan Xu, Wan Yan, Genke Yang, Jiliang Luo, Tao Li, and Jianan He. Centerface: joint face detection and alignment using face as point.Scientific Programming, 2020(1):7845384,

work page 2020

[25] [25]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Aether: Geometric-aware unified world modeling

Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8535–8546, 2025a. Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, ...

work page arXiv 1920

[28] [28]

Table 3 reports the same Stage 1 metrics as Table 1 broken down by source (50 clips each)

hide source-level behaviour that is informative for a panoramic generator’s failure modes: the PanoGeo held-out slice measures within-distribution polish, the Argus testset stresses appearance priors against unseen real- world capture conditions, and Habitat-Sim stresses geometric priors against synthetic but precisely controlled scenes. Table 3 reports t...

work page 2025

[29] [29]

does the generator produce a self-consistent 3D scene

and per source over PanoGeo held-out / Argus testset / Habitat-Sim (Table 3). B.2 Metric Definitions We split metrics into two complementary axes. Thecorrespondence-freeaxis is well-defined under any camera path (both stages); thecorrespondence-requiredaxis assumes frame-aligned predictions against GT and is directly meaningful only when frame0is anchored...

work page 2019

[30] [30]

The geometry loss weights are λd=0.3 and λτ=0.06 in L=L visual +λ dLdepth +λ τ Ltrack

on T=93 -frame clips at 16 fps. The geometry loss weights are λd=0.3 and λτ=0.06 in L=L visual +λ dLdepth +λ τ Ltrack. The noise-adaptive confidence factor c(σ) uses σmax=3.0, which covers ∼86% of the EDM log-normal noise levels we sample from. The augmented track-state coefficients are α=0.5, β=0.25 . A linear warm-up of 1,000 iterations ramps both auxil...

work page 2020