pith. sign in

arxiv: 2602.23058 · v2 · pith:ZIJLOJXCnew · submitted 2026-02-26 · 💻 cs.CV · cs.RO

GeoWorld: Geometric World Models

Pith reviewed 2026-05-21 11:35 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords geometric world modelshyperbolic geometryHyperbolic JEPAmulti-step planningenergy-based modelsvisual planningGeometric Reinforcement Learning
0
0 comments X

The pith

GeoWorld maps latent representations to hyperbolic space to preserve geometric and hierarchical structures for stable multi-step planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that energy-based world models suffer when their latent spaces stay in Euclidean geometry because they lose the underlying geometric and hierarchical relations among states. It proposes mapping those representations onto hyperbolic manifolds with a Hyperbolic JEPA so that the energy landscape supports more reliable planning over several steps. The authors add Geometric Reinforcement Learning to optimize directly in this space. A sympathetic reader would care because better preservation of structure could reduce the rapid performance drop that currently limits long-horizon visual planning.

Core claim

GeoWorld preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds, and introduces Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space with measured gains on standard benchmarks.

What carries the argument

Hyperbolic JEPA that maps Euclidean latent representations onto hyperbolic manifolds to preserve geometric structure and hierarchical relations.

If this is right

  • Around 3% success-rate improvement in 3-step planning tasks.
  • Around 2% success-rate improvement in 4-step planning tasks compared with V-JEPA 2.
  • Reduced degradation across extended rollouts in visual planning.
  • Demonstrated effectiveness on the CrossTask and COIN datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hyperbolic latent spaces may help energy-based models handle other naturally hierarchical data such as graphs or tree-structured tasks.
  • The same mapping technique could be tested on longer planning horizons or in robotic control settings that rely on visual sequences.
  • If the energy landscape remains well-behaved, the approach might combine with other predictive architectures beyond JEPA variants.

Load-bearing premise

Mapping Euclidean latent representations onto hyperbolic manifolds reliably preserves the underlying geometric and hierarchical structure among states without introducing instabilities or new distortions that undermine the energy landscape for planning.

What would settle it

Training an otherwise identical model with Euclidean latents instead of hyperbolic ones and measuring whether success rates in 3-step and 4-step planning on CrossTask or COIN remain the same or drop.

Figures

Figures reproduced from arXiv: 2602.23058 by Danning Li, Ian Reid, Richard Hartley, Zeyu Zhang.

Figure 1
Figure 1. Figure 1: Energy-based planning by GeoWorld. The diagram shows a Replace Memory Chip task from the COIN dataset [71], where GeoWorld plans actions by following geodesics over a hyperbolic energy landscape rather than generating pixels. Abstract Energy-based predictive world models provide a power￾ful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. How… view at source ↗
Figure 2
Figure 2. Figure 2: Energy landscape comparison for V-JEPA 2 [3] and GeoWorld. We visualize the energy by sweeping two orthonor￾mal tangent-space directions (∆x, ∆y) around a reference latent state. GeoWorlds yields a structured, curvature-aware energy land￾scape that better reflects geometric structure and hierarchical rela￾tions among latent states and improves energy-based planning. For more details see Appendix 4. geodesi… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of GeoWorld. Our geometric world model integrates Hyperbolic JEPA for geometry-preserving latent dynamics and Geometric Reinforcement Learning for geodesic-consistent multi-step refinement. Together with energy-based planning using CEM, GeoWorld enables stable and geometry-aware long-horizon visual planning. Formally, the hyperbolic latent state is obtained as s x t,H = exp0 (s x t ) = tanh√ c∥s x… view at source ↗
Figure 1
Figure 1. Figure 1: Gromov δ-hyperbolicity on CrossTask [88]. Such curvature-aware energy landscapes promote more stable long-horizon planning: CEM naturally follows the hyperbolic geodesics shaped by GeoWorld, resulting in more accurate multi-step trajectory optimization. 5. Ablation Study Curvature As discussed in Section 4.3, the curvature K = −c is learned in the logarithmic space by optimizing log(c), which is initialize… view at source ↗
Figure 2
Figure 2. Figure 2: Geometric effects and curvature dynamics: (a) Poincaré disk geodesics connecting x and y under different curvatures K. As the curvature K becomes less negative (i.e., closer to 0), the hyperbolic distance between x and y increases, and the geodesic paths bend less and shift closer toward the origin. (b) Geodesic patterns induced by different boundary anchor points. Varying the anchor location produces a ch… view at source ↗
read the original abstract

Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GeoWorld, a geometric world model for multi-step visual planning in energy-based predictive models. It identifies two challenges with prior work: Euclidean latent spaces that neglect geometric and hierarchical state structure, and rapid degradation in long-horizon rollouts. The proposed solution maps Euclidean latents to hyperbolic manifolds via a Hyperbolic JEPA to preserve structure and hierarchy, combined with Geometric Reinforcement Learning for energy-based optimization in hyperbolic space. Experiments on CrossTask and COIN report ~3% SR gains in 3-step planning and ~2% SR gains in 4-step planning relative to V-JEPA 2.

Significance. If the hyperbolic mapping demonstrably preserves hierarchical relations and stabilizes the planning energy landscape without new distortions, the work would offer a concrete architectural route to incorporating non-Euclidean geometry into world models. The emphasis on an architectural change rather than additional free parameters is a positive feature. The modest reported gains, however, require stronger mechanistic evidence before the approach can be viewed as a clear advance over Euclidean baselines.

major comments (2)
  1. [§3] §3 (Hyperbolic JEPA description): The central claim that mapping Euclidean latents onto hyperbolic manifolds reliably preserves geometric and hierarchical structure is unsupported by any explicit fidelity metric, distance-preservation test, or hierarchy-recovery analysis. This is load-bearing for the multi-step planning stability argument; without such checks, curvature-induced distortions remain a plausible risk to the energy landscape.
  2. [§4] §4 (Experiments and results): The reported 3% and 2% SR improvements on CrossTask/COIN lack ablations that isolate the Hyperbolic JEPA component from the Geometric RL optimizer or other implementation details. Attribution of gains specifically to geometric preservation is therefore not established.
minor comments (2)
  1. [Abstract] Abstract and §4: The success-rate improvements are stated as 'around 3%' and 'around 2%' without reported standard deviations, number of runs, or statistical significance; adding these would strengthen the empirical claims.
  2. [§3] Notation: The definition of the hyperbolic manifold and the precise form of the JEPA loss in hyperbolic space would benefit from an explicit equation reference to avoid ambiguity in the mapping procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional evidence would strengthen the claims regarding structure preservation and the source of performance gains. We address each major comment below and will incorporate revisions to provide the requested analyses and controls.

read point-by-point responses
  1. Referee: [§3] §3 (Hyperbolic JEPA description): The central claim that mapping Euclidean latents onto hyperbolic manifolds reliably preserves geometric and hierarchical structure is unsupported by any explicit fidelity metric, distance-preservation test, or hierarchy-recovery analysis. This is load-bearing for the multi-step planning stability argument; without such checks, curvature-induced distortions remain a plausible risk to the energy landscape.

    Authors: We acknowledge that the current manuscript does not report explicit quantitative fidelity metrics, distance-preservation tests, or hierarchy-recovery analyses to verify structure preservation after the Euclidean-to-hyperbolic mapping. The Hyperbolic JEPA component is motivated by the established theoretical properties of hyperbolic geometry for embedding hierarchical relations with reduced distortion compared to Euclidean space. To directly address the concern about potential curvature-induced distortions, we will add a dedicated subsection to §3 that includes (i) pairwise distance preservation metrics (relative error between original Euclidean distances and hyperbolic geodesic distances) and (ii) a hierarchy-recovery evaluation using measures such as dendrogram purity on sampled state trajectories. These additions will provide empirical support for the stability argument. revision: yes

  2. Referee: [§4] §4 (Experiments and results): The reported 3% and 2% SR improvements on CrossTask/COIN lack ablations that isolate the Hyperbolic JEPA component from the Geometric RL optimizer or other implementation details. Attribution of gains specifically to geometric preservation is therefore not established.

    Authors: We agree that the reported gains are from the integrated system and that isolating the Hyperbolic JEPA contribution from the Geometric RL optimizer is necessary for clear attribution. The current evaluation compares the full GeoWorld model against V-JEPA 2 but does not include component-wise controls. In the revision we will add ablations that (i) apply Geometric RL on top of the original Euclidean V-JEPA 2 latents and (ii) use Hyperbolic JEPA with standard (non-geometric) energy-based optimization. These controls will help attribute improvements specifically to the geometric mapping while acknowledging that the two components are designed to work together. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained via architectural proposal and empirical evaluation

full rationale

The paper's core derivation consists of proposing a Hyperbolic JEPA to map Euclidean latents onto hyperbolic manifolds for preserving geometric and hierarchical structure, followed by Geometric Reinforcement Learning for energy-based multi-step planning. These are presented as design choices, with performance gains (approximately 3% and 2% SR improvements on 3-step and 4-step planning) reported via direct comparison to the external baseline V-JEPA 2 on CrossTask and COIN. No equations, fitted parameters, or predictions in the abstract reduce to inputs by construction, and there are no load-bearing self-citations or uniqueness theorems invoked from the authors' prior work. The chain is independent: model architecture, training procedure, and benchmark evaluation stand on their own without self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The core modeling choice of hyperbolic embedding is treated as a domain assumption whose justification is not detailed here.

pith-pipeline@v0.9.0 · 5701 in / 986 out tokens · 28970 ms · 2026-05-21T11:35:11.415186+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction

    cs.CV 2026-05 unverdicted novelty 6.0

    TrajPilot predicts candidate future trajectories from egocentric context and uses them to condition action prediction in an embedding space, outperforming VLM and planner baselines on Ego-Exo4D, Ego4D, and other datas...

  2. Recovering Physical Dynamics from Discrete Observations via Intrinsic Differential Consistency

    cs.LG 2026-05 unverdicted novelty 6.0

    Enforcing semi-group consistency on a time-conditioned secant velocity field via Symmetry Rupture improves rollout accuracy and efficiency when learning physical dynamics from discrete observations.

  3. HSG: Hyperbolic Scene Graph

    cs.CV 2026-04 unverdicted novelty 6.0

    Hyperbolic Scene Graph (HSG) learns embeddings in hyperbolic space for better hierarchical structure in scene graphs, achieving graph IoU of 33.51 versus 25.37 for the best Euclidean baseline.

  4. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · cited by 4 Pith papers · 17 internal anchors

  1. [1]

    Uncertainty-aware antic- ipation of activities

    Yazan Abu Farha and Juergen Gall. Uncertainty-aware antic- ipation of activities. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision Workshops, pages 0–0, 2019. 6, 7, 10

  2. [2]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 2, 3, 1, 4

  3. [3]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 1, 2, 3, 6, 7, 8, 4, 10, 12, 13

  4. [4]

    Vi- creg: Variance-invariance-covariance regularization for self- supervised learning

    Adrien Bardes, Jean Ponce, and Yann Lecun. Vi- creg: Variance-invariance-covariance regularization for self- supervised learning. InICLR 2022-International Conference on Learning Representations, 2022. 2

  5. [5]

    Revisiting feature prediction for learning visual rep- resentations from video.Transactions on Machine Learning Research, 2024

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual rep- resentations from video.Transactions on Machine Learning Research, 2024. 2, 3, 1, 4

  6. [6]

    Dynamic programming.science, 153 (3731):34–37, 1966

    Richard Bellman. Dynamic programming.science, 153 (3731):34–37, 1966. 8

  7. [7]

    Procedure planning in instructional videos via contextual modeling and model- based policy learning

    Jing Bi, Jiebo Luo, and Chenliang Xu. Procedure planning in instructional videos via contextual modeling and model- based policy learning. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 15611– 15620, 2021. 3, 6, 7, 9

  8. [8]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A Vision–Language– Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164, 2024. 1

  9. [9]

    Regularizing model-based planning with energy-based models

    Rinu Boney, Juho Kannala, and Alexander Ilin. Regularizing model-based planning with energy-based models. InConfer- ence on Robot Learning, pages 182–191. PMLR, 2020. 1

  10. [10]

    In- verse dynamics pretraining learns good representations for multitask imitation.Advances in Neural Information Pro- cessing Systems, 36:66953–66978, 2023

    David Brandfonbrener, Ofir Nachum, and Joan Bruna. In- verse dynamics pretraining learns good representations for multitask imitation.Advances in Neural Information Pro- cessing Systems, 36:66953–66978, 2023. 4

  11. [11]

    Springer Science & Business Media,

    Martin R Bridson and André Haefliger.Metric spaces of non-positive curvature. Springer Science & Business Media,

  12. [12]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 2

  13. [13]

    Hyperbolic deep reinforce- ment learning

    Edoardo Cetin, Benjamin Paul Chamberlain, Michael M Bronstein, and Jonathan J Hunt. Hyperbolic deep reinforce- ment learning. InThe Eleventh International Conference on Learning Representations, 2023. 5, 7

  14. [14]

    Hyperbolic graph convolutional neural networks

    Ines Chami, Zhitao Ying, Christopher Ré, and Jure Leskovec. Hyperbolic graph convolutional neural networks. Advances in neural information processing systems, 32,

  15. [15]

    Procedure planning in instructional videos

    Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. InEuropean Conference on Computer Vision, pages 334–350. Springer, 2020. 3, 6, 7, 8, 9

  16. [16]

    Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

    Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025. 3

  17. [17]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 3

  18. [18]

    Why animals don’t have language.Tanner lectures on human values, 19: 173–210, 1998

    Dorothy L Cheney and Robert M Seyfarth. Why animals don’t have language.Tanner lectures on human values, 19: 173–210, 1998. 2

  19. [19]

    Path-value functions for which dijkstra’s algorithm returns optimal mapping.Journal of Mathematical Imaging and Vision, 60(7):1025–1036, 2018

    Krzysztof Chris Ciesielski, Alexandre Xavier Falcão, and Paulo A V Miranda. Path-value functions for which dijkstra’s algorithm returns optimal mapping.Journal of Mathematical Imaging and Vision, 60(7):1025–1036, 2018. 8

  20. [20]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 6, 8, 9

  21. [21]

    Value function based reinforcement learning in changing markovian envi- ronments.Journal of Machine Learning Research, 9(8),

    Balázs Csanád Csáji and László Monostori. Value function based reinforcement learning in changing markovian envi- ronments.Journal of Machine Learning Research, 9(8),

  22. [22]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self- forcing++: Towards minute-scale high-quality video genera- tion.arXiv preprint arXiv:2510.02283, 2025. 3

  23. [23]

    A tutorial on the cross-entropy method.Annals of operations research, 134(1):19–67, 2005

    Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A tutorial on the cross-entropy method.Annals of operations research, 134(1):19–67, 2005. 3, 6, 7, 4

  24. [24]

    Autoregressive Video Generation without Vector Quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation with- out vector quantization.arXiv preprint arXiv:2412.14169,

  25. [25]

    Hyper- bolic image-text representations

    Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Shanmukha Ramakrishna Vedantam. Hyper- bolic image-text representations. InInternational Confer- ence on Machine Learning, pages 7694–7731. PMLR, 2023. 5, 11

  26. [26]

    Learning iterative reasoning through energy mini- mization

    Yilun Du, Shuang Li, Joshua Tenenbaum, and Igor Mor- datch. Learning iterative reasoning through energy mini- mization. InInternational Conference on Machine Learning, pages 5570–5582. PMLR, 2022. 1

  27. [27]

    Who let the dogs out? modeling dog behavior from visual data

    Kiana Ehsani, Hessam Bagherinezhad, Joseph Redmon, Roozbeh Mottaghi, and Ali Farhadi. Who let the dogs out? modeling dog behavior from visual data. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4051–4060, 2018. 6, 7, 10

  28. [28]

    Hyperbolic neural networks.Advances in neural informa- tion processing systems, 31, 2018

    Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural networks.Advances in neural informa- tion processing systems, 31, 2018. 5, 7

  29. [29]

    Learning and leveraging world models in visual representation learning.arXiv preprint arXiv:2403.00504, 2024

    Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, and Yann LeCun. Learning and leveraging world models in visual representation learning. arXiv preprint arXiv:2403.00504, 2024. 2, 3, 4

  30. [30]

    Hyperbolic contrastive learning for visual representations beyond objects

    Songwei Ge, Shlok Mishra, Simon Kornblith, Chun-Liang Li, and David Jacobs. Hyperbolic contrastive learning for visual representations beyond objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6840–6849, 2023. 5

  31. [31]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on com- puter vision, pages 5842...

  32. [32]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1

  33. [33]

    3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding.arXiv preprint arXiv:2507.23478, 2025

    Ting Huang, Zeyu Zhang, and Hao Tang. 3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding.arXiv preprint arXiv:2507.23478, 2025. 1

  34. [34]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train- test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 3

  35. [35]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π 0.5: A Vision–Language–Action Model with Open-World General- ization.arXiv preprint arXiv:2504.16054, 2025. 1

  36. [36]

    Propose, assess, search: Harnessing llms for goal-oriented planning in instructional videos

    Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Fu- Jen Chu, Kris Kitani, Gedas Bertasius, and Xitong Yang. Propose, assess, search: Harnessing llms for goal-oriented planning in instructional videos. InEuropean Conference on Computer Vision, pages 436–452. Springer, 2024. 3, 6, 7, 9

  37. [37]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

  38. [38]

    Infant visual perception

    Philip J Kellman, ME Arterberry, W Damon, RM Lerner, D Kuhn, RS Siegler, et al. Infant visual perception. 2006. 1

  39. [39]

    Droid: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Bal- akrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InRSS 2024 Workshop: Data Generation for Robotics. 3

  40. [40]

    Hmdb: a large video database for human motion recognition

    Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In2011 Inter- national conference on computer vision, pages 2556–2563. IEEE, 2011. 2

  41. [41]

    A path towards autonomous machine intelli- gence version 0.9

    Yann LeCun. A path towards autonomous machine intelli- gence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62,

  42. [42]

    A language-first approach for procedure planning

    Jiateng Liu, Sha Li, Zhenhailong Wang, Manling Li, and Heng Ji. A language-first approach for procedure planning. InFindings of the Association for Computational Linguis- tics: ACL 2023, pages 1941–1954, 2023. 3, 6, 7, 9

  43. [43]

    Infini- tystar: Unified spacetime autoregressive modeling for visual generation.arXiv preprint arXiv:2511.04675, 2025

    Jinlai Liu, Jian Han, Bin Yan, Hui Wu, Fengda Zhu, Xing Wang, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Infini- tystar: Unified spacetime autoregressive modeling for visual generation.arXiv preprint arXiv:2511.04675, 2025. 3

  44. [44]

    Nav-r1: Reasoning and navigation in embodied scenes

    Qingxiang Liu, Ting Huang, Zeyu Zhang, and Hao Tang. Nav-r1: Reasoning and navigation in embodied scenes. arXiv preprint arXiv:2509.10884, 2025. 1

  45. [45]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7

  46. [46]

    Struc- tured world models from human videos

    Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Struc- tured world models from human videos. 2023. 2, 3, 4

  47. [47]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019. 2

  48. [48]

    The debate over understanding in ai’s large language models.Proceedings of the National Academy of Sciences, 120(13):e2215907120,

    Melanie Mitchell and David C Krakauer. The debate over understanding in ai’s large language models.Proceedings of the National Academy of Sciences, 120(13):e2215907120,

  49. [49]

    Triangle inequality for in- verse optimal control.IEEE Access, 11:119187–119199,

    Sho Mitsuhashi and Shin Ishii. Triangle inequality for in- verse optimal control.IEEE Access, 11:119187–119199,

  50. [50]

    Why not use your text- book? knowledge-enhanced procedure planning of instruc- tional videos

    Kumaranage Ravindu Yasas Nagasinghe, Honglu Zhou, Malitha Gunawardhana, Martin Renqiang Min, Daniel Harari, and Muhammad Haris Khan. Why not use your text- book? knowledge-enhanced procedure planning of instruc- tional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18816– 18826, 2024. 3, 6, 7, 8, 9

  51. [51]

    Poincaré embeddings for learning hierarchical representations.Advances in neural information processing systems, 30, 2017

    Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations.Advances in neural information processing systems, 30, 2017. 2, 5

  52. [52]

    Schema: State changes matter for pro- cedure planning in instructional videos.arXiv preprint arXiv:2403.01599, 2024

    Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, and Shih-Fu Chang. Schema: State changes matter for pro- cedure planning in instructional videos.arXiv preprint arXiv:2403.01599, 2024. 3, 6, 7, 8, 9

  53. [53]

    Gpt-5 system card, version 1.0, 2025-08-13

    OpenAI. Gpt-5 system card, version 1.0, 2025-08-13. 2025. https : / / cdn . openai . com / gpt - 5 - system - card.pdf. 1, 3, 6, 8, 9

  54. [54]

    Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 2

  55. [55]

    Compositional entailment learning for hyperbolic vision-language models.arXiv preprint arXiv:2410.06912,

    Avik Pal, Max van Spengler, Guido Maria D’Amely di Me- lendugno, Alessandro Flaborea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models.arXiv preprint arXiv:2410.06912,

  56. [56]

    Pretrained language models as visual planners for human assistance

    Dhruvesh Patel, Hamid Eghbalzadeh, Nitin Kamra, Michael Louis Iuzzolino, Unnat Jain, and Ruta Desai. Pretrained language models as visual planners for human assistance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15302–15314, 2023. 3

  57. [57]

    An inductive bias for distances: Neural nets that respect the tri- angle inequality

    Silviu Pitis, Harris Chan, Kiarash Jamali, and Jimmy Ba. An inductive bias for distances: Neural nets that respect the tri- angle inequality. InInternational Conference on Learning Representations, 2020. 8

  58. [58]

    Autoregressive video generation beyond next frames prediction.arXiv preprint arXiv:2509.24081, 2025

    Sucheng Ren, Chen Chen, Zhenbang Wang, Liangchen Song, Xiangxin Zhu, Alan Yuille, Yinfei Yang, and Jiasen Lu. Autoregressive video generation beyond next frames prediction.arXiv preprint arXiv:2509.24081, 2025. 3

  59. [59]

    Videoworld: Exploring knowledge learning from unlabeled videos

    Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, and Xiaojie Jin. Videoworld: Exploring knowledge learning from unlabeled videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29029–29039, 2025. 2, 3, 6, 8, 4, 9

  60. [60]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2

  61. [61]

    Actiondiffu- sion: An action-aware diffusion model for procedure plan- ning in instructional videos

    Lei Shi, Paul Bürkner, and Andreas Bulling. Actiondiffu- sion: An action-aware diffusion model for procedure plan- ning in instructional videos. In2025 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 8816–8825. IEEE, 2025. 3, 6, 7, 9

  62. [62]

    Brain mechanisms of serial and parallel processing during dual-task perfor- mance.Journal of Neuroscience, 28(30):7585–7598, 2008

    Mariano Sigman and Stanislas Dehaene. Brain mechanisms of serial and parallel processing during dual-task perfor- mance.Journal of Neuroscience, 28(30):7585–7598, 2008. 1

  63. [63]

    Hand-object interaction pretraining from videos

    Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sfer- razza, Jane Wu, Haozhi Qi, Pieter Abbeel, and Jitendra Ma- lik. Hand-object interaction pretraining from videos. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3352–3360. IEEE, 2025. 2, 3, 4

  64. [64]

    Graph-level representation learning with joint-embedding predictive architectures.Transactions on Machine Learning Research, 2025

    Geri Skenderi, Hang Li, Jiliang Tang, and Marco Cristani. Graph-level representation learning with joint-embedding predictive architectures.Transactions on Machine Learning Research, 2025. 9

  65. [65]

    Maniplvm-r1: Rein- forcement learning for reasoning in embodied manipula- tion with large vision-language models.arXiv preprint arXiv:2505.16517, 2025

    Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Zixiang Xu, Zeyu Zhang, Xiaoqing Zhang, Qian Jiang, Zhenhao Chen, et al. Maniplvm-r1: Rein- forcement learning for reasoning in embodied manipula- tion with large vision-language models.arXiv preprint arXiv:2505.16517, 2025. 1

  66. [66]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 2

  67. [67]

    On adaptive inverse dy- namics control of rigid robots.IEEE Transactions on Auto- matic Control, 35(1):92–95, 2002

    Mark W Spong and Romeo Ortega. On adaptive inverse dy- namics control of rigid robots.IEEE Transactions on Auto- matic Control, 35(1):92–95, 2002. 2, 3, 4

  68. [68]

    Universal planning networks: Learning generalizable representations for visuomotor control

    Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks: Learning generalizable representations for visuomotor control. InIn- ternational conference on machine learning, pages 4732–

  69. [69]

    6, 7, 10

    PMLR, 2018. 6, 7, 10

  70. [70]

    Plate: Visually-grounded plan- ning with transformers in procedural tasks.IEEE Robotics and Automation Letters, 7(2):4924–4930, 2022

    Jiankai Sun, De-An Huang, Bo Lu, Yun-Hui Liu, Bolei Zhou, and Animesh Garg. Plate: Visually-grounded plan- ning with transformers in procedural tasks.IEEE Robotics and Automation Letters, 7(2):4924–4930, 2022. 3, 6, 7, 10

  71. [71]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998. 8

  72. [72]

    Coin: A large-scale dataset for comprehensive instructional video analysis

    Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207– 1216, 2019. 1, 2, 6, 7, 8, 10, 13

  73. [73]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video genera- tion at scale.arXiv preprint arXiv:2505.13211, 2025. 3

  74. [74]

    Predictive inverse dynam- ics models are scalable learners for robotic manipulation

    Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynam- ics models are scalable learners for robotic manipulation. In The Thirteenth International Conference on Learning Rep- resentations, 2025. 4

  75. [75]

    Event-guided procedure planning from in- structional videos with text supervision

    An-Lan Wang, Kun-Yu Lin, Jia-Run Du, Jingke Meng, and Wei-Shi Zheng. Event-guided procedure planning from in- structional videos with text supervision. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 13565–13575, 2023. 3, 6, 7, 8, 10

  76. [76]

    Pdpp: Projected diffusion for procedure planning in instructional videos

    Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. Pdpp: Projected diffusion for procedure planning in instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14836– 14845, 2023. 3, 6, 7, 8, 9

  77. [77]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 3, 6, 8, 9

  78. [78]

    LA VIE: High- quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 2024a

    Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive lan- guage models.arXiv preprint arXiv:2410.02757, 2024. 3

  79. [79]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 1

  80. [80]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 3, 6, 8, 9

Showing first 80 references.