pith. machine review for the scientific record. sign in

arxiv: 2503.15558 · v3 · submitted 2025-03-18 · 💻 cs.AI · cs.CV· cs.LG· cs.RO

Recognition: 2 theorem links

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LGcs.RO
keywords physical AIembodied reasoningchain of thoughtmultimodal LLMphysical common sensereinforcement learningAI for roboticsontology-based training
0
0 comments X

The pith

Cosmos-Reason1 models understand the physical world and generate embodied decisions through long chain-of-thought reasoning in natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cosmos-Reason1 models that aim to equip AI systems with physical common sense and embodied reasoning abilities. It defines physical common sense through a hierarchical ontology covering space, time, and physics, and embodied reasoning through a two-dimensional ontology that applies across different bodies. Two multimodal models, sized 7B and 56B, are trained in stages of supervised fine-tuning on physical AI data followed by reinforcement learning. Custom benchmarks built from the same ontologies measure performance, and results indicate gains from the training process. The models output next-step physical actions expressed in natural language after extended reasoning.

Core claim

Cosmos-Reason1 models demonstrate that multimodal large language models, when trained on data structured by ontologies of physical common sense and embodiment, can perform long chain-of-thought reasoning to interpret physical situations and produce appropriate next actions in natural language.

What carries the argument

The hierarchical ontology for physical common sense combined with the two-dimensional ontology for embodied reasoning, which structures data curation, training, and benchmark creation.

If this is right

  • Supervised fine-tuning followed by reinforcement learning on physical data produces clear gains on benchmarks for common sense and embodied reasoning.
  • The two-dimensional ontology enables the models to output decisions that apply across varied physical embodiments.
  • Natural language chain-of-thought outputs make the reasoning steps inspectable and usable for planning sequences of actions.
  • Open release of the models and code supports wider experimentation in physical AI tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robots using these models could handle unexpected physical changes by generating reasoned plans instead of fixed scripts.
  • Pairing the models with live camera input could support ongoing adaptation in changing environments.
  • The same ontology structure might extend to domains such as chemical reactions or biological systems for broader reasoning applications.
  • Hardware tests on actual robots would show how well the language-based decisions convert into physical control signals.

Load-bearing premise

The chosen ontologies for physical common sense and embodied reasoning contain enough knowledge to let the models generalize to new physical tasks and different robot bodies.

What would settle it

A set of test cases involving novel physical interactions, such as object manipulations under unrepresented constraints like variable gravity or friction, where the models produce incorrect action predictions despite long reasoning chains.

read the original abstract

Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B. We curate data and train our models in two stages: Physical AI supervised fine-tuning (SFT) and Physical AI reinforcement learning (RL). To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and RL bring significant improvements. To facilitate the development of Physical AI, we make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Cosmos-Reason1-7B and Cosmos-Reason1-56B multimodal LLMs designed to understand the physical world and generate embodied decisions via long chain-of-thought reasoning. It defines a hierarchical ontology capturing space, time, and physics for physical common sense, and a two-dimensional ontology for embodied reasoning that generalizes across embodiments. The models are trained in two stages—Physical AI supervised fine-tuning (SFT) followed by reinforcement learning (RL)—on curated data, with new benchmarks constructed directly from these ontologies. The central claim is that SFT and RL yield significant improvements on these benchmarks, with code and models released under the NVIDIA Open Model License.

Significance. If the performance gains prove robust and the ontologies enable genuine cross-task and cross-embodiment generalization, the work could provide a useful structured framework for Physical AI reasoning. The open release of models and code supports reproducibility. However, the current manuscript supplies no numerical results, baselines, or external validation, so the assessed significance remains provisional pending stronger empirical grounding.

major comments (2)
  1. [Abstract] Abstract: the claim that 'Physical AI SFT and RL bring significant improvements' is unsupported because the abstract (and provided text) contains no quantitative metrics, error bars, ablation tables, or baseline comparisons. This leaves the central empirical claim without evidence.
  2. [Evaluation] Evaluation (benchmarks section): the new benchmarks are built directly from the hierarchical ontology (space/time/physics) and two-dimensional embodied ontology introduced earlier in the paper. This construction creates a risk of circular validation, where measured gains may reflect improved adherence to the authors' taxonomy rather than transferable physical reasoning. No results on established external benchmarks (e.g., Physion, CLEVRER, or standard robotics suites) are reported.
minor comments (2)
  1. [Ontologies] The paper would benefit from an explicit statement of the exact dimensions of the two-dimensional embodied ontology and how they are operationalized in the data curation rules.
  2. [Methods] Notation for the model scales (7B and 56B) and training stages should be used consistently in tables and figures once introduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comments below and will revise the manuscript to better support the empirical claims while clarifying the benchmark design rationale.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'Physical AI SFT and RL bring significant improvements' is unsupported because the abstract (and provided text) contains no quantitative metrics, error bars, ablation tables, or baseline comparisons. This leaves the central empirical claim without evidence.

    Authors: We agree that the abstract should include concrete quantitative support for the central claim. The full manuscript's evaluation section reports detailed results, including performance deltas from the SFT and RL stages, baseline comparisons, and ablation studies on the physical common sense and embodied reasoning benchmarks. We will revise the abstract to incorporate key metrics (e.g., accuracy improvements and error bars) with pointers to the relevant tables and figures. revision: yes

  2. Referee: [Evaluation] Evaluation (benchmarks section): the new benchmarks are built directly from the hierarchical ontology (space/time/physics) and two-dimensional embodied ontology introduced earlier in the paper. This construction creates a risk of circular validation, where measured gains may reflect improved adherence to the authors' taxonomy rather than transferable physical reasoning. No results on established external benchmarks (e.g., Physion, CLEVRER, or standard robotics suites) are reported.

    Authors: The ontologies serve as a systematic framework to define and ensure coverage of core capabilities rather than a narrow taxonomy to be memorized; the benchmarks test long chain-of-thought reasoning and next-step action generation on held-out scenarios that require generalization across space, time, physics, and embodiments. Ablation studies in the paper separate the contribution of the training stages from simple ontology matching. We acknowledge the value of external validation and will add results on Physion and CLEVRER (plus a brief discussion of robotics suites) in the revised manuscript to demonstrate transfer, while retaining the new benchmarks as the primary evaluation for the defined Physical AI capabilities. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper presents an empirical pipeline: ontologies are introduced as definitional inputs for physical common sense (hierarchical space/time/physics) and embodied reasoning (two-dimensional), data is curated accordingly, models are trained via SFT then RL, and benchmarks are constructed from the same ontologies for evaluation. No mathematical derivations, equations, or predictions are claimed that reduce by construction to fitted parameters or prior outputs. The central results are reported empirical improvements on the purpose-built benchmarks, which does not constitute a self-referential reduction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. This is a standard task-definition and benchmarking approach for new capabilities rather than circular reasoning.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claims rest on the assumption that the defined ontologies are adequate representations of physical knowledge, plus standard supervised and reinforcement learning assumptions about data quality and reward signals.

free parameters (2)
  • 7B and 56B model scales
    Chosen sizes for the two released models; no justification given in abstract for these exact parameter counts.
  • Data curation rules for SFT and RL stages
    Specific filtering and labeling choices that determine what counts as physical common sense examples.
axioms (2)
  • domain assumption Hierarchical ontology captures fundamental knowledge about space, time, and physics
    Invoked to structure data and benchmarks; no independent validation provided in abstract.
  • domain assumption Two-dimensional ontology generalizes across physical embodiments
    Used to define embodied reasoning capability; treated as given.
invented entities (2)
  • hierarchical ontology for physical common sense no independent evidence
    purpose: To represent and curate knowledge about space, time, and physics for training and evaluation
    New structure introduced in the paper; no external falsifiable test mentioned in abstract.
  • two-dimensional ontology for embodied reasoning no independent evidence
    purpose: To generalize action reasoning across different robot bodies
    New structure introduced in the paper; no external falsifiable test mentioned in abstract.

pith-pipeline@v0.9.0 · 5753 in / 1447 out tokens · 38638 ms · 2026-05-16T12:43:31.071002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation

    cs.RO 2026-05 unverdicted novelty 7.0

    SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.

  2. SCP: Spatial Causal Prediction in Video

    cs.CV 2026-03 unverdicted novelty 7.0

    SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

  3. Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

    cs.CV 2026-05 unverdicted novelty 6.0

    Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...

  4. PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

  5. Seeing Fast and Slow: Learning the Flow of Time in Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.

  6. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  7. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

  8. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.

  9. Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement

    cs.CV 2026-04 unverdicted novelty 6.0

    Re²MoGen generates open-vocabulary motions via MCTS-enhanced LLM keyframe planning, pose-prior optimization with dynamic temporal matching fine-tuning, and physics-aware RL post-training, claiming SOTA performance.

  10. Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

    cs.AI 2026-05 unverdicted novelty 5.0

    Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.

  11. PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

    cs.RO 2026-04 unverdicted novelty 5.0

    PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.

  12. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  13. RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

    cs.RO 2026-04 unverdicted novelty 5.0

    RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.

  14. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  15. JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

    cs.RO 2026-04 unverdicted novelty 4.0

    JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

  16. XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    cs.CV 2026-04 unverdicted novelty 4.0

    XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

  17. Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

    cs.RO 2026-04 unverdicted novelty 4.0

    A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.

  18. World Simulation with Video Foundation Models for Physical AI

    cs.CV 2025-10 unverdicted novelty 4.0

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 17 Pith papers · 19 internal anchors

  1. [1]

    Agibot world colosseum.https://github.com/OpenDriveLab/AgiBot-World, 2024

    AgiBot. Agibot world colosseum.https://github.com/OpenDriveLab/AgiBot-World, 2024. 13

  2. [2]

    Do as i can, not as i say: Grounding language in robotic affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. InCoRL, 2022. 22

  3. [3]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, 2022. 6, 24

  4. [4]

    Covla: Comprehensive vision-language-action dataset for autonomous driving

    H Arai, K Miwa, K Sasaki, Y Yamaguchi, et al. Covla: Comprehensive vision-language-action dataset for autonomous driving. InICRA, 2024. 24

  5. [5]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6, 8, 24

  6. [6]

    RT-H: action hierarchies using language

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 17

  7. [7]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023. 22

  8. [8]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. 24

  9. [9]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 6, 24

  10. [10]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024. 6, 24

  11. [11]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 24

  12. [12]

    NVLM: Open frontier-class multimodal LLMs

    Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NVLM: Open frontier-class multimodal LLMs. arXiv preprint arXiv:2409.11402, 2024. 6, 24

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 2, 6, 10, 12, 15, 24

  14. [14]

    Palm-e: An embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, et al. Palm-e: An embodied multimodal language model. InICML, 2023. 22

  15. [15]

    Semantic anomaly detection with large language models.Autonomous Robots, 2023

    Amine Elhafsi, Rohan Sinha, Christopher Agia, Edward Schmerling, Issa AD Nesnas, and Marco Pavone. Semantic anomaly detection with large language models.Autonomous Robots, 2023. 24 33 Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

  16. [16]

    Gemini robotics: Bringing ai into the physical world, 2025

    Google DeepMind Gemini Robotics Team. Gemini robotics: Bringing ai into the physical world, 2025. URL https://deepmind.google/discover/blog/ gemini-robotics-brings-ai-into-the-physical-world/ . 22

  17. [17]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 6, 24

  18. [18]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InCOLM,

  19. [19]

    V1: Toward multimodal reasoning by designing auxiliary task,

    Tianyu Pang Haonan Wang, Chao Du. V1: Toward multimodal reasoning by designing auxiliary task,

  20. [20]

    URL https://faint-basin-c34.notion.site/1b2db15ae55b800da077e70aafe40212. 24

  21. [21]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024. 8

  22. [22]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InICML, 2022. 22

  23. [23]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025. 24

  24. [24]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, AleksanderMadry,AlexBeutel,AlexCarney,etal. OpenAIo1systemcard. arXivpreprintarXiv:2412.16720 ,

  25. [25]

    Macmillan, 2011

    Daniel Kahneman.Thinking, Fast and Slow. Macmillan, 2011. 3

  26. [26]

    OpenVLA: An Open-Source Vision-Language-Action Model

    MJ Kim, K Pertsch, S Karamcheti, T Xiao, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 22

  27. [27]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InNeurIPS, 2022. 18

  28. [28]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review,

  29. [29]

    Pre-trained language models for interactive decision-making

    Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive decision-making. In NeurIPS, 2022. 22

  30. [30]

    Hamster: Hierarchical action models for open-world robot manipulation

    Y Li, Y Deng, J Zhang, J Jang, M Memme, and R Yu. Hamster: Hierarchical action models for open-world robot manipulation. InICLR, 2025. 22

  31. [31]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. InICRA, 2022. 22

  32. [32]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Bench- marking knowledge transfer for lifelong robot learning. InarXiv preprint arXiv:2306.03310, 2023. 15

  33. [33]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 6, 24 34 Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

  34. [34]

    Reflect: Summarizing robot experiences for failure explanation and correction

    Z Liu, A Bahety, and S Song. Reflect: Summarizing robot experiences for failure explanation and correction. InCoRL, 2023. 18, 24

  35. [35]

    Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint arXiv:2412.15084, 2024

    Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint arXiv:2412.15084, 2024. 24

  36. [36]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 24

  37. [37]

    Position: Levels of agi for operationalizing progress on the path to agi

    Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Position: Levels of agi for operationalizing progress on the path to agi. InICML, 2024. 3

  38. [38]

    Cosmos World Foundation Model Platform for Physical AI

    NVIDIA. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 14

  39. [39]

    Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025a

    NVIDIA. Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025. 6, 8

  40. [40]

    Nemotron-4 340b technical report

    Nvidia, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024. 6

  41. [41]

    Learning to reason with llms, 2024

    OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/. 1, 24

  42. [42]

    Qvq: To see the world with wisdom, 2024

    Qwen-Team. Qvq: To see the world with wisdom, 2024. URL https://qwenlm.github.io/blog/ qvq-72b-preview/. 24

  43. [43]

    Intphys 2019: A benchmark for visual intuitive physics understanding.TPAMI, 2021

    Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. Intphys 2019: A benchmark for visual intuitive physics understanding.TPAMI, 2021. 3

  44. [44]

    Robovqa: Multimodal long-horizon reasoning for robotics

    Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. InICRA, 2024. 13

  45. [45]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 8

  46. [46]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  47. [47]

    Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

    Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417, 2025. 22

  48. [48]

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. InCVPR, 2016. 6 35 Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

  49. [49]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019. 8

  50. [50]

    Sadler, Wei-Lun Chao, and Yu Su

    Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. InICCV, 2023. 22

  51. [51]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 6

  52. [52]

    An empirical study of mamba- based language models.arXiv preprint arXiv:2406.07887,

    Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. An empirical study of mamba-based language models. arXiv preprint arXiv:2406.07887, 2024. 6, 7

  53. [53]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InCoRL, 2023. 12

  54. [54]

    Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InICCV, 2023. 13

  55. [55]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 24

  56. [56]

    Magma: A foundation model for multimodal ai agents

    J Yang, R Tan, Q Wu, R Zheng, B Peng, and Y Liang. Magma: A foundation model for multimodal ai agents. InCVPR, 2025. 22

  57. [57]

    Robotic control via embodied chain-of-thought reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. InCoRL, 2024. 13, 24

  58. [58]

    R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv e-prints, pages arXiv–2503, 2025

    Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv e-prints, pages arXiv–2503, 2025. 24

  59. [59]

    CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. arXiv preprint arXiv:2503.22020, 2025. 24

  60. [60]

    aha moment

    Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s" aha moment" in visual reasoning on a 2b non-sft model.arXiv preprint arXiv:2503.05132, 2025. 24 36