arxiv: 2503.15558 · v3 · submitted 2025-03-18 · 💻 cs.AI · cs.CV· cs.LG· cs.RO

Recognition: 2 theorem links

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

NVIDIA: Alisson Azzolini , Junjie Bai , Hannah Brandon , Jiaxin Cao , Prithvijit Chattopadhyay , Huayu Chen , Jinju Chu , Yin Cui

show 44 more authors

Jenna Diamond Yifan Ding Liang Feng Francesco Ferroni Rama Govindaraju Jinwei Gu Siddharth Gururani Imad El Hanafi Zekun Hao Jacob Huffman Jingyi Jin Brendan Johnson Rizwan Khan George Kurian Elena Lantz Nayeon Lee Zhaoshuo Li Xuan Li Maosheng Liao Tsung-Yi Lin Yen-Chen Lin Ming-Yu Liu Xiangyu Lu Alice Luo Andrew Mathau Yun Ni Lindsey Pavao Wei Ping David W. Romero Misha Smelyanskiy Shuran Song Lyne Tchapmi Andrew Z. Wang Boxin Wang Haoxiang Wang Fangyin Wei Jiashu Xu Yao Xu Dinghao Yang Xiaodong Yang Zhuolin Yang Jingxu Zhang Xiaohui Zeng Zhe Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LGcs.RO

keywords physical AIembodied reasoningchain of thoughtmultimodal LLMphysical common sensereinforcement learningAI for roboticsontology-based training

0 comments

The pith

Cosmos-Reason1 models understand the physical world and generate embodied decisions through long chain-of-thought reasoning in natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cosmos-Reason1 models that aim to equip AI systems with physical common sense and embodied reasoning abilities. It defines physical common sense through a hierarchical ontology covering space, time, and physics, and embodied reasoning through a two-dimensional ontology that applies across different bodies. Two multimodal models, sized 7B and 56B, are trained in stages of supervised fine-tuning on physical AI data followed by reinforcement learning. Custom benchmarks built from the same ontologies measure performance, and results indicate gains from the training process. The models output next-step physical actions expressed in natural language after extended reasoning.

Core claim

Cosmos-Reason1 models demonstrate that multimodal large language models, when trained on data structured by ontologies of physical common sense and embodiment, can perform long chain-of-thought reasoning to interpret physical situations and produce appropriate next actions in natural language.

What carries the argument

The hierarchical ontology for physical common sense combined with the two-dimensional ontology for embodied reasoning, which structures data curation, training, and benchmark creation.

If this is right

Supervised fine-tuning followed by reinforcement learning on physical data produces clear gains on benchmarks for common sense and embodied reasoning.
The two-dimensional ontology enables the models to output decisions that apply across varied physical embodiments.
Natural language chain-of-thought outputs make the reasoning steps inspectable and usable for planning sequences of actions.
Open release of the models and code supports wider experimentation in physical AI tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robots using these models could handle unexpected physical changes by generating reasoned plans instead of fixed scripts.
Pairing the models with live camera input could support ongoing adaptation in changing environments.
The same ontology structure might extend to domains such as chemical reactions or biological systems for broader reasoning applications.
Hardware tests on actual robots would show how well the language-based decisions convert into physical control signals.

Load-bearing premise

The chosen ontologies for physical common sense and embodied reasoning contain enough knowledge to let the models generalize to new physical tasks and different robot bodies.

What would settle it

A set of test cases involving novel physical interactions, such as object manipulations under unrepresented constraints like variable gravity or friction, where the models produce incorrect action predictions despite long reasoning chains.

read the original abstract

Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B. We curate data and train our models in two stages: Physical AI supervised fine-tuning (SFT) and Physical AI reinforcement learning (RL). To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and RL bring significant improvements. To facilitate the development of Physical AI, we make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cosmos-Reason1 releases open 7B and 56B models for physical reasoning but ties its benchmarks directly to the same custom ontologies, which weakens the generalization claims.

read the letter

The punchline is that Cosmos-Reason1 provides open 7B and 56B models for physical and embodied reasoning, built around new ontologies and benchmarks, but the evaluation may not fully test generalization because everything ties back to those same ontologies. The paper does a few things well. Releasing the code and pre-trained models under the NVIDIA Open Model License is genuinely helpful for the community. Anyone can download and experiment with them. The ontologies are clearly laid out: a hierarchical structure for physical common sense that breaks down space, time, and physics, and a two-dimensional one for embodied reasoning meant to apply across embodiments. The two-stage training—first supervised fine-tuning on physical data, then reinforcement learning—follows a logical progression for adapting large models to this domain. The main soft spot is the risk of circular validation. The benchmarks are built according to the authors' ontologies, so improvements after SFT and RL could simply show that the models have internalized those specific categories rather than acquired transferable physical reasoning skills. The abstract gives no quantitative metrics, error bars, or comparisons to prior work, and it does not mention testing on independent benchmarks like Physion or standard robotics tasks. If the full paper includes those, it would strengthen the case considerably. The core training techniques are standard, so the novelty rests mainly on the data and the ontology definitions. This work is for researchers in physical AI and robotics who are looking for open models to build on or who want to explore structured approaches to common sense reasoning. It shows honest engagement with the problem of defining physical capabilities. I would recommend sending it for peer review. The open release means the community can verify the results directly, which makes it worth the referees' time even if revisions are needed on the evaluation side.

Referee Report

2 major / 2 minor

Summary. The paper introduces Cosmos-Reason1-7B and Cosmos-Reason1-56B multimodal LLMs designed to understand the physical world and generate embodied decisions via long chain-of-thought reasoning. It defines a hierarchical ontology capturing space, time, and physics for physical common sense, and a two-dimensional ontology for embodied reasoning that generalizes across embodiments. The models are trained in two stages—Physical AI supervised fine-tuning (SFT) followed by reinforcement learning (RL)—on curated data, with new benchmarks constructed directly from these ontologies. The central claim is that SFT and RL yield significant improvements on these benchmarks, with code and models released under the NVIDIA Open Model License.

Significance. If the performance gains prove robust and the ontologies enable genuine cross-task and cross-embodiment generalization, the work could provide a useful structured framework for Physical AI reasoning. The open release of models and code supports reproducibility. However, the current manuscript supplies no numerical results, baselines, or external validation, so the assessed significance remains provisional pending stronger empirical grounding.

major comments (2)

[Abstract] Abstract: the claim that 'Physical AI SFT and RL bring significant improvements' is unsupported because the abstract (and provided text) contains no quantitative metrics, error bars, ablation tables, or baseline comparisons. This leaves the central empirical claim without evidence.
[Evaluation] Evaluation (benchmarks section): the new benchmarks are built directly from the hierarchical ontology (space/time/physics) and two-dimensional embodied ontology introduced earlier in the paper. This construction creates a risk of circular validation, where measured gains may reflect improved adherence to the authors' taxonomy rather than transferable physical reasoning. No results on established external benchmarks (e.g., Physion, CLEVRER, or standard robotics suites) are reported.

minor comments (2)

[Ontologies] The paper would benefit from an explicit statement of the exact dimensions of the two-dimensional embodied ontology and how they are operationalized in the data curation rules.
[Methods] Notation for the model scales (7B and 56B) and training stages should be used consistently in tables and figures once introduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comments below and will revise the manuscript to better support the empirical claims while clarifying the benchmark design rationale.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'Physical AI SFT and RL bring significant improvements' is unsupported because the abstract (and provided text) contains no quantitative metrics, error bars, ablation tables, or baseline comparisons. This leaves the central empirical claim without evidence.

Authors: We agree that the abstract should include concrete quantitative support for the central claim. The full manuscript's evaluation section reports detailed results, including performance deltas from the SFT and RL stages, baseline comparisons, and ablation studies on the physical common sense and embodied reasoning benchmarks. We will revise the abstract to incorporate key metrics (e.g., accuracy improvements and error bars) with pointers to the relevant tables and figures. revision: yes
Referee: [Evaluation] Evaluation (benchmarks section): the new benchmarks are built directly from the hierarchical ontology (space/time/physics) and two-dimensional embodied ontology introduced earlier in the paper. This construction creates a risk of circular validation, where measured gains may reflect improved adherence to the authors' taxonomy rather than transferable physical reasoning. No results on established external benchmarks (e.g., Physion, CLEVRER, or standard robotics suites) are reported.

Authors: The ontologies serve as a systematic framework to define and ensure coverage of core capabilities rather than a narrow taxonomy to be memorized; the benchmarks test long chain-of-thought reasoning and next-step action generation on held-out scenarios that require generalization across space, time, physics, and embodiments. Ablation studies in the paper separate the contribution of the training stages from simple ontology matching. We acknowledge the value of external validation and will add results on Physion and CLEVRER (plus a brief discussion of robotics suites) in the revised manuscript to demonstrate transfer, while retaining the new benchmarks as the primary evaluation for the defined Physical AI capabilities. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper presents an empirical pipeline: ontologies are introduced as definitional inputs for physical common sense (hierarchical space/time/physics) and embodied reasoning (two-dimensional), data is curated accordingly, models are trained via SFT then RL, and benchmarks are constructed from the same ontologies for evaluation. No mathematical derivations, equations, or predictions are claimed that reduce by construction to fitted parameters or prior outputs. The central results are reported empirical improvements on the purpose-built benchmarks, which does not constitute a self-referential reduction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. This is a standard task-definition and benchmarking approach for new capabilities rather than circular reasoning.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claims rest on the assumption that the defined ontologies are adequate representations of physical knowledge, plus standard supervised and reinforcement learning assumptions about data quality and reward signals.

free parameters (2)

7B and 56B model scales
Chosen sizes for the two released models; no justification given in abstract for these exact parameter counts.
Data curation rules for SFT and RL stages
Specific filtering and labeling choices that determine what counts as physical common sense examples.

axioms (2)

domain assumption Hierarchical ontology captures fundamental knowledge about space, time, and physics
Invoked to structure data and benchmarks; no independent validation provided in abstract.
domain assumption Two-dimensional ontology generalizes across physical embodiments
Used to define embodied reasoning capability; treated as given.

invented entities (2)

hierarchical ontology for physical common sense no independent evidence
purpose: To represent and curate knowledge about space, time, and physics for training and evaluation
New structure introduced in the paper; no external falsifiable test mentioned in abstract.
two-dimensional ontology for embodied reasoning no independent evidence
purpose: To generalize action reasoning across different robot bodies
New structure introduced in the paper; no external falsifiable test mentioned in abstract.

pith-pipeline@v0.9.0 · 5753 in / 1447 out tokens · 38638 ms · 2026-05-16T12:43:31.071002+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation
cs.RO 2026-05 unverdicted novelty 7.0

SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
SCP: Spatial Causal Prediction in Video
cs.CV 2026-03 unverdicted novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
cs.AI 2026-04 unverdicted novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Seeing Fast and Slow: Learning the Flow of Time in Videos
cs.CV 2026-04 unverdicted novelty 6.0

Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement
cs.CV 2026-04 unverdicted novelty 6.0

Re²MoGen generates open-vocabulary motions via MCTS-enhanced LLM keyframe planning, pose-prior optimization with dynamic temporal matching fine-tuning, and physics-aware RL post-training, claiming SOTA performance.
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
cs.AI 2026-05 unverdicted novelty 5.0

Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance
cs.RO 2026-04 unverdicted novelty 5.0

PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
cs.RO 2026-04 unverdicted novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
cs.RO 2026-04 unverdicted novelty 5.0

RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
cs.RO 2026-04 unverdicted novelty 4.0

JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
cs.RO 2026-04 unverdicted novelty 4.0

A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.
World Simulation with Video Foundation Models for Physical AI
cs.CV 2025-10 unverdicted novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 17 Pith papers · 19 internal anchors

[1]

Agibot world colosseum.https://github.com/OpenDriveLab/AgiBot-World, 2024

AgiBot. Agibot world colosseum.https://github.com/OpenDriveLab/AgiBot-World, 2024. 13

work page 2024
[2]

Do as i can, not as i say: Grounding language in robotic affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. InCoRL, 2022. 22

work page 2022
[3]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, 2022. 6, 24

work page 2022
[4]

Covla: Comprehensive vision-language-action dataset for autonomous driving

H Arai, K Miwa, K Sasaki, Y Yamaguchi, et al. Covla: Comprehensive vision-language-action dataset for autonomous driving. InICRA, 2024. 24

work page 2024
[5]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6, 8, 24

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

RT-H: action hierarchies using language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 17

work page arXiv 2024
[7]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023. 22

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. 24

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 6, 24

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024. 6, 24

work page 2024
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 24

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

NVLM: Open frontier-class multimodal LLMs

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NVLM: Open frontier-class multimodal LLMs. arXiv preprint arXiv:2409.11402, 2024. 6, 24

work page arXiv 2024
[13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 2, 6, 10, 12, 15, 24

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, et al. Palm-e: An embodied multimodal language model. InICML, 2023. 22

work page 2023
[15]

Semantic anomaly detection with large language models.Autonomous Robots, 2023

Amine Elhafsi, Rohan Sinha, Christopher Agia, Edward Schmerling, Issa AD Nesnas, and Marco Pavone. Semantic anomaly detection with large language models.Autonomous Robots, 2023. 24 33 Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

work page 2023
[16]

Gemini robotics: Bringing ai into the physical world, 2025

Google DeepMind Gemini Robotics Team. Gemini robotics: Bringing ai into the physical world, 2025. URL https://deepmind.google/discover/blog/ gemini-robotics-brings-ai-into-the-physical-world/ . 22

work page 2025
[17]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 6, 24

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InCOLM,

work page
[19]

V1: Toward multimodal reasoning by designing auxiliary task,

Tianyu Pang Haonan Wang, Chao Du. V1: Toward multimodal reasoning by designing auxiliary task,

work page
[20]

URL https://faint-basin-c34.notion.site/1b2db15ae55b800da077e70aafe40212. 24

work page
[21]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InICML, 2022. 22

work page 2022
[23]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025. 24

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, AleksanderMadry,AlexBeutel,AlexCarney,etal. OpenAIo1systemcard. arXivpreprintarXiv:2412.16720 ,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Macmillan, 2011

Daniel Kahneman.Thinking, Fast and Slow. Macmillan, 2011. 3

work page 2011
[26]

OpenVLA: An Open-Source Vision-Language-Action Model

MJ Kim, K Pertsch, S Karamcheti, T Xiao, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 22

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InNeurIPS, 2022. 18

work page 2022
[28]

A path towards autonomous machine intelligence version 0.9

Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review,

work page 2022
[29]

Pre-trained language models for interactive decision-making

Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive decision-making. In NeurIPS, 2022. 22

work page 2022
[30]

Hamster: Hierarchical action models for open-world robot manipulation

Y Li, Y Deng, J Zhang, J Jang, M Memme, and R Yu. Hamster: Hierarchical action models for open-world robot manipulation. InICLR, 2025. 22

work page 2025
[31]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. InICRA, 2022. 22

work page 2022
[32]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Bench- marking knowledge transfer for lifelong robot learning. InarXiv preprint arXiv:2306.03310, 2023. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 6, 24 34 Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

work page 2023
[34]

Reflect: Summarizing robot experiences for failure explanation and correction

Z Liu, A Bahety, and S Song. Reflect: Summarizing robot experiences for failure explanation and correction. InCoRL, 2023. 18, 24

work page 2023
[35]

Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint arXiv:2412.15084, 2024

Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint arXiv:2412.15084, 2024. 24

work page arXiv 2024
[36]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 24

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Position: Levels of agi for operationalizing progress on the path to agi

Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Position: Levels of agi for operationalizing progress on the path to agi. InICML, 2024. 3

work page 2024
[38]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025a

NVIDIA. Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025. 6, 8

work page arXiv 2025
[40]

Nemotron-4 340b technical report

Nvidia, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024. 6

work page arXiv 2024
[41]

Learning to reason with llms, 2024

OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/. 1, 24

work page 2024
[42]

Qvq: To see the world with wisdom, 2024

Qwen-Team. Qvq: To see the world with wisdom, 2024. URL https://qwenlm.github.io/blog/ qvq-72b-preview/. 24

work page 2024
[43]

Intphys 2019: A benchmark for visual intuitive physics understanding.TPAMI, 2021

Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. Intphys 2019: A benchmark for visual intuitive physics understanding.TPAMI, 2021. 3

work page 2019
[44]

Robovqa: Multimodal long-horizon reasoning for robotics

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. InICRA, 2024. 13

work page 2024
[45]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417, 2025. 22

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. InCVPR, 2016. 6 35 Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

work page 2016
[49]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019. 8

work page internal anchor Pith review Pith/arXiv arXiv 1909
[50]

Sadler, Wei-Lun Chao, and Yu Su

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. InICCV, 2023. 22

work page 2023
[51]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 6

work page 2017
[52]

An empirical study of mamba- based language models.arXiv preprint arXiv:2406.07887,

Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. An empirical study of mamba-based language models. arXiv preprint arXiv:2406.07887, 2024. 6, 7

work page arXiv 2024
[53]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InCoRL, 2023. 12

work page 2023
[54]

Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InICCV, 2023. 13

work page 2023
[55]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 24

work page 2022
[56]

Magma: A foundation model for multimodal ai agents

J Yang, R Tan, Q Wu, R Zheng, B Peng, and Y Liang. Magma: A foundation model for multimodal ai agents. InCVPR, 2025. 22

work page 2025
[57]

Robotic control via embodied chain-of-thought reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. InCoRL, 2024. 13, 24

work page 2024
[58]

R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv e-prints, pages arXiv–2503, 2025

Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv e-prints, pages arXiv–2503, 2025. 24

work page 2025
[59]

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. arXiv preprint arXiv:2503.22020, 2025. 24

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

aha moment

Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s" aha moment" in visual reasoning on a 2b non-sft model.arXiv preprint arXiv:2503.05132, 2025. 24 36

work page arXiv 2025