Recognition: 2 theorem links
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Pith reviewed 2026-05-16 12:43 UTC · model grok-4.3
The pith
Cosmos-Reason1 models understand the physical world and generate embodied decisions through long chain-of-thought reasoning in natural language.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cosmos-Reason1 models demonstrate that multimodal large language models, when trained on data structured by ontologies of physical common sense and embodiment, can perform long chain-of-thought reasoning to interpret physical situations and produce appropriate next actions in natural language.
What carries the argument
The hierarchical ontology for physical common sense combined with the two-dimensional ontology for embodied reasoning, which structures data curation, training, and benchmark creation.
If this is right
- Supervised fine-tuning followed by reinforcement learning on physical data produces clear gains on benchmarks for common sense and embodied reasoning.
- The two-dimensional ontology enables the models to output decisions that apply across varied physical embodiments.
- Natural language chain-of-thought outputs make the reasoning steps inspectable and usable for planning sequences of actions.
- Open release of the models and code supports wider experimentation in physical AI tasks.
Where Pith is reading between the lines
- Robots using these models could handle unexpected physical changes by generating reasoned plans instead of fixed scripts.
- Pairing the models with live camera input could support ongoing adaptation in changing environments.
- The same ontology structure might extend to domains such as chemical reactions or biological systems for broader reasoning applications.
- Hardware tests on actual robots would show how well the language-based decisions convert into physical control signals.
Load-bearing premise
The chosen ontologies for physical common sense and embodied reasoning contain enough knowledge to let the models generalize to new physical tasks and different robot bodies.
What would settle it
A set of test cases involving novel physical interactions, such as object manipulations under unrepresented constraints like variable gravity or friction, where the models produce incorrect action predictions despite long reasoning chains.
read the original abstract
Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B. We curate data and train our models in two stages: Physical AI supervised fine-tuning (SFT) and Physical AI reinforcement learning (RL). To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and RL bring significant improvements. To facilitate the development of Physical AI, we make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Cosmos-Reason1-7B and Cosmos-Reason1-56B multimodal LLMs designed to understand the physical world and generate embodied decisions via long chain-of-thought reasoning. It defines a hierarchical ontology capturing space, time, and physics for physical common sense, and a two-dimensional ontology for embodied reasoning that generalizes across embodiments. The models are trained in two stages—Physical AI supervised fine-tuning (SFT) followed by reinforcement learning (RL)—on curated data, with new benchmarks constructed directly from these ontologies. The central claim is that SFT and RL yield significant improvements on these benchmarks, with code and models released under the NVIDIA Open Model License.
Significance. If the performance gains prove robust and the ontologies enable genuine cross-task and cross-embodiment generalization, the work could provide a useful structured framework for Physical AI reasoning. The open release of models and code supports reproducibility. However, the current manuscript supplies no numerical results, baselines, or external validation, so the assessed significance remains provisional pending stronger empirical grounding.
major comments (2)
- [Abstract] Abstract: the claim that 'Physical AI SFT and RL bring significant improvements' is unsupported because the abstract (and provided text) contains no quantitative metrics, error bars, ablation tables, or baseline comparisons. This leaves the central empirical claim without evidence.
- [Evaluation] Evaluation (benchmarks section): the new benchmarks are built directly from the hierarchical ontology (space/time/physics) and two-dimensional embodied ontology introduced earlier in the paper. This construction creates a risk of circular validation, where measured gains may reflect improved adherence to the authors' taxonomy rather than transferable physical reasoning. No results on established external benchmarks (e.g., Physion, CLEVRER, or standard robotics suites) are reported.
minor comments (2)
- [Ontologies] The paper would benefit from an explicit statement of the exact dimensions of the two-dimensional embodied ontology and how they are operationalized in the data curation rules.
- [Methods] Notation for the model scales (7B and 56B) and training stages should be used consistently in tables and figures once introduced.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address the major comments below and will revise the manuscript to better support the empirical claims while clarifying the benchmark design rationale.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'Physical AI SFT and RL bring significant improvements' is unsupported because the abstract (and provided text) contains no quantitative metrics, error bars, ablation tables, or baseline comparisons. This leaves the central empirical claim without evidence.
Authors: We agree that the abstract should include concrete quantitative support for the central claim. The full manuscript's evaluation section reports detailed results, including performance deltas from the SFT and RL stages, baseline comparisons, and ablation studies on the physical common sense and embodied reasoning benchmarks. We will revise the abstract to incorporate key metrics (e.g., accuracy improvements and error bars) with pointers to the relevant tables and figures. revision: yes
-
Referee: [Evaluation] Evaluation (benchmarks section): the new benchmarks are built directly from the hierarchical ontology (space/time/physics) and two-dimensional embodied ontology introduced earlier in the paper. This construction creates a risk of circular validation, where measured gains may reflect improved adherence to the authors' taxonomy rather than transferable physical reasoning. No results on established external benchmarks (e.g., Physion, CLEVRER, or standard robotics suites) are reported.
Authors: The ontologies serve as a systematic framework to define and ensure coverage of core capabilities rather than a narrow taxonomy to be memorized; the benchmarks test long chain-of-thought reasoning and next-step action generation on held-out scenarios that require generalization across space, time, physics, and embodiments. Ablation studies in the paper separate the contribution of the training stages from simple ontology matching. We acknowledge the value of external validation and will add results on Physion and CLEVRER (plus a brief discussion of robotics suites) in the revised manuscript to demonstrate transfer, while retaining the new benchmarks as the primary evaluation for the defined Physical AI capabilities. revision: partial
Circularity Check
No circularity in derivation or evaluation chain
full rationale
The paper presents an empirical pipeline: ontologies are introduced as definitional inputs for physical common sense (hierarchical space/time/physics) and embodied reasoning (two-dimensional), data is curated accordingly, models are trained via SFT then RL, and benchmarks are constructed from the same ontologies for evaluation. No mathematical derivations, equations, or predictions are claimed that reduce by construction to fitted parameters or prior outputs. The central results are reported empirical improvements on the purpose-built benchmarks, which does not constitute a self-referential reduction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. This is a standard task-definition and benchmarking approach for new capabilities rather than circular reasoning.
Axiom & Free-Parameter Ledger
free parameters (2)
- 7B and 56B model scales
- Data curation rules for SFT and RL stages
axioms (2)
- domain assumption Hierarchical ontology captures fundamental knowledge about space, time, and physics
- domain assumption Two-dimensional ontology generalizes across physical embodiments
invented entities (2)
-
hierarchical ontology for physical common sense
no independent evidence
-
two-dimensional ontology for embodied reasoning
no independent evidence
Forward citations
Cited by 18 Pith papers
-
SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation
SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
-
SCP: Spatial Causal Prediction in Video
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
-
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
-
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
-
Seeing Fast and Slow: Learning the Flow of Time in Videos
Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
-
Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement
Re²MoGen generates open-vocabulary motions via MCTS-enhanced LLM keyframe planning, pose-prior optimization with dynamic temporal matching fine-tuning, and physics-aware RL post-training, claiming SOTA performance.
-
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
-
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance
PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
-
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
-
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Reference graph
Works this paper leans on
-
[1]
Agibot world colosseum.https://github.com/OpenDriveLab/AgiBot-World, 2024
AgiBot. Agibot world colosseum.https://github.com/OpenDriveLab/AgiBot-World, 2024. 13
work page 2024
-
[2]
Do as i can, not as i say: Grounding language in robotic affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. InCoRL, 2022. 22
work page 2022
-
[3]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, 2022. 6, 24
work page 2022
-
[4]
Covla: Comprehensive vision-language-action dataset for autonomous driving
H Arai, K Miwa, K Sasaki, Y Yamaguchi, et al. Covla: Comprehensive vision-language-action dataset for autonomous driving. InICRA, 2024. 24
work page 2024
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6, 8, 24
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
RT-H: action hierarchies using language
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 17
-
[7]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023. 22
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. 24
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 6, 24
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024. 6, 24
work page 2024
-
[11]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 24
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
NVLM: Open frontier-class multimodal LLMs
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NVLM: Open frontier-class multimodal LLMs. arXiv preprint arXiv:2409.11402, 2024. 6, 24
-
[13]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 2, 6, 10, 12, 15, 24
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Palm-e: An embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, et al. Palm-e: An embodied multimodal language model. InICML, 2023. 22
work page 2023
-
[15]
Semantic anomaly detection with large language models.Autonomous Robots, 2023
Amine Elhafsi, Rohan Sinha, Christopher Agia, Edward Schmerling, Issa AD Nesnas, and Marco Pavone. Semantic anomaly detection with large language models.Autonomous Robots, 2023. 24 33 Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
work page 2023
-
[16]
Gemini robotics: Bringing ai into the physical world, 2025
Google DeepMind Gemini Robotics Team. Gemini robotics: Bringing ai into the physical world, 2025. URL https://deepmind.google/discover/blog/ gemini-robotics-brings-ai-into-the-physical-world/ . 22
work page 2025
-
[17]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 6, 24
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InCOLM,
-
[19]
V1: Toward multimodal reasoning by designing auxiliary task,
Tianyu Pang Haonan Wang, Chao Du. V1: Toward multimodal reasoning by designing auxiliary task,
-
[20]
URL https://faint-basin-c34.notion.site/1b2db15ae55b800da077e70aafe40212. 24
-
[21]
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InICML, 2022. 22
work page 2022
-
[23]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025. 24
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, AleksanderMadry,AlexBeutel,AlexCarney,etal. OpenAIo1systemcard. arXivpreprintarXiv:2412.16720 ,
work page internal anchor Pith review Pith/arXiv arXiv
- [25]
-
[26]
OpenVLA: An Open-Source Vision-Language-Action Model
MJ Kim, K Pertsch, S Karamcheti, T Xiao, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 22
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InNeurIPS, 2022. 18
work page 2022
-
[28]
A path towards autonomous machine intelligence version 0.9
Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review,
work page 2022
-
[29]
Pre-trained language models for interactive decision-making
Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive decision-making. In NeurIPS, 2022. 22
work page 2022
-
[30]
Hamster: Hierarchical action models for open-world robot manipulation
Y Li, Y Deng, J Zhang, J Jang, M Memme, and R Yu. Hamster: Hierarchical action models for open-world robot manipulation. InICLR, 2025. 22
work page 2025
-
[31]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. InICRA, 2022. 22
work page 2022
-
[32]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Bench- marking knowledge transfer for lifelong robot learning. InarXiv preprint arXiv:2306.03310, 2023. 15
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 6, 24 34 Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
work page 2023
-
[34]
Reflect: Summarizing robot experiences for failure explanation and correction
Z Liu, A Bahety, and S Song. Reflect: Summarizing robot experiences for failure explanation and correction. InCoRL, 2023. 18, 24
work page 2023
-
[35]
Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint arXiv:2412.15084, 2024. 24
-
[36]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 24
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Position: Levels of agi for operationalizing progress on the path to agi
Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Position: Levels of agi for operationalizing progress on the path to agi. InICML, 2024. 3
work page 2024
-
[38]
Cosmos World Foundation Model Platform for Physical AI
NVIDIA. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
NVIDIA. Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025. 6, 8
-
[40]
Nemotron-4 340b technical report
Nvidia, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024. 6
-
[41]
Learning to reason with llms, 2024
OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/. 1, 24
work page 2024
-
[42]
Qvq: To see the world with wisdom, 2024
Qwen-Team. Qvq: To see the world with wisdom, 2024. URL https://qwenlm.github.io/blog/ qvq-72b-preview/. 24
work page 2024
-
[43]
Intphys 2019: A benchmark for visual intuitive physics understanding.TPAMI, 2021
Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. Intphys 2019: A benchmark for visual intuitive physics understanding.TPAMI, 2021. 3
work page 2019
-
[44]
Robovqa: Multimodal long-horizon reasoning for robotics
Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. InICRA, 2024. 13
work page 2024
-
[45]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417, 2025. 22
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. InCVPR, 2016. 6 35 Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
work page 2016
-
[49]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019. 8
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[50]
Sadler, Wei-Lun Chao, and Yu Su
Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. InICCV, 2023. 22
work page 2023
-
[51]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 6
work page 2017
-
[52]
An empirical study of mamba-based language models
Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. An empirical study of mamba-based language models. arXiv preprint arXiv:2406.07887, 2024. 6, 7
-
[53]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InCoRL, 2023. 12
work page 2023
-
[54]
Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world
Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InICCV, 2023. 13
work page 2023
-
[55]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 24
work page 2022
-
[56]
Magma: A foundation model for multimodal ai agents
J Yang, R Tan, Q Wu, R Zheng, B Peng, and Y Liang. Magma: A foundation model for multimodal ai agents. InCVPR, 2025. 22
work page 2025
-
[57]
Robotic control via embodied chain-of-thought reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. InCoRL, 2024. 13, 24
work page 2024
-
[58]
Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv e-prints, pages arXiv–2503, 2025. 24
work page 2025
-
[59]
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. arXiv preprint arXiv:2503.22020, 2025. 24
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s" aha moment" in visual reasoning on a 2b non-sft model.arXiv preprint arXiv:2503.05132, 2025. 24 36
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.