hub Canonical reference

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

NVIDIA: Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen · 2025 · cs.AI · arXiv 2503.15558

Canonical reference. 77% of citing Pith papers cite this work as background.

49 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 49 citing papers arXiv PDF

abstract

Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B. We curate data and train our models in two stages: Physical AI supervised fine-tuning (SFT) and Physical AI reinforcement learning (RL). To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and RL bring significant improvements. To facilitate the development of Physical AI, we make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 dataset 2

citation-polarity summary

background 10 use dataset 2 unclear 1

representative citing papers

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

Act2Answer protocol reveals VLA models retain simple concepts but show larger gaps on complex semantics than source VLMs, with VQA co-training linked to better retention and knowledge signals peaking in middle layers.

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

cs.CV · 2026-06-02 · accept · novelty 7.0

OVO-S-Bench provides 1680 human-annotated questions on 348 videos to measure streaming spatial intelligence in MLLMs across instantaneous perception, spatiotemporal tracking, spatial simulation, and allocentric mapping.

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

cs.CV · 2026-05-30 · unverdicted · novelty 7.0 · 2 refs

Introduces pause-and-think-T dataset and pause-and-think-B benchmark; fine-tunes 4B VLM to 58% accuracy matching 235B model while generalizing out-of-distribution.

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.

SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation

cs.RO · 2026-05-10 · unverdicted · novelty 7.0

SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.

SCP: Spatial Causal Prediction in Video

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

Do-Undo Bench: Reversibility for Action Understanding in Image Generation

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.

Action-guided generation of 3D functionality segmentation data

cs.CV · 2025-11-28 · unverdicted · novelty 7.0

SynthFun3D generates synthetic 3D functionality segmentation data from action descriptions via object retrieval and scene arrangement, yielding consistent gains of +2.2 mAP, +6.3 mAR, and +5.7 mIoU when augmenting real data for VLM training.

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

cs.CV · 2026-06-16 · unverdicted · novelty 6.0

SR-REAL equips spatial VLMs with dual LOR and DTR reasoning paths trained via RL, achieving better benchmark performance through mutual reinforcement and generalization without per-task tuning.

MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

cs.RO · 2026-06-08 · unverdicted · novelty 6.0

MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.

Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation

cs.RO · 2026-06-07 · unverdicted · novelty 6.0

Dream-Tac unifies visual and tactile signals in a world action model using contact-gated fusion and attention bias, reporting 31.7% average action accuracy gains on six manipulation tasks.

Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs

cs.DB · 2026-06-04 · unverdicted · novelty 6.0

Introduces CausalPhys benchmark with causal graphs and CRFT fine-tuning to improve VLMs' causal physical reasoning accuracy and interpretability.

Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

cs.RO · 2026-06-02 · unverdicted · novelty 6.0

ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.

Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

Introduces a new diagnostic benchmark and million-scale reasoning corpus showing that training on explicit causal traces improves next-state prediction in embodied planning, with reported gains from data scaling.

nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

nuReasoning is a new real-world dataset and benchmark extending nuScenes/nuPlan with 20k clips and multi-type reasoning annotations to evaluate and improve reasoning in long-tail autonomous driving.

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

cs.RO · 2026-05-16 · unverdicted · novelty 6.0

DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.

Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

Seeing Fast and Slow: Learning the Flow of Time in Videos

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.

One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

Re²MoGen generates open-vocabulary motions via MCTS-enhanced LLM keyframe planning, pose-prior optimization with dynamic temporal matching fine-tuning, and physics-aware RL post-training, claiming SOTA performance.

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

cs.RO · 2026-02-09 · unverdicted · novelty 6.0

R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.

MiMo-Embodied: X-Embodied Foundation Model Technical Report

cs.RO · 2025-11-20 · unverdicted · novelty 6.0

MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.

citing papers explorer

Showing 49 of 49 citing papers.

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models cs.LG · 2026-06-17 · unverdicted · none · ref 46 · internal anchor
Act2Answer protocol reveals VLA models retain simple concepts but show larger gaps on complex semantics than source VLMs, with VQA co-training linked to better retention and knowledge signals peaking in middle layers.
OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs cs.CV · 2026-06-02 · accept · none · ref 1 · internal anchor
OVO-S-Bench provides 1680 human-annotated questions on 348 videos to measure streaming spatial intelligence in MLLMs across instantaneous perception, spatiotemporal tracking, spatial simulation, and allocentric mapping.
Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion cs.CV · 2026-05-30 · unverdicted · none · ref 11 · 2 links · internal anchor
Introduces pause-and-think-T dataset and pause-and-think-B benchmark; fine-tunes 4B VLM to 58% accuracy matching 235B model while generalizing out-of-distribution.
MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks cs.CV · 2026-05-21 · unverdicted · none · ref 3 · internal anchor
MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.
SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation cs.RO · 2026-05-10 · unverdicted · none · ref 18 · internal anchor
SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
SCP: Spatial Causal Prediction in Video cs.CV · 2026-03-04 · unverdicted · none · ref 4 · internal anchor
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
Do-Undo Bench: Reversibility for Action Understanding in Image Generation cs.CV · 2025-12-15 · unverdicted · none · ref 2 · internal anchor
Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.
Action-guided generation of 3D functionality segmentation data cs.CV · 2025-11-28 · unverdicted · none · ref 3 · internal anchor
SynthFun3D generates synthetic 3D functionality segmentation data from action descriptions via object retrieval and scene arrangement, yielding consistent gains of +2.2 mAP, +6.3 mAR, and +5.7 mIoU when augmenting real data for VLM training.
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models cs.CV · 2026-06-16 · unverdicted · none · ref 111 · internal anchor
SR-REAL equips spatial VLMs with dual LOR and DTR reasoning paths trained via RL, achieving better benchmark performance through mutual reinforcement and generalization without per-task tuning.
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation cs.RO · 2026-06-08 · unverdicted · none · ref 41 · internal anchor
MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.
Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation cs.RO · 2026-06-07 · unverdicted · none · ref 7 · internal anchor
Dream-Tac unifies visual and tactile signals in a world action model using contact-gated fusion and attention bias, reporting 31.7% average action accuracy gains on six manipulation tasks.
Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs cs.DB · 2026-06-04 · unverdicted · none · ref 6 · internal anchor
Introduces CausalPhys benchmark with causal graphs and CRFT fine-tuning to improve VLMs' causal physical reasoning accuracy and interpretability.
Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation cs.RO · 2026-06-02 · unverdicted · none · ref 36 · internal anchor
ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.
Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners cs.AI · 2026-06-01 · unverdicted · none · ref 3 · internal anchor
Introduces a new diagnostic benchmark and million-scale reasoning corpus showing that training on explicit causal traces improves next-state prediction in embodied planning, with reported gains from data scaling.
nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving cs.CV · 2026-05-29 · unverdicted · none · ref 87 · internal anchor
nuReasoning is a new real-world dataset and benchmark extending nuScenes/nuPlan with 20k clips and multi-type reasoning annotations to evaluate and improve reasoning in long-tail autonomous driving.
How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning cs.RO · 2026-05-16 · unverdicted · none · ref 29 · internal anchor
DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search cs.CV · 2026-05-09 · unverdicted · none · ref 61 · internal anchor
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations cs.AI · 2026-04-30 · unverdicted · none · ref 23 · internal anchor
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Seeing Fast and Slow: Learning the Flow of Time in Videos cs.CV · 2026-04-23 · unverdicted · none · ref 1 · internal anchor
Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models cs.LG · 2026-04-20 · unverdicted · none · ref 148 · internal anchor
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation cs.CV · 2026-04-20 · unverdicted · none · ref 3 · 2 links · internal anchor
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement cs.CV · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
Re²MoGen generates open-vocabulary motions via MCTS-enhanced LLM keyframe planning, pose-prior optimization with dynamic temporal matching fine-tuning, and physics-aware RL post-training, claiming SOTA performance.
Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning cs.RO · 2026-02-09 · unverdicted · none · ref 15 · internal anchor
R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.
MiMo-Embodied: X-Embodied Foundation Model Technical Report cs.RO · 2025-11-20 · unverdicted · none · ref 3 · internal anchor
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail cs.RO · 2025-10-30 · conditional · none · ref 67 · internal anchor
Alpamayo-R1 introduces a VLA model with a Chain of Causation dataset and multi-stage SFT-plus-RL training that reports 12% better planning accuracy and 35% fewer close encounters versus trajectory-only baselines in driving tasks.
Grounded Reinforcement Learning for Visual Reasoning cs.CV · 2025-05-29 · unverdicted · none · ref 43 · internal anchor
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
OmniView-Space: Reinforcing Spatial Reasoning via Multi-Perspective Spatial Mapping cs.CV · 2026-07-01 · unverdicted · none · ref 70 · internal anchor
OmniView-Space framework with MPSM, tool-guided reasoning, and distillation achieves SOTA on spatial reasoning benchmarks for MLLMs while reducing external geometry dependencies.
Multi-scale Mixture of World Models for Embodied Agents in Evolving Environments cs.AI · 2026-07-01 · unverdicted · none · ref 2 · internal anchor
MuSix introduces scale-aware world model mixtures with experiential-distance routing and adaptive forgetting to improve multi-scale reasoning and dynamic adaptation in embodied agents.
Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model cs.CV · 2026-06-25 · unverdicted · none · ref 57 · internal anchor
DexAC-WM improves FID, FVD, and PCK in high-DoF action-conditioned video prediction via structured action modeling and semantic grounding on EgoDex and EgoVerse.
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 54 · internal anchor
PAIWorld adds explicit geometric cross-view mechanisms and 3D distillation to DiT world models to achieve multi-view 3D consistency in robotic manipulation benchmarks.
Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models cs.RO · 2026-06-09 · unverdicted · none · ref 1 · internal anchor
Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.
Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning cs.RO · 2026-06-04 · unverdicted · none · ref 2 · internal anchor
Discrete-WAM unifies world modeling and policy learning for autonomous driving by representing observations, states, decisions, and actions as tokens in one space and using hierarchical token editing for planning.
Wall-OSS-0.5 Technical Report cs.RO · 2026-05-29 · unverdicted · none · ref 59 · internal anchor
Wall-OSS-0.5 is a 4B VLA model pretrained across many embodiments that achieves zero-shot real-robot performance on a 17-task suite and outperforms π_0.5 after fine-tuning.
GEM: Generative Supervision Helps Embodied Intelligence cs.CV · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.
Extending Embodied Question Answering from Perception to Decision cs.RO · 2026-05-25 · unverdicted · none · ref 3 · internal anchor
Introduces EQA-Decision dataset with 4M+ QA pairs across four embodied reasoning dimensions and RoboDecision baseline for joint perception-reasoning-decision evaluation.
Rethinking VLM Representation for VLA Initialization cs.CV · 2026-05-25 · unverdicted · none · ref 25 · internal anchor
Experiments indicate original VLM representations are crucial for VLA performance, LoRA outperforms full finetuning, and staged robot-data pretraining yields the strongest initialization.
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation cs.AI · 2026-05-09 · unverdicted · none · ref 32 · internal anchor
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance cs.RO · 2026-04-22 · unverdicted · none · ref 45 · internal anchor
PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement cs.RO · 2026-04-20 · unverdicted · none · ref 2 · internal anchor
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning cs.RO · 2026-04-09 · unverdicted · none · ref 4 · internal anchor
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations cs.AI · 2026-03-18 · unverdicted · none · ref 2 · internal anchor
CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning cs.CV · 2025-07-22 · unverdicted · none · ref 31 · internal anchor
ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 280 · internal anchor
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 27 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy cs.RO · 2026-04-22 · unverdicted · none · ref 1 · internal anchor
JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments cs.CV · 2026-04-20 · unverdicted · none · ref 2 · internal anchor
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap cs.RO · 2026-04-15 · unverdicted · none · ref 26 · internal anchor
A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.
World Simulation with Video Foundation Models for Physical AI cs.CV · 2025-10-28 · unverdicted · none · ref 55 · internal anchor
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
VRAG: Learning World Models for Interactive Video Generation cs.CV · 2025-05-28 · unreviewed · ref 49 · internal anchor

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer