Authors create ReasonMatch-Bench and DCRL training to boost MLLM performance on wide-baseline matching, reporting gains over baselines while preserving general capabilities.
Robobrain: A unified brain model for robotic manipulation from abstract to concrete
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
DR-MV3D decomposes MV3D-VQA into global map construction, question-conditioned view planning, and egocentric grounding, supervised by global consistency and local trajectory rewards optimized via GRPO.
Re²MoGen generates open-vocabulary motions via MCTS-enhanced LLM keyframe planning, pose-prior optimization with dynamic temporal matching fine-tuning, and physics-aware RL post-training, claiming SOTA performance.
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.
MapNav uses annotated semantic maps as memory for VLN agents, claiming SOTA results in simulation and real-world tests while promising code and data release.
GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.
ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.
citing papers explorer
-
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
Authors create ReasonMatch-Bench and DCRL training to boost MLLM performance on wide-baseline matching, reporting gains over baselines while preserving general capabilities.
-
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
-
Exploring Spatial Intelligence from a Generative Perspective
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
-
Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views
DR-MV3D decomposes MV3D-VQA into global map construction, question-conditioned view planning, and egocentric grounding, supervised by global consistency and local trajectory rewards optimized via GRPO.
-
Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement
Re²MoGen generates open-vocabulary motions via MCTS-enhanced LLM keyframe planning, pose-prior optimization with dynamic temporal matching fine-tuning, and physics-aware RL post-training, claiming SOTA performance.
-
MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation
MapNav uses annotated semantic maps as memory for VLN agents, claiming SOTA results in simulation and real-world tests while promising code and data release.
-
GEM: Generative Supervision Helps Embodied Intelligence
GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.
-
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
-
Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning
Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.