hub Baseline reference

Robobrain 2.0 technical report

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, et al · 2025 · arXiv 2507.02029

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

27 Pith papers citing it

Baseline 60% of classified citations

read on arXiv browse 27 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 3 background 2

citation-polarity summary

baseline 3 background 1 unclear 1

representative citing papers

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Astra couples an RL-trained VLM policy with a view-consistent Bagel-based world simulator to enable agentic imagination during spatial reasoning, yielding benchmark gains on MMSI-Bench and MindCube.

Token Warping Helps MLLMs Look from Nearby Viewpoints

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

cs.RO · 2025-11-21 · accept · novelty 7.0

RoboCOIN is a large multi-embodiment bimanual manipulation dataset with hierarchical annotations and an open processing pipeline that improves model performance across robotic platforms.

RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought

cs.AI · 2026-06-14 · unverdicted · novelty 6.0

Introduces PinCoT paradigm with visual reasoning anchors, builds PIN-170K dataset via automated pipeline, and trains 4B RoboPIN model via three-stage post-training to outperform 7B baselines by 12% on embodied reasoning benchmarks.

RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 6.0

RoboProcessBench is a new benchmark decomposing process-aware understanding into static monitoring and dynamic reasoning across 12 question families, with evaluations showing VLM limitations but post-training gains on the provided data.

Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

Introduces a new diagnostic benchmark and million-scale reasoning corpus showing that training on explicit causal traces improves next-state prediction in embodied planning, with reported gains from data scaling.

RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

Assistance Without Interruption: A Benchmark and LLM-based Framework for Non-Intrusive Human-Robot Assistance

cs.RO · 2026-05-02 · unverdicted · novelty 6.0

The work creates NIABench and an LLM-plus-scoring-model framework that enables robots to deliver proactive assistance during human multi-step activities while avoiding interruptions and reducing human effort.

Long-Horizon Manipulation via Trace-Conditioned VLA Planning

cs.RO · 2026-04-23 · unverdicted · novelty 6.0

LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.

Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes

cs.CV · 2025-12-15 · unverdicted · novelty 6.0

GAPL learns a compact set of canonical forgery prototypes and applies two-stage LoRA training to build a low-variance feature space that improves generalization across GAN and diffusion generators.

MiMo-Embodied: X-Embodied Foundation Model Technical Report

cs.RO · 2025-11-20 · unverdicted · novelty 6.0

MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

cs.LG · 2025-10-31 · unverdicted · novelty 6.0

DeepThinkVLA shows CoT improves VLA models only under decoding and causal alignment, delivering 97% success on LIBERO and 21.7-point gains via hybrid attention and SFT-RL training.

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

cs.RO · 2025-10-15 · unverdicted · novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

OmniView-Space: Reinforcing Spatial Reasoning via Multi-Perspective Spatial Mapping

cs.CV · 2026-07-01 · unverdicted · novelty 5.0

OmniView-Space framework with MPSM, tool-guided reasoning, and distillation achieves SOTA on spatial reasoning benchmarks for MLLMs while reducing external geometry dependencies.

FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 5.0

FutureNav proposes a 4B-scale VLM that jointly optimizes action prediction, inverse/forward dynamics, and future state generation for VLN and reports SOTA results on multiple benchmarks.

SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale

cs.RO · 2026-06-11 · unverdicted · novelty 5.0

SPARC generates reliable spatial annotations for robot demonstrations by leveraging spatio-temporal task structure, outperforming detection baselines on localization accuracy while retaining more samples and enabling competitive model performance without manual annotations.

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

cs.RO · 2026-06-09 · unverdicted · novelty 5.0

Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.

Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data

cs.RO · 2026-06-07 · unverdicted · novelty 5.0

Introduces embodied trajectory-coupled data and a three-stage training recipe to bridge VLMs to generalizable VLAs without steep degradation of pre-trained representations.

Grounded 3D-Aware Spatial Vision-Language Modeling

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

GR3D is a VLM that combines explicit 2D, implicit 2D, and monocular 3D grounding mechanisms to improve performance on spatial understanding benchmarks.

Extending Embodied Question Answering from Perception to Decision

cs.RO · 2026-05-25 · unverdicted · novelty 5.0

Introduces EQA-Decision dataset with 4M+ QA pairs across four embodied reasoning dimensions and RoboDecision baseline for joint perception-reasoning-decision evaluation.

Rethinking VLM Representation for VLA Initialization

cs.CV · 2026-05-25 · unverdicted · novelty 5.0

Experiments indicate original VLM representations are crucial for VLA performance, LoRA outperforms full finetuning, and staged robot-data pretraining yields the strongest initialization.

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly cs.RO · 2026-04-10 · unreviewed · ref 40

Robobrain 2.0 technical report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer