pith. sign in

hub Baseline reference

Robobrain 2.0 technical report

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

25 Pith papers citing it
Baseline 60% of classified citations

hub tools

citation-role summary

baseline 3 background 2

citation-polarity summary

years

2026 20 2025 5

clear filters

representative citing papers

Token Warping Helps MLLMs Look from Nearby Viewpoints

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought

cs.AI · 2026-06-14 · unverdicted · novelty 6.0

Introduces PinCoT paradigm with visual reasoning anchors, builds PIN-170K dataset via automated pipeline, and trains 4B RoboPIN model via three-stage post-training to outperform 7B baselines by 12% on embodied reasoning benchmarks.

Long-Horizon Manipulation via Trace-Conditioned VLA Planning

cs.RO · 2026-04-23 · unverdicted · novelty 6.0

LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.

MiMo-Embodied: X-Embodied Foundation Model Technical Report

cs.RO · 2025-11-20 · unverdicted · novelty 6.0

MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.

Grounded 3D-Aware Spatial Vision-Language Modeling

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

GR3D is a VLM that combines explicit 2D, implicit 2D, and monocular 3D grounding mechanisms to improve performance on spatial understanding benchmarks.

Rethinking VLM Representation for VLA Initialization

cs.CV · 2026-05-25 · unverdicted · novelty 5.0

Experiments indicate original VLM representations are crucial for VLA performance, LoRA outperforms full finetuning, and staged robot-data pretraining yields the strongest initialization.

Kwai Keye-VL-2.0 Technical Report

cs.CV · 2026-06-09 · unverdicted · novelty 4.0

Kwai Keye-VL-2.0-30B-A3B is a 30B MoE model with 3B active parameters using DSA adaptation and MOPD distillation that reports SOTA results on video understanding and agent benchmarks.

citing papers explorer

Showing 25 of 25 citing papers.