Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
Embodied arena: A comprehensive, unified, and evolving evaluation platform for embodied ai
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 2
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
background 2polarities
background 2representative citing papers
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
citing papers explorer
-
Token Warping Helps MLLMs Look from Nearby Viewpoints
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
-
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.