Towards Spatial Trace with Reasoning in Vision-Language Models for Robotics

Cheng Chi; Enshen Zhou; Huajie Tan; Jiawei He; Jiayuan Zhang; Jingkun An; Lu Sheng; Mengzhen Liu; Pengwei Wang; Shanghang Zhang

arxiv: 2512.13660 · v3 · pith:DV7YFKW5new · submitted 2025-12-15 · 💻 cs.RO · cs.CV

Towards Spatial Trace with Reasoning in Vision-Language Models for Robotics

Enshen Zhou , Yibo Li , Jingkun An , Jiayuan Zhang , Shanyu Rong , Mengzhen Liu , Yi Han , Yuheng Ji

show 7 more authors

Huajie Tan Jiawei He Pengwei Wang Zhongyuan Wang Cheng Chi Lu Sheng Shanghang Zhang

This is my paper

classification 💻 cs.RO cs.CV

keywords spatialrobotracerreasoningreferringachieveschallengingcomplexfine-tuning

0 comments

read the original abstract

Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes. Please see the project page at https://zhoues.github.io/RoboTracer.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance
cs.RO 2026-06 unverdicted novelty 6.0

3D HAMSTER augments a VLM with depth encoding and reconstruction to predict 3D waypoints that directly guide pointcloud policies, outperforming 2D baselines especially under distribution shifts.