LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
Tiger: Tool-integrated geometric rea- soning in vision-language models for robotics
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
BOP-ASK supplies 150k images and 33M QA pairs across six tasks to improve VLMs on precise 3D object interaction reasoning and spatial planning.
MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.
citing papers explorer
-
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
-
BOP-ASK: Object-Interaction Reasoning for Vision-Language Models
BOP-ASK supplies 150k images and 33M QA pairs across six tasks to improve VLMs on precise 3D object interaction reasoning and spatial planning.
-
MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding
MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.
- AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly