BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

Farshad Khorrami; Greg Heinrich; Jonathan Tremblay; Prashanth Krishnamurthy; Ramesh Karri; Stan Birchfield; Sungsu Kim; Valts Blukis; Vineet Bhat

arxiv: 2511.16857 · v3 · submitted 2025-11-20 · 💻 cs.CV · cs.RO

BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

Vineet Bhat , Sungsu Kim , Valts Blukis , Greg Heinrich , Prashanth Krishnamurthy , Ramesh Karri , Stan Birchfield , Farshad Khorrami

show 1 more author

Jonathan Tremblay

This is my paper

Pith reviewed 2026-05-17 19:54 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords vision-language modelsobject interaction reasoningspatial reasoningdataset6D pose estimationgrasp posestrajectory planningaffordances

0 comments

The pith

BOP-ASK dataset trains vision-language models to perform precise object grasp estimation and multi-step spatial planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BOP-ASK, a dataset of over 150,000 images and 33 million question-answer pairs built from 6D object poses in existing BOP collections. It supplies annotations for grasp poses, trajectories, relative spatial relations, depth, and object-to-object links across six tasks. Models trained on this data outperform baselines and acquire new skills in accurate pose estimation, path planning, and object-centric reasoning inside cluttered scenes. A sympathetic reader would care because standard vision-language benchmarks overlook the detailed physical and interaction details required for practical use.

Core claim

By deriving fine-grained annotations for grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships directly from 6D object poses in the BOP datasets, BOP-ASK creates a training and evaluation resource that lets vision-language models develop precise object localization, physical compatibility understanding, affordance recognition, and multi-step planning abilities.

What carries the argument

The BOP-ASK dataset, which automatically generates question-answer pairs for object-interaction tasks from 6D pose data to support training and benchmarking of vision-language models.

If this is right

Models trained on BOP-ASK outperform baselines on the six object-interaction tasks.
Trained models exhibit precise object and grasp pose estimation.
Models demonstrate trajectory planning and fine-grained object-centric spatial reasoning in cluttered environments.
Performance generalizes to out-of-distribution images in the released BOP-ASK-lab benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The annotation generation pipeline could be applied to other 6D pose datasets to produce additional training resources for interaction reasoning.
Improved spatial and planning capabilities in vision-language models may support more reliable integration into robotic manipulation systems.
Human evaluations on the BOP-ASK-core test set point toward reduced reliance on manual annotation for spatial tasks.

Load-bearing premise

The automatically generated annotations from 6D object poses accurately reflect real-world physical compatibility, affordances, and multi-step planning needs.

What would settle it

A model trained on BOP-ASK fails to achieve higher grasp success rates or more accurate trajectory plans than baseline models when evaluated on physical robot experiments using the same objects and scenes.

Figures

Figures reproduced from arXiv: 2511.16857 by Farshad Khorrami, Greg Heinrich, Jonathan Tremblay, Prashanth Krishnamurthy, Ramesh Karri, Stan Birchfield, Sungsu Kim, Valts Blukis, Vineet Bhat.

**Figure 1.** Figure 1: The BOP-Ask dataset facilitates object-interaction reasoning for robot manipulation. This illustration demonstrates how a model trained on BOP-Ask enables human and robot-aligned spatial understanding for different actions, supporting physical relationship, locating where to grasp objects, precise pose estimation, and motion planning between objects. as scene description [14], or code generation for robo… view at source ↗

**Figure 2.** Figure 2: Overview of the BOP-Ask dataset. We automatically generate object-interaction and spatial reasoning annotations from 3D point clouds, images, object poses and 3D models with description. We create question/answer pairs covering 6 types of questions (from left to right, top to bottom), object pose estimation, grasp affordance, motion planning, physical interaction, object relationship, and depth relationshi… view at source ↗

**Figure 3.** Figure 3: Predictions from samples in BOP-Ask-core and BOP-Ask-lab (identified by (lab)), showing improvements gained from fine-tuning on BOP-Ask. Predictions from NVILA (shown in magenta) and NVILA SFT (shown in blue) are shown alongside the Ground Truth (in green). For the ’Rearrangement’ task, the Ground Truth shape delineates the area of valid predictions. Absence of a colored prediction indicates none was made … view at source ↗

**Figure 4.** Figure 4: Our proposed data generation framework can trans [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of free form questions and task types in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Real world robot experiments with a Franka arm and a ZED2 Stereo camera. VLMs fine-tuned on [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BOP-ASK adds a practical dataset for fine-grained VLM interaction tasks derived from BOP poses, but the auto-generated labels need closer checks on physical accuracy.

read the letter

BOP-ASK is mainly a new dataset that turns existing BOP 6D pose data into a large collection of QA pairs for object interaction reasoning in VLMs. The pipeline derives grasp poses, trajectories, relative depths, and object-to-object relations automatically from the pose annotations, yielding 150k images and 33M pairs across six tasks with four of them new. This targets a clear gap, since most current VLM spatial benchmarks stay at coarse relations and skip the precise localization and multi-step planning that matter for robotics. The release of BOP-ASK-core with human evaluation plus an out-of-distribution BOP-ASK-lab set is a straightforward plus for testing generalization. Experiments indicate that models trained on the data beat baselines and show some gains in pose estimation and planning. The main soft spot sits in label quality. The abstract gives no sign of physics simulation, collision checking, or systematic validation on the training distribution itself, with human review limited to the core test split. If the derived affordances or paths carry noise or incorrect compatibility assumptions, the performance lifts and any emergent behaviors could partly trace to dataset artifacts rather than genuine interaction understanding. This work suits researchers building or evaluating VLMs for manipulation and spatial planning. Anyone focused on dataset resources for object-centric reasoning will get direct value from the scale and task definitions. The contribution is grounded enough on the data side to merit a serious referee rather than a desk reject. I would send it for peer review and flag the need for more explicit validation of the generated annotations against real physical constraints.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BOP-ASK, a large-scale dataset derived from BOP 6D object poses containing over 150k images and 33M QA pairs across six tasks (four novel) focused on fine-grained object interaction reasoning. Annotations for grasp poses, trajectories, relative depth, and object-to-object compatibility are generated automatically from the poses. The authors evaluate proprietary and open-source VLMs, report that models trained on BOP-ASK outperform baselines with emergent capabilities in precise pose estimation, trajectory planning, and object-centric spatial reasoning, and release BOP-ASK-core (with human evaluation) and BOP-ASK-lab (OOD benchmark).

Significance. If the auto-generated labels are reliable, the dataset fills an important gap in training VLMs for embodied and robotic applications requiring physical compatibility and multi-step planning. The scale, task diversity, and provision of both in-distribution and out-of-distribution test sets are strengths that could support reproducible progress in this area.

major comments (2)

[§3 (Data Generation Pipeline)] §3 (Data Generation Pipeline): The derivation of grasp poses, path planning trajectories, and object-to-object relationships from 6D poses is presented without collision checking, physics simulation, or human validation on the training distribution (only human evaluation is reported on the BOP-ASK-core test split). This is load-bearing for the central claim, as noisy or incorrect affordance labels could produce the reported performance gains and 'emergent' behaviors as artifacts rather than evidence of genuine interaction understanding.
[§5 (Experiments)] §5 (Experiments): The protocol for baseline comparisons and the demonstration of outperformance lack explicit controls for potential selection effects in question generation or post-hoc filtering. Without these details, it is difficult to confirm that the gains on the six tasks (including the four novel ones) reflect improved reasoning rather than dataset-specific biases.

minor comments (2)

[Abstract] The abstract states that four tasks are novel but does not name them; this should be stated explicitly in the introduction or dataset section for clarity.
[Figures and Tables] Figure captions and table headers should consistently define abbreviations such as 'BOP-ASK-core' on first use to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below and indicate the revisions planned for the next version of the paper.

read point-by-point responses

Referee: [§3 (Data Generation Pipeline)] §3 (Data Generation Pipeline): The derivation of grasp poses, path planning trajectories, and object-to-object relationships from 6D poses is presented without collision checking, physics simulation, or human validation on the training distribution (only human evaluation is reported on the BOP-ASK-core test split). This is load-bearing for the central claim, as noisy or incorrect affordance labels could produce the reported performance gains and 'emergent' behaviors as artifacts rather than evidence of genuine interaction understanding.

Authors: The annotations are produced via deterministic geometric operations applied directly to the high-accuracy 6D poses supplied by the BOP benchmark. Grasp poses are derived from surface normals and principal axes, trajectories are constructed as sequences of relative SE(3) transforms, and object-to-object compatibility is computed from pairwise pose distances and orientation thresholds. Because the tasks target pose-based spatial reasoning rather than dynamic physical interaction, full physics simulation and collision checking were not required for label generation. Human validation was performed on the BOP-ASK-core test split, where label agreement exceeded 90 %. We will revise §3 to document the precise geometric procedures, add quantitative consistency checks across the training distribution, and include an explicit limitations paragraph acknowledging the lack of physics-based validation. These additions will clarify that reported gains arise from improved spatial reasoning rather than label artifacts. revision: partial
Referee: [§5 (Experiments)] §5 (Experiments): The protocol for baseline comparisons and the demonstration of outperformance lack explicit controls for potential selection effects in question generation or post-hoc filtering. Without these details, it is difficult to confirm that the gains on the six tasks (including the four novel ones) reflect improved reasoning rather than dataset-specific biases.

Authors: All question-answer pairs were generated exhaustively from the pose annotations for every image and task combination; no post-hoc filtering or model-dependent selection was applied. The identical question set was used for every baseline and fine-tuned model. We will expand §5 with a complete description of the generation algorithm, per-task question counts, and additional ablation results that subsample questions uniformly and re-evaluate performance. These controls will demonstrate that the observed improvements on both in-distribution and out-of-distribution sets are attributable to enhanced object-interaction reasoning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical dataset and evaluation paper with independent test benchmarks

full rationale

The paper's core contribution is the construction of the BOP-ASK dataset by automatically deriving QA pairs and annotations (grasp poses, trajectories, spatial relations) from existing external BOP 6D pose data, followed by empirical training and evaluation of VLMs on held-out splits including human-validated BOP-ASK-core and an OOD BOP-ASK-lab benchmark. No equations, fitted parameters, or derivation chains are present that would reduce reported performance or emergent capabilities to quantities defined by construction inside the paper. Claims rest on experimental comparisons against baselines rather than self-referential definitions, self-citation load-bearing uniqueness theorems, or ansatz smuggling. This is a standard self-contained empirical contribution against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that 6D poses from BOP datasets can be reliably transformed into grasp poses, depth relations, and planning trajectories without additional physical simulation or human validation at scale.

axioms (1)

domain assumption Existing BOP 6D pose annotations are sufficiently accurate and complete to derive grasp poses, relative spatial relations, and multi-step trajectories.
Invoked in the data generation pipeline description.

pith-pipeline@v0.9.0 · 5586 in / 1259 out tokens · 22191 ms · 2026-05-17T19:54:06.928943+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Motion trajectories are synthesized using a Rapidly-exploring Random Tree (RRT) planner operating in 3D Cartesian space... Object grasps are computed... using the transformer-based parallel gripper model M2T2
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We estimate the camera-to-world transformation... fit a plane to the corresponding 3D points via RANSAC

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 3 internal anchors

[1]

Scanqa: 3d question answering for spa- tial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Mo- toaki Kawanabe. Scanqa: 3d question answering for spa- tial scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Maplegrasp: Mask-guided feature pooling for language-driven efficient robotic grasping.arXiv preprint arXiv:2506.06535, 2025

Vineet Bhat, Naman Patel, Prashanth Krishnamurthy, Ramesh Karri, and Farshad Khorrami. Maplegrasp: Mask-guided feature pooling for language-driven efficient robotic grasping.arXiv preprint arXiv:2506.06535, 2025. 5

work page arXiv 2025
[4]

Spatial- bot: Precise spatial understanding with vision language models

Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xi- aoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatial- bot: Precise spatial understanding with vision language models. InProceedings of the IEEE International Con- ference on Robotics and Automation (ICRA), 2025. 8

work page 2025
[5]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, 2024. 1, 2, 3

work page 2024
[6]

Robo2vlm: Improving visual question answering using large-scale robot manipulation data

Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R San- keti, and Ken Goldberg. Robo2vlm: Improving visual question answering using large-scale robot manipulation data. InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems Datasets and Bench- marks Track. 3

work page
[7]

Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185– 24198, 2024. 6

work page 2024
[8]

Spatialrgpt: Grounded spatial reasoning in vision- language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision- language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 2, 3

work page 2024
[9]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146, 2024. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Graspmolmo: Generalizable task-oriented grasping via large-scale syn- thetic data generation, 2025

Abhay Deshpande, Yuquan Deng, Arijit Ray, Jordi Sal- vador, Winson Han, Jiafei Duan, Kuo-Hao Zeng, Yuke Zhu, Ranjay Krishna, and Rose Hendrix. Graspmolmo: Generalizable task-oriented grasping via large-scale syn- thetic data generation, 2025. 5

work page 2025
[11]

Douglas and Thomas K

David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature.Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, 1973. 4

work page 1973
[12]

EmbSpatial-bench: Benchmarking spa- tial understanding for embodied tasks with large vision- language models

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. EmbSpatial-bench: Benchmarking spa- tial understanding for embodied tasks with large vision- language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), 2024. 2, 3

work page 2024
[13]

AHA: A vision- language-model for detecting and reasoning over failures in robotic manipulation

Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, and Yijie Guo. AHA: A vision- language-model for detecting and reasoning over failures in robotic manipulation. InThe Thirteenth International Conference on Learning Representations, 2025. 3

work page 2025
[14]

Moka: Open-world robotic manipulation through mark-based visual prompting.Robotics: Science and Systems (RSS), 2024

Kuan Fang, Fangchen Liu, Pieter Abbeel, and Sergey Levine. Moka: Open-world robotic manipulation through mark-based visual prompting.Robotics: Science and Systems (RSS), 2024. 1

work page 2024
[15]

Blink: Mul- timodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, et al. Blink: Mul- timodal large language models can see but not perceive. InProceedings of the European Conference on Computer Vision (ECCV), pages 148–166. Springer, 2024. 2, 3

work page 2024
[16]

Handal: A dataset of real-world manipulable object cat- egories with pose annotations, affordances, and recon- structions

Andrew Guo, Bowen Wen, Jianhe Yuan, Jonathan Trem- blay, Stephen Tyree, Jeffrey Smith, and Stan Birchfield. Handal: A dataset of real-world manipulable object cat- egories with pose annotations, affordances, and recon- structions. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11428– 11435, 2023. 5, 7, 8

work page 2023
[17]

Tiger: Tool-integrated geometric rea- soning in vision-language models for robotics.arXiv preprint arXiv:2510.07181, 2025

Yi Han, Cheng Chi, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Lu Sheng, and Shanghang Zhang. Tiger: Tool-integrated geometric rea- soning in vision-language models for robotics.arXiv preprint arXiv:2510.07181, 2025. 3

work page arXiv 2025
[18]

Bradski, and Nassir Navab

Stefan Hinterstoißer, Vincent Lepetit, Slobodan Ilic, Ste- fan Holzer, Kurt Konolige, Gary R. Bradski, and Nassir Navab. Model based training, detection and pose estima- tion of texture-less 3d objects in heavily cluttered scenes. InAsian Conference on Computer Vision, 2012. 5, 8

work page 2012
[19]

Copa: General robotic manipulation through spatial constraints of parts with foundation mod- els

Haoxu Huang, Fanqi Lin, Yingdong Hu, Shengjie Wang, and Yang Gao. Copa: General robotic manipulation through spatial constraints of parts with foundation mod- els. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9488–9495, 2024. 3

work page 2024
[20]

Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation

Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. In8th Annual Conference on Robot Learning, 2024. 3

work page 2024
[21]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and com- positional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700–6709, 2019. 3 9

work page 2019
[22]

Robobrain: A unified brain model for robotic manipulation from abstract to con- crete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to con- crete. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1724–1734, 2025. 2

work page 2025
[23]

Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InProceedings of the Eu- ropean Conference on Computer Vision (ECCV), 2024. 3

work page 2024
[24]

Lawrence Zitnick, and Ross Gir- shick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Gir- shick. Clevr: A diagnostic dataset for compositional lan- guage and elementary visual reasoning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3

work page 2017
[25]

What’s “up” with vision-language models? investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. 2, 3

work page 2023
[26]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, et al. Openvla: An open-source vision-language-action model. InProceedings of the 8th Conference on Robot Learning (CoRL), pages 2679–2713. PMLR, 2025. 3

work page 2025
[27]

Visual genome: Connecting language and vision using crowd- sourced dense image annotations.International journal of computer vision, 123:32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowd- sourced dense image annotations.International journal of computer vision, 123:32–73, 2017. 3

work page 2017
[28]

LaValle and James J

Steven M. LaValle and James J. Kuffner. Rapidly- exploring random trees: A new tool for path planning. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 1999. 4

work page 1999
[29]

MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Code as policies: Language model programs for em- bodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for em- bodied control. InIEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500,

work page
[31]

Multi-modal situated reasoning in 3d scenes

Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xi- aojian Ma, Baoxiong Jia, and Siyuan Huang. Multi-modal situated reasoning in 3d scenes. InAdvances in Neural Information Processing Systems, 2024. NeurIPS. 3

work page 2024
[32]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023. 2, 3

work page 2023
[33]

Nvila: Efficient frontier visual language models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, Xiuyu Li, Haotian Tang, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Jinyi Hu, Sifei Liu, Ranjay Krishna, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. Nvila: Efficient frontier visual lan...

work page 2025
[34]

Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors.Advances in neural information processing systems, 37:68803–68832, 2024. 3

work page 2024
[35]

Sqa3d: Sit- uated question answering in 3d scenes

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Sit- uated question answering in 3d scenes. InInternational Conference on Learning Representations, 2023. 3

work page 2023
[36]

Situa- tional awareness matters in 3d vision language reasoning

Yunze Man, Liang-Yan Gui, and Yu-Xiong Wang. Situa- tional awareness matters in 3d vision language reasoning. InCVPR, 2024. 3

work page 2024
[37]

Pivot: iterative visual prompting elicits actionable knowledge for vlms

Soroush Nasiriany, Fei Xia, Wenhao Yu, et al. Pivot: iterative visual prompting elicits actionable knowledge for vlms. InProceedings of the International Conference on Machine Learning (ICML). JMLR.org, 2024. 3

work page 2024
[38]

BOP challenge 2024 on model-based and model-free 6d object pose estimation.CoRR, abs/2504.02812,

Van Nguyen Nguyen, Stephen Tyree, Andrew Guo, Med- eric Fourmy, Anas Gouda, Taeyeop Lee, Sungphill Moon, Hyeontae Son, Lukas Ranftl, Jonathan Tremblay, et al. Bop challenge 2024 on model-based and model-free 6d object pose estimation.arXiv preprint arXiv:2504.02812,

work page arXiv 2024
[39]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, et al. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 3

work page 2024
[40]

Introducing gpt-5.https://openai

OpenAI. Introducing gpt-5.https://openai. com/index/introducing-gpt-5/, 2025. Ope- nAI Blog. 6

work page 2025
[41]

Gpt-4 technical report, 2024

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, et al. Gpt-4 technical report, 2024. 3

work page 2024
[42]

Towards grounded visual spatial reasoning in multi-modal vision language models,

Navid Rajabi and Jana Kosecka. Towards grounded visual spatial reasoning in multi-modal vision language models,

work page
[43]

Ryoo, and Tsung-Yu Lin

Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, and Tsung-Yu Lin. Learning to localize objects improves spatial reasoning in visual- llms. InProceedings of IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 12977–12987, 2024. 3

work page 2024
[44]

Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models, 2025. 3

work page 2025
[45]

Olinde Rodrigues. Des lois g´eom´etriques qui r´egissent les d´eplacements d’un syst`eme solide dans l’espace, et de la variation des coordonn´ees provenant de ces d´eplacements consid´er´es ind ´ependamment des causes qui peuvent les produire.Journal de math ´ematiques pures et appliqu´ees, 5:380–440, 1840. 4 10

work page
[46]

Sophia Koepke, Hendrik P

Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, and Zeynep Akata. Clevr-x: A visual reason- ing dataset for natural language explanations. InxxAI - Beyond explainable Artificial Intelligence, pages 85–104. Springer, 2022. 3

work page 2022
[47]

An empirical analysis on spatial reasoning capabilities of large multimodal models

Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, et al. An empirical analysis on spatial reasoning capabilities of large multimodal models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,

work page 2024
[48]

Progprompt: Generating situated robot task plans using large language models

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2023. 1, 3

work page 2023
[49]

Sadler, Wei-Lun Chao, and Yu Su

Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M. Sadler, Wei-Lun Chao, and Yu Su. One step at a time: Long-horizon vision-and-language navigation with mile- stones. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15482–15491, 2022. 3

work page 2022
[50]

Robospa- tial: Teaching spatial understanding to 2d and 3d vision- language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospa- tial: Teaching spatial understanding to 2d and 3d vision- language models for robotics. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 15768–15780, 2025. 2, 3, 8

work page 2025
[51]

A corpus for reasoning about natural language grounded in photographs

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. InPro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428, Florence, Italy, 2019. Association for Computational Linguistics. 3

work page 2019
[52]

Space3D-Bench: Spatial 3D Question Answering Benchmark

Emilia Szymanska, Mihai Dusmanu, Jan-Willem Bu- urlage, Mahdi Rad, and Marc Pollefeys. Space3D-Bench: Spatial 3D Question Answering Benchmark. InEu- ropean Conference on Computer Vision (ECCV) Work- shops, 2024. 3

work page 2024
[53]

Gemini Robotics Team. Building the next gen- eration of physical agents with gemini robotics-er 1.5.https://developers.googleblog.com/ en / building - the - next - generation - of - physical-agents-with-gemini-robotics- er-15/, 2025. Google Developers Blog. 2, 6

work page 2025
[54]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024. 2, 3, 8

work page 2024
[55]

6-dof pose estimation of household objects for robotic manip- ulation: An accessible dataset and benchmark

Stephen Tyree, Jonathan Tremblay, Thang To, Jia Cheng, Terry Mosier, Jeffrey Smith, and Stan Birchfield. 6-dof pose estimation of household objects for robotic manip- ulation: An accessible dataset and benchmark. InInter- national Conference on Intelligent Robots and Systems (IROS), 2022. 5, 7, 8

work page 2022
[56]

Gpt-4v(ision) for robotics: Multimodal task planning from human demon- stration.IEEE Robotics and Automation Letters, 9(11): 10567–10574, 2024

Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, and Katsushi Ikeuchi. Gpt-4v(ision) for robotics: Multimodal task planning from human demon- stration.IEEE Robotics and Automation Letters, 9(11): 10567–10574, 2024. 3

work page 2024
[57]

Em- bodiedscan: A holistic multi-modal 3d perception suite towards embodied ai

Tai Wang, Xiaohan Mao, Chenming Zhu, et al. Em- bodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[58]

Calamari: Contact-aware and language conditioned spatial action mapping for contact- rich manipulation

Youngsun Wi, Mark Van der Merwe, Pete Florence, Andy Zeng, and Nima Fazeli. Calamari: Contact-aware and language conditioned spatial action mapping for contact- rich manipulation. In7th Annual Conference on Robot Learning, 2023. 3

work page 2023
[59]

Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes

Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. InPro- ceedings of Robotics: Science and Systems, 2017. 5, 7, 8

work page 2017
[60]

A parameter-efficient tuning frame- work for language-guided object grounding and robot grasping

Houjian Yu, Mingen Li, Alireza Rezazadeh, Yang Yang, and Changhyun Choi. A parameter-efficient tuning frame- work for language-guided object grounding and robot grasping. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 14353–14360,

work page
[61]

M2t2: Multi-task masked trans- former for object-centric pick and place.arXiv preprint arXiv:2311.00926, 2023

Wentao Yuan, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. M2t2: Multi-task masked trans- former for object-centric pick and place.arXiv preprint arXiv:2311.00926, 2023. 4

work page arXiv 2023
[62]

Robopoint: A vision-language model for spatial affordance prediction in robotics

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Ar- salan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction in robotics. In8th Annual Conference on Robot Learn- ing, 2024. 1, 2

work page 2024
[63]

SPARTUN3d: Situated spa- tial understanding of 3d world in large language model

Yue Zhang, Zhiyang Xu, Ying Shen, Parisa Kord- jamshidi, and Lifu Huang. SPARTUN3d: Situated spa- tial understanding of 3d world in large language model. InThe Thirteenth International Conference on Learning Representations, 2025. 3

work page 2025
[64]

Manipbench: Benchmarking vision-language models for low-level robot manipulation.arXiv preprint arXiv:2505.09698, 2025

Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, and Daniel Seita. Manipbench: Benchmarking vision- language models for low-level robot manipulation.arXiv preprint arXiv:2505.09698, 2025. 3

work page arXiv 2025
[65]

Robore- fer: Towards spatial referring with reasoning in vision- language models for robotics, 2025

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, and Shanghang Zhang. Robore- fer: Towards spatial referring with reasoning in vision- language models for robotics, 2025. 2, 3, 6

work page 2025
[66]

Rt-2: Vision-language-action models transfer web knowl- edge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, et al. Rt-2: Vision-language-action models transfer web knowl- edge to robotic control. InProceedings of The 7th Con- ference on Robot Learning, pages 2165–2183. PMLR,

work page
[67]

Point to the flat surface where the objects are placed

3 11 A. Data Generation Pipeline TheBenchmark for 6D Object Pose Estimation (BOP) family of datasets provides training data for 6D object pose estimation and comprises real and simulation im- ages showcasing multiple objects and diverse setups. For example, HOPE, a BOP-based dataset, comprises 28 toy grocery objects captured in 50 scenes from 10 house- ho...

work page 1920

[1] [1]

Scanqa: 3d question answering for spa- tial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Mo- toaki Kawanabe. Scanqa: 3d question answering for spa- tial scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Maplegrasp: Mask-guided feature pooling for language-driven efficient robotic grasping.arXiv preprint arXiv:2506.06535, 2025

Vineet Bhat, Naman Patel, Prashanth Krishnamurthy, Ramesh Karri, and Farshad Khorrami. Maplegrasp: Mask-guided feature pooling for language-driven efficient robotic grasping.arXiv preprint arXiv:2506.06535, 2025. 5

work page arXiv 2025

[4] [4]

Spatial- bot: Precise spatial understanding with vision language models

Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xi- aoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatial- bot: Precise spatial understanding with vision language models. InProceedings of the IEEE International Con- ference on Robotics and Automation (ICRA), 2025. 8

work page 2025

[5] [5]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, 2024. 1, 2, 3

work page 2024

[6] [6]

Robo2vlm: Improving visual question answering using large-scale robot manipulation data

Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R San- keti, and Ken Goldberg. Robo2vlm: Improving visual question answering using large-scale robot manipulation data. InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems Datasets and Bench- marks Track. 3

work page

[7] [7]

Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185– 24198, 2024. 6

work page 2024

[8] [8]

Spatialrgpt: Grounded spatial reasoning in vision- language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision- language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 2, 3

work page 2024

[9] [9]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146, 2024. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Graspmolmo: Generalizable task-oriented grasping via large-scale syn- thetic data generation, 2025

Abhay Deshpande, Yuquan Deng, Arijit Ray, Jordi Sal- vador, Winson Han, Jiafei Duan, Kuo-Hao Zeng, Yuke Zhu, Ranjay Krishna, and Rose Hendrix. Graspmolmo: Generalizable task-oriented grasping via large-scale syn- thetic data generation, 2025. 5

work page 2025

[11] [11]

Douglas and Thomas K

David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature.Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, 1973. 4

work page 1973

[12] [12]

EmbSpatial-bench: Benchmarking spa- tial understanding for embodied tasks with large vision- language models

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. EmbSpatial-bench: Benchmarking spa- tial understanding for embodied tasks with large vision- language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), 2024. 2, 3

work page 2024

[13] [13]

AHA: A vision- language-model for detecting and reasoning over failures in robotic manipulation

Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, and Yijie Guo. AHA: A vision- language-model for detecting and reasoning over failures in robotic manipulation. InThe Thirteenth International Conference on Learning Representations, 2025. 3

work page 2025

[14] [14]

Moka: Open-world robotic manipulation through mark-based visual prompting.Robotics: Science and Systems (RSS), 2024

Kuan Fang, Fangchen Liu, Pieter Abbeel, and Sergey Levine. Moka: Open-world robotic manipulation through mark-based visual prompting.Robotics: Science and Systems (RSS), 2024. 1

work page 2024

[15] [15]

Blink: Mul- timodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, et al. Blink: Mul- timodal large language models can see but not perceive. InProceedings of the European Conference on Computer Vision (ECCV), pages 148–166. Springer, 2024. 2, 3

work page 2024

[16] [16]

Handal: A dataset of real-world manipulable object cat- egories with pose annotations, affordances, and recon- structions

Andrew Guo, Bowen Wen, Jianhe Yuan, Jonathan Trem- blay, Stephen Tyree, Jeffrey Smith, and Stan Birchfield. Handal: A dataset of real-world manipulable object cat- egories with pose annotations, affordances, and recon- structions. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11428– 11435, 2023. 5, 7, 8

work page 2023

[17] [17]

Tiger: Tool-integrated geometric rea- soning in vision-language models for robotics.arXiv preprint arXiv:2510.07181, 2025

Yi Han, Cheng Chi, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Lu Sheng, and Shanghang Zhang. Tiger: Tool-integrated geometric rea- soning in vision-language models for robotics.arXiv preprint arXiv:2510.07181, 2025. 3

work page arXiv 2025

[18] [18]

Bradski, and Nassir Navab

Stefan Hinterstoißer, Vincent Lepetit, Slobodan Ilic, Ste- fan Holzer, Kurt Konolige, Gary R. Bradski, and Nassir Navab. Model based training, detection and pose estima- tion of texture-less 3d objects in heavily cluttered scenes. InAsian Conference on Computer Vision, 2012. 5, 8

work page 2012

[19] [19]

Copa: General robotic manipulation through spatial constraints of parts with foundation mod- els

Haoxu Huang, Fanqi Lin, Yingdong Hu, Shengjie Wang, and Yang Gao. Copa: General robotic manipulation through spatial constraints of parts with foundation mod- els. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9488–9495, 2024. 3

work page 2024

[20] [20]

Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation

Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. In8th Annual Conference on Robot Learning, 2024. 3

work page 2024

[21] [21]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and com- positional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700–6709, 2019. 3 9

work page 2019

[22] [22]

Robobrain: A unified brain model for robotic manipulation from abstract to con- crete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to con- crete. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1724–1734, 2025. 2

work page 2025

[23] [23]

Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InProceedings of the Eu- ropean Conference on Computer Vision (ECCV), 2024. 3

work page 2024

[24] [24]

Lawrence Zitnick, and Ross Gir- shick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Gir- shick. Clevr: A diagnostic dataset for compositional lan- guage and elementary visual reasoning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3

work page 2017

[25] [25]

What’s “up” with vision-language models? investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. 2, 3

work page 2023

[26] [26]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, et al. Openvla: An open-source vision-language-action model. InProceedings of the 8th Conference on Robot Learning (CoRL), pages 2679–2713. PMLR, 2025. 3

work page 2025

[27] [27]

Visual genome: Connecting language and vision using crowd- sourced dense image annotations.International journal of computer vision, 123:32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowd- sourced dense image annotations.International journal of computer vision, 123:32–73, 2017. 3

work page 2017

[28] [28]

LaValle and James J

Steven M. LaValle and James J. Kuffner. Rapidly- exploring random trees: A new tool for path planning. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 1999. 4

work page 1999

[29] [29]

MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Code as policies: Language model programs for em- bodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for em- bodied control. InIEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500,

work page

[31] [31]

Multi-modal situated reasoning in 3d scenes

Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xi- aojian Ma, Baoxiong Jia, and Siyuan Huang. Multi-modal situated reasoning in 3d scenes. InAdvances in Neural Information Processing Systems, 2024. NeurIPS. 3

work page 2024

[32] [32]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023. 2, 3

work page 2023

[33] [33]

Nvila: Efficient frontier visual language models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, Xiuyu Li, Haotian Tang, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Jinyi Hu, Sifei Liu, Ranjay Krishna, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. Nvila: Efficient frontier visual lan...

work page 2025

[34] [34]

Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors.Advances in neural information processing systems, 37:68803–68832, 2024. 3

work page 2024

[35] [35]

Sqa3d: Sit- uated question answering in 3d scenes

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Sit- uated question answering in 3d scenes. InInternational Conference on Learning Representations, 2023. 3

work page 2023

[36] [36]

Situa- tional awareness matters in 3d vision language reasoning

Yunze Man, Liang-Yan Gui, and Yu-Xiong Wang. Situa- tional awareness matters in 3d vision language reasoning. InCVPR, 2024. 3

work page 2024

[37] [37]

Pivot: iterative visual prompting elicits actionable knowledge for vlms

Soroush Nasiriany, Fei Xia, Wenhao Yu, et al. Pivot: iterative visual prompting elicits actionable knowledge for vlms. InProceedings of the International Conference on Machine Learning (ICML). JMLR.org, 2024. 3

work page 2024

[38] [38]

BOP challenge 2024 on model-based and model-free 6d object pose estimation.CoRR, abs/2504.02812,

Van Nguyen Nguyen, Stephen Tyree, Andrew Guo, Med- eric Fourmy, Anas Gouda, Taeyeop Lee, Sungphill Moon, Hyeontae Son, Lukas Ranftl, Jonathan Tremblay, et al. Bop challenge 2024 on model-based and model-free 6d object pose estimation.arXiv preprint arXiv:2504.02812,

work page arXiv 2024

[39] [39]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, et al. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 3

work page 2024

[40] [40]

Introducing gpt-5.https://openai

OpenAI. Introducing gpt-5.https://openai. com/index/introducing-gpt-5/, 2025. Ope- nAI Blog. 6

work page 2025

[41] [41]

Gpt-4 technical report, 2024

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, et al. Gpt-4 technical report, 2024. 3

work page 2024

[42] [42]

Towards grounded visual spatial reasoning in multi-modal vision language models,

Navid Rajabi and Jana Kosecka. Towards grounded visual spatial reasoning in multi-modal vision language models,

work page

[43] [43]

Ryoo, and Tsung-Yu Lin

Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, and Tsung-Yu Lin. Learning to localize objects improves spatial reasoning in visual- llms. InProceedings of IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 12977–12987, 2024. 3

work page 2024

[44] [44]

Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models, 2025. 3

work page 2025

[45] [45]

Olinde Rodrigues. Des lois g´eom´etriques qui r´egissent les d´eplacements d’un syst`eme solide dans l’espace, et de la variation des coordonn´ees provenant de ces d´eplacements consid´er´es ind ´ependamment des causes qui peuvent les produire.Journal de math ´ematiques pures et appliqu´ees, 5:380–440, 1840. 4 10

work page

[46] [46]

Sophia Koepke, Hendrik P

Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, and Zeynep Akata. Clevr-x: A visual reason- ing dataset for natural language explanations. InxxAI - Beyond explainable Artificial Intelligence, pages 85–104. Springer, 2022. 3

work page 2022

[47] [47]

An empirical analysis on spatial reasoning capabilities of large multimodal models

Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, et al. An empirical analysis on spatial reasoning capabilities of large multimodal models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,

work page 2024

[48] [48]

Progprompt: Generating situated robot task plans using large language models

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2023. 1, 3

work page 2023

[49] [49]

Sadler, Wei-Lun Chao, and Yu Su

Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M. Sadler, Wei-Lun Chao, and Yu Su. One step at a time: Long-horizon vision-and-language navigation with mile- stones. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15482–15491, 2022. 3

work page 2022

[50] [50]

Robospa- tial: Teaching spatial understanding to 2d and 3d vision- language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospa- tial: Teaching spatial understanding to 2d and 3d vision- language models for robotics. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 15768–15780, 2025. 2, 3, 8

work page 2025

[51] [51]

A corpus for reasoning about natural language grounded in photographs

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. InPro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428, Florence, Italy, 2019. Association for Computational Linguistics. 3

work page 2019

[52] [52]

Space3D-Bench: Spatial 3D Question Answering Benchmark

Emilia Szymanska, Mihai Dusmanu, Jan-Willem Bu- urlage, Mahdi Rad, and Marc Pollefeys. Space3D-Bench: Spatial 3D Question Answering Benchmark. InEu- ropean Conference on Computer Vision (ECCV) Work- shops, 2024. 3

work page 2024

[53] [53]

Gemini Robotics Team. Building the next gen- eration of physical agents with gemini robotics-er 1.5.https://developers.googleblog.com/ en / building - the - next - generation - of - physical-agents-with-gemini-robotics- er-15/, 2025. Google Developers Blog. 2, 6

work page 2025

[54] [54]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024. 2, 3, 8

work page 2024

[55] [55]

6-dof pose estimation of household objects for robotic manip- ulation: An accessible dataset and benchmark

Stephen Tyree, Jonathan Tremblay, Thang To, Jia Cheng, Terry Mosier, Jeffrey Smith, and Stan Birchfield. 6-dof pose estimation of household objects for robotic manip- ulation: An accessible dataset and benchmark. InInter- national Conference on Intelligent Robots and Systems (IROS), 2022. 5, 7, 8

work page 2022

[56] [56]

Gpt-4v(ision) for robotics: Multimodal task planning from human demon- stration.IEEE Robotics and Automation Letters, 9(11): 10567–10574, 2024

Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, and Katsushi Ikeuchi. Gpt-4v(ision) for robotics: Multimodal task planning from human demon- stration.IEEE Robotics and Automation Letters, 9(11): 10567–10574, 2024. 3

work page 2024

[57] [57]

Em- bodiedscan: A holistic multi-modal 3d perception suite towards embodied ai

Tai Wang, Xiaohan Mao, Chenming Zhu, et al. Em- bodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024

[58] [58]

Calamari: Contact-aware and language conditioned spatial action mapping for contact- rich manipulation

Youngsun Wi, Mark Van der Merwe, Pete Florence, Andy Zeng, and Nima Fazeli. Calamari: Contact-aware and language conditioned spatial action mapping for contact- rich manipulation. In7th Annual Conference on Robot Learning, 2023. 3

work page 2023

[59] [59]

Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes

Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. InPro- ceedings of Robotics: Science and Systems, 2017. 5, 7, 8

work page 2017

[60] [60]

A parameter-efficient tuning frame- work for language-guided object grounding and robot grasping

Houjian Yu, Mingen Li, Alireza Rezazadeh, Yang Yang, and Changhyun Choi. A parameter-efficient tuning frame- work for language-guided object grounding and robot grasping. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 14353–14360,

work page

[61] [61]

M2t2: Multi-task masked trans- former for object-centric pick and place.arXiv preprint arXiv:2311.00926, 2023

Wentao Yuan, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. M2t2: Multi-task masked trans- former for object-centric pick and place.arXiv preprint arXiv:2311.00926, 2023. 4

work page arXiv 2023

[62] [62]

Robopoint: A vision-language model for spatial affordance prediction in robotics

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Ar- salan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction in robotics. In8th Annual Conference on Robot Learn- ing, 2024. 1, 2

work page 2024

[63] [63]

SPARTUN3d: Situated spa- tial understanding of 3d world in large language model

Yue Zhang, Zhiyang Xu, Ying Shen, Parisa Kord- jamshidi, and Lifu Huang. SPARTUN3d: Situated spa- tial understanding of 3d world in large language model. InThe Thirteenth International Conference on Learning Representations, 2025. 3

work page 2025

[64] [64]

Manipbench: Benchmarking vision-language models for low-level robot manipulation.arXiv preprint arXiv:2505.09698, 2025

Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, and Daniel Seita. Manipbench: Benchmarking vision- language models for low-level robot manipulation.arXiv preprint arXiv:2505.09698, 2025. 3

work page arXiv 2025

[65] [65]

Robore- fer: Towards spatial referring with reasoning in vision- language models for robotics, 2025

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, and Shanghang Zhang. Robore- fer: Towards spatial referring with reasoning in vision- language models for robotics, 2025. 2, 3, 6

work page 2025

[66] [66]

Rt-2: Vision-language-action models transfer web knowl- edge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, et al. Rt-2: Vision-language-action models transfer web knowl- edge to robotic control. InProceedings of The 7th Con- ference on Robot Learning, pages 2165–2183. PMLR,

work page

[67] [67]

Point to the flat surface where the objects are placed

3 11 A. Data Generation Pipeline TheBenchmark for 6D Object Pose Estimation (BOP) family of datasets provides training data for 6D object pose estimation and comprises real and simulation im- ages showcasing multiple objects and diverse setups. For example, HOPE, a BOP-based dataset, comprises 28 toy grocery objects captured in 50 scenes from 10 house- ho...

work page 1920