pith. sign in

arxiv: 2511.16857 · v3 · submitted 2025-11-20 · 💻 cs.CV · cs.RO

BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

Pith reviewed 2026-05-17 19:54 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords vision-language modelsobject interaction reasoningspatial reasoningdataset6D pose estimationgrasp posestrajectory planningaffordances
0
0 comments X

The pith

BOP-ASK dataset trains vision-language models to perform precise object grasp estimation and multi-step spatial planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BOP-ASK, a dataset of over 150,000 images and 33 million question-answer pairs built from 6D object poses in existing BOP collections. It supplies annotations for grasp poses, trajectories, relative spatial relations, depth, and object-to-object links across six tasks. Models trained on this data outperform baselines and acquire new skills in accurate pose estimation, path planning, and object-centric reasoning inside cluttered scenes. A sympathetic reader would care because standard vision-language benchmarks overlook the detailed physical and interaction details required for practical use.

Core claim

By deriving fine-grained annotations for grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships directly from 6D object poses in the BOP datasets, BOP-ASK creates a training and evaluation resource that lets vision-language models develop precise object localization, physical compatibility understanding, affordance recognition, and multi-step planning abilities.

What carries the argument

The BOP-ASK dataset, which automatically generates question-answer pairs for object-interaction tasks from 6D pose data to support training and benchmarking of vision-language models.

If this is right

  • Models trained on BOP-ASK outperform baselines on the six object-interaction tasks.
  • Trained models exhibit precise object and grasp pose estimation.
  • Models demonstrate trajectory planning and fine-grained object-centric spatial reasoning in cluttered environments.
  • Performance generalizes to out-of-distribution images in the released BOP-ASK-lab benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The annotation generation pipeline could be applied to other 6D pose datasets to produce additional training resources for interaction reasoning.
  • Improved spatial and planning capabilities in vision-language models may support more reliable integration into robotic manipulation systems.
  • Human evaluations on the BOP-ASK-core test set point toward reduced reliance on manual annotation for spatial tasks.

Load-bearing premise

The automatically generated annotations from 6D object poses accurately reflect real-world physical compatibility, affordances, and multi-step planning needs.

What would settle it

A model trained on BOP-ASK fails to achieve higher grasp success rates or more accurate trajectory plans than baseline models when evaluated on physical robot experiments using the same objects and scenes.

Figures

Figures reproduced from arXiv: 2511.16857 by Farshad Khorrami, Greg Heinrich, Jonathan Tremblay, Prashanth Krishnamurthy, Ramesh Karri, Stan Birchfield, Sungsu Kim, Valts Blukis, Vineet Bhat.

Figure 1
Figure 1. Figure 1: The BOP-Ask dataset facilitates object-interaction reasoning for robot manipulation. This illustration demon￾strates how a model trained on BOP-Ask enables human and robot-aligned spatial understanding for different actions, sup￾porting physical relationship, locating where to grasp objects, precise pose estimation, and motion planning between objects. as scene description [14], or code generation for robo… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the BOP-Ask dataset. We automatically generate object-interaction and spatial reasoning annotations from 3D point clouds, images, object poses and 3D models with description. We create question/answer pairs covering 6 types of questions (from left to right, top to bottom), object pose estimation, grasp affordance, motion planning, physical interaction, object relationship, and depth relationshi… view at source ↗
Figure 3
Figure 3. Figure 3: Predictions from samples in BOP-Ask-core and BOP-Ask-lab (identified by (lab)), showing improvements gained from fine-tuning on BOP-Ask. Predictions from NVILA (shown in magenta) and NVILA SFT (shown in blue) are shown alongside the Ground Truth (in green). For the ’Rearrangement’ task, the Ground Truth shape delineates the area of valid predictions. Absence of a colored prediction indicates none was made … view at source ↗
Figure 4
Figure 4. Figure 4: Our proposed data generation framework can trans [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of free form questions and task types in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Real world robot experiments with a Franka arm and a ZED2 Stereo camera. VLMs fine-tuned on [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BOP-ASK, a large-scale dataset derived from BOP 6D object poses containing over 150k images and 33M QA pairs across six tasks (four novel) focused on fine-grained object interaction reasoning. Annotations for grasp poses, trajectories, relative depth, and object-to-object compatibility are generated automatically from the poses. The authors evaluate proprietary and open-source VLMs, report that models trained on BOP-ASK outperform baselines with emergent capabilities in precise pose estimation, trajectory planning, and object-centric spatial reasoning, and release BOP-ASK-core (with human evaluation) and BOP-ASK-lab (OOD benchmark).

Significance. If the auto-generated labels are reliable, the dataset fills an important gap in training VLMs for embodied and robotic applications requiring physical compatibility and multi-step planning. The scale, task diversity, and provision of both in-distribution and out-of-distribution test sets are strengths that could support reproducible progress in this area.

major comments (2)
  1. [§3 (Data Generation Pipeline)] §3 (Data Generation Pipeline): The derivation of grasp poses, path planning trajectories, and object-to-object relationships from 6D poses is presented without collision checking, physics simulation, or human validation on the training distribution (only human evaluation is reported on the BOP-ASK-core test split). This is load-bearing for the central claim, as noisy or incorrect affordance labels could produce the reported performance gains and 'emergent' behaviors as artifacts rather than evidence of genuine interaction understanding.
  2. [§5 (Experiments)] §5 (Experiments): The protocol for baseline comparisons and the demonstration of outperformance lack explicit controls for potential selection effects in question generation or post-hoc filtering. Without these details, it is difficult to confirm that the gains on the six tasks (including the four novel ones) reflect improved reasoning rather than dataset-specific biases.
minor comments (2)
  1. [Abstract] The abstract states that four tasks are novel but does not name them; this should be stated explicitly in the introduction or dataset section for clarity.
  2. [Figures and Tables] Figure captions and table headers should consistently define abbreviations such as 'BOP-ASK-core' on first use to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below and indicate the revisions planned for the next version of the paper.

read point-by-point responses
  1. Referee: [§3 (Data Generation Pipeline)] §3 (Data Generation Pipeline): The derivation of grasp poses, path planning trajectories, and object-to-object relationships from 6D poses is presented without collision checking, physics simulation, or human validation on the training distribution (only human evaluation is reported on the BOP-ASK-core test split). This is load-bearing for the central claim, as noisy or incorrect affordance labels could produce the reported performance gains and 'emergent' behaviors as artifacts rather than evidence of genuine interaction understanding.

    Authors: The annotations are produced via deterministic geometric operations applied directly to the high-accuracy 6D poses supplied by the BOP benchmark. Grasp poses are derived from surface normals and principal axes, trajectories are constructed as sequences of relative SE(3) transforms, and object-to-object compatibility is computed from pairwise pose distances and orientation thresholds. Because the tasks target pose-based spatial reasoning rather than dynamic physical interaction, full physics simulation and collision checking were not required for label generation. Human validation was performed on the BOP-ASK-core test split, where label agreement exceeded 90 %. We will revise §3 to document the precise geometric procedures, add quantitative consistency checks across the training distribution, and include an explicit limitations paragraph acknowledging the lack of physics-based validation. These additions will clarify that reported gains arise from improved spatial reasoning rather than label artifacts. revision: partial

  2. Referee: [§5 (Experiments)] §5 (Experiments): The protocol for baseline comparisons and the demonstration of outperformance lack explicit controls for potential selection effects in question generation or post-hoc filtering. Without these details, it is difficult to confirm that the gains on the six tasks (including the four novel ones) reflect improved reasoning rather than dataset-specific biases.

    Authors: All question-answer pairs were generated exhaustively from the pose annotations for every image and task combination; no post-hoc filtering or model-dependent selection was applied. The identical question set was used for every baseline and fine-tuned model. We will expand §5 with a complete description of the generation algorithm, per-task question counts, and additional ablation results that subsample questions uniformly and re-evaluate performance. These controls will demonstrate that the observed improvements on both in-distribution and out-of-distribution sets are attributable to enhanced object-interaction reasoning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical dataset and evaluation paper with independent test benchmarks

full rationale

The paper's core contribution is the construction of the BOP-ASK dataset by automatically deriving QA pairs and annotations (grasp poses, trajectories, spatial relations) from existing external BOP 6D pose data, followed by empirical training and evaluation of VLMs on held-out splits including human-validated BOP-ASK-core and an OOD BOP-ASK-lab benchmark. No equations, fitted parameters, or derivation chains are present that would reduce reported performance or emergent capabilities to quantities defined by construction inside the paper. Claims rest on experimental comparisons against baselines rather than self-referential definitions, self-citation load-bearing uniqueness theorems, or ansatz smuggling. This is a standard self-contained empirical contribution against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that 6D poses from BOP datasets can be reliably transformed into grasp poses, depth relations, and planning trajectories without additional physical simulation or human validation at scale.

axioms (1)
  • domain assumption Existing BOP 6D pose annotations are sufficiently accurate and complete to derive grasp poses, relative spatial relations, and multi-step trajectories.
    Invoked in the data generation pipeline description.

pith-pipeline@v0.9.0 · 5586 in / 1259 out tokens · 22191 ms · 2026-05-17T19:54:06.928943+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 3 internal anchors

  1. [1]

    Scanqa: 3d question answering for spa- tial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Mo- toaki Kawanabe. Scanqa: 3d question answering for spa- tial scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6, 8

  3. [3]

    Maplegrasp: Mask-guided feature pooling for language-driven efficient robotic grasping.arXiv preprint arXiv:2506.06535, 2025

    Vineet Bhat, Naman Patel, Prashanth Krishnamurthy, Ramesh Karri, and Farshad Khorrami. Maplegrasp: Mask-guided feature pooling for language-driven efficient robotic grasping.arXiv preprint arXiv:2506.06535, 2025. 5

  4. [4]

    Spatial- bot: Precise spatial understanding with vision language models

    Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xi- aoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatial- bot: Precise spatial understanding with vision language models. InProceedings of the IEEE International Con- ference on Robotics and Automation (ICRA), 2025. 8

  5. [5]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, 2024. 1, 2, 3

  6. [6]

    Robo2vlm: Improving visual question answering using large-scale robot manipulation data

    Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R San- keti, and Ken Goldberg. Robo2vlm: Improving visual question answering using large-scale robot manipulation data. InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems Datasets and Bench- marks Track. 3

  7. [7]

    Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185– 24198, 2024. 6

  8. [8]

    Spatialrgpt: Grounded spatial reasoning in vision- language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision- language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 2, 3

  9. [9]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Matt Deitke, Christopher Clark, Sangho Lee, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146, 2024. 4, 6

  10. [10]

    Graspmolmo: Generalizable task-oriented grasping via large-scale syn- thetic data generation, 2025

    Abhay Deshpande, Yuquan Deng, Arijit Ray, Jordi Sal- vador, Winson Han, Jiafei Duan, Kuo-Hao Zeng, Yuke Zhu, Ranjay Krishna, and Rose Hendrix. Graspmolmo: Generalizable task-oriented grasping via large-scale syn- thetic data generation, 2025. 5

  11. [11]

    Douglas and Thomas K

    David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature.Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, 1973. 4

  12. [12]

    EmbSpatial-bench: Benchmarking spa- tial understanding for embodied tasks with large vision- language models

    Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. EmbSpatial-bench: Benchmarking spa- tial understanding for embodied tasks with large vision- language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), 2024. 2, 3

  13. [13]

    AHA: A vision- language-model for detecting and reasoning over failures in robotic manipulation

    Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, and Yijie Guo. AHA: A vision- language-model for detecting and reasoning over failures in robotic manipulation. InThe Thirteenth International Conference on Learning Representations, 2025. 3

  14. [14]

    Moka: Open-world robotic manipulation through mark-based visual prompting.Robotics: Science and Systems (RSS), 2024

    Kuan Fang, Fangchen Liu, Pieter Abbeel, and Sergey Levine. Moka: Open-world robotic manipulation through mark-based visual prompting.Robotics: Science and Systems (RSS), 2024. 1

  15. [15]

    Blink: Mul- timodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, et al. Blink: Mul- timodal large language models can see but not perceive. InProceedings of the European Conference on Computer Vision (ECCV), pages 148–166. Springer, 2024. 2, 3

  16. [16]

    Handal: A dataset of real-world manipulable object cat- egories with pose annotations, affordances, and recon- structions

    Andrew Guo, Bowen Wen, Jianhe Yuan, Jonathan Trem- blay, Stephen Tyree, Jeffrey Smith, and Stan Birchfield. Handal: A dataset of real-world manipulable object cat- egories with pose annotations, affordances, and recon- structions. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11428– 11435, 2023. 5, 7, 8

  17. [17]

    Tiger: Tool-integrated geometric rea- soning in vision-language models for robotics.arXiv preprint arXiv:2510.07181, 2025

    Yi Han, Cheng Chi, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Lu Sheng, and Shanghang Zhang. Tiger: Tool-integrated geometric rea- soning in vision-language models for robotics.arXiv preprint arXiv:2510.07181, 2025. 3

  18. [18]

    Bradski, and Nassir Navab

    Stefan Hinterstoißer, Vincent Lepetit, Slobodan Ilic, Ste- fan Holzer, Kurt Konolige, Gary R. Bradski, and Nassir Navab. Model based training, detection and pose estima- tion of texture-less 3d objects in heavily cluttered scenes. InAsian Conference on Computer Vision, 2012. 5, 8

  19. [19]

    Copa: General robotic manipulation through spatial constraints of parts with foundation mod- els

    Haoxu Huang, Fanqi Lin, Yingdong Hu, Shengjie Wang, and Yang Gao. Copa: General robotic manipulation through spatial constraints of parts with foundation mod- els. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9488–9495, 2024. 3

  20. [20]

    Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation

    Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. In8th Annual Conference on Robot Learning, 2024. 3

  21. [21]

    Hudson and Christopher D

    Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and com- positional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700–6709, 2019. 3 9

  22. [22]

    Robobrain: A unified brain model for robotic manipulation from abstract to con- crete

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to con- crete. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1724–1734, 2025. 2

  23. [23]

    Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

    Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InProceedings of the Eu- ropean Conference on Computer Vision (ECCV), 2024. 3

  24. [24]

    Lawrence Zitnick, and Ross Gir- shick

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Gir- shick. Clevr: A diagnostic dataset for compositional lan- guage and elementary visual reasoning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3

  25. [25]

    What’s “up” with vision-language models? investigating their struggle with spatial reasoning

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. 2, 3

  26. [26]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, et al. Openvla: An open-source vision-language-action model. InProceedings of the 8th Conference on Robot Learning (CoRL), pages 2679–2713. PMLR, 2025. 3

  27. [27]

    Visual genome: Connecting language and vision using crowd- sourced dense image annotations.International journal of computer vision, 123:32–73, 2017

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowd- sourced dense image annotations.International journal of computer vision, 123:32–73, 2017. 3

  28. [28]

    LaValle and James J

    Steven M. LaValle and James J. Kuffner. Rapidly- exploring random trees: A new tool for path planning. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 1999. 4

  29. [29]

    MolmoAct: Action Reasoning Models that can Reason in Space

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025. 2, 3

  30. [30]

    Code as policies: Language model programs for em- bodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for em- bodied control. InIEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500,

  31. [31]

    Multi-modal situated reasoning in 3d scenes

    Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xi- aojian Ma, Baoxiong Jia, and Siyuan Huang. Multi-modal situated reasoning in 3d scenes. InAdvances in Neural Information Processing Systems, 2024. NeurIPS. 3

  32. [32]

    Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023. 2, 3

  33. [33]

    Nvila: Efficient frontier visual language models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, Xiuyu Li, Haotian Tang, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Jinyi Hu, Sifei Liu, Ranjay Krishna, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. Nvila: Efficient frontier visual lan...

  34. [34]

    Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors.Advances in neural information processing systems, 37:68803–68832, 2024. 3

  35. [35]

    Sqa3d: Sit- uated question answering in 3d scenes

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Sit- uated question answering in 3d scenes. InInternational Conference on Learning Representations, 2023. 3

  36. [36]

    Situa- tional awareness matters in 3d vision language reasoning

    Yunze Man, Liang-Yan Gui, and Yu-Xiong Wang. Situa- tional awareness matters in 3d vision language reasoning. InCVPR, 2024. 3

  37. [37]

    Pivot: iterative visual prompting elicits actionable knowledge for vlms

    Soroush Nasiriany, Fei Xia, Wenhao Yu, et al. Pivot: iterative visual prompting elicits actionable knowledge for vlms. InProceedings of the International Conference on Machine Learning (ICML). JMLR.org, 2024. 3

  38. [38]

    BOP challenge 2024 on model-based and model-free 6d object pose estimation.CoRR, abs/2504.02812,

    Van Nguyen Nguyen, Stephen Tyree, Andrew Guo, Med- eric Fourmy, Anas Gouda, Taeyeop Lee, Sungphill Moon, Hyeontae Son, Lukas Ranftl, Jonathan Tremblay, et al. Bop challenge 2024 on model-based and model-free 6d object pose estimation.arXiv preprint arXiv:2504.02812,

  39. [39]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, et al. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 3

  40. [40]

    Introducing gpt-5.https://openai

    OpenAI. Introducing gpt-5.https://openai. com/index/introducing-gpt-5/, 2025. Ope- nAI Blog. 6

  41. [41]

    Gpt-4 technical report, 2024

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, et al. Gpt-4 technical report, 2024. 3

  42. [42]

    Towards grounded visual spatial reasoning in multi-modal vision language models,

    Navid Rajabi and Jana Kosecka. Towards grounded visual spatial reasoning in multi-modal vision language models,

  43. [43]

    Ryoo, and Tsung-Yu Lin

    Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, and Tsung-Yu Lin. Learning to localize objects improves spatial reasoning in visual- llms. InProceedings of IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 12977–12987, 2024. 3

  44. [44]

    Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko

    Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models, 2025. 3

  45. [45]

    Olinde Rodrigues. Des lois g´eom´etriques qui r´egissent les d´eplacements d’un syst`eme solide dans l’espace, et de la variation des coordonn´ees provenant de ces d´eplacements consid´er´es ind ´ependamment des causes qui peuvent les produire.Journal de math ´ematiques pures et appliqu´ees, 5:380–440, 1840. 4 10

  46. [46]

    Sophia Koepke, Hendrik P

    Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, and Zeynep Akata. Clevr-x: A visual reason- ing dataset for natural language explanations. InxxAI - Beyond explainable Artificial Intelligence, pages 85–104. Springer, 2022. 3

  47. [47]

    An empirical analysis on spatial reasoning capabilities of large multimodal models

    Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, et al. An empirical analysis on spatial reasoning capabilities of large multimodal models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,

  48. [48]

    Progprompt: Generating situated robot task plans using large language models

    Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2023. 1, 3

  49. [49]

    Sadler, Wei-Lun Chao, and Yu Su

    Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M. Sadler, Wei-Lun Chao, and Yu Su. One step at a time: Long-horizon vision-and-language navigation with mile- stones. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15482–15491, 2022. 3

  50. [50]

    Robospa- tial: Teaching spatial understanding to 2d and 3d vision- language models for robotics

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospa- tial: Teaching spatial understanding to 2d and 3d vision- language models for robotics. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 15768–15780, 2025. 2, 3, 8

  51. [51]

    A corpus for reasoning about natural language grounded in photographs

    Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. InPro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428, Florence, Italy, 2019. Association for Computational Linguistics. 3

  52. [52]

    Space3D-Bench: Spatial 3D Question Answering Benchmark

    Emilia Szymanska, Mihai Dusmanu, Jan-Willem Bu- urlage, Mahdi Rad, and Marc Pollefeys. Space3D-Bench: Spatial 3D Question Answering Benchmark. InEu- ropean Conference on Computer Vision (ECCV) Work- shops, 2024. 3

  53. [53]

    Gemini Robotics Team. Building the next gen- eration of physical agents with gemini robotics-er 1.5.https://developers.googleblog.com/ en / building - the - next - generation - of - physical-agents-with-gemini-robotics- er-15/, 2025. Google Developers Blog. 2, 6

  54. [54]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024. 2, 3, 8

  55. [55]

    6-dof pose estimation of household objects for robotic manip- ulation: An accessible dataset and benchmark

    Stephen Tyree, Jonathan Tremblay, Thang To, Jia Cheng, Terry Mosier, Jeffrey Smith, and Stan Birchfield. 6-dof pose estimation of household objects for robotic manip- ulation: An accessible dataset and benchmark. InInter- national Conference on Intelligent Robots and Systems (IROS), 2022. 5, 7, 8

  56. [56]

    Gpt-4v(ision) for robotics: Multimodal task planning from human demon- stration.IEEE Robotics and Automation Letters, 9(11): 10567–10574, 2024

    Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, and Katsushi Ikeuchi. Gpt-4v(ision) for robotics: Multimodal task planning from human demon- stration.IEEE Robotics and Automation Letters, 9(11): 10567–10574, 2024. 3

  57. [57]

    Em- bodiedscan: A holistic multi-modal 3d perception suite towards embodied ai

    Tai Wang, Xiaohan Mao, Chenming Zhu, et al. Em- bodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

  58. [58]

    Calamari: Contact-aware and language conditioned spatial action mapping for contact- rich manipulation

    Youngsun Wi, Mark Van der Merwe, Pete Florence, Andy Zeng, and Nima Fazeli. Calamari: Contact-aware and language conditioned spatial action mapping for contact- rich manipulation. In7th Annual Conference on Robot Learning, 2023. 3

  59. [59]

    Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes

    Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. InPro- ceedings of Robotics: Science and Systems, 2017. 5, 7, 8

  60. [60]

    A parameter-efficient tuning frame- work for language-guided object grounding and robot grasping

    Houjian Yu, Mingen Li, Alireza Rezazadeh, Yang Yang, and Changhyun Choi. A parameter-efficient tuning frame- work for language-guided object grounding and robot grasping. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 14353–14360,

  61. [61]

    M2t2: Multi-task masked trans- former for object-centric pick and place.arXiv preprint arXiv:2311.00926, 2023

    Wentao Yuan, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. M2t2: Multi-task masked trans- former for object-centric pick and place.arXiv preprint arXiv:2311.00926, 2023. 4

  62. [62]

    Robopoint: A vision-language model for spatial affordance prediction in robotics

    Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Ar- salan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction in robotics. In8th Annual Conference on Robot Learn- ing, 2024. 1, 2

  63. [63]

    SPARTUN3d: Situated spa- tial understanding of 3d world in large language model

    Yue Zhang, Zhiyang Xu, Ying Shen, Parisa Kord- jamshidi, and Lifu Huang. SPARTUN3d: Situated spa- tial understanding of 3d world in large language model. InThe Thirteenth International Conference on Learning Representations, 2025. 3

  64. [64]

    Manipbench: Benchmarking vision-language models for low-level robot manipulation.arXiv preprint arXiv:2505.09698, 2025

    Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, and Daniel Seita. Manipbench: Benchmarking vision- language models for low-level robot manipulation.arXiv preprint arXiv:2505.09698, 2025. 3

  65. [65]

    Robore- fer: Towards spatial referring with reasoning in vision- language models for robotics, 2025

    Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, and Shanghang Zhang. Robore- fer: Towards spatial referring with reasoning in vision- language models for robotics, 2025. 2, 3, 6

  66. [66]

    Rt-2: Vision-language-action models transfer web knowl- edge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, et al. Rt-2: Vision-language-action models transfer web knowl- edge to robotic control. InProceedings of The 7th Con- ference on Robot Learning, pages 2165–2183. PMLR,

  67. [67]

    Point to the flat surface where the objects are placed

    3 11 A. Data Generation Pipeline TheBenchmark for 6D Object Pose Estimation (BOP) family of datasets provides training data for 6D object pose estimation and comprises real and simulation im- ages showcasing multiple objects and diverse setups. For example, HOPE, a BOP-based dataset, comprises 28 toy grocery objects captured in 50 scenes from 10 house- ho...