BOP-ASK: Object-Interaction Reasoning for Vision-Language Models
Pith reviewed 2026-05-17 19:54 UTC · model grok-4.3
The pith
BOP-ASK dataset trains vision-language models to perform precise object grasp estimation and multi-step spatial planning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By deriving fine-grained annotations for grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships directly from 6D object poses in the BOP datasets, BOP-ASK creates a training and evaluation resource that lets vision-language models develop precise object localization, physical compatibility understanding, affordance recognition, and multi-step planning abilities.
What carries the argument
The BOP-ASK dataset, which automatically generates question-answer pairs for object-interaction tasks from 6D pose data to support training and benchmarking of vision-language models.
If this is right
- Models trained on BOP-ASK outperform baselines on the six object-interaction tasks.
- Trained models exhibit precise object and grasp pose estimation.
- Models demonstrate trajectory planning and fine-grained object-centric spatial reasoning in cluttered environments.
- Performance generalizes to out-of-distribution images in the released BOP-ASK-lab benchmark.
Where Pith is reading between the lines
- The annotation generation pipeline could be applied to other 6D pose datasets to produce additional training resources for interaction reasoning.
- Improved spatial and planning capabilities in vision-language models may support more reliable integration into robotic manipulation systems.
- Human evaluations on the BOP-ASK-core test set point toward reduced reliance on manual annotation for spatial tasks.
Load-bearing premise
The automatically generated annotations from 6D object poses accurately reflect real-world physical compatibility, affordances, and multi-step planning needs.
What would settle it
A model trained on BOP-ASK fails to achieve higher grasp success rates or more accurate trajectory plans than baseline models when evaluated on physical robot experiments using the same objects and scenes.
Figures
read the original abstract
Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BOP-ASK, a large-scale dataset derived from BOP 6D object poses containing over 150k images and 33M QA pairs across six tasks (four novel) focused on fine-grained object interaction reasoning. Annotations for grasp poses, trajectories, relative depth, and object-to-object compatibility are generated automatically from the poses. The authors evaluate proprietary and open-source VLMs, report that models trained on BOP-ASK outperform baselines with emergent capabilities in precise pose estimation, trajectory planning, and object-centric spatial reasoning, and release BOP-ASK-core (with human evaluation) and BOP-ASK-lab (OOD benchmark).
Significance. If the auto-generated labels are reliable, the dataset fills an important gap in training VLMs for embodied and robotic applications requiring physical compatibility and multi-step planning. The scale, task diversity, and provision of both in-distribution and out-of-distribution test sets are strengths that could support reproducible progress in this area.
major comments (2)
- [§3 (Data Generation Pipeline)] §3 (Data Generation Pipeline): The derivation of grasp poses, path planning trajectories, and object-to-object relationships from 6D poses is presented without collision checking, physics simulation, or human validation on the training distribution (only human evaluation is reported on the BOP-ASK-core test split). This is load-bearing for the central claim, as noisy or incorrect affordance labels could produce the reported performance gains and 'emergent' behaviors as artifacts rather than evidence of genuine interaction understanding.
- [§5 (Experiments)] §5 (Experiments): The protocol for baseline comparisons and the demonstration of outperformance lack explicit controls for potential selection effects in question generation or post-hoc filtering. Without these details, it is difficult to confirm that the gains on the six tasks (including the four novel ones) reflect improved reasoning rather than dataset-specific biases.
minor comments (2)
- [Abstract] The abstract states that four tasks are novel but does not name them; this should be stated explicitly in the introduction or dataset section for clarity.
- [Figures and Tables] Figure captions and table headers should consistently define abbreviations such as 'BOP-ASK-core' on first use to aid readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below and indicate the revisions planned for the next version of the paper.
read point-by-point responses
-
Referee: [§3 (Data Generation Pipeline)] §3 (Data Generation Pipeline): The derivation of grasp poses, path planning trajectories, and object-to-object relationships from 6D poses is presented without collision checking, physics simulation, or human validation on the training distribution (only human evaluation is reported on the BOP-ASK-core test split). This is load-bearing for the central claim, as noisy or incorrect affordance labels could produce the reported performance gains and 'emergent' behaviors as artifacts rather than evidence of genuine interaction understanding.
Authors: The annotations are produced via deterministic geometric operations applied directly to the high-accuracy 6D poses supplied by the BOP benchmark. Grasp poses are derived from surface normals and principal axes, trajectories are constructed as sequences of relative SE(3) transforms, and object-to-object compatibility is computed from pairwise pose distances and orientation thresholds. Because the tasks target pose-based spatial reasoning rather than dynamic physical interaction, full physics simulation and collision checking were not required for label generation. Human validation was performed on the BOP-ASK-core test split, where label agreement exceeded 90 %. We will revise §3 to document the precise geometric procedures, add quantitative consistency checks across the training distribution, and include an explicit limitations paragraph acknowledging the lack of physics-based validation. These additions will clarify that reported gains arise from improved spatial reasoning rather than label artifacts. revision: partial
-
Referee: [§5 (Experiments)] §5 (Experiments): The protocol for baseline comparisons and the demonstration of outperformance lack explicit controls for potential selection effects in question generation or post-hoc filtering. Without these details, it is difficult to confirm that the gains on the six tasks (including the four novel ones) reflect improved reasoning rather than dataset-specific biases.
Authors: All question-answer pairs were generated exhaustively from the pose annotations for every image and task combination; no post-hoc filtering or model-dependent selection was applied. The identical question set was used for every baseline and fine-tuned model. We will expand §5 with a complete description of the generation algorithm, per-task question counts, and additional ablation results that subsample questions uniformly and re-evaluate performance. These controls will demonstrate that the observed improvements on both in-distribution and out-of-distribution sets are attributable to enhanced object-interaction reasoning. revision: yes
Circularity Check
No significant circularity; empirical dataset and evaluation paper with independent test benchmarks
full rationale
The paper's core contribution is the construction of the BOP-ASK dataset by automatically deriving QA pairs and annotations (grasp poses, trajectories, spatial relations) from existing external BOP 6D pose data, followed by empirical training and evaluation of VLMs on held-out splits including human-validated BOP-ASK-core and an OOD BOP-ASK-lab benchmark. No equations, fitted parameters, or derivation chains are present that would reduce reported performance or emergent capabilities to quantities defined by construction inside the paper. Claims rest on experimental comparisons against baselines rather than self-referential definitions, self-citation load-bearing uniqueness theorems, or ansatz smuggling. This is a standard self-contained empirical contribution against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing BOP 6D pose annotations are sufficiently accurate and complete to derive grasp poses, relative spatial relations, and multi-step trajectories.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Motion trajectories are synthesized using a Rapidly-exploring Random Tree (RRT) planner operating in 3D Cartesian space... Object grasps are computed... using the transformer-based parallel gripper model M2T2
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We estimate the camera-to-world transformation... fit a plane to the corresponding 3D points via RANSAC
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Scanqa: 3d question answering for spa- tial scene understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Mo- toaki Kawanabe. Scanqa: 3d question answering for spa- tial scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3
work page 2022
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Vineet Bhat, Naman Patel, Prashanth Krishnamurthy, Ramesh Karri, and Farshad Khorrami. Maplegrasp: Mask-guided feature pooling for language-driven efficient robotic grasping.arXiv preprint arXiv:2506.06535, 2025. 5
-
[4]
Spatial- bot: Precise spatial understanding with vision language models
Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xi- aoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatial- bot: Precise spatial understanding with vision language models. InProceedings of the IEEE International Con- ference on Robotics and Automation (ICRA), 2025. 8
work page 2025
-
[5]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, 2024. 1, 2, 3
work page 2024
-
[6]
Robo2vlm: Improving visual question answering using large-scale robot manipulation data
Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R San- keti, and Ken Goldberg. Robo2vlm: Improving visual question answering using large-scale robot manipulation data. InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems Datasets and Bench- marks Track. 3
-
[7]
Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185– 24198, 2024. 6
work page 2024
-
[8]
Spatialrgpt: Grounded spatial reasoning in vision- language models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision- language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 2, 3
work page 2024
-
[9]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146, 2024. 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Graspmolmo: Generalizable task-oriented grasping via large-scale syn- thetic data generation, 2025
Abhay Deshpande, Yuquan Deng, Arijit Ray, Jordi Sal- vador, Winson Han, Jiafei Duan, Kuo-Hao Zeng, Yuke Zhu, Ranjay Krishna, and Rose Hendrix. Graspmolmo: Generalizable task-oriented grasping via large-scale syn- thetic data generation, 2025. 5
work page 2025
-
[11]
David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature.Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, 1973. 4
work page 1973
-
[12]
Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. EmbSpatial-bench: Benchmarking spa- tial understanding for embodied tasks with large vision- language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), 2024. 2, 3
work page 2024
-
[13]
AHA: A vision- language-model for detecting and reasoning over failures in robotic manipulation
Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, and Yijie Guo. AHA: A vision- language-model for detecting and reasoning over failures in robotic manipulation. InThe Thirteenth International Conference on Learning Representations, 2025. 3
work page 2025
-
[14]
Kuan Fang, Fangchen Liu, Pieter Abbeel, and Sergey Levine. Moka: Open-world robotic manipulation through mark-based visual prompting.Robotics: Science and Systems (RSS), 2024. 1
work page 2024
-
[15]
Blink: Mul- timodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, et al. Blink: Mul- timodal large language models can see but not perceive. InProceedings of the European Conference on Computer Vision (ECCV), pages 148–166. Springer, 2024. 2, 3
work page 2024
-
[16]
Andrew Guo, Bowen Wen, Jianhe Yuan, Jonathan Trem- blay, Stephen Tyree, Jeffrey Smith, and Stan Birchfield. Handal: A dataset of real-world manipulable object cat- egories with pose annotations, affordances, and recon- structions. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11428– 11435, 2023. 5, 7, 8
work page 2023
-
[17]
Yi Han, Cheng Chi, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Lu Sheng, and Shanghang Zhang. Tiger: Tool-integrated geometric rea- soning in vision-language models for robotics.arXiv preprint arXiv:2510.07181, 2025. 3
-
[18]
Stefan Hinterstoißer, Vincent Lepetit, Slobodan Ilic, Ste- fan Holzer, Kurt Konolige, Gary R. Bradski, and Nassir Navab. Model based training, detection and pose estima- tion of texture-less 3d objects in heavily cluttered scenes. InAsian Conference on Computer Vision, 2012. 5, 8
work page 2012
-
[19]
Copa: General robotic manipulation through spatial constraints of parts with foundation mod- els
Haoxu Huang, Fanqi Lin, Yingdong Hu, Shengjie Wang, and Yang Gao. Copa: General robotic manipulation through spatial constraints of parts with foundation mod- els. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9488–9495, 2024. 3
work page 2024
-
[20]
Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation
Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. In8th Annual Conference on Robot Learning, 2024. 3
work page 2024
-
[21]
Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and com- positional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700–6709, 2019. 3 9
work page 2019
-
[22]
Robobrain: A unified brain model for robotic manipulation from abstract to con- crete
Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to con- crete. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1724–1734, 2025. 2
work page 2025
-
[23]
Sceneverse: Scaling 3d vision-language learning for grounded scene understanding
Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InProceedings of the Eu- ropean Conference on Computer Vision (ECCV), 2024. 3
work page 2024
-
[24]
Lawrence Zitnick, and Ross Gir- shick
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Gir- shick. Clevr: A diagnostic dataset for compositional lan- guage and elementary visual reasoning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3
work page 2017
-
[25]
What’s “up” with vision-language models? investigating their struggle with spatial reasoning
Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. 2, 3
work page 2023
-
[26]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, et al. Openvla: An open-source vision-language-action model. InProceedings of the 8th Conference on Robot Learning (CoRL), pages 2679–2713. PMLR, 2025. 3
work page 2025
-
[27]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowd- sourced dense image annotations.International journal of computer vision, 123:32–73, 2017. 3
work page 2017
-
[28]
Steven M. LaValle and James J. Kuffner. Rapidly- exploring random trees: A new tool for path planning. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 1999. 4
work page 1999
-
[29]
MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Code as policies: Language model programs for em- bodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for em- bodied control. InIEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500,
-
[31]
Multi-modal situated reasoning in 3d scenes
Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xi- aojian Ma, Baoxiong Jia, and Siyuan Huang. Multi-modal situated reasoning in 3d scenes. InAdvances in Neural Information Processing Systems, 2024. NeurIPS. 3
work page 2024
-
[32]
Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023. 2, 3
work page 2023
-
[33]
Nvila: Efficient frontier visual language models
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, Xiuyu Li, Haotian Tang, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Jinyi Hu, Sifei Liu, Ranjay Krishna, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. Nvila: Efficient frontier visual lan...
work page 2025
-
[34]
Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors.Advances in neural information processing systems, 37:68803–68832, 2024. 3
work page 2024
-
[35]
Sqa3d: Sit- uated question answering in 3d scenes
Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Sit- uated question answering in 3d scenes. InInternational Conference on Learning Representations, 2023. 3
work page 2023
-
[36]
Situa- tional awareness matters in 3d vision language reasoning
Yunze Man, Liang-Yan Gui, and Yu-Xiong Wang. Situa- tional awareness matters in 3d vision language reasoning. InCVPR, 2024. 3
work page 2024
-
[37]
Pivot: iterative visual prompting elicits actionable knowledge for vlms
Soroush Nasiriany, Fei Xia, Wenhao Yu, et al. Pivot: iterative visual prompting elicits actionable knowledge for vlms. InProceedings of the International Conference on Machine Learning (ICML). JMLR.org, 2024. 3
work page 2024
-
[38]
BOP challenge 2024 on model-based and model-free 6d object pose estimation.CoRR, abs/2504.02812,
Van Nguyen Nguyen, Stephen Tyree, Andrew Guo, Med- eric Fourmy, Anas Gouda, Taeyeop Lee, Sungphill Moon, Hyeontae Son, Lukas Ranftl, Jonathan Tremblay, et al. Bop challenge 2024 on model-based and model-free 6d object pose estimation.arXiv preprint arXiv:2504.02812,
-
[39]
Octo: An open-source generalist robot policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, et al. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 3
work page 2024
-
[40]
Introducing gpt-5.https://openai
OpenAI. Introducing gpt-5.https://openai. com/index/introducing-gpt-5/, 2025. Ope- nAI Blog. 6
work page 2025
-
[41]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, et al. Gpt-4 technical report, 2024. 3
work page 2024
-
[42]
Towards grounded visual spatial reasoning in multi-modal vision language models,
Navid Rajabi and Jana Kosecka. Towards grounded visual spatial reasoning in multi-modal vision language models,
-
[43]
Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, and Tsung-Yu Lin. Learning to localize objects improves spatial reasoning in visual- llms. InProceedings of IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 12977–12987, 2024. 3
work page 2024
-
[44]
Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko
Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models, 2025. 3
work page 2025
-
[45]
Olinde Rodrigues. Des lois g´eom´etriques qui r´egissent les d´eplacements d’un syst`eme solide dans l’espace, et de la variation des coordonn´ees provenant de ces d´eplacements consid´er´es ind ´ependamment des causes qui peuvent les produire.Journal de math ´ematiques pures et appliqu´ees, 5:380–440, 1840. 4 10
-
[46]
Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, and Zeynep Akata. Clevr-x: A visual reason- ing dataset for natural language explanations. InxxAI - Beyond explainable Artificial Intelligence, pages 85–104. Springer, 2022. 3
work page 2022
-
[47]
An empirical analysis on spatial reasoning capabilities of large multimodal models
Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, et al. An empirical analysis on spatial reasoning capabilities of large multimodal models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,
work page 2024
-
[48]
Progprompt: Generating situated robot task plans using large language models
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2023. 1, 3
work page 2023
-
[49]
Sadler, Wei-Lun Chao, and Yu Su
Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M. Sadler, Wei-Lun Chao, and Yu Su. One step at a time: Long-horizon vision-and-language navigation with mile- stones. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15482–15491, 2022. 3
work page 2022
-
[50]
Robospa- tial: Teaching spatial understanding to 2d and 3d vision- language models for robotics
Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospa- tial: Teaching spatial understanding to 2d and 3d vision- language models for robotics. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 15768–15780, 2025. 2, 3, 8
work page 2025
-
[51]
A corpus for reasoning about natural language grounded in photographs
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. InPro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428, Florence, Italy, 2019. Association for Computational Linguistics. 3
work page 2019
-
[52]
Space3D-Bench: Spatial 3D Question Answering Benchmark
Emilia Szymanska, Mihai Dusmanu, Jan-Willem Bu- urlage, Mahdi Rad, and Marc Pollefeys. Space3D-Bench: Spatial 3D Question Answering Benchmark. InEu- ropean Conference on Computer Vision (ECCV) Work- shops, 2024. 3
work page 2024
-
[53]
Gemini Robotics Team. Building the next gen- eration of physical agents with gemini robotics-er 1.5.https://developers.googleblog.com/ en / building - the - next - generation - of - physical-agents-with-gemini-robotics- er-15/, 2025. Google Developers Blog. 2, 6
work page 2025
-
[54]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024. 2, 3, 8
work page 2024
-
[55]
Stephen Tyree, Jonathan Tremblay, Thang To, Jia Cheng, Terry Mosier, Jeffrey Smith, and Stan Birchfield. 6-dof pose estimation of household objects for robotic manip- ulation: An accessible dataset and benchmark. InInter- national Conference on Intelligent Robots and Systems (IROS), 2022. 5, 7, 8
work page 2022
-
[56]
Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, and Katsushi Ikeuchi. Gpt-4v(ision) for robotics: Multimodal task planning from human demon- stration.IEEE Robotics and Automation Letters, 9(11): 10567–10574, 2024. 3
work page 2024
-
[57]
Em- bodiedscan: A holistic multi-modal 3d perception suite towards embodied ai
Tai Wang, Xiaohan Mao, Chenming Zhu, et al. Em- bodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3
work page 2024
-
[58]
Youngsun Wi, Mark Van der Merwe, Pete Florence, Andy Zeng, and Nima Fazeli. Calamari: Contact-aware and language conditioned spatial action mapping for contact- rich manipulation. In7th Annual Conference on Robot Learning, 2023. 3
work page 2023
-
[59]
Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes
Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. InPro- ceedings of Robotics: Science and Systems, 2017. 5, 7, 8
work page 2017
-
[60]
A parameter-efficient tuning frame- work for language-guided object grounding and robot grasping
Houjian Yu, Mingen Li, Alireza Rezazadeh, Yang Yang, and Changhyun Choi. A parameter-efficient tuning frame- work for language-guided object grounding and robot grasping. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 14353–14360,
-
[61]
Wentao Yuan, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. M2t2: Multi-task masked trans- former for object-centric pick and place.arXiv preprint arXiv:2311.00926, 2023. 4
-
[62]
Robopoint: A vision-language model for spatial affordance prediction in robotics
Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Ar- salan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction in robotics. In8th Annual Conference on Robot Learn- ing, 2024. 1, 2
work page 2024
-
[63]
SPARTUN3d: Situated spa- tial understanding of 3d world in large language model
Yue Zhang, Zhiyang Xu, Ying Shen, Parisa Kord- jamshidi, and Lifu Huang. SPARTUN3d: Situated spa- tial understanding of 3d world in large language model. InThe Thirteenth International Conference on Learning Representations, 2025. 3
work page 2025
-
[64]
Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, and Daniel Seita. Manipbench: Benchmarking vision- language models for low-level robot manipulation.arXiv preprint arXiv:2505.09698, 2025. 3
-
[65]
Robore- fer: Towards spatial referring with reasoning in vision- language models for robotics, 2025
Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, and Shanghang Zhang. Robore- fer: Towards spatial referring with reasoning in vision- language models for robotics, 2025. 2, 3, 6
work page 2025
-
[66]
Rt-2: Vision-language-action models transfer web knowl- edge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, et al. Rt-2: Vision-language-action models transfer web knowl- edge to robotic control. InProceedings of The 7th Con- ference on Robot Learning, pages 2165–2183. PMLR,
-
[67]
Point to the flat surface where the objects are placed
3 11 A. Data Generation Pipeline TheBenchmark for 6D Object Pose Estimation (BOP) family of datasets provides training data for 6D object pose estimation and comprises real and simulation im- ages showcasing multiple objects and diverse setups. For example, HOPE, a BOP-based dataset, comprises 28 toy grocery objects captured in 50 scenes from 10 house- ho...
work page 1920
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.