pith. machine review for the scientific record. sign in

arxiv: 2605.14704 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI· cs.RO

Recognition: no theorem link

SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords invisible object localizationfunctional object reasoningvision-language modelsspatial reasoning benchmarkcommonsense inferenceoccluded object detectiontask-driven localizationSceneFunRI
0
0 comments X

The pith

Vision-language models cannot reliably locate invisible functional objects from task instructions and commonsense.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SceneFunRI, a benchmark of 855 instances drawn from real scenes that tests whether vision-language models can predict the 2D locations of functional objects hidden from the camera. It frames the problem as spatial reasoning that must combine visible context, task goals, and everyday knowledge rather than direct visual detection. Evaluations show the strongest model reaches only 15.20 percent accuracy at the strict CAcc@75 threshold along with low overlap and high distance errors. The work groups prompting techniques into instruction strengthening, explicit reasoning chains, and spatial elimination steps, yet finds all remain unstable. This establishes that current models lack dependable mechanisms for inferring occluded object positions in task-driven settings.

Core claim

SceneFunRI converts the task of localizing invisible functional objects into a 2D spatial reasoning problem with 855 semi-automatically generated instances. Given an image, a task instruction, and commonsense priors, a model must output a bounding box for an object that is not visible. Baseline results on Gemini 3 Flash and similar VLMs yield 15.20 CAcc@75, 0.74 mIoU, and 28.65 distance error, while prompting variants such as Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination produce only modest gains. The benchmark therefore demonstrates that invisible-region reasoning remains an unstable capability in existing vision-language models.

What carries the argument

The SceneFunRI benchmark, which generates task-driven queries for occluded functional objects that require commonsense spatial inference beyond visible pixels.

If this is right

  • Task intent must be tightly coupled with commonsense priors for effective spatial grounding in occluded scenes.
  • Uncertainty-aware search mechanisms become necessary when models cannot directly observe target objects.
  • Current prompting strategies improve results modestly but do not close the performance gap.
  • Models will need explicit integration of task-driven reasoning and spatial elimination to handle invisible functional objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robotic systems that follow natural-language instructions in everyday environments will remain limited until this capability improves.
  • Training data for VLMs may need more examples of functional objects in occluded configurations paired with task context.
  • The benchmark could serve as a testbed for hybrid architectures that combine vision-language models with explicit spatial simulators.

Load-bearing premise

The semi-automatic pipeline produces instances that genuinely require commonsense and spatial reasoning rather than being solvable from visible cues alone.

What would settle it

A model that reaches above 70 percent CAcc@75 on the full set of 855 instances using only the provided image and task instruction would indicate that invisible-region reasoning is no longer unstable.

Figures

Figures reproduced from arXiv: 2605.14704 by Gueter Josmy Faure, Hung-Ting Su, Posheng Chen, Powen Cheng, Winston H. Hsu.

Figure 1
Figure 1. Figure 1: Overview of SceneFunRI. SceneFunRI introduces a benchmark for Reasoning the Invisible challenge, requiring models to localize task-relevant functional objects that are not directly visible. It evaluates whether models can infer hidden target locations from limited visual observations, task instructions, and commonsense reasoning. We further analyze Spatial Process of Elimination (SPoE), an iterative chain-… view at source ↗
Figure 2
Figure 2. Figure 2: SPoE analysis results. As the number of iterations increases, the Dist metric decreases progressively, indicating that eliminating low-probability visible regions effectively narrows the search space. x-axis represents the iteration steps, and y-axis de￾notes the Dist value. The dashed line corresponds to the baseline score reported in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples of SceneFunRI. The red bound￾ing boxes indicate the model predictions, while the green bounding boxes represent the ground-truth locations of the target objects. tially reduced or even eliminated. Despite this, the model remains unable to determine an appropriate stopping point on its own and continues to exclude regions. This suggests that VLMs still have limitations in reasoning abou… view at source ↗
read the original abstract

In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SceneFunRI, a benchmark of 855 instances derived from SceneFun3D via a semi-automatic pipeline that converts 3D scenes into 2D task-driven localization problems requiring inference of invisible functional objects from instructions and commonsense. It evaluates VLMs on metrics including CAcc@75, mIoU, and Dist, reports that the strongest baseline (Gemini 3 Flash) reaches only 15.20/0.74/28.65, and analyzes three prompting categories (Strong Instruction, Reasoning-based, and SPoE) to argue that invisible-region reasoning remains unstable in current models.

Significance. If the benchmark instances are shown to genuinely require invisible commonsense reasoning, the work supplies a useful empirical stress test that quantifies a clear capability gap in VLMs and supplies concrete prompting baselines, thereby motivating targeted improvements in task-intent integration and uncertainty-aware spatial search for robotics and scene-understanding applications.

major comments (3)
  1. [Benchmark Construction] Benchmark Construction section: the semi-automatic pipeline that produces the 855 instances supplies no quantitative validation (e.g., occlusion confirmation rates, human verification that targets are fully invisible, or checks that tasks cannot be solved from visible cues alone). Because the central claim—that low VLM scores demonstrate unstable invisible reasoning—rests on instance validity, this omission is load-bearing.
  2. [Experiments] Experiments and Results sections: no error analysis, confusion matrices, or discussion of labeling biases is reported. Consequently it remains unclear whether the reported Gemini 3 Flash scores (CAcc@75 = 15.20, mIoU = 0.74, Dist = 28.65) reflect model limits or pipeline artifacts such as partially visible targets or ambiguous instructions.
  3. [Prompting Analysis] Prompting Analysis section: the three prompting categories are presented without ablation studies, statistical significance tests, or controls for prompt length, making it difficult to determine whether observed differences are attributable to the reasoning strategy or to other factors.
minor comments (2)
  1. [Abstract] Abstract: the metric CAcc@75 is used without a one-sentence definition or pointer to its formal definition in the methods; adding this would aid readers.
  2. [Figures] Figure captions: several example visualizations would benefit from explicit arrows or masks indicating the invisible region and the expected reasoning path.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our work introducing SceneFunRI. We address each major point below and will make revisions to improve the manuscript's clarity and rigor.

read point-by-point responses
  1. Referee: Benchmark Construction section: the semi-automatic pipeline that produces the 855 instances supplies no quantitative validation (e.g., occlusion confirmation rates, human verification that targets are fully invisible, or checks that tasks cannot be solved from visible cues alone). Because the central claim—that low VLM scores demonstrate unstable invisible reasoning—rests on instance validity, this omission is load-bearing.

    Authors: We acknowledge this limitation in the current version. The pipeline leverages 3D annotations from SceneFun3D to identify invisible regions, but we did not report quantitative human validation metrics. In the revised manuscript, we will include results from a human study verifying a sample of instances for full invisibility and that tasks require invisible reasoning, along with occlusion rates. This will be added to Section 3. revision: yes

  2. Referee: Experiments and Results sections: no error analysis, confusion matrices, or discussion of labeling biases is reported. Consequently it remains unclear whether the reported Gemini 3 Flash scores (CAcc@75 = 15.20, mIoU = 0.74, Dist = 28.65) reflect model limits or pipeline artifacts such as partially visible targets or ambiguous instructions.

    Authors: We agree that an error analysis is valuable. We will add a dedicated subsection analyzing common error types, including cases potentially due to partial visibility or instruction ambiguity. We will also discuss potential labeling biases from the semi-automatic process. This will help substantiate that the low performance is due to challenges in invisible commonsense reasoning. revision: yes

  3. Referee: Prompting Analysis section: the three prompting categories are presented without ablation studies, statistical significance tests, or controls for prompt length, making it difficult to determine whether observed differences are attributable to the reasoning strategy or to other factors.

    Authors: The three categories were chosen to represent distinct approaches to prompting for this task. While we did not perform full ablations or statistical tests in the initial submission due to the focus on establishing the benchmark, we will add controls for prompt length and report basic comparisons in the revision. We believe the differences highlight the instability, but agree additional rigor is needed. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark study with no derivation chain or fitted predictions

full rationale

The paper introduces SceneFunRI as a benchmark derived from the external SceneFun3D dataset through a described semi-automatic pipeline, then reports VLM evaluation metrics on the resulting 855 instances. No equations, parameter fittings, predictions, or uniqueness theorems appear anywhere in the manuscript. The work contains no self-definitional steps, fitted-input predictions, or load-bearing self-citations that reduce any claim to its own inputs by construction. All reported results (CAcc@75, mIoU, Dist) are direct empirical measurements on held-out model outputs, making the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the semi-automatic pipeline produces valid test cases requiring genuine commonsense reasoning and that low model scores reflect VLM limitations rather than pipeline artifacts.

axioms (1)
  • domain assumption The semi-automatic pipeline correctly identifies invisible functional object locations from task instructions and commonsense knowledge.
    The benchmark construction and performance claims depend on this pipeline producing reliable ground truth.

pith-pipeline@v0.9.0 · 5522 in / 1260 out tokens · 47429 ms · 2026-05-15T05:00:12.913741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 3 internal anchors

  1. [1]

    Image amodal completion: A survey.Computer Vision and Image Understanding, 229:103661, 2023

    Jiayang Ao, Qiuhong Ke, and Krista A Ehinger. Image amodal completion: A survey.Computer Vision and Image Understanding, 229:103661, 2023. 1, 3

  2. [2]

    Open-world amodal appearance completion

    Jiayang Ao, Yanbei Jiang, Qiuhong Ke, and Krista A Ehinger. Open-world amodal appearance completion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 6490–6499, 2025. 1, 3

  3. [3]

    It’s not easy being wrong: Large language models struggle with process of elimination reasoning

    Nishant Balepur, Shramay Palta, and Rachel Rudinger. It’s not easy being wrong: Large language models struggle with process of elimination reasoning. InFindings of the Associa- tion for Computational Linguistics ACL 2024, pages 10143– 10166, Bangkok, Thailand and virtual meeting, 2024. Asso- ciation for Computational Linguistics. 7

  4. [4]

    Scaling spatial intelligence with multi- modal foundation models.arXiv preprint arXiv:2511.13719,

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junx- iang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling s...

  5. [5]

    SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes

    Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 4

  6. [6]

    Coarse-to-fine amodal segmentation with shape prior

    Jianxiong Gao, Xuelin Qian, Yikai Wang, Tianjun Xiao, Tong He, Zheng Zhang, and Yanwei Fu. Coarse-to-fine amodal segmentation with shape prior. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 1262–1271, 2023. 1, 3

  7. [7]

    Rel3d: A minimally contrastive benchmark for grounding spatial re- lations in 3d.Advances in Neural Information Processing Systems, 33, 2020

    Ankit Goyal, Kaiyu Yang, Dawei Yang, and Jia Deng. Rel3d: A minimally contrastive benchmark for grounding spatial re- lations in 3d.Advances in Neural Information Processing Systems, 33, 2020. 1, 2

  8. [8]

    Occlusion reasoning for object detectionunder arbitrary viewpoint.IEEE transac- tions on pattern analysis and machine intelligence, 36(9): 1803–1815, 2014

    Edward Hsiao and Martial Hebert. Occlusion reasoning for object detectionunder arbitrary viewpoint.IEEE transac- tions on pattern analysis and machine intelligence, 36(9): 1803–1815, 2014. 1

  9. [9]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 1, 3

  10. [10]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

  11. [11]

    Llms meet vlms: Boost open vocabulary ob- ject detection with fine-grained descriptors.arXiv preprint arXiv:2402.04630, 2024

    Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, and Shijian Lu. Llms meet vlms: Boost open vocabulary ob- ject detection with fine-grained descriptors.arXiv preprint arXiv:2402.04630, 2024. 1, 3

  12. [12]

    Mental models in cognitive science

    Philip N Johnson-Laird. Mental models in cognitive science. Cognitive science, 4(1):71–115, 1980. 1

  13. [13]

    Harvard University Press, 1983

    Philip Nicholas Johnson-Laird.Mental models: Towards a cognitive science of language, inference, and consciousness. Harvard University Press, 1983. 1

  14. [14]

    What’s ”up” with vision-language models? investigating their strug- gle with spatial reasoning

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s ”up” with vision-language models? investigating their strug- gle with spatial reasoning. InEMNLP, 2023. 1, 3

  15. [15]

    Multihopspatial: Multi- hop compositional spatial reasoning benchmark for vision- language model.arXiv preprint arXiv:2603.18892, 2026

    Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, and Sung Ju Hwang. Multihopspatial: Multi- hop compositional spatial reasoning benchmark for vision- language model.arXiv preprint arXiv:2603.18892, 2026. 1, 3, 4

  16. [16]

    Unveiling the invisible: Reasoning complex occlusions amodally with aura

    Zhixuan Li, Hyunse Yoon, Sanghoon Lee, and Weisi Lin. Unveiling the invisible: Reasoning complex occlusions amodally with aura. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2025. 1, 3

  17. [17]

    Variational amodal object completion

    Huan Ling, David Acuna, Karsten Kreis, Seung Wook Kim, and Sanja Fidler. Variational amodal object completion. Advances in Neural Information Processing Systems, 33: 16246–16257, 2020. 1, 3

  18. [18]

    Visual spa- tial reasoning.Transactions of the Association for Computa- tional Linguistics, 11:635–651, 2023

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spa- tial reasoning.Transactions of the Association for Computa- tional Linguistics, 11:635–651, 2023. 1, 3

  19. [19]

    Poe: Process of elimination for multiple choice reasoning

    Chenkai Ma and Xinya Du. Poe: Process of elimination for multiple choice reasoning. InProceedings of the 2023 con- ference on empirical methods in natural language process- ing, pages 4487–4496, 2023. 7

  20. [20]

    Evidence for a unitary structure of spatial cogni- tion beyond general intelligence.npj Science of Learning, 5 (1):9, 2020

    Margherita Malanchini, Kaili Rimfeld, Nicholas G Shake- shaft, Andrew McMillan, Kerry L Schofield, Maja Rodic, Valerio Rossi, Yulia Kovas, Philip S Dale, Elliot M Tucker- Drob, et al. Evidence for a unitary structure of spatial cogni- tion beyond general intelligence.npj Science of Learning, 5 (1):9, 2020. 1

  21. [21]

    Capture: Evaluating spatial reasoning in vision lan- guage models via occluded object counting

    Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, and Mohit Bansal. Capture: Evaluating spatial reasoning in vision lan- guage models via occluded object counting. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 8001–8010, 2025. 1, 3

  22. [22]

    Qwen3.5: Towards native multimodal agents,

    Qwen Team. Qwen3.5: Towards native multimodal agents,

  23. [23]

    Learning to lo- calize objects improves spatial reasoning in visual-llms

    Kanchana Ranasinghe, Satya Narayan Shukla, Omid Pour- saeed, Michael S Ryoo, and Tsung-Yu Lin. Learning to lo- calize objects improves spatial reasoning in visual-llms. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12977–12987, 2024. 1, 3

  24. [24]

    Oc- clusion handling in generic object detection: A review

    Kaziwa Saleh, S ´andor Sz ´en´asi, and Zolt ´an V ´amossy. Oc- clusion handling in generic object detection: A review. In 2021 IEEE 19th World Symposium on Applied Machine In- telligence and Informatics (SAMI), pages 000477–000484. IEEE, 2021. 1

  25. [25]

    Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 15768–15780, 2025. 1

  26. [26]

    Tsaf- taris

    Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaf- taris. Mind the gap: Benchmarking spatial reasoning in vision-language models, 2025. 1, 3

  27. [27]

    Amodal segmentation through out-of-task and out-of-distribution generalization with a bayesian model

    Yihong Sun, Adam Kortylewski, and Alan Yuille. Amodal segmentation through out-of-task and out-of-distribution generalization with a bayesian model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1215–1224, 2022. 1, 3

  28. [28]

    Segment anything, even oc- cluded

    Wei-En Tai, Yu-Lin Shih, Cheng Sun, Yu-Chiang Frank Wang, and Hwann-Tzong Chen. Segment anything, even oc- cluded. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29385–29394, 2025. 1, 3

  29. [29]

    Robobrain 2.5: Depth in sight, time in mind.arXiv preprint arXiv:2601.14352, 2026

    Huajie Tan, Enshen Zhou, Zhiyu Li, Yijie Xu, Yuheng Ji, Xiansheng Chen, Cheng Chi, Pengwei Wang, Huizhu Jia, Yulong Ao, et al. Robobrain 2.5: Depth in sight, time in mind.arXiv preprint arXiv:2601.14352, 2026. 4

  30. [30]

    Gemma 3, 2025

    Gemma Team. Gemma 3, 2025. 4

  31. [31]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 4

  32. [32]

    Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving

    Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4567–4576, 2025. 1

  33. [33]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 4

  34. [34]

    SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

    Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mah- fuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md Iqramul Hoque, et al. Spatialab: Can vision- language models perform spatial reasoning in the wild? arXiv preprint arXiv:2602.03916, 2026. 1, 3

  35. [35]

    Amodal segmentation based on vis- ible region segmentation and shape prior

    Yuting Xiao, Yanyu Xu, Ziming Zhong, Weixin Luo, Jiawei Li, and Shenghua Gao. Amodal segmentation based on vis- ible region segmentation and shape prior. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2995– 3003, 2021. 1, 3

  36. [36]

    Amodal com- pletion via progressive mixed context diffusion

    Katherine Xu, Lingzhi Zhang, and Jianbo Shi. Amodal com- pletion via progressive mixed context diffusion. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9099–9109, 2024. 1, 3

  37. [37]

    Spatialsense: An adversarially crowdsourced benchmark for spatial rela- tion recognition

    Kaiyu Yang, Olga Russakovsky, and Jia Deng. Spatialsense: An adversarially crowdsourced benchmark for spatial rela- tion recognition. InInternational Conference on Computer Vision (ICCV), 2019. 1, 2

  38. [38]

    arXiv preprint arXiv:2502.09560 (2025)

    Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents.arXiv preprint arXiv:2502.09560, 2025. 1