arxiv: 2605.14704 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI· cs.RO

Recognition: no theorem link

SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

Posheng Chen , Powen Cheng , Gueter Josmy Faure , Hung-Ting Su , Winston H. Hsu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords invisible object localizationfunctional object reasoningvision-language modelsspatial reasoning benchmarkcommonsense inferenceoccluded object detectiontask-driven localizationSceneFunRI

0 comments

The pith

Vision-language models cannot reliably locate invisible functional objects from task instructions and commonsense.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SceneFunRI, a benchmark of 855 instances drawn from real scenes that tests whether vision-language models can predict the 2D locations of functional objects hidden from the camera. It frames the problem as spatial reasoning that must combine visible context, task goals, and everyday knowledge rather than direct visual detection. Evaluations show the strongest model reaches only 15.20 percent accuracy at the strict CAcc@75 threshold along with low overlap and high distance errors. The work groups prompting techniques into instruction strengthening, explicit reasoning chains, and spatial elimination steps, yet finds all remain unstable. This establishes that current models lack dependable mechanisms for inferring occluded object positions in task-driven settings.

Core claim

SceneFunRI converts the task of localizing invisible functional objects into a 2D spatial reasoning problem with 855 semi-automatically generated instances. Given an image, a task instruction, and commonsense priors, a model must output a bounding box for an object that is not visible. Baseline results on Gemini 3 Flash and similar VLMs yield 15.20 CAcc@75, 0.74 mIoU, and 28.65 distance error, while prompting variants such as Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination produce only modest gains. The benchmark therefore demonstrates that invisible-region reasoning remains an unstable capability in existing vision-language models.

What carries the argument

The SceneFunRI benchmark, which generates task-driven queries for occluded functional objects that require commonsense spatial inference beyond visible pixels.

If this is right

Task intent must be tightly coupled with commonsense priors for effective spatial grounding in occluded scenes.
Uncertainty-aware search mechanisms become necessary when models cannot directly observe target objects.
Current prompting strategies improve results modestly but do not close the performance gap.
Models will need explicit integration of task-driven reasoning and spatial elimination to handle invisible functional objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robotic systems that follow natural-language instructions in everyday environments will remain limited until this capability improves.
Training data for VLMs may need more examples of functional objects in occluded configurations paired with task context.
The benchmark could serve as a testbed for hybrid architectures that combine vision-language models with explicit spatial simulators.

Load-bearing premise

The semi-automatic pipeline produces instances that genuinely require commonsense and spatial reasoning rather than being solvable from visible cues alone.

What would settle it

A model that reaches above 70 percent CAcc@75 on the full set of 855 instances using only the provided image and task instruction would indicate that invisible-region reasoning is no longer unstable.

Figures

Figures reproduced from arXiv: 2605.14704 by Gueter Josmy Faure, Hung-Ting Su, Posheng Chen, Powen Cheng, Winston H. Hsu.

**Figure 1.** Figure 1: Overview of SceneFunRI. SceneFunRI introduces a benchmark for Reasoning the Invisible challenge, requiring models to localize task-relevant functional objects that are not directly visible. It evaluates whether models can infer hidden target locations from limited visual observations, task instructions, and commonsense reasoning. We further analyze Spatial Process of Elimination (SPoE), an iterative chain-… view at source ↗

**Figure 2.** Figure 2: SPoE analysis results. As the number of iterations increases, the Dist metric decreases progressively, indicating that eliminating low-probability visible regions effectively narrows the search space. x-axis represents the iteration steps, and y-axis denotes the Dist value. The dashed line corresponds to the baseline score reported in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative examples of SceneFunRI. The red bounding boxes indicate the model predictions, while the green bounding boxes represent the ground-truth locations of the target objects. tially reduced or even eliminated. Despite this, the model remains unable to determine an appropriate stopping point on its own and continues to exclude regions. This suggests that VLMs still have limitations in reasoning abou… view at source ↗

read the original abstract

In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SceneFunRI sets up a benchmark for VLMs on hidden functional objects and shows weak baseline results, but the semi-automatic pipeline has no reported validation so the low scores are hard to interpret.

read the letter

This paper introduces SceneFunRI, a benchmark that turns the SceneFun3D dataset into 855 2D instances for testing how VLMs locate functional objects that are not visible in a scene. The task uses task instructions plus commonsense to force reasoning about invisible regions rather than direct visual matching. That specific framing for task-driven functional localization in occluded areas is new relative to the cited prior work. They run concrete baselines and report numbers, with Gemini 3 Flash topping out at 15.20 CAcc@75, 0.74 mIoU, and 28.65 distance error. Grouping prompting strategies into strong instruction, reasoning-based, and spatial process of elimination categories also gives a practical breakdown of what helps a little and what still fails. Those elements make the paper a usable starting point for anyone measuring progress on occlusion handling in embodied settings. The soft spot is the missing validation on the benchmark construction itself. The abstract describes a semi-automatic pipeline but supplies no occlusion confirmation rates, no human checks that the targets cannot be solved from visible cues alone, and no error analysis. Without those steps it is difficult to know whether the low model scores reflect genuine limits in invisible reasoning or artifacts from ambiguous labels, partial visibility, or instructions that do not actually require the claimed commonsense. The central claim that current VLMs have unstable invisible-region reasoning therefore rests on unverified data quality. This is the kind of work that could interest people building VLMs for robotics or scene understanding who need test cases beyond fully visible objects. It deserves peer review so the authors can add the missing pipeline checks and analysis; the idea is straightforward and the reported gaps are worth documenting once the instances are shown to be solid.

Referee Report

3 major / 2 minor

Summary. The paper introduces SceneFunRI, a benchmark of 855 instances derived from SceneFun3D via a semi-automatic pipeline that converts 3D scenes into 2D task-driven localization problems requiring inference of invisible functional objects from instructions and commonsense. It evaluates VLMs on metrics including CAcc@75, mIoU, and Dist, reports that the strongest baseline (Gemini 3 Flash) reaches only 15.20/0.74/28.65, and analyzes three prompting categories (Strong Instruction, Reasoning-based, and SPoE) to argue that invisible-region reasoning remains unstable in current models.

Significance. If the benchmark instances are shown to genuinely require invisible commonsense reasoning, the work supplies a useful empirical stress test that quantifies a clear capability gap in VLMs and supplies concrete prompting baselines, thereby motivating targeted improvements in task-intent integration and uncertainty-aware spatial search for robotics and scene-understanding applications.

major comments (3)

[Benchmark Construction] Benchmark Construction section: the semi-automatic pipeline that produces the 855 instances supplies no quantitative validation (e.g., occlusion confirmation rates, human verification that targets are fully invisible, or checks that tasks cannot be solved from visible cues alone). Because the central claim—that low VLM scores demonstrate unstable invisible reasoning—rests on instance validity, this omission is load-bearing.
[Experiments] Experiments and Results sections: no error analysis, confusion matrices, or discussion of labeling biases is reported. Consequently it remains unclear whether the reported Gemini 3 Flash scores (CAcc@75 = 15.20, mIoU = 0.74, Dist = 28.65) reflect model limits or pipeline artifacts such as partially visible targets or ambiguous instructions.
[Prompting Analysis] Prompting Analysis section: the three prompting categories are presented without ablation studies, statistical significance tests, or controls for prompt length, making it difficult to determine whether observed differences are attributable to the reasoning strategy or to other factors.

minor comments (2)

[Abstract] Abstract: the metric CAcc@75 is used without a one-sentence definition or pointer to its formal definition in the methods; adding this would aid readers.
[Figures] Figure captions: several example visualizations would benefit from explicit arrows or masks indicating the invisible region and the expected reasoning path.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our work introducing SceneFunRI. We address each major point below and will make revisions to improve the manuscript's clarity and rigor.

read point-by-point responses

Referee: Benchmark Construction section: the semi-automatic pipeline that produces the 855 instances supplies no quantitative validation (e.g., occlusion confirmation rates, human verification that targets are fully invisible, or checks that tasks cannot be solved from visible cues alone). Because the central claim—that low VLM scores demonstrate unstable invisible reasoning—rests on instance validity, this omission is load-bearing.

Authors: We acknowledge this limitation in the current version. The pipeline leverages 3D annotations from SceneFun3D to identify invisible regions, but we did not report quantitative human validation metrics. In the revised manuscript, we will include results from a human study verifying a sample of instances for full invisibility and that tasks require invisible reasoning, along with occlusion rates. This will be added to Section 3. revision: yes
Referee: Experiments and Results sections: no error analysis, confusion matrices, or discussion of labeling biases is reported. Consequently it remains unclear whether the reported Gemini 3 Flash scores (CAcc@75 = 15.20, mIoU = 0.74, Dist = 28.65) reflect model limits or pipeline artifacts such as partially visible targets or ambiguous instructions.

Authors: We agree that an error analysis is valuable. We will add a dedicated subsection analyzing common error types, including cases potentially due to partial visibility or instruction ambiguity. We will also discuss potential labeling biases from the semi-automatic process. This will help substantiate that the low performance is due to challenges in invisible commonsense reasoning. revision: yes
Referee: Prompting Analysis section: the three prompting categories are presented without ablation studies, statistical significance tests, or controls for prompt length, making it difficult to determine whether observed differences are attributable to the reasoning strategy or to other factors.

Authors: The three categories were chosen to represent distinct approaches to prompting for this task. While we did not perform full ablations or statistical tests in the initial submission due to the focus on establishing the benchmark, we will add controls for prompt length and report basic comparisons in the revision. We believe the differences highlight the instability, but agree additional rigor is needed. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark study with no derivation chain or fitted predictions

full rationale

The paper introduces SceneFunRI as a benchmark derived from the external SceneFun3D dataset through a described semi-automatic pipeline, then reports VLM evaluation metrics on the resulting 855 instances. No equations, parameter fittings, predictions, or uniqueness theorems appear anywhere in the manuscript. The work contains no self-definitional steps, fitted-input predictions, or load-bearing self-citations that reduce any claim to its own inputs by construction. All reported results (CAcc@75, mIoU, Dist) are direct empirical measurements on held-out model outputs, making the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the semi-automatic pipeline produces valid test cases requiring genuine commonsense reasoning and that low model scores reflect VLM limitations rather than pipeline artifacts.

axioms (1)

domain assumption The semi-automatic pipeline correctly identifies invisible functional object locations from task instructions and commonsense knowledge.
The benchmark construction and performance claims depend on this pipeline producing reliable ground truth.

pith-pipeline@v0.9.0 · 5522 in / 1260 out tokens · 47429 ms · 2026-05-15T05:00:12.913741+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 3 internal anchors

[1]

Image amodal completion: A survey.Computer Vision and Image Understanding, 229:103661, 2023

Jiayang Ao, Qiuhong Ke, and Krista A Ehinger. Image amodal completion: A survey.Computer Vision and Image Understanding, 229:103661, 2023. 1, 3

work page 2023
[2]

Open-world amodal appearance completion

Jiayang Ao, Yanbei Jiang, Qiuhong Ke, and Krista A Ehinger. Open-world amodal appearance completion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 6490–6499, 2025. 1, 3

work page 2025
[3]

It’s not easy being wrong: Large language models struggle with process of elimination reasoning

Nishant Balepur, Shramay Palta, and Rachel Rudinger. It’s not easy being wrong: Large language models struggle with process of elimination reasoning. InFindings of the Associa- tion for Computational Linguistics ACL 2024, pages 10143– 10166, Bangkok, Thailand and virtual meeting, 2024. Asso- ciation for Computational Linguistics. 7

work page 2024
[4]

Scaling spatial intelligence with multi- modal foundation models.arXiv preprint arXiv:2511.13719,

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junx- iang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling s...

work page arXiv
[5]

SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes

Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 4

work page 2024
[6]

Coarse-to-fine amodal segmentation with shape prior

Jianxiong Gao, Xuelin Qian, Yikai Wang, Tianjun Xiao, Tong He, Zheng Zhang, and Yanwei Fu. Coarse-to-fine amodal segmentation with shape prior. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 1262–1271, 2023. 1, 3

work page 2023
[7]

Rel3d: A minimally contrastive benchmark for grounding spatial re- lations in 3d.Advances in Neural Information Processing Systems, 33, 2020

Ankit Goyal, Kaiyu Yang, Dawei Yang, and Jia Deng. Rel3d: A minimally contrastive benchmark for grounding spatial re- lations in 3d.Advances in Neural Information Processing Systems, 33, 2020. 1, 2

work page 2020
[8]

Occlusion reasoning for object detectionunder arbitrary viewpoint.IEEE transac- tions on pattern analysis and machine intelligence, 36(9): 1803–1815, 2014

Edward Hsiao and Martial Hebert. Occlusion reasoning for object detectionunder arbitrary viewpoint.IEEE transac- tions on pattern analysis and machine intelligence, 36(9): 1803–1815, 2014. 1

work page 2014
[9]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 1, 3

work page 2019
[10]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

work page arXiv
[11]

Llms meet vlms: Boost open vocabulary ob- ject detection with fine-grained descriptors.arXiv preprint arXiv:2402.04630, 2024

Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, and Shijian Lu. Llms meet vlms: Boost open vocabulary ob- ject detection with fine-grained descriptors.arXiv preprint arXiv:2402.04630, 2024. 1, 3

work page arXiv 2024
[12]

Mental models in cognitive science

Philip N Johnson-Laird. Mental models in cognitive science. Cognitive science, 4(1):71–115, 1980. 1

work page 1980
[13]

Harvard University Press, 1983

Philip Nicholas Johnson-Laird.Mental models: Towards a cognitive science of language, inference, and consciousness. Harvard University Press, 1983. 1

work page 1983
[14]

What’s ”up” with vision-language models? investigating their strug- gle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s ”up” with vision-language models? investigating their strug- gle with spatial reasoning. InEMNLP, 2023. 1, 3

work page 2023
[15]

Multihopspatial: Multi- hop compositional spatial reasoning benchmark for vision- language model.arXiv preprint arXiv:2603.18892, 2026

Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, and Sung Ju Hwang. Multihopspatial: Multi- hop compositional spatial reasoning benchmark for vision- language model.arXiv preprint arXiv:2603.18892, 2026. 1, 3, 4

work page arXiv 2026
[16]

Unveiling the invisible: Reasoning complex occlusions amodally with aura

Zhixuan Li, Hyunse Yoon, Sanghoon Lee, and Weisi Lin. Unveiling the invisible: Reasoning complex occlusions amodally with aura. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2025. 1, 3

work page 2025
[17]

Variational amodal object completion

Huan Ling, David Acuna, Karsten Kreis, Seung Wook Kim, and Sanja Fidler. Variational amodal object completion. Advances in Neural Information Processing Systems, 33: 16246–16257, 2020. 1, 3

work page 2020
[18]

Visual spa- tial reasoning.Transactions of the Association for Computa- tional Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spa- tial reasoning.Transactions of the Association for Computa- tional Linguistics, 11:635–651, 2023. 1, 3

work page 2023
[19]

Poe: Process of elimination for multiple choice reasoning

Chenkai Ma and Xinya Du. Poe: Process of elimination for multiple choice reasoning. InProceedings of the 2023 con- ference on empirical methods in natural language process- ing, pages 4487–4496, 2023. 7

work page 2023
[20]

Evidence for a unitary structure of spatial cogni- tion beyond general intelligence.npj Science of Learning, 5 (1):9, 2020

Margherita Malanchini, Kaili Rimfeld, Nicholas G Shake- shaft, Andrew McMillan, Kerry L Schofield, Maja Rodic, Valerio Rossi, Yulia Kovas, Philip S Dale, Elliot M Tucker- Drob, et al. Evidence for a unitary structure of spatial cogni- tion beyond general intelligence.npj Science of Learning, 5 (1):9, 2020. 1

work page 2020
[21]

Capture: Evaluating spatial reasoning in vision lan- guage models via occluded object counting

Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, and Mohit Bansal. Capture: Evaluating spatial reasoning in vision lan- guage models via occluded object counting. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 8001–8010, 2025. 1, 3

work page 2025
[22]

Qwen3.5: Towards native multimodal agents,

Qwen Team. Qwen3.5: Towards native multimodal agents,

work page
[23]

Learning to lo- calize objects improves spatial reasoning in visual-llms

Kanchana Ranasinghe, Satya Narayan Shukla, Omid Pour- saeed, Michael S Ryoo, and Tsung-Yu Lin. Learning to lo- calize objects improves spatial reasoning in visual-llms. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12977–12987, 2024. 1, 3

work page 2024
[24]

Oc- clusion handling in generic object detection: A review

Kaziwa Saleh, S ´andor Sz ´en´asi, and Zolt ´an V ´amossy. Oc- clusion handling in generic object detection: A review. In 2021 IEEE 19th World Symposium on Applied Machine In- telligence and Informatics (SAMI), pages 000477–000484. IEEE, 2021. 1

work page 2021
[25]

Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 15768–15780, 2025. 1

work page 2025
[26]

Tsaf- taris

Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaf- taris. Mind the gap: Benchmarking spatial reasoning in vision-language models, 2025. 1, 3

work page 2025
[27]

Amodal segmentation through out-of-task and out-of-distribution generalization with a bayesian model

Yihong Sun, Adam Kortylewski, and Alan Yuille. Amodal segmentation through out-of-task and out-of-distribution generalization with a bayesian model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1215–1224, 2022. 1, 3

work page 2022
[28]

Segment anything, even oc- cluded

Wei-En Tai, Yu-Lin Shih, Cheng Sun, Yu-Chiang Frank Wang, and Hwann-Tzong Chen. Segment anything, even oc- cluded. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29385–29394, 2025. 1, 3

work page 2025
[29]

Robobrain 2.5: Depth in sight, time in mind.arXiv preprint arXiv:2601.14352, 2026

Huajie Tan, Enshen Zhou, Zhiyu Li, Yijie Xu, Yuheng Ji, Xiansheng Chen, Cheng Chi, Pengwei Wang, Huizhu Jia, Yulong Ao, et al. Robobrain 2.5: Depth in sight, time in mind.arXiv preprint arXiv:2601.14352, 2026. 4

work page arXiv 2026
[30]

Gemma 3, 2025

Gemma Team. Gemma 3, 2025. 4

work page 2025
[31]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving

Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4567–4576, 2025. 1

work page 2025
[33]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mah- fuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md Iqramul Hoque, et al. Spatialab: Can vision- language models perform spatial reasoning in the wild? arXiv preprint arXiv:2602.03916, 2026. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Amodal segmentation based on vis- ible region segmentation and shape prior

Yuting Xiao, Yanyu Xu, Ziming Zhong, Weixin Luo, Jiawei Li, and Shenghua Gao. Amodal segmentation based on vis- ible region segmentation and shape prior. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2995– 3003, 2021. 1, 3

work page 2021
[36]

Amodal com- pletion via progressive mixed context diffusion

Katherine Xu, Lingzhi Zhang, and Jianbo Shi. Amodal com- pletion via progressive mixed context diffusion. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9099–9109, 2024. 1, 3

work page 2024
[37]

Spatialsense: An adversarially crowdsourced benchmark for spatial rela- tion recognition

Kaiyu Yang, Olga Russakovsky, and Jia Deng. Spatialsense: An adversarially crowdsourced benchmark for spatial rela- tion recognition. InInternational Conference on Computer Vision (ICCV), 2019. 1, 2

work page 2019
[38]

arXiv preprint arXiv:2502.09560 (2025)

Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents.arXiv preprint arXiv:2502.09560, 2025. 1

work page arXiv 2025