pith. sign in

arxiv: 2605.20448 · v1 · pith:CDCR6CHLnew · submitted 2026-05-19 · 💻 cs.CV · cs.LG

Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?

Pith reviewed 2026-05-21 06:51 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords vision-language models3D scene understandingspatial reasoningocclusionreflectionstoken compressionbenchmarks
0
0 comments X

The pith

Vision-language models rearrange visible objects accurately yet fail on occlusion and reflections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests if vision-language models build internal representations of 3D scene layouts or simply detect and name objects. It introduces a benchmark with tasks for depth-ordered occlusion using multiple counterfactual setups, inference of geometry from visible reflections, and planning object rearrangements in clear views. Six models score well on rearrangement planning with low collision violations but drop sharply on the occlusion and reflection probes. White-box inspection of one model shows spatial details survive the vision encoder yet become inaccessible after visual token compression.

Core claim

Models achieve 53-97 percent accuracy on volumetric rearrangement planning over visible layouts and rarely violate collision constraints, yet accuracy falls to 6-45 percent on occlusion probes and below 7 percent on reflection geometry, with the performance gap localized to the visual-token merger stage where recoverable spatial information is lost.

What carries the argument

A 3,034-sample human-curated benchmark using three independent counterfactual operationalisations for occlusion, optical-geometry probes for reflections, and rearrangement planning tasks, scored by trained annotators with white-box activation patching to trace failure points.

If this is right

  • Success on rearrangement planning does not indicate broad 3D scene understanding.
  • Spatial information is lost specifically during visual token compression rather than in the vision encoder itself.
  • Embodied reasoning models exhibit the same performance profile, pointing to a shared architectural limitation.
  • Patching post-merger activations can recover spatial capability on these probes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar gaps may appear in robotics tasks that require reasoning about partially observed environments.
  • Benchmarks limited to fully visible scenes likely overestimate current models' spatial competence.
  • Architectures that preserve 3D coordinates through the vision-language interface could close the observed gap.

Load-bearing premise

The benchmark tasks for occlusion, reflections, and rearrangement together measure genuine 3D spatial understanding rather than surface patterns or task-specific artifacts.

What would settle it

A controlled experiment that restores high accuracy on the occlusion and reflection tasks by patching clean post-merger visual activations into the language decoder without altering other components would falsify the localization of the failure.

Figures

Figures reproduced from arXiv: 2605.20448 by Animesh Maheshwari, Divyansh Sahu, Nishit Verma.

Figure 1
Figure 1. Figure 1: DGAR aggregate findings over 163 failure cases. Left: mean attention fraction allocated to each region per task family; dashed line marks uniform chance (1/3). Across all three tasks, the depth-correct region receives well below 5% of visual attention, while the irrelevant region absorbs roughly 80%. Right: failure-mode distribution. Attention Dispersed accounts for 100% of failures across all three tasks;… view at source ↗
Figure 2
Figure 2. Figure 2: DGAR is uniformly low across the LM decoder. Left: mean DGAR across the LM decoder layers; spatial attention to the depth-correct region remains consistently low across the entire decoder, drifting downward in deeper layers rather than rising. No layer shows a significant spike toward depth-correct attention. Right: layer×head DGAR heatmap; no individual head or layer emerges as a strong spatial specialist… view at source ↗
Figure 3
Figure 3. Figure 3: Causal tracing reveals where spatial information is processed and lost (163 failure cases). (a) Recovery curves for Corruption A (target-object patches): restoration is ineffective at the merger stage but fully recovers at L0. (b) Recovery curves for Corruption B (depth-correct patches): identical V-shape with sharp drop at the merger. (c) Groundedness classification per task and corruption type. Vertical … view at source ↗
Figure 4
Figure 4. Figure 4: Reflection task setup. Scene (left); prompt and model response (right). The model lists every object on the metallic countertop as having a visible reflection, including the blue water bottle and pen whose reflections are not resolved in the image. The reasoning chain reveals the failure: the model invokes the rule “the table is reflective, so any object placed on it must have a reflection,” applying a lea… view at source ↗
Figure 5
Figure 5. Figure 5: Vision encoder activation (V21–V27). Self-attention concentrates on the foreground objects (bottle, cloth, watch) rather than on the countertop region containing the actual reflections. Entropy values are reported above each block. cable, a pen, a watch, and a checkered cloth resting on a metallic countertop. A correct answer would list only objects whose reflections are actually present on the metal surfa… view at source ↗
Figure 6
Figure 6. Figure 6: Bridge-layer cluster map (L20). Left: factual attention heatmap. Centre: 20 attention clusters under the factual prompt. Right: 24 clusters under the counterfactual prompt. Both partitions are diffuse and lack the coherent object–reflection pair structure that a reflection-aware model would exhibit. I Scaling the Benchmark: A Semi-Automated Construction Pipeline The current release is built end-to-end by t… view at source ↗
read the original abstract

Vision--language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces a human-curated 3,034-sample benchmark to test whether vision-language models understand 3D scene layout beyond object recognition. It evaluates six frontier and open-weight VLMs on three components: depth-ordered occlusion (via three independent counterfactual operationalisations), optical-geometry inference from visible reflections, and volumetric rearrangement planning over visible layouts. Using trained human annotators to score 18,204 responses (no LLM judges), the work reports a sharp dissociation: models achieve 53-97% accuracy on visible rearrangement planning with low collision violations, but drop to 6-45% on occlusion probes and below 7% on reflections. An embodied-reasoning model shows the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger step, where spatial information recoverable in the vision encoder becomes inaccessible after compression.

Significance. If the reported dissociation holds, the work is significant for demonstrating that current VLMs lack robust 3D spatial understanding, succeeding on visible planning but failing on occlusion and geometric inference. Strengths include the human-curated dataset, three independent occlusion operationalisations, 18k human-scored responses, absence of fitted parameters or self-referential derivations, and white-box patching that identifies a concrete architectural bottleneck at the vision-token merger. These elements provide falsifiable, reproducible evidence against simple surface-pattern explanations.

major comments (2)
  1. [§4.3] §4.3 (occlusion probes): the three counterfactual operationalisations are presented as independent, but the manuscript does not report inter-probe correlation or a control showing that each isolates a distinct failure mode rather than a shared surface cue; this is load-bearing for the claim that the dissociation reflects genuine 3D understanding deficits.
  2. [Table 2] Table 2 (rearrangement results): while collision violations are reported as low, the paper does not include a baseline comparison against a model that has access only to 2D bounding boxes; without this, it remains possible that the high visible-layout accuracy reflects 2D heuristics rather than 3D layout representation.
minor comments (3)
  1. [Figure 4] Figure 4: the activation patching diagram would benefit from explicit labels indicating which layers correspond to the vision encoder, merger, and language decoder to improve readability of the localisation claim.
  2. [§3.2] §3.2: the description of the 3,034-sample curation process could specify the exact criteria used by the three independent annotators to resolve disagreements on occlusion ordering.
  3. [Related Work] Related work section: add a brief discussion of recent benchmarks on VLM spatial reasoning (e.g., those using synthetic 3D renders) to better situate the novelty of the reflection and multi-operationalisation probes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, making revisions where they strengthen the manuscript without altering our core claims or results.

read point-by-point responses
  1. Referee: [§4.3] §4.3 (occlusion probes): the three counterfactual operationalisations are presented as independent, but the manuscript does not report inter-probe correlation or a control showing that each isolates a distinct failure mode rather than a shared surface cue; this is load-bearing for the claim that the dissociation reflects genuine 3D understanding deficits.

    Authors: We agree that explicit quantitative support for the independence of the three operationalisations would reinforce the interpretation. The probes were constructed with distinct mechanisms—one using multi-view depth ordering, one using counterfactual object removal to test visibility, and one using direct visibility prediction—but we acknowledge the value of reporting correlations. In the revised manuscript we add a new analysis in §4.3 computing pairwise Pearson correlations across the three probes for all evaluated models; the resulting coefficients are low (0.12–0.31), consistent with distinct failure modes rather than a single surface cue. We also include a shuffled-label control within each probe that reduces accuracy to chance levels, further indicating that performance is not driven by shared low-level heuristics. These additions directly address the concern while preserving the original experimental design. revision: yes

  2. Referee: [Table 2] Table 2 (rearrangement results): while collision violations are reported as low, the paper does not include a baseline comparison against a model that has access only to 2D bounding boxes; without this, it remains possible that the high visible-layout accuracy reflects 2D heuristics rather than 3D layout representation.

    Authors: We appreciate the suggestion of an additional control. However, we maintain that a 2D-bounding-box-only baseline is not required to support our conclusions and would not be straightforward to implement within the scope of the present study. The volumetric rearrangement task explicitly requires models to reason about depth ordering, free space, and potential 3D collisions even when all objects are visible; 2D bounding boxes alone cannot encode these relations (for example, whether one object can be moved behind another without intersection in depth). More importantly, the same models that succeed on visible rearrangement fail sharply on the occlusion and reflection tasks, which cannot be solved by 2D heuristics. This dissociation itself argues against a purely 2D strategy. We have expanded the discussion section to articulate this reasoning and to note that constructing a non-VLM 2D baseline lies outside the paper’s focus on evaluating existing vision-language models. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper evaluates VLMs on a human-curated benchmark using direct model testing, three independent counterfactual operationalisations for occlusion, reflection geometry probes, and trained human annotators scoring 18k responses with no LLM-as-judge. White-box analysis localises failures to the vision-token merger step via patching experiments. No equations, fitted parameters, self-referential derivations, or load-bearing self-citations reduce the central dissociation claim to its inputs by construction. The design relies on external annotations and multiple independent measures, rendering the result self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the curated benchmark tasks validly isolate 3D spatial understanding and that the observed performance gap is caused by the token-merger stage rather than dataset artifacts or annotator bias.

axioms (1)
  • domain assumption Human annotators provide reliable ground-truth scoring for spatial-reasoning responses across the three task types.
    All 18,204 responses were scored by trained annotators; the dissociation claim depends on this scoring being accurate and unbiased.

pith-pipeline@v0.9.0 · 5713 in / 1387 out tokens · 47036 ms · 2026-05-21T06:51:36.925512+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 8 internal anchors

  1. [1]

    Qwen3.5-VL: Scaling vision-language models with mixture-of-experts

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen3.5-VL: Scaling vision-language models with mixture-of-experts. arXiv preprint, 2025

  2. [2]

    Qwen3-VL-30B-A3B-Thinking: A thinking-mode vision-language model

    Jinze Bai, Shuai Bai, Shusheng Yang, et al. Qwen3-VL-30B-A3B-Thinking: A thinking-mode vision-language model. https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking , 2025

  3. [3]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

  4. [4]

    RT-2: Vision-language- action models transfer web knowledge to robotic control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023

  5. [5]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

  6. [6]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021

  7. [7]

    SpatialVLM: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024

  8. [8]

    CounterVQA: Evaluating and improving counterfactual reasoning in vision-language models for video understanding.arXiv preprint arXiv:2511.19923, 2025

    Yuefei Chen, Jiang Liu, Xiaodong Lin, and Ruixiang Tang. CounterVQA: Evaluating and improving counterfactual reasoning in vision-language models for video understanding.arXiv preprint arXiv:2511.19923, 2025. 9

  9. [9]

    Ego3D-Bench: Evaluating 3d scene understanding from egocentric multi-view videos.arXiv preprint, 2024

    Yuxuan Chen, Bowen Li, Weijie Zhang, et al. Ego3D-Bench: Evaluating 3d scene understanding from egocentric multi-view videos.arXiv preprint, 2024

  10. [10]

    arXiv preprint arXiv:2205.01089 (2022) 16 M

    Zhenfang Chen, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, and Chuang Gan. ComPhy: Compositional physical reasoning of objects and events from videos.arXiv preprint arXiv:2205.01089, 2022

  11. [11]

    SpatialRGPT: Grounded spatial reasoning in vision-language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. SpatialRGPT: Grounded spatial reasoning in vision-language models. arXiv preprint arXiv:2406.01584, 2024

  12. [12]

    Internspatial: A comprehensive dataset for spatial reasoning in vision-language models.arXiv preprint arXiv:2506.18385, 2025

    Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tianshuo Yang, Min Dou, Tong He, Wenqi Shao, Kaipeng Zhang, Yi Wang, Botian Shi, Yanting Zhang, Jifeng Dai, Yu Qiao, Hongjie Zhang, and Wenhai Wang. InternSpa- tial: A comprehensive dataset for spatial reasoning in vision-language models.arXiv preprint arXiv:...

  13. [13]

    PaLM-E: An embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, and Igor Mordatch. PaLM-E: An embodied multimodal languag...

  14. [14]

    Gemini 3.1: A family of highly capable multimodal models

    Google DeepMind. Gemini 3.1: A family of highly capable multimodal models. https: //deepmind.google/technologies/gemini/, 2025

  15. [15]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

  16. [16]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaEval: An automatic evaluator of instruction- following models.https://github.com/tatsu-lab/alpaca_eval, 2023

  17. [17]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  18. [18]

    Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

  19. [19]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, et al. MMBench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2025

  20. [20]

    iVISPAR: An interactive visual-spatial reasoning benchmark for vlms.arXiv preprint arXiv:2502.03214, 2025

    Julius Mayer, Mohamad Ballout, Serwan Jassim, Farbod Nosrat Nezami, and Elia Bruni. iVISPAR: An interactive visual-spatial reasoning benchmark for vlms.arXiv preprint arXiv:2502.03214, 2025

  21. [21]

    Locating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  22. [22]

    GPT-5.2 system card

    OpenAI. GPT-5.2 system card. https://openai.com/index/introducing-gpt-5/, 2025

  23. [23]

    NuScenes-Spatial QA: Spatial reasoning in autonomous driving with lidar-grounded question answering.arXiv preprint, 2024

    Rui Qian, Jiankai Huang, Yijun Wang, et al. NuScenes-Spatial QA: Spatial reasoning in autonomous driving with lidar-grounded question answering.arXiv preprint, 2024

  24. [24]

    Nano banana 2: Combining pro capabilities with lightning-fast speed

    Naina Raisinghani. Nano banana 2: Combining pro capabilities with lightning-fast speed. https://blog.google/innovation-and-ai/technology/ai/nano-banana-2/ , February 2026. Accessed: 2026-05-06

  25. [25]

    Mind the gap: Diagnosing spatial reasoning failures in vision-language models

    Kanchana Ranasinghe, Shao-Yu Tran, Xueyan Luo, et al. Mind the gap: Diagnosing spatial reasoning failures in vision-language models. InICLR, 2025

  26. [26]

    Object hallucination in image captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4035–4045, 2018. 10

  27. [27]

    Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

    Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

  28. [28]

    Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

    Gemini Robotics Team et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

  29. [29]

    Qwen3-vl-8b-thinking

    Qwen Team. Qwen3-vl-8b-thinking. https://huggingface.co/Qwen/ Qwen3-VL-8B-Thinking, 2025

  30. [30]

    Vision language models are blind

    Shengbang Tong, Ellis Brown, Penghao Wu, et al. Eyes wide shut? exploring the visual shortcomings of multimodal LLMs.arXiv preprint arXiv:2407.06581, 2024

  31. [31]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Ming Yan, Ji Zhang, and Jitao Sang. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

  32. [32]

    Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025

    Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei- Fei, and Manling Li. Mindcube: Spatial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025

  33. [33]

    COVER: Counterfactual video reasoning for evaluating vision-language models.arXiv preprint, 2025

    Yuxin Wang, Jiaxin Chen, Zhuo Li, et al. COVER: Counterfactual video reasoning for evaluating vision-language models.arXiv preprint, 2025

  34. [34]

    VSI-Bench: Benchmarking visual spatial intelligence in vision-language models.arXiv preprint arXiv:2407.07890, 2024

    Lixin Yang, Kailin Chen, Songyou Peng, et al. VSI-Bench: Benchmarking visual spatial intelligence in vision-language models.arXiv preprint arXiv:2407.07890, 2024

  35. [35]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yao, Linjie Luo, et al. MM-Vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2024

  36. [36]

    GLM-4.6V: A bilingual multi-modal language model

    Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. GLM-4.6V: A bilingual multi-modal language model. https://github.com/THUDM/GLM-4, 2025

  37. [37]

    CausalVLBench: Benchmarking causal reasoning in vision-language models.arXiv preprint, 2025

    Yiming Zhang, Yifan Liu, Zhiwei Wang, et al. CausalVLBench: Benchmarking causal reasoning in vision-language models.arXiv preprint, 2025

  38. [38]

    if X is removed, what becomes visible?

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InNeurIPS, 2024. Appendix Contents 1 Introduction 1 2 Related Work 2 3 Benchmark Design 3 4 Experimental Setup 3 5 Result...

  39. [39]

    Reject scenes where any object label has confidence below a calibrated threshold

    Object inventory.Given a scene image, run an open-vocabulary detector and SAM 3 [ 5] to obtain per-object masks {Mi} and canonical noun-phrase labels {ℓi}. Reject scenes where any object label has confidence below a calibrated threshold

  40. [40]

    Assign each object a representative depthd i = median(D[Mi >0])

    Depth assignment.Run Depth-Anything-V3 [ 17] to obtain a per-pixel depth map D. Assign each object a representative depthd i = median(D[Mi >0])

  41. [41]

    The resulting directed graph Gocc encodes the depth-ordered occlusion structure of the scene

    Occlusion-graph construction.For each ordered pair (i, j) with di < d j and IoU(Mi,bbox(M j))> τ occ, declare i occludes j. The resulting directed graph Gocc encodes the depth-ordered occlusion structure of the scene

  42. [42]

    Per-task ground truth derivation.T1 (single-object removal):For target X, the ground truth is the set of j such that X is theuniquepredecessor of j in Gocc (i.e. removing X leaves j unoccluded).T2 (multi-object removal):Compute the minimum set cover over predecessors of X in Gocc via an exact small-set solver (the graphs are small enough that this is trac...

  43. [43]

    in front

    Validation gates.Reject samples where Gocc has fewer than three nodes, where the depth gradient across the occluding chain is below a calibrated minimum (to avoid degenerate cases where “in front” is ill-defined), or where multiple objects share a depth band within sensor noise. The output is a tuple (image,prompt,ground-truth set) per task that mirrors t...

  44. [44]

    Reject scenes with no high-confidence reflective surface

    Reflective-surface detection.Run a material-classification head over SAM 3 masks to identify candidate reflective surfaces (polished wood, granite, marble, quartz, glass). Reject scenes with no high-confidence reflective surface

  45. [45]

    Reflection-region detection.Within each reflective surface mask, detect candidate reflection patches via a separate detector trained on natural reflections, returning per-patch bounding boxes {Bk}

  46. [46]

    Concretely, mirror each above-surface object’s bounding box across the surface plane and rank candidates by geometric overlap with Bk

    3D-to-2D correspondence.For each detected reflection patch Bk, identify the 3D object whose silhouette and pose are consistent with the reflection geometry under a planar-mirror model of the surface. Concretely, mirror each above-surface object’s bounding box across the surface plane and rank candidates by geometric overlap with Bk. Establish (Bk ↔object ...

  47. [47]

    The ground-truth answer set is then {objecti : (B k, i)/∈S} , i.e

    Counterfactual variant generation.Sample a subset S of detected reflection patches to be removed; perform inpainting (a diffusion-based inpainter conditioned on the surrounding surface texture) overS k∈S Bk to remove those reflections while preserving the rest. The ground-truth answer set is then {objecti : (B k, i)/∈S} , i.e. the objects whose reflection...

  48. [48]

    Validation gates.Run a perceptual-quality model and a reflection re-detector over the inpainted image; reject samples where (a) inpainting artifacts are detectable above threshold, or (b) the re-detector finds reflections at any of the removed locations. The reflection pipeline is the most failure-prone of the three because it composes three imperfect mod...

  49. [49]

    Object volumes.Use SAM 3 masks plus depth to estimate each object’s frustum-projected volumeV i and itsx-coordinatex i (lateral position)

  50. [50]

    Target ordering generation.Sample a target left-to-right ordering by permuting a subset of {xi} such that achieving the target requires either (T5) a unique single pairwise swap or (T6) a multi-step sequence of swaps

  51. [51]

    For T5, exhaustively enumerate single swaps and retain scenes admitting exactly one feasible solution

    Collision-aware swap search.A pairwise swap (i, j) isfeasibleiff placing Vi at xj and Vj at xi produces no overlap with any other Vk. For T5, exhaustively enumerate single swaps and retain scenes admitting exactly one feasible solution. For T6, run BFS over swap sequences and retain the shortest valid sequence; tag scenes where no sequence exists asinfeas...

  52. [52]

    Prompt assembly.Generate the natural-language prompt by populating the T5/T6 template (Appendix E) with the sampled target ordering

  53. [53]

    I.2 Pilot-Batch Validation We do not attempt per-sample agreement against the existing 3,034-sample release

    Validation gates.Reject scenes where (a) object volumes overlap in the original image (a perception failure upstream), or (b) the search returns multiple equally-shortest solutions for T6 (ambiguous ground truth). I.2 Pilot-Batch Validation We do not attempt per-sample agreement against the existing 3,034-sample release. The scenes admit many valid config...