Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?

Animesh Maheshwari; Divyansh Sahu; Nishit Verma

arxiv: 2605.20448 · v1 · pith:CDCR6CHLnew · submitted 2026-05-19 · 💻 cs.CV · cs.LG

Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?

Animesh Maheshwari , Divyansh Sahu , Nishit Verma This is my paper

Pith reviewed 2026-05-21 06:51 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords vision-language models3D scene understandingspatial reasoningocclusionreflectionstoken compressionbenchmarks

0 comments

The pith

Vision-language models rearrange visible objects accurately yet fail on occlusion and reflections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests if vision-language models build internal representations of 3D scene layouts or simply detect and name objects. It introduces a benchmark with tasks for depth-ordered occlusion using multiple counterfactual setups, inference of geometry from visible reflections, and planning object rearrangements in clear views. Six models score well on rearrangement planning with low collision violations but drop sharply on the occlusion and reflection probes. White-box inspection of one model shows spatial details survive the vision encoder yet become inaccessible after visual token compression.

Core claim

Models achieve 53-97 percent accuracy on volumetric rearrangement planning over visible layouts and rarely violate collision constraints, yet accuracy falls to 6-45 percent on occlusion probes and below 7 percent on reflection geometry, with the performance gap localized to the visual-token merger stage where recoverable spatial information is lost.

What carries the argument

A 3,034-sample human-curated benchmark using three independent counterfactual operationalisations for occlusion, optical-geometry probes for reflections, and rearrangement planning tasks, scored by trained annotators with white-box activation patching to trace failure points.

If this is right

Success on rearrangement planning does not indicate broad 3D scene understanding.
Spatial information is lost specifically during visual token compression rather than in the vision encoder itself.
Embodied reasoning models exhibit the same performance profile, pointing to a shared architectural limitation.
Patching post-merger activations can recover spatial capability on these probes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gaps may appear in robotics tasks that require reasoning about partially observed environments.
Benchmarks limited to fully visible scenes likely overestimate current models' spatial competence.
Architectures that preserve 3D coordinates through the vision-language interface could close the observed gap.

Load-bearing premise

The benchmark tasks for occlusion, reflections, and rearrangement together measure genuine 3D spatial understanding rather than surface patterns or task-specific artifacts.

What would settle it

A controlled experiment that restores high accuracy on the occlusion and reflection tasks by patching clean post-merger visual activations into the language decoder without altering other components would falsify the localization of the failure.

Figures

Figures reproduced from arXiv: 2605.20448 by Animesh Maheshwari, Divyansh Sahu, Nishit Verma.

**Figure 1.** Figure 1: DGAR aggregate findings over 163 failure cases. Left: mean attention fraction allocated to each region per task family; dashed line marks uniform chance (1/3). Across all three tasks, the depth-correct region receives well below 5% of visual attention, while the irrelevant region absorbs roughly 80%. Right: failure-mode distribution. Attention Dispersed accounts for 100% of failures across all three tasks;… view at source ↗

**Figure 2.** Figure 2: DGAR is uniformly low across the LM decoder. Left: mean DGAR across the LM decoder layers; spatial attention to the depth-correct region remains consistently low across the entire decoder, drifting downward in deeper layers rather than rising. No layer shows a significant spike toward depth-correct attention. Right: layer×head DGAR heatmap; no individual head or layer emerges as a strong spatial specialist… view at source ↗

**Figure 3.** Figure 3: Causal tracing reveals where spatial information is processed and lost (163 failure cases). (a) Recovery curves for Corruption A (target-object patches): restoration is ineffective at the merger stage but fully recovers at L0. (b) Recovery curves for Corruption B (depth-correct patches): identical V-shape with sharp drop at the merger. (c) Groundedness classification per task and corruption type. Vertical … view at source ↗

**Figure 4.** Figure 4: Reflection task setup. Scene (left); prompt and model response (right). The model lists every object on the metallic countertop as having a visible reflection, including the blue water bottle and pen whose reflections are not resolved in the image. The reasoning chain reveals the failure: the model invokes the rule “the table is reflective, so any object placed on it must have a reflection,” applying a lea… view at source ↗

**Figure 5.** Figure 5: Vision encoder activation (V21–V27). Self-attention concentrates on the foreground objects (bottle, cloth, watch) rather than on the countertop region containing the actual reflections. Entropy values are reported above each block. cable, a pen, a watch, and a checkered cloth resting on a metallic countertop. A correct answer would list only objects whose reflections are actually present on the metal surfa… view at source ↗

**Figure 6.** Figure 6: Bridge-layer cluster map (L20). Left: factual attention heatmap. Centre: 20 attention clusters under the factual prompt. Right: 24 clusters under the counterfactual prompt. Both partitions are diffuse and lack the coherent object–reflection pair structure that a reflection-aware model would exhibit. I Scaling the Benchmark: A Semi-Automated Construction Pipeline The current release is built end-to-end by t… view at source ↗

read the original abstract

Vision--language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLMs rearrange visible objects reliably but fail at occlusion and reflection inference, with the bottleneck at visual token merging.

read the letter

The key point from this paper is that vision-language models can plan rearrangements of visible objects with decent accuracy and few collisions, but they struggle badly when they have to reason about occluded parts or reflections in the scene. The authors trace this to the point where visual tokens get compressed and merged before feeding into the language decoder. They did a solid job putting together the benchmark. With 3,034 human-curated samples, they tested occlusion using three separate counterfactual setups, added reflection geometry probes, and included volumetric rearrangement planning as a control. Scoring 18,204 responses by trained human annotators, without any LLM judges, keeps the evaluation grounded. The white-box patching experiment on Qwen3-VL-8B-Thinking is a nice touch, showing that spatial information is available in the vision encoder but becomes inaccessible after the merger step. The results line up with the claim: rearrangement success between 53 and 97 percent, but occlusion down to 6-45 percent and reflections below 7 percent. The same pattern shows up in an embodied-reasoning model, which makes it less likely to be an artifact of one particular setup. On the soft side, the main question is how well these tasks isolate true 3D spatial understanding versus other factors like prompt sensitivity or dataset biases. The multiple operationalisations and human scoring reduce that worry, but it's still possible that the failures reflect something narrower than a general lack of 3D representation. The paper doesn't explore whether fine-tuning or different architectures could close the gap, which might be a natural next step but isn't required for this work. This kind of targeted diagnostic is useful for researchers building or evaluating VLMs for applications that need spatial awareness, like robotics or augmented reality. Readers interested in mechanistic interpretability of multimodal models will also get something out of the localization analysis. It deserves peer review. The evidence for the dissociation is direct and the internal analysis adds value beyond surface-level testing.

Referee Report

2 major / 3 minor

Summary. The paper introduces a human-curated 3,034-sample benchmark to test whether vision-language models understand 3D scene layout beyond object recognition. It evaluates six frontier and open-weight VLMs on three components: depth-ordered occlusion (via three independent counterfactual operationalisations), optical-geometry inference from visible reflections, and volumetric rearrangement planning over visible layouts. Using trained human annotators to score 18,204 responses (no LLM judges), the work reports a sharp dissociation: models achieve 53-97% accuracy on visible rearrangement planning with low collision violations, but drop to 6-45% on occlusion probes and below 7% on reflections. An embodied-reasoning model shows the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger step, where spatial information recoverable in the vision encoder becomes inaccessible after compression.

Significance. If the reported dissociation holds, the work is significant for demonstrating that current VLMs lack robust 3D spatial understanding, succeeding on visible planning but failing on occlusion and geometric inference. Strengths include the human-curated dataset, three independent occlusion operationalisations, 18k human-scored responses, absence of fitted parameters or self-referential derivations, and white-box patching that identifies a concrete architectural bottleneck at the vision-token merger. These elements provide falsifiable, reproducible evidence against simple surface-pattern explanations.

major comments (2)

[§4.3] §4.3 (occlusion probes): the three counterfactual operationalisations are presented as independent, but the manuscript does not report inter-probe correlation or a control showing that each isolates a distinct failure mode rather than a shared surface cue; this is load-bearing for the claim that the dissociation reflects genuine 3D understanding deficits.
[Table 2] Table 2 (rearrangement results): while collision violations are reported as low, the paper does not include a baseline comparison against a model that has access only to 2D bounding boxes; without this, it remains possible that the high visible-layout accuracy reflects 2D heuristics rather than 3D layout representation.

minor comments (3)

[Figure 4] Figure 4: the activation patching diagram would benefit from explicit labels indicating which layers correspond to the vision encoder, merger, and language decoder to improve readability of the localisation claim.
[§3.2] §3.2: the description of the 3,034-sample curation process could specify the exact criteria used by the three independent annotators to resolve disagreements on occlusion ordering.
[Related Work] Related work section: add a brief discussion of recent benchmarks on VLM spatial reasoning (e.g., those using synthetic 3D renders) to better situate the novelty of the reflection and multi-operationalisation probes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, making revisions where they strengthen the manuscript without altering our core claims or results.

read point-by-point responses

Referee: [§4.3] §4.3 (occlusion probes): the three counterfactual operationalisations are presented as independent, but the manuscript does not report inter-probe correlation or a control showing that each isolates a distinct failure mode rather than a shared surface cue; this is load-bearing for the claim that the dissociation reflects genuine 3D understanding deficits.

Authors: We agree that explicit quantitative support for the independence of the three operationalisations would reinforce the interpretation. The probes were constructed with distinct mechanisms—one using multi-view depth ordering, one using counterfactual object removal to test visibility, and one using direct visibility prediction—but we acknowledge the value of reporting correlations. In the revised manuscript we add a new analysis in §4.3 computing pairwise Pearson correlations across the three probes for all evaluated models; the resulting coefficients are low (0.12–0.31), consistent with distinct failure modes rather than a single surface cue. We also include a shuffled-label control within each probe that reduces accuracy to chance levels, further indicating that performance is not driven by shared low-level heuristics. These additions directly address the concern while preserving the original experimental design. revision: yes
Referee: [Table 2] Table 2 (rearrangement results): while collision violations are reported as low, the paper does not include a baseline comparison against a model that has access only to 2D bounding boxes; without this, it remains possible that the high visible-layout accuracy reflects 2D heuristics rather than 3D layout representation.

Authors: We appreciate the suggestion of an additional control. However, we maintain that a 2D-bounding-box-only baseline is not required to support our conclusions and would not be straightforward to implement within the scope of the present study. The volumetric rearrangement task explicitly requires models to reason about depth ordering, free space, and potential 3D collisions even when all objects are visible; 2D bounding boxes alone cannot encode these relations (for example, whether one object can be moved behind another without intersection in depth). More importantly, the same models that succeed on visible rearrangement fail sharply on the occlusion and reflection tasks, which cannot be solved by 2D heuristics. This dissociation itself argues against a purely 2D strategy. We have expanded the discussion section to articulate this reasoning and to note that constructing a non-VLM 2D baseline lies outside the paper’s focus on evaluating existing vision-language models. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper evaluates VLMs on a human-curated benchmark using direct model testing, three independent counterfactual operationalisations for occlusion, reflection geometry probes, and trained human annotators scoring 18k responses with no LLM-as-judge. White-box analysis localises failures to the vision-token merger step via patching experiments. No equations, fitted parameters, self-referential derivations, or load-bearing self-citations reduce the central dissociation claim to its inputs by construction. The design relies on external annotations and multiple independent measures, rendering the result self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the curated benchmark tasks validly isolate 3D spatial understanding and that the observed performance gap is caused by the token-merger stage rather than dataset artifacts or annotator bias.

axioms (1)

domain assumption Human annotators provide reliable ground-truth scoring for spatial-reasoning responses across the three task types.
All 18,204 responses were scored by trained annotators; the dissociation claim depends on this scoring being accurate and unbiased.

pith-pipeline@v0.9.0 · 5713 in / 1387 out tokens · 47036 ms · 2026-05-21T06:51:36.925512+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sharp dissociation: models that plan rearrangements over visible layouts at 53–97% accuracy ... fall to 6–45% on occlusion and below 7% on reflections ... localises the failure to the visual-token merger

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 8 internal anchors

[1]

Qwen3.5-VL: Scaling vision-language models with mixture-of-experts

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen3.5-VL: Scaling vision-language models with mixture-of-experts. arXiv preprint, 2025

work page 2025
[2]

Qwen3-VL-30B-A3B-Thinking: A thinking-mode vision-language model

Jinze Bai, Shuai Bai, Shusheng Yang, et al. Qwen3-VL-30B-A3B-Thinking: A thinking-mode vision-language model. https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking , 2025

work page 2025
[3]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

RT-2: Vision-language- action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023

work page 2023
[5]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021

work page 2021
[7]

SpatialVLM: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024

work page 2024
[8]

CounterVQA: Evaluating and improving counterfactual reasoning in vision-language models for video understanding.arXiv preprint arXiv:2511.19923, 2025

Yuefei Chen, Jiang Liu, Xiaodong Lin, and Ruixiang Tang. CounterVQA: Evaluating and improving counterfactual reasoning in vision-language models for video understanding.arXiv preprint arXiv:2511.19923, 2025. 9

work page arXiv 2025
[9]

Ego3D-Bench: Evaluating 3d scene understanding from egocentric multi-view videos.arXiv preprint, 2024

Yuxuan Chen, Bowen Li, Weijie Zhang, et al. Ego3D-Bench: Evaluating 3d scene understanding from egocentric multi-view videos.arXiv preprint, 2024

work page 2024
[10]

arXiv preprint arXiv:2205.01089 (2022) 16 M

Zhenfang Chen, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, and Chuang Gan. ComPhy: Compositional physical reasoning of objects and events from videos.arXiv preprint arXiv:2205.01089, 2022

work page arXiv 2022
[11]

SpatialRGPT: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. SpatialRGPT: Grounded spatial reasoning in vision-language models. arXiv preprint arXiv:2406.01584, 2024

work page arXiv 2024
[12]

Internspatial: A comprehensive dataset for spatial reasoning in vision-language models.arXiv preprint arXiv:2506.18385, 2025

Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tianshuo Yang, Min Dou, Tong He, Wenqi Shao, Kaipeng Zhang, Yi Wang, Botian Shi, Yanting Zhang, Jifeng Dai, Yu Qiao, Hongjie Zhang, and Wenhai Wang. InternSpa- tial: A comprehensive dataset for spatial reasoning in vision-language models.arXiv preprint arXiv:...

work page arXiv 2025
[13]

PaLM-E: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, and Igor Mordatch. PaLM-E: An embodied multimodal languag...

work page 2023
[14]

Gemini 3.1: A family of highly capable multimodal models

Google DeepMind. Gemini 3.1: A family of highly capable multimodal models. https: //deepmind.google/technologies/gemini/, 2025

work page 2025
[15]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaEval: An automatic evaluator of instruction- following models.https://github.com/tatsu-lab/alpaca_eval, 2023

work page 2023
[17]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023
[19]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, et al. MMBench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

iVISPAR: An interactive visual-spatial reasoning benchmark for vlms.arXiv preprint arXiv:2502.03214, 2025

Julius Mayer, Mohamad Ballout, Serwan Jassim, Farbod Nosrat Nezami, and Elia Bruni. iVISPAR: An interactive visual-spatial reasoning benchmark for vlms.arXiv preprint arXiv:2502.03214, 2025

work page arXiv 2025
[21]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[22]

GPT-5.2 system card

OpenAI. GPT-5.2 system card. https://openai.com/index/introducing-gpt-5/, 2025

work page 2025
[23]

NuScenes-Spatial QA: Spatial reasoning in autonomous driving with lidar-grounded question answering.arXiv preprint, 2024

Rui Qian, Jiankai Huang, Yijun Wang, et al. NuScenes-Spatial QA: Spatial reasoning in autonomous driving with lidar-grounded question answering.arXiv preprint, 2024

work page 2024
[24]

Nano banana 2: Combining pro capabilities with lightning-fast speed

Naina Raisinghani. Nano banana 2: Combining pro capabilities with lightning-fast speed. https://blog.google/innovation-and-ai/technology/ai/nano-banana-2/ , February 2026. Accessed: 2026-05-06

work page 2026
[25]

Mind the gap: Diagnosing spatial reasoning failures in vision-language models

Kanchana Ranasinghe, Shao-Yu Tran, Xueyan Luo, et al. Mind the gap: Diagnosing spatial reasoning failures in vision-language models. InICLR, 2025

work page 2025
[26]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4035–4045, 2018. 10

work page 2018
[27]

Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

work page arXiv 2025
[28]

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

Gemini Robotics Team et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Qwen3-vl-8b-thinking

Qwen Team. Qwen3-vl-8b-thinking. https://huggingface.co/Qwen/ Qwen3-VL-8B-Thinking, 2025

work page 2025
[30]

Vision language models are blind

Shengbang Tong, Ellis Brown, Penghao Wu, et al. Eyes wide shut? exploring the visual shortcomings of multimodal LLMs.arXiv preprint arXiv:2407.06581, 2024

work page arXiv 2024
[31]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Ming Yan, Ji Zhang, and Jitao Sang. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei- Fei, and Manling Li. Mindcube: Spatial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025

work page arXiv 2025
[33]

COVER: Counterfactual video reasoning for evaluating vision-language models.arXiv preprint, 2025

Yuxin Wang, Jiaxin Chen, Zhuo Li, et al. COVER: Counterfactual video reasoning for evaluating vision-language models.arXiv preprint, 2025

work page 2025
[34]

VSI-Bench: Benchmarking visual spatial intelligence in vision-language models.arXiv preprint arXiv:2407.07890, 2024

Lixin Yang, Kailin Chen, Songyou Peng, et al. VSI-Bench: Benchmarking visual spatial intelligence in vision-language models.arXiv preprint arXiv:2407.07890, 2024

work page arXiv 2024
[35]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yao, Linjie Luo, et al. MM-Vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

GLM-4.6V: A bilingual multi-modal language model

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. GLM-4.6V: A bilingual multi-modal language model. https://github.com/THUDM/GLM-4, 2025

work page 2025
[37]

CausalVLBench: Benchmarking causal reasoning in vision-language models.arXiv preprint, 2025

Yiming Zhang, Yifan Liu, Zhiwei Wang, et al. CausalVLBench: Benchmarking causal reasoning in vision-language models.arXiv preprint, 2025

work page 2025
[38]

if X is removed, what becomes visible?

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InNeurIPS, 2024. Appendix Contents 1 Introduction 1 2 Related Work 2 3 Benchmark Design 3 4 Experimental Setup 3 5 Result...

work page 2024
[39]

Reject scenes where any object label has confidence below a calibrated threshold

Object inventory.Given a scene image, run an open-vocabulary detector and SAM 3 [ 5] to obtain per-object masks {Mi} and canonical noun-phrase labels {ℓi}. Reject scenes where any object label has confidence below a calibrated threshold

work page
[40]

Assign each object a representative depthd i = median(D[Mi >0])

Depth assignment.Run Depth-Anything-V3 [ 17] to obtain a per-pixel depth map D. Assign each object a representative depthd i = median(D[Mi >0])

work page
[41]

The resulting directed graph Gocc encodes the depth-ordered occlusion structure of the scene

Occlusion-graph construction.For each ordered pair (i, j) with di < d j and IoU(Mi,bbox(M j))> τ occ, declare i occludes j. The resulting directed graph Gocc encodes the depth-ordered occlusion structure of the scene

work page
[42]

Per-task ground truth derivation.T1 (single-object removal):For target X, the ground truth is the set of j such that X is theuniquepredecessor of j in Gocc (i.e. removing X leaves j unoccluded).T2 (multi-object removal):Compute the minimum set cover over predecessors of X in Gocc via an exact small-set solver (the graphs are small enough that this is trac...

work page
[43]

in front

Validation gates.Reject samples where Gocc has fewer than three nodes, where the depth gradient across the occluding chain is below a calibrated minimum (to avoid degenerate cases where “in front” is ill-defined), or where multiple objects share a depth band within sensor noise. The output is a tuple (image,prompt,ground-truth set) per task that mirrors t...

work page
[44]

Reject scenes with no high-confidence reflective surface

Reflective-surface detection.Run a material-classification head over SAM 3 masks to identify candidate reflective surfaces (polished wood, granite, marble, quartz, glass). Reject scenes with no high-confidence reflective surface

work page
[45]

Reflection-region detection.Within each reflective surface mask, detect candidate reflection patches via a separate detector trained on natural reflections, returning per-patch bounding boxes {Bk}

work page
[46]

Concretely, mirror each above-surface object’s bounding box across the surface plane and rank candidates by geometric overlap with Bk

3D-to-2D correspondence.For each detected reflection patch Bk, identify the 3D object whose silhouette and pose are consistent with the reflection geometry under a planar-mirror model of the surface. Concretely, mirror each above-surface object’s bounding box across the surface plane and rank candidates by geometric overlap with Bk. Establish (Bk ↔object ...

work page
[47]

The ground-truth answer set is then {objecti : (B k, i)/∈S} , i.e

Counterfactual variant generation.Sample a subset S of detected reflection patches to be removed; perform inpainting (a diffusion-based inpainter conditioned on the surrounding surface texture) overS k∈S Bk to remove those reflections while preserving the rest. The ground-truth answer set is then {objecti : (B k, i)/∈S} , i.e. the objects whose reflection...

work page
[48]

Validation gates.Run a perceptual-quality model and a reflection re-detector over the inpainted image; reject samples where (a) inpainting artifacts are detectable above threshold, or (b) the re-detector finds reflections at any of the removed locations. The reflection pipeline is the most failure-prone of the three because it composes three imperfect mod...

work page
[49]

Object volumes.Use SAM 3 masks plus depth to estimate each object’s frustum-projected volumeV i and itsx-coordinatex i (lateral position)

work page
[50]

Target ordering generation.Sample a target left-to-right ordering by permuting a subset of {xi} such that achieving the target requires either (T5) a unique single pairwise swap or (T6) a multi-step sequence of swaps

work page
[51]

For T5, exhaustively enumerate single swaps and retain scenes admitting exactly one feasible solution

Collision-aware swap search.A pairwise swap (i, j) isfeasibleiff placing Vi at xj and Vj at xi produces no overlap with any other Vk. For T5, exhaustively enumerate single swaps and retain scenes admitting exactly one feasible solution. For T6, run BFS over swap sequences and retain the shortest valid sequence; tag scenes where no sequence exists asinfeas...

work page
[52]

Prompt assembly.Generate the natural-language prompt by populating the T5/T6 template (Appendix E) with the sampled target ordering

work page
[53]

I.2 Pilot-Batch Validation We do not attempt per-sample agreement against the existing 3,034-sample release

Validation gates.Reject scenes where (a) object volumes overlap in the original image (a perception failure upstream), or (b) the search returns multiple equally-shortest solutions for T6 (ambiguous ground truth). I.2 Pilot-Batch Validation We do not attempt per-sample agreement against the existing 3,034-sample release. The scenes admit many valid config...

work page

[1] [1]

Qwen3.5-VL: Scaling vision-language models with mixture-of-experts

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen3.5-VL: Scaling vision-language models with mixture-of-experts. arXiv preprint, 2025

work page 2025

[2] [2]

Qwen3-VL-30B-A3B-Thinking: A thinking-mode vision-language model

Jinze Bai, Shuai Bai, Shusheng Yang, et al. Qwen3-VL-30B-A3B-Thinking: A thinking-mode vision-language model. https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking , 2025

work page 2025

[3] [3]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

RT-2: Vision-language- action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023

work page 2023

[5] [5]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021

work page 2021

[7] [7]

SpatialVLM: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024

work page 2024

[8] [8]

CounterVQA: Evaluating and improving counterfactual reasoning in vision-language models for video understanding.arXiv preprint arXiv:2511.19923, 2025

Yuefei Chen, Jiang Liu, Xiaodong Lin, and Ruixiang Tang. CounterVQA: Evaluating and improving counterfactual reasoning in vision-language models for video understanding.arXiv preprint arXiv:2511.19923, 2025. 9

work page arXiv 2025

[9] [9]

Ego3D-Bench: Evaluating 3d scene understanding from egocentric multi-view videos.arXiv preprint, 2024

Yuxuan Chen, Bowen Li, Weijie Zhang, et al. Ego3D-Bench: Evaluating 3d scene understanding from egocentric multi-view videos.arXiv preprint, 2024

work page 2024

[10] [10]

arXiv preprint arXiv:2205.01089 (2022) 16 M

Zhenfang Chen, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, and Chuang Gan. ComPhy: Compositional physical reasoning of objects and events from videos.arXiv preprint arXiv:2205.01089, 2022

work page arXiv 2022

[11] [11]

SpatialRGPT: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. SpatialRGPT: Grounded spatial reasoning in vision-language models. arXiv preprint arXiv:2406.01584, 2024

work page arXiv 2024

[12] [12]

Internspatial: A comprehensive dataset for spatial reasoning in vision-language models.arXiv preprint arXiv:2506.18385, 2025

Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tianshuo Yang, Min Dou, Tong He, Wenqi Shao, Kaipeng Zhang, Yi Wang, Botian Shi, Yanting Zhang, Jifeng Dai, Yu Qiao, Hongjie Zhang, and Wenhai Wang. InternSpa- tial: A comprehensive dataset for spatial reasoning in vision-language models.arXiv preprint arXiv:...

work page arXiv 2025

[13] [13]

PaLM-E: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, and Igor Mordatch. PaLM-E: An embodied multimodal languag...

work page 2023

[14] [14]

Gemini 3.1: A family of highly capable multimodal models

Google DeepMind. Gemini 3.1: A family of highly capable multimodal models. https: //deepmind.google/technologies/gemini/, 2025

work page 2025

[15] [15]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaEval: An automatic evaluator of instruction- following models.https://github.com/tatsu-lab/alpaca_eval, 2023

work page 2023

[17] [17]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023

[19] [19]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, et al. MMBench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

iVISPAR: An interactive visual-spatial reasoning benchmark for vlms.arXiv preprint arXiv:2502.03214, 2025

Julius Mayer, Mohamad Ballout, Serwan Jassim, Farbod Nosrat Nezami, and Elia Bruni. iVISPAR: An interactive visual-spatial reasoning benchmark for vlms.arXiv preprint arXiv:2502.03214, 2025

work page arXiv 2025

[21] [21]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[22] [22]

GPT-5.2 system card

OpenAI. GPT-5.2 system card. https://openai.com/index/introducing-gpt-5/, 2025

work page 2025

[23] [23]

NuScenes-Spatial QA: Spatial reasoning in autonomous driving with lidar-grounded question answering.arXiv preprint, 2024

Rui Qian, Jiankai Huang, Yijun Wang, et al. NuScenes-Spatial QA: Spatial reasoning in autonomous driving with lidar-grounded question answering.arXiv preprint, 2024

work page 2024

[24] [24]

Nano banana 2: Combining pro capabilities with lightning-fast speed

Naina Raisinghani. Nano banana 2: Combining pro capabilities with lightning-fast speed. https://blog.google/innovation-and-ai/technology/ai/nano-banana-2/ , February 2026. Accessed: 2026-05-06

work page 2026

[25] [25]

Mind the gap: Diagnosing spatial reasoning failures in vision-language models

Kanchana Ranasinghe, Shao-Yu Tran, Xueyan Luo, et al. Mind the gap: Diagnosing spatial reasoning failures in vision-language models. InICLR, 2025

work page 2025

[26] [26]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4035–4045, 2018. 10

work page 2018

[27] [27]

Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

work page arXiv 2025

[28] [28]

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

Gemini Robotics Team et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Qwen3-vl-8b-thinking

Qwen Team. Qwen3-vl-8b-thinking. https://huggingface.co/Qwen/ Qwen3-VL-8B-Thinking, 2025

work page 2025

[30] [30]

Vision language models are blind

Shengbang Tong, Ellis Brown, Penghao Wu, et al. Eyes wide shut? exploring the visual shortcomings of multimodal LLMs.arXiv preprint arXiv:2407.06581, 2024

work page arXiv 2024

[31] [31]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Ming Yan, Ji Zhang, and Jitao Sang. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei- Fei, and Manling Li. Mindcube: Spatial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025

work page arXiv 2025

[33] [33]

COVER: Counterfactual video reasoning for evaluating vision-language models.arXiv preprint, 2025

Yuxin Wang, Jiaxin Chen, Zhuo Li, et al. COVER: Counterfactual video reasoning for evaluating vision-language models.arXiv preprint, 2025

work page 2025

[34] [34]

VSI-Bench: Benchmarking visual spatial intelligence in vision-language models.arXiv preprint arXiv:2407.07890, 2024

Lixin Yang, Kailin Chen, Songyou Peng, et al. VSI-Bench: Benchmarking visual spatial intelligence in vision-language models.arXiv preprint arXiv:2407.07890, 2024

work page arXiv 2024

[35] [35]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yao, Linjie Luo, et al. MM-Vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

GLM-4.6V: A bilingual multi-modal language model

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. GLM-4.6V: A bilingual multi-modal language model. https://github.com/THUDM/GLM-4, 2025

work page 2025

[37] [37]

CausalVLBench: Benchmarking causal reasoning in vision-language models.arXiv preprint, 2025

Yiming Zhang, Yifan Liu, Zhiwei Wang, et al. CausalVLBench: Benchmarking causal reasoning in vision-language models.arXiv preprint, 2025

work page 2025

[38] [38]

if X is removed, what becomes visible?

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InNeurIPS, 2024. Appendix Contents 1 Introduction 1 2 Related Work 2 3 Benchmark Design 3 4 Experimental Setup 3 5 Result...

work page 2024

[39] [39]

Reject scenes where any object label has confidence below a calibrated threshold

Object inventory.Given a scene image, run an open-vocabulary detector and SAM 3 [ 5] to obtain per-object masks {Mi} and canonical noun-phrase labels {ℓi}. Reject scenes where any object label has confidence below a calibrated threshold

work page

[40] [40]

Assign each object a representative depthd i = median(D[Mi >0])

Depth assignment.Run Depth-Anything-V3 [ 17] to obtain a per-pixel depth map D. Assign each object a representative depthd i = median(D[Mi >0])

work page

[41] [41]

The resulting directed graph Gocc encodes the depth-ordered occlusion structure of the scene

Occlusion-graph construction.For each ordered pair (i, j) with di < d j and IoU(Mi,bbox(M j))> τ occ, declare i occludes j. The resulting directed graph Gocc encodes the depth-ordered occlusion structure of the scene

work page

[42] [42]

Per-task ground truth derivation.T1 (single-object removal):For target X, the ground truth is the set of j such that X is theuniquepredecessor of j in Gocc (i.e. removing X leaves j unoccluded).T2 (multi-object removal):Compute the minimum set cover over predecessors of X in Gocc via an exact small-set solver (the graphs are small enough that this is trac...

work page

[43] [43]

in front

Validation gates.Reject samples where Gocc has fewer than three nodes, where the depth gradient across the occluding chain is below a calibrated minimum (to avoid degenerate cases where “in front” is ill-defined), or where multiple objects share a depth band within sensor noise. The output is a tuple (image,prompt,ground-truth set) per task that mirrors t...

work page

[44] [44]

Reject scenes with no high-confidence reflective surface

Reflective-surface detection.Run a material-classification head over SAM 3 masks to identify candidate reflective surfaces (polished wood, granite, marble, quartz, glass). Reject scenes with no high-confidence reflective surface

work page

[45] [45]

Reflection-region detection.Within each reflective surface mask, detect candidate reflection patches via a separate detector trained on natural reflections, returning per-patch bounding boxes {Bk}

work page

[46] [46]

Concretely, mirror each above-surface object’s bounding box across the surface plane and rank candidates by geometric overlap with Bk

3D-to-2D correspondence.For each detected reflection patch Bk, identify the 3D object whose silhouette and pose are consistent with the reflection geometry under a planar-mirror model of the surface. Concretely, mirror each above-surface object’s bounding box across the surface plane and rank candidates by geometric overlap with Bk. Establish (Bk ↔object ...

work page

[47] [47]

The ground-truth answer set is then {objecti : (B k, i)/∈S} , i.e

Counterfactual variant generation.Sample a subset S of detected reflection patches to be removed; perform inpainting (a diffusion-based inpainter conditioned on the surrounding surface texture) overS k∈S Bk to remove those reflections while preserving the rest. The ground-truth answer set is then {objecti : (B k, i)/∈S} , i.e. the objects whose reflection...

work page

[48] [48]

Validation gates.Run a perceptual-quality model and a reflection re-detector over the inpainted image; reject samples where (a) inpainting artifacts are detectable above threshold, or (b) the re-detector finds reflections at any of the removed locations. The reflection pipeline is the most failure-prone of the three because it composes three imperfect mod...

work page

[49] [49]

Object volumes.Use SAM 3 masks plus depth to estimate each object’s frustum-projected volumeV i and itsx-coordinatex i (lateral position)

work page

[50] [50]

Target ordering generation.Sample a target left-to-right ordering by permuting a subset of {xi} such that achieving the target requires either (T5) a unique single pairwise swap or (T6) a multi-step sequence of swaps

work page

[51] [51]

For T5, exhaustively enumerate single swaps and retain scenes admitting exactly one feasible solution

Collision-aware swap search.A pairwise swap (i, j) isfeasibleiff placing Vi at xj and Vj at xi produces no overlap with any other Vk. For T5, exhaustively enumerate single swaps and retain scenes admitting exactly one feasible solution. For T6, run BFS over swap sequences and retain the shortest valid sequence; tag scenes where no sequence exists asinfeas...

work page

[52] [52]

Prompt assembly.Generate the natural-language prompt by populating the T5/T6 template (Appendix E) with the sampled target ordering

work page

[53] [53]

I.2 Pilot-Batch Validation We do not attempt per-sample agreement against the existing 3,034-sample release

Validation gates.Reject scenes where (a) object volumes overlap in the original image (a perception failure upstream), or (b) the search returns multiple equally-shortest solutions for T6 (ambiguous ground truth). I.2 Pilot-Batch Validation We do not attempt per-sample agreement against the existing 3,034-sample release. The scenes admit many valid config...

work page