Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?
Pith reviewed 2026-05-21 06:51 UTC · model grok-4.3
The pith
Vision-language models rearrange visible objects accurately yet fail on occlusion and reflections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models achieve 53-97 percent accuracy on volumetric rearrangement planning over visible layouts and rarely violate collision constraints, yet accuracy falls to 6-45 percent on occlusion probes and below 7 percent on reflection geometry, with the performance gap localized to the visual-token merger stage where recoverable spatial information is lost.
What carries the argument
A 3,034-sample human-curated benchmark using three independent counterfactual operationalisations for occlusion, optical-geometry probes for reflections, and rearrangement planning tasks, scored by trained annotators with white-box activation patching to trace failure points.
If this is right
- Success on rearrangement planning does not indicate broad 3D scene understanding.
- Spatial information is lost specifically during visual token compression rather than in the vision encoder itself.
- Embodied reasoning models exhibit the same performance profile, pointing to a shared architectural limitation.
- Patching post-merger activations can recover spatial capability on these probes.
Where Pith is reading between the lines
- Similar gaps may appear in robotics tasks that require reasoning about partially observed environments.
- Benchmarks limited to fully visible scenes likely overestimate current models' spatial competence.
- Architectures that preserve 3D coordinates through the vision-language interface could close the observed gap.
Load-bearing premise
The benchmark tasks for occlusion, reflections, and rearrangement together measure genuine 3D spatial understanding rather than surface patterns or task-specific artifacts.
What would settle it
A controlled experiment that restores high accuracy on the occlusion and reflection tasks by patching clean post-merger visual activations into the language decoder without altering other components would falsify the localization of the failure.
Figures
read the original abstract
Vision--language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a human-curated 3,034-sample benchmark to test whether vision-language models understand 3D scene layout beyond object recognition. It evaluates six frontier and open-weight VLMs on three components: depth-ordered occlusion (via three independent counterfactual operationalisations), optical-geometry inference from visible reflections, and volumetric rearrangement planning over visible layouts. Using trained human annotators to score 18,204 responses (no LLM judges), the work reports a sharp dissociation: models achieve 53-97% accuracy on visible rearrangement planning with low collision violations, but drop to 6-45% on occlusion probes and below 7% on reflections. An embodied-reasoning model shows the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger step, where spatial information recoverable in the vision encoder becomes inaccessible after compression.
Significance. If the reported dissociation holds, the work is significant for demonstrating that current VLMs lack robust 3D spatial understanding, succeeding on visible planning but failing on occlusion and geometric inference. Strengths include the human-curated dataset, three independent occlusion operationalisations, 18k human-scored responses, absence of fitted parameters or self-referential derivations, and white-box patching that identifies a concrete architectural bottleneck at the vision-token merger. These elements provide falsifiable, reproducible evidence against simple surface-pattern explanations.
major comments (2)
- [§4.3] §4.3 (occlusion probes): the three counterfactual operationalisations are presented as independent, but the manuscript does not report inter-probe correlation or a control showing that each isolates a distinct failure mode rather than a shared surface cue; this is load-bearing for the claim that the dissociation reflects genuine 3D understanding deficits.
- [Table 2] Table 2 (rearrangement results): while collision violations are reported as low, the paper does not include a baseline comparison against a model that has access only to 2D bounding boxes; without this, it remains possible that the high visible-layout accuracy reflects 2D heuristics rather than 3D layout representation.
minor comments (3)
- [Figure 4] Figure 4: the activation patching diagram would benefit from explicit labels indicating which layers correspond to the vision encoder, merger, and language decoder to improve readability of the localisation claim.
- [§3.2] §3.2: the description of the 3,034-sample curation process could specify the exact criteria used by the three independent annotators to resolve disagreements on occlusion ordering.
- [Related Work] Related work section: add a brief discussion of recent benchmarks on VLM spatial reasoning (e.g., those using synthetic 3D renders) to better situate the novelty of the reflection and multi-operationalisation probes.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment point by point below, making revisions where they strengthen the manuscript without altering our core claims or results.
read point-by-point responses
-
Referee: [§4.3] §4.3 (occlusion probes): the three counterfactual operationalisations are presented as independent, but the manuscript does not report inter-probe correlation or a control showing that each isolates a distinct failure mode rather than a shared surface cue; this is load-bearing for the claim that the dissociation reflects genuine 3D understanding deficits.
Authors: We agree that explicit quantitative support for the independence of the three operationalisations would reinforce the interpretation. The probes were constructed with distinct mechanisms—one using multi-view depth ordering, one using counterfactual object removal to test visibility, and one using direct visibility prediction—but we acknowledge the value of reporting correlations. In the revised manuscript we add a new analysis in §4.3 computing pairwise Pearson correlations across the three probes for all evaluated models; the resulting coefficients are low (0.12–0.31), consistent with distinct failure modes rather than a single surface cue. We also include a shuffled-label control within each probe that reduces accuracy to chance levels, further indicating that performance is not driven by shared low-level heuristics. These additions directly address the concern while preserving the original experimental design. revision: yes
-
Referee: [Table 2] Table 2 (rearrangement results): while collision violations are reported as low, the paper does not include a baseline comparison against a model that has access only to 2D bounding boxes; without this, it remains possible that the high visible-layout accuracy reflects 2D heuristics rather than 3D layout representation.
Authors: We appreciate the suggestion of an additional control. However, we maintain that a 2D-bounding-box-only baseline is not required to support our conclusions and would not be straightforward to implement within the scope of the present study. The volumetric rearrangement task explicitly requires models to reason about depth ordering, free space, and potential 3D collisions even when all objects are visible; 2D bounding boxes alone cannot encode these relations (for example, whether one object can be moved behind another without intersection in depth). More importantly, the same models that succeed on visible rearrangement fail sharply on the occlusion and reflection tasks, which cannot be solved by 2D heuristics. This dissociation itself argues against a purely 2D strategy. We have expanded the discussion section to articulate this reasoning and to note that constructing a non-VLM 2D baseline lies outside the paper’s focus on evaluating existing vision-language models. revision: no
Circularity Check
No significant circularity
full rationale
The paper evaluates VLMs on a human-curated benchmark using direct model testing, three independent counterfactual operationalisations for occlusion, reflection geometry probes, and trained human annotators scoring 18k responses with no LLM-as-judge. White-box analysis localises failures to the vision-token merger step via patching experiments. No equations, fitted parameters, self-referential derivations, or load-bearing self-citations reduce the central dissociation claim to its inputs by construction. The design relies on external annotations and multiple independent measures, rendering the result self-contained against the listed circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotators provide reliable ground-truth scoring for spatial-reasoning responses across the three task types.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
sharp dissociation: models that plan rearrangements over visible layouts at 53–97% accuracy ... fall to 6–45% on occlusion and below 7% on reflections ... localises the failure to the visual-token merger
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Qwen3.5-VL: Scaling vision-language models with mixture-of-experts
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen3.5-VL: Scaling vision-language models with mixture-of-experts. arXiv preprint, 2025
work page 2025
-
[2]
Qwen3-VL-30B-A3B-Thinking: A thinking-mode vision-language model
Jinze Bai, Shuai Bai, Shusheng Yang, et al. Qwen3-VL-30B-A3B-Thinking: A thinking-mode vision-language model. https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking , 2025
work page 2025
-
[3]
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
RT-2: Vision-language- action models transfer web knowledge to robotic control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023
work page 2023
-
[5]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021
work page 2021
-
[7]
SpatialVLM: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024
work page 2024
-
[8]
Yuefei Chen, Jiang Liu, Xiaodong Lin, and Ruixiang Tang. CounterVQA: Evaluating and improving counterfactual reasoning in vision-language models for video understanding.arXiv preprint arXiv:2511.19923, 2025. 9
-
[9]
Yuxuan Chen, Bowen Li, Weijie Zhang, et al. Ego3D-Bench: Evaluating 3d scene understanding from egocentric multi-view videos.arXiv preprint, 2024
work page 2024
-
[10]
arXiv preprint arXiv:2205.01089 (2022) 16 M
Zhenfang Chen, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, and Chuang Gan. ComPhy: Compositional physical reasoning of objects and events from videos.arXiv preprint arXiv:2205.01089, 2022
-
[11]
SpatialRGPT: Grounded spatial reasoning in vision-language models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. SpatialRGPT: Grounded spatial reasoning in vision-language models. arXiv preprint arXiv:2406.01584, 2024
-
[12]
Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tianshuo Yang, Min Dou, Tong He, Wenqi Shao, Kaipeng Zhang, Yi Wang, Botian Shi, Yanting Zhang, Jifeng Dai, Yu Qiao, Hongjie Zhang, and Wenhai Wang. InternSpa- tial: A comprehensive dataset for spatial reasoning in vision-language models.arXiv preprint arXiv:...
-
[13]
PaLM-E: An embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, and Igor Mordatch. PaLM-E: An embodied multimodal languag...
work page 2023
-
[14]
Gemini 3.1: A family of highly capable multimodal models
Google DeepMind. Gemini 3.1: A family of highly capable multimodal models. https: //deepmind.google/technologies/gemini/, 2025
work page 2025
-
[15]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [16]
-
[17]
Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023
work page 2023
-
[19]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, et al. MMBench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Julius Mayer, Mohamad Ballout, Serwan Jassim, Farbod Nosrat Nezami, and Elia Bruni. iVISPAR: An interactive visual-spatial reasoning benchmark for vlms.arXiv preprint arXiv:2502.03214, 2025
-
[21]
Locating and editing factual associations in GPT
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[22]
OpenAI. GPT-5.2 system card. https://openai.com/index/introducing-gpt-5/, 2025
work page 2025
-
[23]
Rui Qian, Jiankai Huang, Yijun Wang, et al. NuScenes-Spatial QA: Spatial reasoning in autonomous driving with lidar-grounded question answering.arXiv preprint, 2024
work page 2024
-
[24]
Nano banana 2: Combining pro capabilities with lightning-fast speed
Naina Raisinghani. Nano banana 2: Combining pro capabilities with lightning-fast speed. https://blog.google/innovation-and-ai/technology/ai/nano-banana-2/ , February 2026. Accessed: 2026-05-06
work page 2026
-
[25]
Mind the gap: Diagnosing spatial reasoning failures in vision-language models
Kanchana Ranasinghe, Shao-Yu Tran, Xueyan Luo, et al. Mind the gap: Diagnosing spatial reasoning failures in vision-language models. InICLR, 2025
work page 2025
-
[26]
Object hallucination in image captioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4035–4045, 2018. 10
work page 2018
-
[27]
Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025
-
[28]
Gemini Robotics Team et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Qwen Team. Qwen3-vl-8b-thinking. https://huggingface.co/Qwen/ Qwen3-VL-8B-Thinking, 2025
work page 2025
-
[30]
Vision language models are blind
Shengbang Tong, Ellis Brown, Penghao Wu, et al. Eyes wide shut? exploring the visual shortcomings of multimodal LLMs.arXiv preprint arXiv:2407.06581, 2024
-
[31]
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Ming Yan, Ji Zhang, and Jitao Sang. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025
Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei- Fei, and Manling Li. Mindcube: Spatial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025
-
[33]
COVER: Counterfactual video reasoning for evaluating vision-language models.arXiv preprint, 2025
Yuxin Wang, Jiaxin Chen, Zhuo Li, et al. COVER: Counterfactual video reasoning for evaluating vision-language models.arXiv preprint, 2025
work page 2025
-
[34]
Lixin Yang, Kailin Chen, Songyou Peng, et al. VSI-Bench: Benchmarking visual spatial intelligence in vision-language models.arXiv preprint arXiv:2407.07890, 2024
-
[35]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yao, Linjie Luo, et al. MM-Vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
GLM-4.6V: A bilingual multi-modal language model
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. GLM-4.6V: A bilingual multi-modal language model. https://github.com/THUDM/GLM-4, 2025
work page 2025
-
[37]
CausalVLBench: Benchmarking causal reasoning in vision-language models.arXiv preprint, 2025
Yiming Zhang, Yifan Liu, Zhiwei Wang, et al. CausalVLBench: Benchmarking causal reasoning in vision-language models.arXiv preprint, 2025
work page 2025
-
[38]
if X is removed, what becomes visible?
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InNeurIPS, 2024. Appendix Contents 1 Introduction 1 2 Related Work 2 3 Benchmark Design 3 4 Experimental Setup 3 5 Result...
work page 2024
-
[39]
Reject scenes where any object label has confidence below a calibrated threshold
Object inventory.Given a scene image, run an open-vocabulary detector and SAM 3 [ 5] to obtain per-object masks {Mi} and canonical noun-phrase labels {ℓi}. Reject scenes where any object label has confidence below a calibrated threshold
-
[40]
Assign each object a representative depthd i = median(D[Mi >0])
Depth assignment.Run Depth-Anything-V3 [ 17] to obtain a per-pixel depth map D. Assign each object a representative depthd i = median(D[Mi >0])
-
[41]
The resulting directed graph Gocc encodes the depth-ordered occlusion structure of the scene
Occlusion-graph construction.For each ordered pair (i, j) with di < d j and IoU(Mi,bbox(M j))> τ occ, declare i occludes j. The resulting directed graph Gocc encodes the depth-ordered occlusion structure of the scene
-
[42]
Per-task ground truth derivation.T1 (single-object removal):For target X, the ground truth is the set of j such that X is theuniquepredecessor of j in Gocc (i.e. removing X leaves j unoccluded).T2 (multi-object removal):Compute the minimum set cover over predecessors of X in Gocc via an exact small-set solver (the graphs are small enough that this is trac...
-
[43]
Validation gates.Reject samples where Gocc has fewer than three nodes, where the depth gradient across the occluding chain is below a calibrated minimum (to avoid degenerate cases where “in front” is ill-defined), or where multiple objects share a depth band within sensor noise. The output is a tuple (image,prompt,ground-truth set) per task that mirrors t...
-
[44]
Reject scenes with no high-confidence reflective surface
Reflective-surface detection.Run a material-classification head over SAM 3 masks to identify candidate reflective surfaces (polished wood, granite, marble, quartz, glass). Reject scenes with no high-confidence reflective surface
-
[45]
Reflection-region detection.Within each reflective surface mask, detect candidate reflection patches via a separate detector trained on natural reflections, returning per-patch bounding boxes {Bk}
-
[46]
3D-to-2D correspondence.For each detected reflection patch Bk, identify the 3D object whose silhouette and pose are consistent with the reflection geometry under a planar-mirror model of the surface. Concretely, mirror each above-surface object’s bounding box across the surface plane and rank candidates by geometric overlap with Bk. Establish (Bk ↔object ...
-
[47]
The ground-truth answer set is then {objecti : (B k, i)/∈S} , i.e
Counterfactual variant generation.Sample a subset S of detected reflection patches to be removed; perform inpainting (a diffusion-based inpainter conditioned on the surrounding surface texture) overS k∈S Bk to remove those reflections while preserving the rest. The ground-truth answer set is then {objecti : (B k, i)/∈S} , i.e. the objects whose reflection...
-
[48]
Validation gates.Run a perceptual-quality model and a reflection re-detector over the inpainted image; reject samples where (a) inpainting artifacts are detectable above threshold, or (b) the re-detector finds reflections at any of the removed locations. The reflection pipeline is the most failure-prone of the three because it composes three imperfect mod...
-
[49]
Object volumes.Use SAM 3 masks plus depth to estimate each object’s frustum-projected volumeV i and itsx-coordinatex i (lateral position)
-
[50]
Target ordering generation.Sample a target left-to-right ordering by permuting a subset of {xi} such that achieving the target requires either (T5) a unique single pairwise swap or (T6) a multi-step sequence of swaps
-
[51]
Collision-aware swap search.A pairwise swap (i, j) isfeasibleiff placing Vi at xj and Vj at xi produces no overlap with any other Vk. For T5, exhaustively enumerate single swaps and retain scenes admitting exactly one feasible solution. For T6, run BFS over swap sequences and retain the shortest valid sequence; tag scenes where no sequence exists asinfeas...
-
[52]
Prompt assembly.Generate the natural-language prompt by populating the T5/T6 template (Appendix E) with the sampled target ordering
-
[53]
Validation gates.Reject scenes where (a) object volumes overlap in the original image (a perception failure upstream), or (b) the search returns multiple equally-shortest solutions for T6 (ambiguous ground truth). I.2 Pilot-Batch Validation We do not attempt per-sample agreement against the existing 3,034-sample release. The scenes admit many valid config...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.