DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

Anqi Li; Jianze Li; Kai Liu; Xinrui Shi; Yulun Zhang; Ziqing Zhang

arxiv: 2605.26038 · v1 · pith:NHHHA7W6new · submitted 2026-05-25 · 💻 cs.CV · cs.AI

DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

Xinrui Shi , Kai Liu , Ziqing Zhang , Jianze Li , Anqi Li , Yulun Zhang This is my paper

Pith reviewed 2026-06-29 22:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords dense-scene reasoningvision-language modelssupervised fine-tuninggrounded reasoningDRBenchlightweight modelsmulti-step inferencevisual grounding

0 comments

The pith

Four-stage supervision lets a 3B vision-language model outperform a frozen 32B model on dense-scene reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Lightweight vision-language models often produce fluent answers that ignore the actual objects and relations in cluttered images. The paper introduces DRBench, a new benchmark with over 14,000 questions that require grounding multiple entities across three layers of progressive reasoning. It then presents DRScaffold, which breaks supervision into four causally ordered stages so each reasoning step is tied explicitly to visual content. When applied to existing small models, this yields large gains on the new benchmark while keeping general performance intact. The standout result is that the trained 3B model beats the much larger 32B model on dense-scene tasks.

Core claim

We introduce DRBench, a benchmark of 14,573 questions across 2,943 images spanning five task categories and three reasoning layers, and DRScaffold, a supervised fine-tuning framework that decomposes the target into four causally ordered stages. These stages enforce explicit grounding between reasoning steps and visual entities without any architectural change. On three lightweight VLMs, DRScaffold produces substantial gains on DRBench while preserving or improving results on general benchmarks; notably, the 3B model trained this way surpasses the frozen 32B model on DRBench.

What carries the argument

DRScaffold, a supervised fine-tuning framework that decomposes the supervision target into four causally ordered stages enforcing grounded reasoning.

If this is right

Structured supervision can close much of the performance gap between small and large models on dense-scene tasks.
The same lightweight backbone can be reused across general and dense-reasoning workloads without trade-offs.
No model architecture changes are required to obtain the reported gains.
The approach generalizes across at least three different lightweight vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the causal stages truly produce grounded chains, the same decomposition could be tested on non-visual reasoning benchmarks that also suffer from unanchored steps.
Success here suggests that targeted supervision schedules may reduce reliance on scale for other multi-step grounding problems in perception.
The benchmark categories could be used to diagnose which specific layer of reasoning (object, attribute, or relation) still limits current models.

Load-bearing premise

The four causally ordered supervision stages actually enforce explicit grounding between reasoning steps and visual entities rather than simply supplying extra training signal.

What would settle it

An ablation that removes the causal ordering of the four stages or randomizes their sequence while keeping total training data fixed, then measures whether DRBench scores remain unchanged.

Figures

Figures reproduced from arXiv: 2605.26038 by Anqi Li, Jianze Li, Kai Liu, Xinrui Shi, Yulun Zhang, Ziqing Zhang.

**Figure 2.** Figure 2: Qwen2.5-VL-3B vs 72B. However, these advances share a common blind spot: dense-scene reasoning. In such settings, multiple objects, attributes, and relations are tightly coupled, requiring multi-step reasoning over visual dependencies, which are ubiquitous in autonomous driving, robotic manipulation, and surveillance. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Example DRBench annotation, with an image, question, answer, and structured intermediate [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Entity and relation statistics in DRBench. In (a), warm colors indicate indoor Hypersim [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Question taxonomy and category distribution in DRBench. Each category is illustrated [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Construction and quality-control pipeline of DRBench. The pipeline includes two quality [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Overview of the DRScaffold schema design. The supervision target is decomposed into [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison between base models and their DRScaffold-tuned counterparts on [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Relationship between average output length and accuracy across training methods. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Stage-wise diagnosis of DRScaffold training with Qwen2.5-VL-3B. In (b), annotated [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of outputs at each stage of the training process. Correct answers are [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: Representative HyperSim indoor preview frames. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Representative Cityscapes outdoor preview frames. [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 14.** Figure 14: HyperSim ai_001_001: rendered bathroom with bathtub, sink, shelf, and towel ring. Accepted | Perception | open-ended Q: What color is the towel hanging on the ring to the right of the bathtub and below the shelf? Answer: White Key objects: towel, ring, bathtub, shelf Scene graph: towel→attached_to→ring; ring→right_of→bathtub; ring→below→shelf Reasoning: [Perception] Detect a white towel on a silver ring o… view at source ↗

**Figure 15.** Figure 15: Cityscapes berlin_000272_000019: European urban street with parked vehicles and pedestrians. Accepted | Perception | single-choice Q: What is the color of the vehicle parked directly behind the silver hatchback on the right side of the road? A. Black B. Blue C. Red D. White Answer: A Key objects: silver hatchback, bicycle, right road curb Scene graph: silver hatchback→right_of→road; bicycle→behind→silver … view at source ↗

**Figure 16.** Figure 16: Left: ai_001_001 (bathroom). Right: berlin_000272 (street). Both yield questions rejected by Step 2. Rejected (Step 2) Q (indoor): Which surface receives the most sunlight during daytime? Answer: Floor Reason: A static interior photo cannot reveal the sun’s trajectory or time-of-day lighting; the answer requires multi-temporal reasoning that no single image can support. 19 [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 17.** Figure 17: HyperSim ai_001_001. (a) RGB image. (b) Semantic GT (NYU-40 overlay). Rejected (Step 3) Q: Which item is located on the sink ledge and contains liquid? A. Blue bottle B. Red toothbrush C. Clear bottle with blue liquid D. White soap dish Answer: C Reason: HyperSim instance annotations show no bottle or liquid-containing object on the sink ledge in this scene. The answer is factually unsupported by GT label… view at source ↗

**Figure 18.** Figure 18: Cityscapes frankfurt_000001_032942 (val). (a) RGB image. (b) Semantic GT (Cityscapes color map). Rejected (Step 3) Q (street): What color is the bicycle parked on the sidewalk to the right of the tram? A. Red B. Black C. Blue D. Silver Answer: B Reason: The GT segmentation map shows the area to the right of the tram is classified as sidewalk and person – no bicycle instance is annotated there. The referri… view at source ↗

read the original abstract

Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for real-world applications where models must reliably interpret cluttered environments. Yet existing training signals provide no explicit grounding between reasoning steps and the underlying visual entities and relations, leaving lightweight models free to generate fluent but visually unanchored reasoning chains. To address this gap, we first introduce DRBench, a benchmark of 14,573 questions across 2,943 images, organized into five task categories spanning three progressive reasoning layers. Building on DRBench, we propose DRScaffold, a supervised fine-tuning framework that decomposes the supervision target into four causally ordered stages, enforcing grounded reasoning without architectural modification. Experiments on three lightweight VLMs demonstrate substantial gains on DRBench while preserving or improving performance on general-purpose benchmarks. Notably, Qwen2.5-VL-3B trained with DRScaffold surpasses the frozen Qwen2.5-VL-32B on DRBench, demonstrating that structured supervision can substitute for a significant portion of model scale in dense-scene reasoning. Our code and models are available at https://github.com/irene-shi/DRScaffold .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 3B-over-32B result on the new DRBench is the headline claim, but the abstract supplies no ablations separating the four-stage causal ordering from simply adding more supervision signal.

read the letter

The one or two things to know: DRScaffold gets a 3B VLM to beat a frozen 32B on their DRBench for dense reasoning, but the abstract gives no ablations showing that the four causally ordered stages are necessary rather than just more supervision.

What is new is the benchmark itself—14,573 questions over 2,943 images in five categories across three reasoning layers—and the DRScaffold framework that splits supervision into four stages without changing the model architecture. The experiments show gains across three lightweight models on DRBench and no loss on general benchmarks. Releasing the code is a plus.

The main concern is that we cannot tell if the causal ordering enforces the claimed visual grounding or if the result comes from the volume and format of the extra training signal. The paper does not appear to include controls that hold total tokens constant while varying only the decomposition. That leaves the interpretation of scale substitution open.

This paper targets people working on lightweight vision-language models for applications like robotics where dense scenes matter. Readers looking for new benchmarks in multi-step visual reasoning or training methods for small models will find value here. The empirical focus and released artifacts make it worth a serious referee's time.

I would send it to peer review with requests for the missing ablations.

Referee Report

3 major / 1 minor

Summary. The paper introduces DRBench, a benchmark of 14,573 questions across 2,943 images spanning five task categories and three reasoning layers, and proposes DRScaffold, a supervised fine-tuning framework that decomposes targets into four causally ordered stages to enforce explicit grounding between reasoning steps and visual entities/relations in lightweight VLMs. Experiments on three lightweight models report substantial gains on DRBench, with Qwen2.5-VL-3B fine-tuned via DRScaffold surpassing the frozen Qwen2.5-VL-32B while preserving or improving general-purpose benchmark performance; code and models are released.

Significance. If the central empirical result holds under controlled conditions, the demonstration that structured supervision can substitute for a substantial fraction of model scale in dense-scene reasoning would be a meaningful contribution to efficient VLM development. The public release of code and models is a clear strength that supports reproducibility.

major comments (3)

[Experiments] The claim that the four causally ordered stages produce explicit step-to-entity grounding (rather than simply supplying additional training signal) is load-bearing for the substitution-for-scale interpretation, yet the experimental section provides no ablation that holds total supervision tokens, question distribution, and format fixed while varying only the causal decomposition. Without this control, the reported 3B > 32B result on DRBench remains underdetermined.
[Experiments] The comparison between the fine-tuned 3B model and the frozen 32B model does not report whether the 32B model was evaluated under identical prompting or whether any additional inference-time scaffolding was applied; this detail is required to interpret the scale-substitution result.
[DRScaffold framework] Implementation details of the four supervision stages (exact prompt templates, loss weighting across stages, and how causal ordering is enforced during data construction) are not supplied, preventing independent verification that the reported gains arise from the proposed mechanism.

minor comments (1)

[Abstract] The abstract states 'substantial gains' without quantifying effect sizes or reporting statistical significance; adding these numbers would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify key areas for strengthening experimental controls and reproducibility, which we address point by point below.

read point-by-point responses

Referee: [Experiments] The claim that the four causally ordered stages produce explicit step-to-entity grounding (rather than simply supplying additional training signal) is load-bearing for the substitution-for-scale interpretation, yet the experimental section provides no ablation that holds total supervision tokens, question distribution, and format fixed while varying only the causal decomposition. Without this control, the reported 3B > 32B result on DRBench remains underdetermined.

Authors: We agree that an ablation holding total supervision tokens, question distribution, and format fixed while varying only the causal decomposition would more cleanly isolate whether gains arise from explicit grounding rather than additional training signal volume. Our current experiments compare DRScaffold to standard SFT baselines with matched data volume, but do not include this precise control. We will add the requested ablation in the revised manuscript. revision: yes
Referee: [Experiments] The comparison between the fine-tuned 3B model and the frozen 32B model does not report whether the 32B model was evaluated under identical prompting or whether any additional inference-time scaffolding was applied; this detail is required to interpret the scale-substitution result.

Authors: The frozen 32B model was evaluated using identical zero-shot prompting with no additional inference-time scaffolding. We will explicitly state this evaluation protocol in the revised experimental setup section. revision: yes
Referee: [DRScaffold framework] Implementation details of the four supervision stages (exact prompt templates, loss weighting across stages, and how causal ordering is enforced during data construction) are not supplied, preventing independent verification that the reported gains arise from the proposed mechanism.

Authors: We agree these details are necessary for independent verification. We will add the exact prompt templates, loss weighting scheme across stages, and data construction procedure for enforcing causal ordering to the revised manuscript (expanding Section 3 and the appendix). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on direct model comparisons, not self-referential definitions or fits.

full rationale

The paper introduces DRBench and the DRScaffold training framework as an empirical intervention, then reports performance numbers from fine-tuning lightweight VLMs and comparing them to baselines including a larger frozen model. No equations, parameter fits, or derivations are present in the provided text; the central claim (3B model surpassing 32B on DRBench) is a straightforward experimental outcome rather than a quantity defined in terms of itself or recovered from a self-citation chain. None of the enumerated circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that decomposing supervision into four causally ordered stages produces grounded reasoning; this is treated as a domain assumption about how training signals affect model behavior rather than a derived result.

axioms (1)

domain assumption Decomposing the supervision target into four causally ordered stages enforces grounded reasoning without architectural modification.
This premise is invoked to explain why the method improves performance on dense-scene tasks.

pith-pipeline@v0.9.1-grok · 5773 in / 1235 out tokens · 40199 ms · 2026-06-29T22:51:14.754053+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 17 canonical work pages · 13 internal anchors

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Cdh-bench: A commonsense- driven hallucination benchmark for evaluating visual fidelity in vision-language models.arXiv preprint arXiv:2603.27982, 2026

Kesheng Chen, Yamin Hu, Qi Zhou, Zhenqian Zhu, and Wenjian Luo. Cdh-bench: A commonsense- driven hallucination benchmark for evaluating visual fidelity in vision-language models.arXiv preprint arXiv:2603.27982, 2026

work page arXiv 2026
[5]

Are we on the right way for evaluating large vision-language models?NeurIPS, 37, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?NeurIPS, 37, 2024

2024
[6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InCVPR, 2016

2016
[10]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In ECCV, 2024

2024
[11]

Specializing smaller language models towards multi-step reasoning

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning. InICML, 2023

2023
[12]

Multi-frame, lightweight & efficient vision- language models for question answering in autonomous driving.arXiv preprint arXiv:2403.19838, 2024

Akshay Gopalkrishnan, Ross Greer, and Mohan Trivedi. Multi-frame, lightweight & efficient vision- language models for question answering in autonomous driving.arXiv preprint arXiv:2403.19838, 2024

work page arXiv 2024
[13]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InCVPR, 2017

2017
[14]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024

2024
[15]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

Large language models are reasoning teachers

Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. InACL, 2023

2023
[17]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InACL Findings, 2023

2023
[18]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019. 10

2019
[19]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023

2023
[20]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024

2024
[22]

Visual instruction tuning.NeurIPS, 36, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2023

2023
[23]

Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

2024
[24]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 35, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 35, 2022

2022
[26]

Deconstructing long chain-of-thought: A structured reasoning optimization framework for long cot distillation.arXiv preprint arXiv:2503.16385, 2025

Yijia Luo, Yulin Song, Xingyao Zhang, Jiaheng Liu, Weixun Wang, GengRu Chen, Wenbo Su, and Bo Zheng. Deconstructing long chain-of-thought: A structured reasoning optimization framework for long cot distillation.arXiv preprint arXiv:2503.16385, 2025

work page arXiv 2025
[27]

M., Ahmadi, R., Ghafouri, M., Babaei, A

Amir M Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Elaheh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, et al. A comprehensive survey on knowledge distillation.arXiv preprint arXiv:2503.12067, 2025

work page arXiv 2025
[28]

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021

2021
[30]

FitNets: Hints for Thin Deep Nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets.arXiv preprint arXiv:1412.6550, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[31]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github .com/tatsu-lab/stanford_alpaca, 2023

2023
[32]

Zephyr: Direct Distillation of LM Alignment

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanse- viero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment.arXiv preprint arXiv:2310.16944, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Lingo-2: Driving with natural language

Wayve AI. Lingo-2: Driving with natural language. https://wayve.ai/thinking/lingo-2-drivi ng-with-language, 2024

2024
[35]

Realworldqa.https://huggingface.co/datasets/xai-org/RealworldQA, 2024

xAI. Realworldqa.https://huggingface.co/datasets/xai-org/RealworldQA, 2024

2024
[36]

Llava-cot: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InICCV, 2025

2025
[37]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InCVPR, 2024

2024
[38]

Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models.NeurIPS, 36, 2023

Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models.NeurIPS, 36, 2023. 11 Table of Contents 1.Dataset Construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 (a) Source Im...

2023
[39]

Do NOT ask about objects that are too small, heavily occluded, or in extreme shadow
[40]

Answers must be uniquely supportable by unambiguous visible evidence
[41]

Question Design (5 Categories; at least 2 per category)

Prefer large, salient anchors (furniture, fixtures occupying enough pixels). Question Design (5 Categories; at least 2 per category)
[42]

Which objects near the sink are blue?

Perception – e.g. “Which objects near the sink are blue?”; format: multiple-choice or multi-select
[43]

What is the direction of the plant relative to the stack of books?

Spatial Reasoning – e.g. “What is the direction of the plant relative to the stack of books?”; must involve spatial relation, may involve simple physics
[44]

Where is the best place to put a towel for easy access after washing hands?

Affordance Reasoning – e.g. “Where is the best place to put a towel for easy access after washing hands?”; must rely on object relationships
[45]

Is there any potentially unsafe or unreasonable object placement around the bathtub?

Anomaly Detection – e.g. “Is there any potentially unsafe or unreasonable object placement around the bathtub?”; open-ended but MUST be grounded in visible evidence
[46]

What color is the luggage rack on the roof of the car?

False Premise Rejection – inquire about an object or spatial relationship completely absent from the image as if it were present; prefer objects not visible but logically plausible in the scene (e.g. accessories on a visible object); use referring expressions tied to what is visible so the model must search the image before concluding the part/object is a...
[47]

Scene / room-type mismatch: question presupposes an object a reasonable person would NOT expect in this kind of room
[48]

inferring direction of water/ball flow from floor slope; inferring precise temperature, airflow, or sub-visible details from a normal photograph)

Unreasonable / non-commonsense reasoning: depends on physical or numerical assumptions real rooms generally do not support (e.g. inferring direction of water/ball flow from floor slope; inferring precise temperature, airflow, or sub-visible details from a normal photograph)
[49]

results”: [{“id

Internally inconsistent: question premise contradicts itself or the given options/answer. If NONE of the above clearly applies, do NOT mark as bad; leave it for visual verification. Default: bad_question=false. Output STRICT JSON: {“results”: [{“id”, “bad_question”: bool, “reason”: “...”}]} For Cityscapes, the structure is the same as Prompt 3b with some ...
[50]

If the answer is not visually annotation-supported, fix it (b) or delete (c)

Ground everything in what you SEE or what the GT labels confirm. If the answer is not visually annotation-supported, fix it (b) or delete (c)
[51]

Delete if the question references objects/settings clearly absent from the image or GT labels, or relies on unsupported assumptions (precise slope/flow on a level floor; unreadable text; micro-physics)
[52]

Camera may be tilted; mentally re-level the scene first
[53]

A” / “B” / “A,B

For multi-choice, keep answer as letters like “A” / “B” / “A,B”
[54]

Only change the answer if the fixed answer is clearly correct and supported by image + GT
[55]

Do not output id / type / category / etc

You may only change question / options / answer. Do not output id / type / category / etc. Cityscapes additions (same base structure; additional hard rules for street scenes):
[56]

the car on the right

For questions depending on a specific object/instance, DELETE or MODIFY when: - the target is TOO SMALL or too far away to be reliably identified; - the target’s SPATIAL LOCATION is ambiguous (e.g. “the car on the right” when several exist); - the target’s COLOR / identity is not clearly distinguishable (harsh lighting, 16 shadow, motion blur, JPEG). If b...
[57]

results”: [{“id

Do NOT discuss camera tilt; treat the view as the driver would see it. Output STRICT JSON only (all input ids must appear exactly once): {“results”: [{“id”, “bad_question”: bool, “is_modified”: bool, “question”, “answer”, “options”, “reason”}]} A.5 Step 4: Structured Field Generation and Consistency Verification After Steps 2–3 some questions have their a...

[1] [1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Cdh-bench: A commonsense- driven hallucination benchmark for evaluating visual fidelity in vision-language models.arXiv preprint arXiv:2603.27982, 2026

Kesheng Chen, Yamin Hu, Qi Zhou, Zhenqian Zhu, and Wenjian Luo. Cdh-bench: A commonsense- driven hallucination benchmark for evaluating visual fidelity in vision-language models.arXiv preprint arXiv:2603.27982, 2026

work page arXiv 2026

[5] [5]

Are we on the right way for evaluating large vision-language models?NeurIPS, 37, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?NeurIPS, 37, 2024

2024

[6] [6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InCVPR, 2016

2016

[10] [10]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In ECCV, 2024

2024

[11] [11]

Specializing smaller language models towards multi-step reasoning

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning. InICML, 2023

2023

[12] [12]

Multi-frame, lightweight & efficient vision- language models for question answering in autonomous driving.arXiv preprint arXiv:2403.19838, 2024

Akshay Gopalkrishnan, Ross Greer, and Mohan Trivedi. Multi-frame, lightweight & efficient vision- language models for question answering in autonomous driving.arXiv preprint arXiv:2403.19838, 2024

work page arXiv 2024

[13] [13]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InCVPR, 2017

2017

[14] [14]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024

2024

[15] [15]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

Large language models are reasoning teachers

Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. InACL, 2023

2023

[17] [17]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InACL Findings, 2023

2023

[18] [18]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019. 10

2019

[19] [19]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023

2023

[20] [20]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024

2024

[22] [22]

Visual instruction tuning.NeurIPS, 36, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2023

2023

[23] [23]

Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

2024

[24] [24]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 35, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 35, 2022

2022

[26] [26]

Deconstructing long chain-of-thought: A structured reasoning optimization framework for long cot distillation.arXiv preprint arXiv:2503.16385, 2025

Yijia Luo, Yulin Song, Xingyao Zhang, Jiaheng Liu, Weixun Wang, GengRu Chen, Wenbo Su, and Bo Zheng. Deconstructing long chain-of-thought: A structured reasoning optimization framework for long cot distillation.arXiv preprint arXiv:2503.16385, 2025

work page arXiv 2025

[27] [27]

M., Ahmadi, R., Ghafouri, M., Babaei, A

Amir M Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Elaheh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, et al. A comprehensive survey on knowledge distillation.arXiv preprint arXiv:2503.12067, 2025

work page arXiv 2025

[28] [28]

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021

2021

[30] [30]

FitNets: Hints for Thin Deep Nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets.arXiv preprint arXiv:1412.6550, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[31] [31]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github .com/tatsu-lab/stanford_alpaca, 2023

2023

[32] [32]

Zephyr: Direct Distillation of LM Alignment

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanse- viero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment.arXiv preprint arXiv:2310.16944, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Lingo-2: Driving with natural language

Wayve AI. Lingo-2: Driving with natural language. https://wayve.ai/thinking/lingo-2-drivi ng-with-language, 2024

2024

[35] [35]

Realworldqa.https://huggingface.co/datasets/xai-org/RealworldQA, 2024

xAI. Realworldqa.https://huggingface.co/datasets/xai-org/RealworldQA, 2024

2024

[36] [36]

Llava-cot: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InICCV, 2025

2025

[37] [37]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InCVPR, 2024

2024

[38] [38]

Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models.NeurIPS, 36, 2023

Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models.NeurIPS, 36, 2023. 11 Table of Contents 1.Dataset Construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 (a) Source Im...

2023

[39] [39]

Do NOT ask about objects that are too small, heavily occluded, or in extreme shadow

[40] [40]

Answers must be uniquely supportable by unambiguous visible evidence

[41] [41]

Question Design (5 Categories; at least 2 per category)

Prefer large, salient anchors (furniture, fixtures occupying enough pixels). Question Design (5 Categories; at least 2 per category)

[42] [42]

Which objects near the sink are blue?

Perception – e.g. “Which objects near the sink are blue?”; format: multiple-choice or multi-select

[43] [43]

What is the direction of the plant relative to the stack of books?

Spatial Reasoning – e.g. “What is the direction of the plant relative to the stack of books?”; must involve spatial relation, may involve simple physics

[44] [44]

Where is the best place to put a towel for easy access after washing hands?

Affordance Reasoning – e.g. “Where is the best place to put a towel for easy access after washing hands?”; must rely on object relationships

[45] [45]

Is there any potentially unsafe or unreasonable object placement around the bathtub?

Anomaly Detection – e.g. “Is there any potentially unsafe or unreasonable object placement around the bathtub?”; open-ended but MUST be grounded in visible evidence

[46] [46]

What color is the luggage rack on the roof of the car?

False Premise Rejection – inquire about an object or spatial relationship completely absent from the image as if it were present; prefer objects not visible but logically plausible in the scene (e.g. accessories on a visible object); use referring expressions tied to what is visible so the model must search the image before concluding the part/object is a...

[47] [47]

Scene / room-type mismatch: question presupposes an object a reasonable person would NOT expect in this kind of room

[48] [48]

inferring direction of water/ball flow from floor slope; inferring precise temperature, airflow, or sub-visible details from a normal photograph)

Unreasonable / non-commonsense reasoning: depends on physical or numerical assumptions real rooms generally do not support (e.g. inferring direction of water/ball flow from floor slope; inferring precise temperature, airflow, or sub-visible details from a normal photograph)

[49] [49]

results”: [{“id

Internally inconsistent: question premise contradicts itself or the given options/answer. If NONE of the above clearly applies, do NOT mark as bad; leave it for visual verification. Default: bad_question=false. Output STRICT JSON: {“results”: [{“id”, “bad_question”: bool, “reason”: “...”}]} For Cityscapes, the structure is the same as Prompt 3b with some ...

[50] [50]

If the answer is not visually annotation-supported, fix it (b) or delete (c)

Ground everything in what you SEE or what the GT labels confirm. If the answer is not visually annotation-supported, fix it (b) or delete (c)

[51] [51]

Delete if the question references objects/settings clearly absent from the image or GT labels, or relies on unsupported assumptions (precise slope/flow on a level floor; unreadable text; micro-physics)

[52] [52]

Camera may be tilted; mentally re-level the scene first

[53] [53]

A” / “B” / “A,B

For multi-choice, keep answer as letters like “A” / “B” / “A,B”

[54] [54]

Only change the answer if the fixed answer is clearly correct and supported by image + GT

[55] [55]

Do not output id / type / category / etc

You may only change question / options / answer. Do not output id / type / category / etc. Cityscapes additions (same base structure; additional hard rules for street scenes):

[56] [56]

the car on the right

For questions depending on a specific object/instance, DELETE or MODIFY when: - the target is TOO SMALL or too far away to be reliably identified; - the target’s SPATIAL LOCATION is ambiguous (e.g. “the car on the right” when several exist); - the target’s COLOR / identity is not clearly distinguishable (harsh lighting, 16 shadow, motion blur, JPEG). If b...

[57] [57]

results”: [{“id

Do NOT discuss camera tilt; treat the view as the driver would see it. Output STRICT JSON only (all input ids must appear exactly once): {“results”: [{“id”, “bad_question”: bool, “is_modified”: bool, “question”, “answer”, “options”, “reason”}]} A.5 Step 4: Structured Field Generation and Consistency Verification After Steps 2–3 some questions have their a...