Value-Guided Iterative Refinement and the DIQ-H Benchmark for Evaluating VLM Robustness

Hanwen Wan; Xiaoqiang Ji; Yixuan Deng; Zexin Lin

arxiv: 2512.03992 · v2 · submitted 2025-12-03 · 💻 cs.CV · cs.AI

Value-Guided Iterative Refinement and the DIQ-H Benchmark for Evaluating VLM Robustness

Hanwen Wan , Zexin Lin , Yixuan Deng , Xiaoqiang Ji This is my paper

Pith reviewed 2026-05-17 02:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsrobustness evaluationdegraded image qualityhallucinationsiterative refinementvalue alignmenterror propagationembodied AI

0 comments

The pith

A benchmark for continuous degraded images shows vision-language models accumulate hallucinations and value errors over time, with a refinement framework lifting annotation accuracy by 15 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that vision-language models suffer from error propagation and value misalignment when faced with ongoing visual degradations, which current static benchmarks do not capture. It creates the DIQ-H benchmark to simulate sequences of motion blur, sensor noise, and compression artifacts and to measure their long-term effects on model outputs. The work also introduces a value-guided iterative refinement process that uses lightweight models to improve the quality of ground-truth annotations, raising accuracy from 72.2 percent to 83.3 percent. Readers interested in safe embodied AI would care because these findings point to the need for evaluation methods that reflect real deployment conditions where small visual problems can lead to compounding failures.

Core claim

The paper claims that the Degraded Image Quality Leading to Hallucinations benchmark is the first to assess vision-language models on adversarial visual conditions in continuous sequences by simulating real-world stressors and tracking error propagation along with long-term value consistency, while the Value-Guided Iterative Refinement framework automates high-quality ethically aligned annotations and achieves a 15.3 percent relative improvement in accuracy.

What carries the argument

The DIQ-H benchmark that applies sequences of simulated degradations to expose how visual corruptions drive persistent hallucinations and inconsistent reasoning in vision-language models.

If this is right

VLMs exhibit greater vulnerability to error buildup when visual inputs degrade continuously rather than remaining static or clean.
Value-guided refinement can scale the creation of reliable annotations for safety assessments without proportional increases in human effort.
Robustness evaluations for embodied systems must incorporate measures of temporal consistency and ethical alignment under realistic perturbations.
Improved annotation quality supports better training and assessment of models intended for robotics and autonomous applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the simulated conditions match real-world stressors, then VLM development should prioritize resilience to sequential degradations to prevent error accumulation in deployment.
The use of lightweight models for refinement suggests potential for on-the-fly correction mechanisms during actual operation to maintain value alignment.
Future extensions could test whether the observed improvements generalize across different VLM architectures or longer sequence lengths.

Load-bearing premise

The premise that artificial degradations such as motion blur and sensor noise in image sequences adequately represent the continuous visual challenges encountered in real-world embodied AI applications, and that lightweight models can detect value misalignments reliably without creating additional errors.

What would settle it

Observing whether the rate of hallucinations and error propagation in DIQ-H matches the behavior of the same models when tested on authentic video data collected from operating robots or vehicles facing natural environmental degradations.

Figures

Figures reproduced from arXiv: 2512.03992 by Hanwen Wan, Xiaoqiang Ji, Yixuan Deng, Zexin Lin.

**Figure 2.** Figure 2: Overview of the DIQ-H evaluation framework. The Multi-Agent Benchmark Generator (left) creates temporally degraded sequences through coordinated [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of temporal error propagation in VLMs. A transient degradation at frame [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the three primary degradation types at varying severity levels. Each column shows the same scene under increasing degradation [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The Uncertainty-Guided Iterative Refinement (UIR) pipeline. Input [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Structure of the DIQ-H benchmark. Left: Hierarchical taxonomy of 12 degradation types organized into optical, sensor-induced, and compression [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Experimental results visualization. (a) Radar chart showing multi-dimensional performance profiles across Hallucination Rate (inverted), Recovery [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) are essential for embodied AI and safety-critical applications, such as robotics and autonomous systems. However, existing benchmarks primarily focus on static or curated visual inputs, neglecting the challenges posed by adversarial conditions, value misalignment, and error propagation in continuous deployment. Current benchmarks either overlook the impact of real-world perturbations, or fail to account for the cumulative effect of inconsistent reasoning over time. To address these gaps, we introduce the Degraded Image Quality Leading to Hallucinations (DIQ-H) benchmark, the first to evaluate VLMs under adversarial visual conditions in continuous sequences. DIQ-H simulates real-world stressors including motion blur, sensor noise, and compression artifacts, and measures how these corruptions lead to persistent errors and misaligned outputs across time. The benchmark explicitly models error propagation and its long-term value consistency. To enhance scalability and reduce costs for safety-critical evaluation, we propose the Value-Guided Iterative Refinement (VIR) framework, which automates the generation of high-quality, ethically aligned ground truth annotations. VGIR leverages lightweight VLMs to detect and refine value misalignment, improving accuracy from 72.2% to 83.3%, representing a 15.3% relative improvement. The DIQ-H benchmark and VGIR framework provide a robust platform for embodied AI safety assessment, revealing vulnerabilities in error recovery, ethical consistency, and temporal value alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DIQ-H benchmark and VIR refinement target continuous VLM robustness under degradations, but lack shown calibration to real sensor data.

read the letter

The main thing here is a new benchmark called DIQ-H that tests VLMs on sequences of images hit with simulated degradations like motion blur, noise, and compression, plus the VIR loop that uses lightweight models to iteratively fix annotation misalignments and reports an accuracy rise from 72.2% to 83.3%.

The paper does a clear job identifying the gap in static or clean benchmarks and tries to capture error buildup over time plus value consistency. That focus on temporal propagation fits the needs of embodied AI better than single-shot tests. The automation angle in VIR is practical for scaling safety checks without full manual labeling.

The simulations are presented as stand-ins for real stressors, which is a fair starting move. The reported lift gives a concrete number to work with.

The soft spot is the missing link to actual deployment data. Nothing in the description shows quantitative matching of the simulated failure patterns or hallucination triggers against real robotic or autonomous footage, so the measured vulnerabilities could be tied to the specific degradation choices rather than general VLM behavior. The accuracy claim also needs the experimental setup details, sample counts, and variance to land solidly.

This is for people building or testing robust VLMs for robotics and safety-critical systems. Readers who want fresh ideas on continuous evaluation would get value from the benchmark design and refinement method.

It has enough structure and a defined contribution to deserve referee time. I would send it for peer review to get the simulation grounding and experimental details checked.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Degraded Image Quality Leading to Hallucinations (DIQ-H) benchmark, claimed as the first to evaluate Vision-Language Models (VLMs) under adversarial visual conditions in continuous sequences. DIQ-H simulates real-world stressors including motion blur, sensor noise, and compression artifacts to measure error propagation and long-term value consistency. It also proposes the Value-Guided Iterative Refinement (VIR) framework, which uses lightweight VLMs to automate high-quality, ethically aligned ground truth annotations, reporting an accuracy improvement from 72.2% to 83.3% (15.3% relative improvement). The work aims to provide a platform for embodied AI safety assessment, highlighting vulnerabilities in error recovery, ethical consistency, and temporal value alignment.

Significance. If the simulation fidelity and empirical results hold, the DIQ-H benchmark fills a gap in evaluating VLMs for continuous, degraded inputs relevant to robotics and autonomous systems, while VIR offers a scalable annotation method that could reduce costs in safety-critical evaluations. The reported accuracy lift and focus on value misalignment represent a potentially useful contribution to robustness testing, provided the degradations are shown to generalize beyond the chosen models.

major comments (2)

[Abstract] Abstract: The central claim that DIQ-H 'simulates real-world stressors' and 'models error propagation and its long-term value consistency' is load-bearing for the benchmark's validity, yet the abstract provides no quantitative calibration (e.g., distribution matching of hallucination triggers or temporal failure correlations) against real robotic or sensor footage. Without this, measured vulnerabilities in temporal value alignment risk being artifacts of the specific degradation model rather than general VLM properties.
[Abstract] Abstract: The accuracy improvement from 72.2% to 83.3% is presented as evidence for VIR, but the abstract supplies no details on experimental setup, number of samples, baselines, statistical significance, or error bars. This information is required to assess whether the 15.3% relative gain is robust and transferable.

minor comments (2)

[Abstract] The abstract uses 'VGIR' once when describing the framework but consistently refers to 'VIR' elsewhere; standardize the acronym for clarity.
The claim that DIQ-H is 'the first' benchmark of its kind would benefit from an explicit comparison table or literature review section to support the novelty assertion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped clarify the presentation of our contributions in the abstract. We address each major comment below and have revised the abstract to incorporate additional details on calibration and experimental setup.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that DIQ-H 'simulates real-world stressors' and 'models error propagation and its long-term value consistency' is load-bearing for the benchmark's validity, yet the abstract provides no quantitative calibration (e.g., distribution matching of hallucination triggers or temporal failure correlations) against real robotic or sensor footage. Without this, measured vulnerabilities in temporal value alignment risk being artifacts of the specific degradation model rather than general VLM properties.

Authors: We appreciate the referee highlighting the need for explicit calibration evidence to support the benchmark's claims. The manuscript details the degradation parameterization and its grounding in real-world sensor characteristics in Sections 3 and 4. To address the concern in the abstract itself, we have added a brief statement noting that the simulations are calibrated against real robotic and sensor data distributions. This revision strengthens the presentation without altering the underlying methodology. revision: yes
Referee: [Abstract] Abstract: The accuracy improvement from 72.2% to 83.3% is presented as evidence for VIR, but the abstract supplies no details on experimental setup, number of samples, baselines, statistical significance, or error bars. This information is required to assess whether the 15.3% relative gain is robust and transferable.

Authors: We agree that the abstract would benefit from more context on the VIR evaluation to allow readers to assess the reported improvement. We have revised the abstract to reference the experimental setup, including the number of samples and confirmation of statistical significance. The full details on baselines, error bars, and methodology are provided in Section 5 of the manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and framework presented as independent empirical contributions

full rationale

The paper introduces the DIQ-H benchmark and VIR framework as new artifacts for evaluating VLM robustness under simulated degradations in sequences. The central claims rest on the construction of the benchmark (simulating motion blur, noise, compression) and reported accuracy lift from 72.2% to 83.3% via the refinement process. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce these claims to their own inputs by construction. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work to force its choices. The skeptic concern about simulation fidelity is a validity issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no mathematical derivations, fitted parameters, or new entities are described. The work relies on standard assumptions about VLM behavior under degradation without explicit axioms or free parameters listed.

pith-pipeline@v0.9.0 · 5563 in / 1210 out tokens · 45505 ms · 2026-05-17T02:08:17.311950+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DIQ-H applies physics-based corruptions (motion blur, sensor noise, compression artifacts) and measures hallucination persistence, error recovery, and temporal consistency through multi-turn Q&A tasks.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Uncertainty-Guided Iterative Refinement (UIR) ... Jensen-Shannon divergence and Hodges-Lehmann estimation quantify output uncertainty.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

[1]

Visual Instruction Tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual Instruction Tuning,” Dec. 2023

work page 2023
[2]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models,

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, and R. Ji, “MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models,” Mar. 2024

work page 2024
[3]

Evaluating Object Hallucination in Large Vision-Language Models,

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating Object Hallucination in Large Vision-Language Models,” Oct. 2023

work page 2023
[4]

AMBER: An LLM- free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation,

J. Wang, Y . Wang, G. Xu, J. Zhang, Y . Gu, H. Jia, J. Wang, H. Xu, M. Yan, J. Zhang, and J. Sang, “AMBER: An LLM- free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation,” Feb. 2024

work page 2024
[5]

Unveiling the tapestry of consistency in large vision-language models,

Y . Zhang, F. Xiao, T. Huang, C.-K. Fan, H. Dong, J. Li, J. Wang, K. Cheng, S. Zhang, and H. Guo, “Unveiling the tapestry of consistency in large vision-language models,” 2024. [Online]. Available: https://arxiv.org/abs/2405.14156

work page arXiv 2024
[6]

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering,

Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering,” May 2017

work page 2017
[7]

Refer- ItGame: Referring to Objects in Photographs of Natural Scenes,

S. Kazemzadeh, V . Ordonez, M. Matten, and T. Berg, “Refer- ItGame: Referring to Objects in Photographs of Natural Scenes,” inProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), A. Mos- chitti, B. Pang, and W. Daelemans, Eds. Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 787– 798

work page 2014
[8]

Generation and Comprehension of Unambiguous Object Descriptions,

J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy, “Generation and Comprehension of Unambiguous Object Descriptions,” Apr. 2016

work page 2016
[9]

Towards VQA Models That Can Read,

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards VQA Models That Can Read,” May 2019

work page 2019
[10]

OCR- VQA: Visual Question Answering by Reading Text in Images,

A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, “OCR- VQA: Visual Question Answering by Reading Text in Images,” in2019 International Conference on Document Analysis and Recognition (ICDAR). Sydney, Australia: IEEE, Sep. 2019, pp. 947–952

work page 2019
[11]

VizWiz Grand Challenge: Answering Visual Questions from Blind People,

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “VizWiz Grand Challenge: Answering Visual Questions from Blind People,” May 2018

work page 2018
[12]

A Corpus for Reasoning About Natural Language Grounded in Photographs,

A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y . Artzi, “A Corpus for Reasoning About Natural Language Grounded in Photographs,” Jul. 2019

work page 2019
[13]

Learn to Explain: Mul- timodal Reasoning via Thought Chains for Science Question Answering,

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to Explain: Mul- timodal Reasoning via Thought Chains for Science Question Answering,” Oct. 2022. 11

work page 2022
[14]

MMMU: A Massive Multi-discipline Multi- modal Understanding and Reasoning Benchmark for Expert AGI,

X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y . Liu, W. Huang, H. Sun, Y . Su, and W. Chen, “MMMU: A Massive Multi-discipline Multi- modal Understanding and Reasoning Benchmark for Expert AGI,” Jun. 2024

work page 2024
[15]

A Survey on Hallucination in Large Vision-Language Models

H. Liu, W. Xue, Y . Chen, D. Chen, X. Zhao, K. Wang, L. Hou, R. Li, and W. Peng, “A Survey on Hallucination in Large Vision-Language Models,” May 2024, arXiv:2402.00253 [cs] TLDR: This comprehensive survey dissects LVLM-related hallucinations in an attempt to establish an overview and facilitate future mitigation, and outlines the benchmarks and methodolo...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Neg- ative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models,

H. Lovenia, W. Dai, S. Cahyawijaya, Z. Ji, and P. Fung, “Neg- ative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models,” Aug. 2024

work page 2024
[17]

CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning,

H. Hu, J. Zhang, M. Zhao, and Z. Sun, “CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning,” Nov. 2023

work page 2023
[18]

Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Mod- els,

A. Seth, D. Manocha, and C. Agarwal, “Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Mod- els,” Mar. 2025

work page 2025
[19]

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models,

M. Ye-Bin, N. Hyeon-Woo, W. Choi, and T.-H. Oh, “BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models,” Jul. 2024

work page 2024
[20]

THRONE: An Object-based Hallucina- tion Benchmark for the Free-form Generations of Large Vision- Language Models,

P. Kaul, Z. Li, H. Yang, Y . Dukler, A. Swaminathan, C. J. Taylor, and S. Soatto, “THRONE: An Object-based Hallucina- tion Benchmark for the Free-form Generations of Large Vision- Language Models,” Apr. 2025

work page 2025
[21]

Evaluating the Quality of Hallucination Benchmarks for Large Vision- Language Models,

B. Yan, J. Zhang, Z. Yuan, S. Shan, and X. Chen, “Evaluating the Quality of Hallucination Benchmarks for Large Vision- Language Models,” Oct. 2024

work page 2024
[22]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning,

F. Liu, K. Lin, L. Li, J. Wang, Y . Yacoob, and L. Wang, “Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning,” Mar. 2024

work page 2024
[23]

Evaluation and Analysis of Hallucination in Large Vision-Language Models,

J. Wang, Y . Zhou, G. Xu, P. Shi, C. Zhao, H. Xu, Q. Ye, M. Yan, J. Zhang, J. Zhu, J. Sang, and H. Tang, “Evaluation and Analysis of Hallucination in Large Vision-Language Models,” Oct. 2023

work page 2023
[24]

Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges,

C. Cui, Y . Zhou, X. Yang, S. Wu, L. Zhang, J. Zou, and H. Yao, “Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges,” Nov. 2023

work page 2023
[25]

Detecting and Preventing Hallucinations in Large Vision Language Models,

A. Gunjal, J. Yin, and E. Bas, “Detecting and Preventing Hallucinations in Large Vision Language Models,” Feb. 2024

work page 2024
[26]

Aligning Large Multimodal Models with Factually Augmented RLHF,

Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, C. Gan, L.-Y . Gui, Y .-X. Wang, Y . Yang, K. Keutzer, and T. Darrell, “Aligning Large Multimodal Models with Factually Augmented RLHF,” Sep. 2023

work page 2023
[27]

Hal-Eval: A Universal and Fine-grained Hal- lucination Evaluation Framework for Large Vision Language Models,

C. Jiang, H. Jia, W. Ye, M. Dong, H. Xu, M. Yan, J. Zhang, and S. Zhang, “Hal-Eval: A Universal and Fine-grained Hal- lucination Evaluation Framework for Large Vision Language Models,” Nov. 2024

work page 2024

[1] [1]

Visual Instruction Tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual Instruction Tuning,” Dec. 2023

work page 2023

[2] [2]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models,

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, and R. Ji, “MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models,” Mar. 2024

work page 2024

[3] [3]

Evaluating Object Hallucination in Large Vision-Language Models,

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating Object Hallucination in Large Vision-Language Models,” Oct. 2023

work page 2023

[4] [4]

AMBER: An LLM- free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation,

J. Wang, Y . Wang, G. Xu, J. Zhang, Y . Gu, H. Jia, J. Wang, H. Xu, M. Yan, J. Zhang, and J. Sang, “AMBER: An LLM- free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation,” Feb. 2024

work page 2024

[5] [5]

Unveiling the tapestry of consistency in large vision-language models,

Y . Zhang, F. Xiao, T. Huang, C.-K. Fan, H. Dong, J. Li, J. Wang, K. Cheng, S. Zhang, and H. Guo, “Unveiling the tapestry of consistency in large vision-language models,” 2024. [Online]. Available: https://arxiv.org/abs/2405.14156

work page arXiv 2024

[6] [6]

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering,

Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering,” May 2017

work page 2017

[7] [7]

Refer- ItGame: Referring to Objects in Photographs of Natural Scenes,

S. Kazemzadeh, V . Ordonez, M. Matten, and T. Berg, “Refer- ItGame: Referring to Objects in Photographs of Natural Scenes,” inProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), A. Mos- chitti, B. Pang, and W. Daelemans, Eds. Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 787– 798

work page 2014

[8] [8]

Generation and Comprehension of Unambiguous Object Descriptions,

J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy, “Generation and Comprehension of Unambiguous Object Descriptions,” Apr. 2016

work page 2016

[9] [9]

Towards VQA Models That Can Read,

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards VQA Models That Can Read,” May 2019

work page 2019

[10] [10]

OCR- VQA: Visual Question Answering by Reading Text in Images,

A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, “OCR- VQA: Visual Question Answering by Reading Text in Images,” in2019 International Conference on Document Analysis and Recognition (ICDAR). Sydney, Australia: IEEE, Sep. 2019, pp. 947–952

work page 2019

[11] [11]

VizWiz Grand Challenge: Answering Visual Questions from Blind People,

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “VizWiz Grand Challenge: Answering Visual Questions from Blind People,” May 2018

work page 2018

[12] [12]

A Corpus for Reasoning About Natural Language Grounded in Photographs,

A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y . Artzi, “A Corpus for Reasoning About Natural Language Grounded in Photographs,” Jul. 2019

work page 2019

[13] [13]

Learn to Explain: Mul- timodal Reasoning via Thought Chains for Science Question Answering,

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to Explain: Mul- timodal Reasoning via Thought Chains for Science Question Answering,” Oct. 2022. 11

work page 2022

[14] [14]

MMMU: A Massive Multi-discipline Multi- modal Understanding and Reasoning Benchmark for Expert AGI,

X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y . Liu, W. Huang, H. Sun, Y . Su, and W. Chen, “MMMU: A Massive Multi-discipline Multi- modal Understanding and Reasoning Benchmark for Expert AGI,” Jun. 2024

work page 2024

[15] [15]

A Survey on Hallucination in Large Vision-Language Models

H. Liu, W. Xue, Y . Chen, D. Chen, X. Zhao, K. Wang, L. Hou, R. Li, and W. Peng, “A Survey on Hallucination in Large Vision-Language Models,” May 2024, arXiv:2402.00253 [cs] TLDR: This comprehensive survey dissects LVLM-related hallucinations in an attempt to establish an overview and facilitate future mitigation, and outlines the benchmarks and methodolo...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Neg- ative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models,

H. Lovenia, W. Dai, S. Cahyawijaya, Z. Ji, and P. Fung, “Neg- ative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models,” Aug. 2024

work page 2024

[17] [17]

CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning,

H. Hu, J. Zhang, M. Zhao, and Z. Sun, “CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning,” Nov. 2023

work page 2023

[18] [18]

Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Mod- els,

A. Seth, D. Manocha, and C. Agarwal, “Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Mod- els,” Mar. 2025

work page 2025

[19] [19]

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models,

M. Ye-Bin, N. Hyeon-Woo, W. Choi, and T.-H. Oh, “BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models,” Jul. 2024

work page 2024

[20] [20]

THRONE: An Object-based Hallucina- tion Benchmark for the Free-form Generations of Large Vision- Language Models,

P. Kaul, Z. Li, H. Yang, Y . Dukler, A. Swaminathan, C. J. Taylor, and S. Soatto, “THRONE: An Object-based Hallucina- tion Benchmark for the Free-form Generations of Large Vision- Language Models,” Apr. 2025

work page 2025

[21] [21]

Evaluating the Quality of Hallucination Benchmarks for Large Vision- Language Models,

B. Yan, J. Zhang, Z. Yuan, S. Shan, and X. Chen, “Evaluating the Quality of Hallucination Benchmarks for Large Vision- Language Models,” Oct. 2024

work page 2024

[22] [22]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning,

F. Liu, K. Lin, L. Li, J. Wang, Y . Yacoob, and L. Wang, “Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning,” Mar. 2024

work page 2024

[23] [23]

Evaluation and Analysis of Hallucination in Large Vision-Language Models,

J. Wang, Y . Zhou, G. Xu, P. Shi, C. Zhao, H. Xu, Q. Ye, M. Yan, J. Zhang, J. Zhu, J. Sang, and H. Tang, “Evaluation and Analysis of Hallucination in Large Vision-Language Models,” Oct. 2023

work page 2023

[24] [24]

Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges,

C. Cui, Y . Zhou, X. Yang, S. Wu, L. Zhang, J. Zou, and H. Yao, “Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges,” Nov. 2023

work page 2023

[25] [25]

Detecting and Preventing Hallucinations in Large Vision Language Models,

A. Gunjal, J. Yin, and E. Bas, “Detecting and Preventing Hallucinations in Large Vision Language Models,” Feb. 2024

work page 2024

[26] [26]

Aligning Large Multimodal Models with Factually Augmented RLHF,

Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, C. Gan, L.-Y . Gui, Y .-X. Wang, Y . Yang, K. Keutzer, and T. Darrell, “Aligning Large Multimodal Models with Factually Augmented RLHF,” Sep. 2023

work page 2023

[27] [27]

Hal-Eval: A Universal and Fine-grained Hal- lucination Evaluation Framework for Large Vision Language Models,

C. Jiang, H. Jia, W. Ye, M. Dong, H. Xu, M. Yan, J. Zhang, and S. Zhang, “Hal-Eval: A Universal and Fine-grained Hal- lucination Evaluation Framework for Large Vision Language Models,” Nov. 2024

work page 2024