pith. sign in

arxiv: 2605.01733 · v2 · pith:GQ3K7COZnew · submitted 2026-05-03 · 💻 cs.CV · cs.AI

GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models

Pith reviewed 2026-05-21 00:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords Vision-Language ModelsObject HallucinationSelf-Generated CaptionsSelective TrustTraining-FreePOPEHallusionBenchGated Decoding
0
0 comments X

The pith

GEASS lets vision-language models selectively gate self-generated captions to reduce object hallucinations on a per-query basis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that self-generated captions are not uniformly helpful for VLMs and can degrade performance by anchoring reasoning trajectories and lexical choices, with fabrications carrying outsized impact despite being rarer than omissions. It shows that a caption's value is a per-query property rather than a fixed corpus property. GEASS is a training-free module that gates caption consumption by the clean path's confidence, weights it by entropy reduction, and raises the evidence threshold on pathway disagreement. Experiments across four VLMs on POPE and HallusionBench demonstrate consistent gains over vanilla inference and contrastive decoding using only two extra forward passes per query. A sympathetic reader would care because this offers a lightweight, plug-in way to improve reliability in existing models without retraining.

Core claim

The author claims that because captions anchor both final answers and reasoning paths, and because caption errors are asymmetric with fabrications having larger per-instance effects, usefulness must be assessed per query. GEASS implements selective trust by gating on the clean path's confidence, weighting by the entropy reduction the caption produces, and raising the evidence bar when the two pathways disagree, producing consistent accuracy gains on hallucination benchmarks across multiple VLMs with minimal added computation.

What carries the argument

GEASS, the gated evidence-adaptive selective caption trust module that dynamically decides per-query how much of a self-generated caption the model consumes using confidence, entropy reduction, and pathway disagreement.

If this is right

  • Reduces object hallucination rates in VLMs by adapting caption trust to query-specific evidence quality.
  • Outperforms both vanilla inference and contrastive decoding on POPE and HallusionBench across four different VLMs.
  • Adds only two extra forward passes per query while remaining entirely training-free.
  • Treats caption errors as asymmetric and query-dependent rather than uniformly beneficial.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gating logic could extend to other generative settings where intermediate outputs risk introducing bias or anchoring.
  • Similar per-query adaptation might reduce hallucinations in text-only models that use self-generated reasoning steps.
  • Combining GEASS-style selection with external retrieval could create stronger evidence filtering in multimodal systems.
  • The approach suggests testing whether the same confidence-entropy-disagreement signals work for non-caption evidence sources.

Load-bearing premise

A caption's usefulness is a per-query property that can be reliably estimated from the clean path's confidence, the entropy reduction it produces, and disagreement between the two pathways.

What would settle it

If applying GEASS to additional VLMs or benchmarks produces no improvement or causes accuracy to drop relative to vanilla inference, the selective per-query trust mechanism would be falsified.

Figures

Figures reproduced from arXiv: 2605.01733 by Jiashen Ding, Shuoyang Zhang, Zeshang Li.

Figure 1
Figure 1. Figure 1: Caption anchoring effect observed on Qwen2.5-VL-3B with chain-of-thought reasoning. The model’s output (left, in red) closely mirrors the phrasing of the embedded caption (right, in red), demonstrating that captions reshape not only the final answer but the model’s entire reasoning trajectory. methods avoid retraining and intervene at decoding. One line leverages external vision models for post-hoc verific… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Objects contained in the caption generated by InternVL2-8B. Right: Salient objects that are clearly visible in the image but not mentioned in the caption [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Top: With a correct caption mentioning “a dog sitting on the beach,” both Qwen2.5-VL-3B and InternVL2-8B revise their initially incorrect answers from No to Yes, demonstrating confidence grounding. Bottom: With a wrong caption mentioning “a cat sitting on the beach,” both models similarly flip to Yes and fabricate supporting details, demonstrating hallucination amplifica￾tion. The same anchoring mechanism … view at source ↗
Figure 4
Figure 4. Figure 4: Asymmetric per-instance impact of caption errors on Qwen2.5-VL-3B (100 image–question instances): fabrication shifts predictions sharply (∆p = 0.64, 87% flips), while omission is mild on average (∆p = 0.13) but its long tail still flips 11% of answers. ∆p is the caption-induced shift toward the wrong answer; answers flip above the shaded threshold (∆p > 0.4). Inner boxes mark median and IQR; diamonds mark … view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the GEASS pipeline. Given an image I and a question Q, the model first generates a caption C via self-captioning (Stage 1). Two parallel forward passes through the same VLM with shared parameters produce logit vectors zclean (conditioned on I, Q) and zcap (conditioned on I, Q, C) (Stage 2). The adaptive fusion module (Stage 3) computes a confidence gate α that assesses whether the model needs h… view at source ↗
Figure 6
Figure 6. Figure 6: Left: example where caption causes hallucination (need to reduce caption influence). Right: example where caption corrects the prediction (need normal caption steering). 11 [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) excel at grounded reasoning but remain prone to object hallucination. Recent work treats self-generated captions as a uniformly positive resource, yet we find that naively embedding one can degrade rather than help--dropping Qwen2.5-VL-3B accuracy on HallusionBench by nearly 10 points. Two structural properties explain this. First, captions anchor not only the model's final answer but also its reasoning trajectory and lexical choices. Second, caption errors are asymmetric: omissions vastly outnumber fabrications, yet each fabrication carries a much larger per-instance impact. A caption's usefulness is therefore a per-query property, not a per-corpus one. We propose GEASS (ated Evidence-Adaptive Selective Caption Trust ), a training-free module that decides on each query how much of the caption the model consumes: it gates the caption by the clean path's confidence, weights it by the entropy reduction it produces, and raises the evidence bar when the two pathways disagree. Experiments on POPE and HallusionBench across four VLMs show that GEASS consistently improves over vanilla inference and contrastive decoding, with only two extra forward passes per query.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes GEASS, a training-free module for VLMs that selectively trusts self-generated captions on a per-query basis to reduce object hallucination. It gates caption use by clean-path confidence, weights by entropy reduction, and raises the evidence threshold on pathway disagreement. The central claim is that this adaptive mechanism yields consistent gains over vanilla inference and contrastive decoding on POPE and HallusionBench across four VLMs while requiring only two extra forward passes.

Significance. If the empirical results hold and the gating signals are shown to be predictive, the work supplies a lightweight, training-free technique for mitigating a known failure mode in VLMs. The observation that caption errors are asymmetric and that captions anchor both answers and reasoning trajectories is a useful diagnostic insight. The minimal overhead and multi-model, multi-benchmark evaluation are practical strengths.

major comments (2)
  1. The central claim rests on the assumption that clean-path confidence, entropy reduction, and pathway disagreement reliably indicate per-query caption usefulness. However, the experiments section provides only aggregate accuracy improvements and does not include correlation analysis, calibration plots, or an oracle comparison demonstrating that high-confidence/low-disagreement cases correspond to low-hallucination captions on POPE or HallusionBench. Without this validation, it remains possible that the selective mechanism adds variance rather than systematic gain.
  2. §4 (Experimental Setup) and the results tables: the abstract and reported experiments claim consistent improvements, yet no numerical values, standard deviations, or per-VLM/per-benchmark breakdowns are supplied in the provided text. This makes it impossible to assess effect sizes or whether gains exceed the variance introduced by the two extra forward passes.
minor comments (2)
  1. The acronym expansion in the title and abstract appears truncated (GEASS (ated Evidence-Adaptive...); this should be corrected to the full intended phrase.
  2. Notation for the three gating signals (confidence, entropy reduction, disagreement) should be defined explicitly with equations or pseudocode in the method section to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on validating the gating signals and improving the presentation of experimental results. We address each major comment below and have revised the manuscript to incorporate the suggested analyses and clarifications.

read point-by-point responses
  1. Referee: The central claim rests on the assumption that clean-path confidence, entropy reduction, and pathway disagreement reliably indicate per-query caption usefulness. However, the experiments section provides only aggregate accuracy improvements and does not include correlation analysis, calibration plots, or an oracle comparison demonstrating that high-confidence/low-disagreement cases correspond to low-hallucination captions on POPE or HallusionBench. Without this validation, it remains possible that the selective mechanism adds variance rather than systematic gain.

    Authors: We agree that explicit validation of the gating signals would strengthen the central claim and reduce the possibility that gains are due to variance. In the revised manuscript we have added a dedicated analysis subsection that reports Pearson correlations between clean-path confidence and per-instance hallucination rates on POPE, calibration plots for the three signals, and an oracle comparison in which GEASS-selected captions are contrasted with random and high-confidence-only selections. These additions confirm statistically significant positive correlations and that GEASS outperforms random gating, supporting systematic rather than variance-driven improvement. revision: yes

  2. Referee: §4 (Experimental Setup) and the results tables: the abstract and reported experiments claim consistent improvements, yet no numerical values, standard deviations, or per-VLM/per-benchmark breakdowns are supplied in the provided text. This makes it impossible to assess effect sizes or whether gains exceed the variance introduced by the two extra forward passes.

    Authors: The full manuscript contains result tables with per-VLM and per-benchmark numbers; however, we acknowledge that standard deviations and explicit effect-size discussion were insufficiently prominent. In the revision we have added standard deviations (computed over three random seeds) to all reported accuracies, inserted a summary table of effect sizes, and included a direct comparison showing that the observed gains exceed the variance attributable to the two additional forward passes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method presented as heuristic without load-bearing self-citation or definitional reduction

full rationale

The paper introduces GEASS as a training-free heuristic module that gates, weights, and thresholds captions using three per-query signals (clean-path confidence, entropy reduction, and pathway disagreement). No equations, derivations, or first-principles claims are shown that reduce the gating logic to a fitted parameter or to a self-citation whose content is itself unverified. The central improvement is supported by direct experiments on POPE and HallusionBench rather than by any chain that collapses back to the inputs by construction. Self-citations, if present in the full text, are not load-bearing for the core claim. This is the common case of an empirical heuristic whose validity rests on external benchmarks rather than internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review is based solely on the abstract; the ledger therefore records only the two structural properties explicitly named in the abstract as domain assumptions.

axioms (2)
  • domain assumption Captions anchor not only the model's final answer but also its reasoning trajectory and lexical choices.
    First structural property stated in the abstract.
  • domain assumption Caption errors are asymmetric: omissions vastly outnumber fabrications, yet each fabrication carries a much larger per-instance impact.
    Second structural property stated in the abstract.

pith-pipeline@v0.9.0 · 5740 in / 1385 out tokens · 38099 ms · 2026-05-21T00:38:02.130650+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,

  2. [2]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., and Lin, D. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pp. 370–387. Springer, 2024a. Chen, Z., Wang, W., Cao, Y ., Liu, Y ., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding per- formance boundaries o...

  3. [3]

    A., and Luo, J

    Hu, Y ., Hua, H., Yang, Z., Shi, W., Smith, N. A., and Luo, J. Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699,

  4. [4]

    and Song, M

    Lee, J. and Song, M. Retrieval visual contrastive decoding to mitigate object hallucinations in large vision-language models. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 8200–8219,

  5. [5]

    Cai: Caption- sensitive attention intervention for mitigating object hallu- cination in large vision-language models.arXiv preprint arXiv:2506.23590, 2025a

    Li, Q., Ye, Z., Feng, X., Zhong, W., Qin, L., Chen, R., Li, B., Jiang, K., Wang, Y ., Liu, T., et al. Cai: Caption- sensitive attention intervention for mitigating object hallu- cination in large vision-language models.arXiv preprint arXiv:2506.23590, 2025a. Li, Q., Ye, Z., Feng, X., Zhong, W., Qin, L., Chen, R., Huang, L., Li, B., Jiang, K., Wang, Y ., e...

  6. [6]

    Evaluating object hallucination in large vision-language models

    Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empiri- cal methods in natural language processing, pp. 292–305,

  7. [7]

    Li, Z., Shi, H., Gao, Y ., Liu, D., Wang, Z., Chen, Y ., Liu, T., Zhao, L., Wang, H., and Metaxas, D. N. The hidden life of tokens: Reducing hallucination of large vision- language models via visual information steering.arXiv preprint arXiv:2502.03628, 2025b. Lin, T.-Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll ´ar, P., and Zitnic...

  8. [8]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    9 GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models Liu, C., Wang, C., Peng, Y ., and Li, Z. Zvqaf: Zero-shot visual question answering with feedback from large lan- guage models.Neurocomputing, 580:127505, 2024a. Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Mitigating hallucination in large multi-modal mod...

  9. [9]

    Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26296–26306, 2024b. Park, W., Kim, W., Kim, J., and Do, J. Second: Mitigat- ing perceptual hallucination in vision-language models via selective and contrastive decoding.arXiv...

  10. [10]

    A., Burns, K., Darrell, T., and Saenko, K

    Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, K. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pp. 4035–4045,

  11. [11]

    A., and Kundu, S

    Sarkar, S., Che, Y ., Gavin, A., Beerel, P. A., and Kundu, S. Mitigating hallucinations in vision-language models through image-guided head suppression. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 12492–12511,

  12. [12]

    Aligning large multi- modal models with factually augmented rlhf

    Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y ., Gan, C., Gui, L., Wang, Y .-X., Yang, Y ., et al. Aligning large multi- modal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 13088–13110,

  13. [13]

    Mitigating hallucinations in large vision-language models with in- ternal fact-based contrastive decoding.arXiv preprint arXiv:2502.01056,

    Wang, C., Zhou, X., Fu, W., and Zhou, Y . Mitigating hallucinations in large vision-language models with in- ternal fact-based contrastive decoding.arXiv preprint arXiv:2502.01056,

  14. [14]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

  15. [15]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,