GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models
Pith reviewed 2026-05-21 00:38 UTC · model grok-4.3
The pith
GEASS lets vision-language models selectively gate self-generated captions to reduce object hallucinations on a per-query basis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The author claims that because captions anchor both final answers and reasoning paths, and because caption errors are asymmetric with fabrications having larger per-instance effects, usefulness must be assessed per query. GEASS implements selective trust by gating on the clean path's confidence, weighting by the entropy reduction the caption produces, and raising the evidence bar when the two pathways disagree, producing consistent accuracy gains on hallucination benchmarks across multiple VLMs with minimal added computation.
What carries the argument
GEASS, the gated evidence-adaptive selective caption trust module that dynamically decides per-query how much of a self-generated caption the model consumes using confidence, entropy reduction, and pathway disagreement.
If this is right
- Reduces object hallucination rates in VLMs by adapting caption trust to query-specific evidence quality.
- Outperforms both vanilla inference and contrastive decoding on POPE and HallusionBench across four different VLMs.
- Adds only two extra forward passes per query while remaining entirely training-free.
- Treats caption errors as asymmetric and query-dependent rather than uniformly beneficial.
Where Pith is reading between the lines
- The gating logic could extend to other generative settings where intermediate outputs risk introducing bias or anchoring.
- Similar per-query adaptation might reduce hallucinations in text-only models that use self-generated reasoning steps.
- Combining GEASS-style selection with external retrieval could create stronger evidence filtering in multimodal systems.
- The approach suggests testing whether the same confidence-entropy-disagreement signals work for non-caption evidence sources.
Load-bearing premise
A caption's usefulness is a per-query property that can be reliably estimated from the clean path's confidence, the entropy reduction it produces, and disagreement between the two pathways.
What would settle it
If applying GEASS to additional VLMs or benchmarks produces no improvement or causes accuracy to drop relative to vanilla inference, the selective per-query trust mechanism would be falsified.
Figures
read the original abstract
Vision-Language Models (VLMs) excel at grounded reasoning but remain prone to object hallucination. Recent work treats self-generated captions as a uniformly positive resource, yet we find that naively embedding one can degrade rather than help--dropping Qwen2.5-VL-3B accuracy on HallusionBench by nearly 10 points. Two structural properties explain this. First, captions anchor not only the model's final answer but also its reasoning trajectory and lexical choices. Second, caption errors are asymmetric: omissions vastly outnumber fabrications, yet each fabrication carries a much larger per-instance impact. A caption's usefulness is therefore a per-query property, not a per-corpus one. We propose GEASS (ated Evidence-Adaptive Selective Caption Trust ), a training-free module that decides on each query how much of the caption the model consumes: it gates the caption by the clean path's confidence, weights it by the entropy reduction it produces, and raises the evidence bar when the two pathways disagree. Experiments on POPE and HallusionBench across four VLMs show that GEASS consistently improves over vanilla inference and contrastive decoding, with only two extra forward passes per query.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GEASS, a training-free module for VLMs that selectively trusts self-generated captions on a per-query basis to reduce object hallucination. It gates caption use by clean-path confidence, weights by entropy reduction, and raises the evidence threshold on pathway disagreement. The central claim is that this adaptive mechanism yields consistent gains over vanilla inference and contrastive decoding on POPE and HallusionBench across four VLMs while requiring only two extra forward passes.
Significance. If the empirical results hold and the gating signals are shown to be predictive, the work supplies a lightweight, training-free technique for mitigating a known failure mode in VLMs. The observation that caption errors are asymmetric and that captions anchor both answers and reasoning trajectories is a useful diagnostic insight. The minimal overhead and multi-model, multi-benchmark evaluation are practical strengths.
major comments (2)
- The central claim rests on the assumption that clean-path confidence, entropy reduction, and pathway disagreement reliably indicate per-query caption usefulness. However, the experiments section provides only aggregate accuracy improvements and does not include correlation analysis, calibration plots, or an oracle comparison demonstrating that high-confidence/low-disagreement cases correspond to low-hallucination captions on POPE or HallusionBench. Without this validation, it remains possible that the selective mechanism adds variance rather than systematic gain.
- §4 (Experimental Setup) and the results tables: the abstract and reported experiments claim consistent improvements, yet no numerical values, standard deviations, or per-VLM/per-benchmark breakdowns are supplied in the provided text. This makes it impossible to assess effect sizes or whether gains exceed the variance introduced by the two extra forward passes.
minor comments (2)
- The acronym expansion in the title and abstract appears truncated (GEASS (ated Evidence-Adaptive...); this should be corrected to the full intended phrase.
- Notation for the three gating signals (confidence, entropy reduction, disagreement) should be defined explicitly with equations or pseudocode in the method section to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on validating the gating signals and improving the presentation of experimental results. We address each major comment below and have revised the manuscript to incorporate the suggested analyses and clarifications.
read point-by-point responses
-
Referee: The central claim rests on the assumption that clean-path confidence, entropy reduction, and pathway disagreement reliably indicate per-query caption usefulness. However, the experiments section provides only aggregate accuracy improvements and does not include correlation analysis, calibration plots, or an oracle comparison demonstrating that high-confidence/low-disagreement cases correspond to low-hallucination captions on POPE or HallusionBench. Without this validation, it remains possible that the selective mechanism adds variance rather than systematic gain.
Authors: We agree that explicit validation of the gating signals would strengthen the central claim and reduce the possibility that gains are due to variance. In the revised manuscript we have added a dedicated analysis subsection that reports Pearson correlations between clean-path confidence and per-instance hallucination rates on POPE, calibration plots for the three signals, and an oracle comparison in which GEASS-selected captions are contrasted with random and high-confidence-only selections. These additions confirm statistically significant positive correlations and that GEASS outperforms random gating, supporting systematic rather than variance-driven improvement. revision: yes
-
Referee: §4 (Experimental Setup) and the results tables: the abstract and reported experiments claim consistent improvements, yet no numerical values, standard deviations, or per-VLM/per-benchmark breakdowns are supplied in the provided text. This makes it impossible to assess effect sizes or whether gains exceed the variance introduced by the two extra forward passes.
Authors: The full manuscript contains result tables with per-VLM and per-benchmark numbers; however, we acknowledge that standard deviations and explicit effect-size discussion were insufficiently prominent. In the revision we have added standard deviations (computed over three random seeds) to all reported accuracies, inserted a summary table of effect sizes, and included a direct comparison showing that the observed gains exceed the variance attributable to the two additional forward passes. revision: yes
Circularity Check
No significant circularity; method presented as heuristic without load-bearing self-citation or definitional reduction
full rationale
The paper introduces GEASS as a training-free heuristic module that gates, weights, and thresholds captions using three per-query signals (clean-path confidence, entropy reduction, and pathway disagreement). No equations, derivations, or first-principles claims are shown that reduce the gating logic to a fitted parameter or to a self-citation whose content is itself unverified. The central improvement is supported by direct experiments on POPE and HallusionBench rather than by any chain that collapses back to the inputs by construction. Self-citations, if present in the full text, are not load-bearing for the core claim. This is the common case of an empirical heuristic whose validity rests on external benchmarks rather than internal definitional equivalence.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Captions anchor not only the model's final answer but also its reasoning trajectory and lexical choices.
- domain assumption Caption errors are asymmetric: omissions vastly outnumber fabrications, yet each fabrication carries a much larger per-instance impact.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GEASS performs two forward passes... combines their logits with a query-specific weight built from three components: a confidence gate... an information-gain term... and a disagreement penalty
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
r(t) = H(p_clean) - H(p_cap) / H(p_clean) + ε (relative entropy reduction)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., and Lin, D. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pp. 370–387. Springer, 2024a. Chen, Z., Wang, W., Cao, Y ., Liu, Y ., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding per- formance boundaries o...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Hu, Y ., Hua, H., Yang, Z., Shi, W., Smith, N. A., and Luo, J. Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699,
-
[4]
Lee, J. and Song, M. Retrieval visual contrastive decoding to mitigate object hallucinations in large vision-language models. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 8200–8219,
work page 2025
-
[5]
Li, Q., Ye, Z., Feng, X., Zhong, W., Qin, L., Chen, R., Li, B., Jiang, K., Wang, Y ., Liu, T., et al. Cai: Caption- sensitive attention intervention for mitigating object hallu- cination in large vision-language models.arXiv preprint arXiv:2506.23590, 2025a. Li, Q., Ye, Z., Feng, X., Zhong, W., Qin, L., Chen, R., Huang, L., Li, B., Jiang, K., Wang, Y ., e...
-
[6]
Evaluating object hallucination in large vision-language models
Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empiri- cal methods in natural language processing, pp. 292–305,
work page 2023
-
[7]
Li, Z., Shi, H., Gao, Y ., Liu, D., Wang, Z., Chen, Y ., Liu, T., Zhao, L., Wang, H., and Metaxas, D. N. The hidden life of tokens: Reducing hallucination of large vision- language models via visual information steering.arXiv preprint arXiv:2502.03628, 2025b. Lin, T.-Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll ´ar, P., and Zitnic...
-
[8]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
9 GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models Liu, C., Wang, C., Peng, Y ., and Li, Z. Zvqaf: Zero-shot visual question answering with feedback from large lan- guage models.Neurocomputing, 580:127505, 2024a. Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Mitigating hallucination in large multi-modal mod...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26296–26306, 2024b. Park, W., Kim, W., Kim, J., and Do, J. Second: Mitigat- ing perceptual hallucination in vision-language models via selective and contrastive decoding.arXiv...
-
[10]
A., Burns, K., Darrell, T., and Saenko, K
Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, K. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pp. 4035–4045,
work page 2018
-
[11]
Sarkar, S., Che, Y ., Gavin, A., Beerel, P. A., and Kundu, S. Mitigating hallucinations in vision-language models through image-guided head suppression. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 12492–12511,
work page 2025
-
[12]
Aligning large multi- modal models with factually augmented rlhf
Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y ., Gan, C., Gui, L., Wang, Y .-X., Yang, Y ., et al. Aligning large multi- modal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 13088–13110,
work page 2024
-
[13]
Wang, C., Zhou, X., Fu, W., and Zhou, Y . Mitigating hallucinations in large vision-language models with in- ternal fact-based contrastive decoding.arXiv preprint arXiv:2502.01056,
-
[14]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.