arxiv: 2310.00754 · v2 · pith:GNLFWKJ6new · submitted 2023-10-01 · 💻 cs.LG · cs.CL· cs.CV

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Yiyang Zhou , Chenhang Cui , Jaehong Yoon , Linjun Zhang , Zhun Deng , Chelsea Finn , Mohit Bansal , Huaxiu Yao This is my paper

Pith reviewed 2026-05-17 22:42 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV

keywords object hallucinationlarge vision-language modelspost-hoc revisionco-occurrence statisticsdecoding uncertaintytext positionLUREvision-language tasks

0 comments

The pith

A post-hoc algorithm called LURE reduces object hallucinations in large vision-language models by reconstructing descriptions based on co-occurrence, uncertainty, and text position.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets object hallucination in large vision-language models, where generated descriptions include objects absent from the input image and thereby undermine downstream tasks such as visual summarization and reasoning. It introduces LURE, a lightweight revisor that corrects outputs after generation without retraining the underlying model. The approach rests on a statistical breakdown showing that hallucinations correlate with frequent object co-occurrences, higher uncertainty at decoding time, and later positions in the generated sentence. A sympathetic reader would value the method because it promises immediate gains on existing open-source models while remaining simple to deploy.

Core claim

LURE post-hoc rectifies object hallucination by reconstructing less hallucinatory descriptions, drawing on three statistical factors: object co-occurrence in training data, uncertainty during LVLM decoding, and the tendency for hallucinations to appear later in generated text; when applied to six open-source LVLMs it yields a 23 percent improvement in standard hallucination metrics and tops both GPT-based and human rankings.

What carries the argument

LURE (LVLM Hallucination Revisor), an algorithm that uses detected statistical patterns in co-occurrence, decoding uncertainty, and sentence position to guide reconstruction of image descriptions.

If this is right

LURE improves object hallucination metrics by 23 percent over prior methods when evaluated on six open-source LVLMs.
The revisor integrates directly with any existing LVLM without weight changes.
Both automated GPT judgments and human raters rank LURE-corrected outputs highest.
Corrected descriptions support more trustworthy visual summarization and reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the three factors generalize, similar lightweight revisors could target attribute or relation hallucinations in the same models.
Deployment pipelines could insert LURE as an automatic post-processing step to raise user trust without retraining costs.
Combining LURE with targeted fine-tuning on domain-specific images might produce further reductions in hallucinations.
Pretraining objectives that penalize the identified statistical patterns could lower hallucination rates at the source.

Load-bearing premise

That the three identified statistical factors suffice to direct reliable reconstruction of accurate descriptions across varied models and image domains.

What would settle it

A test set of images containing rare object combinations or a new LVLM whose decoding uncertainty does not align with hallucinated objects, where LURE produces no metric gains or introduces new errors.

read the original abstract

Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at https://github.com/YiyangZhou/LURE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LURE offers a practical post-hoc revisor for object hallucination in LVLMs with reported 23% metric gains across six models, though the three statistical factors may be correlational rather than causal.

read the letter

Hey colleague, the main thing here is that the authors give a straightforward post-hoc algorithm called LURE to revise LVLM outputs and reduce object hallucinations. They ground the revision in three factors from their analysis: object co-occurrence in images, uncertainty during decoding, and the tendency for errors to appear later in the generated text. On six open-source models this yields a 23% improvement over the prior best method, and the revised descriptions rank highest in both GPT and human evaluations. The code release is a clear plus for anyone who wants to try it out or check the details. What stands out is how simple the integration is—no retraining required, just a reconstruction step on top of existing models. This makes it immediately usable for tasks like image captioning or visual reasoning where made-up objects cause problems. The soft spots are around the strength of the causal story. The factors come from statistical correlations in their data, but the paper does not appear to include ablations that isolate each one or test whether removing any collapses the gains. If these are downstream symptoms of training distributions or decoding choices rather than root causes, the improvements could shrink on new models, different image domains, or varied revision prompts. The central results look reproducible from the reported setup, but more controls would make the claims tighter. This is for people working on reliable vision-language systems who need something they can apply right away without big changes to their pipeline. It has enough concrete evaluations and a clear algorithm to deserve a serious referee, even if the experiments need some extra robustness checks. I would send it out for peer review.

Referee Report

2 major / 2 minor

Summary. The paper analyzes object hallucination in large vision-language models (LVLMs), identifying three statistical factors—object co-occurrence in images, decoding uncertainty, and later position in generated text—as key contributors. It proposes LURE, a post-hoc revision algorithm that uses these factors to reconstruct less hallucinatory descriptions, which can be integrated with any LVLM. Evaluations on six open-source LVLMs report a 23% improvement in hallucination metrics over the prior best method, with LURE ranking highest in both GPT-based and human evaluations; code and data are released.

Significance. If the central empirical findings hold under broader testing, LURE provides a practical, training-free mitigation strategy for a pervasive issue in vision-language systems, potentially improving reliability in downstream tasks like visual reasoning and summarization. The multi-model evaluation, inclusion of GPT and human judgments, and public release of code/data at the cited GitHub repository are clear strengths that support reproducibility and adoption.

major comments (2)

[Section 4] Section 4 (LURE algorithm description): the claim that the three identified factors are sufficient to drive reliable post-hoc reconstruction is load-bearing for the 23% metric improvement, yet the manuscript provides no ablation experiments that remove or isolate individual factors (co-occurrence, uncertainty, or position) while holding the revision prompt fixed. Without such controls, it remains possible that gains arise primarily from the generic revision step rather than the specific statistical guidance.
[Table 1] Table 1 or equivalent results table (six-LVLM evaluation): the reported 23% average improvement over the previous best approach aggregates across models and metrics, but the paper does not report per-factor contribution or per-model variance that would confirm the factors' causal role versus correlational association. This weakens the sufficiency argument raised in the skeptic note.

minor comments (2)

[Section 3] The notation for uncertainty (e.g., how token-level entropy or probability is aggregated into object-level uncertainty) should be defined more explicitly with an equation or pseudocode to aid replication.
[Figure 2] Figure 2 (or the factor visualization): axis labels and legend entries could be enlarged for clarity when the paper is viewed in print or on smaller screens.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below. Both comments correctly identify gaps in our current empirical validation of the three factors' specific contributions. We will incorporate the requested analyses in the revised manuscript.

read point-by-point responses

Referee: [Section 4] Section 4 (LURE algorithm description): the claim that the three identified factors are sufficient to drive reliable post-hoc reconstruction is load-bearing for the 23% metric improvement, yet the manuscript provides no ablation experiments that remove or isolate individual factors (co-occurrence, uncertainty, or position) while holding the revision prompt fixed. Without such controls, it remains possible that gains arise primarily from the generic revision step rather than the specific statistical guidance.

Authors: We agree that the absence of controlled ablations isolating each factor (while holding the revision prompt structure fixed) leaves open the possibility that improvements arise from generic revision rather than the specific statistical guidance. Section 3 presents statistical evidence linking the three factors to hallucination, and LURE explicitly encodes them in the prompt, but this does not substitute for the requested ablations. We will add these experiments to the revised manuscript, reporting performance when each factor is removed individually from the prompt. revision: yes
Referee: [Table 1] Table 1 or equivalent results table (six-LVLM evaluation): the reported 23% average improvement over the previous best approach aggregates across models and metrics, but the paper does not report per-factor contribution or per-model variance that would confirm the factors' causal role versus correlational association. This weakens the sufficiency argument raised in the skeptic note.

Authors: We acknowledge that the aggregate 23% figure does not yet include per-factor contribution breakdowns or per-model variance statistics that would more directly support a causal interpretation. In the revision we will expand the results section with additional tables or supplementary figures that decompose performance by factor (via the ablations described above) and report per-model means and variances across the six LVLMs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical analysis and post-hoc revision form self-contained method

full rationale

The paper conducts a statistical analysis of hallucination factors (co-occurrence, decoding uncertainty, text position) and uses the resulting observations to design the LURE post-hoc reconstruction algorithm. No derivation chain, equations, or first-principles result is presented that reduces by construction to fitted inputs, self-referential predictions, or load-bearing self-citations. The reported 23% metric improvement is an empirical outcome from evaluation on six LVLMs rather than a quantity forced by the paper's own definitions or prior author work. The approach is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical identification of three hallucination factors and the assumption that a revision procedure can exploit them; no explicit free parameters, new axioms, or invented entities are stated in the abstract.

axioms (1)

domain assumption Object hallucination arises primarily from co-occurrence statistics, decoding uncertainty, and generation position, and can be mitigated by post-hoc reconstruction guided by these factors.
This premise underpins the design of LURE as described in the abstract.

pith-pipeline@v0.9.0 · 5549 in / 1238 out tokens · 44286 ms · 2026-05-17T22:42:30.899600+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence ... uncertainty ... and object position
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2.1 ... Err(ˆf(2)2) ≤ Err(ˆf(1)2)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights
cs.AI 2026-05 conditional novelty 7.0

TRIVIA+ is a new long-context RAG hallucination benchmark with four noisy label variants that shows current detectors have substantial room for improvement and are hindered by label noise.
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
Letting the neural code speak: Automated characterization of monkey visual neurons through human language
q-bio.NC 2026-05 unverdicted novelty 6.0

Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.
Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent
cs.CV 2026-05 unverdicted novelty 6.0

CAVI framework uses character-guided token pruning, orthogonal feature modulation, and modality-adaptive role steering to resolve modality-role interference in multimodal RPAs.
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
cs.CV 2026-05 unverdicted novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time
cs.LG 2026-05 unverdicted novelty 6.0

LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Decoder-side Temporal Rebalancing (DTR) reduces hallucinations in Video-LLMs by mitigating over-dominance of a single anchor frame during inference without training or auxiliary models.
ReflectCAP: Detailed Image Captioning with Reflective Memory
cs.AI 2026-04 unverdicted novelty 6.0

ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-cov...
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
cs.LG 2026-05 unverdicted novelty 5.0

UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
cs.CV 2026-04 unverdicted novelty 5.0

MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing
cs.CV 2026-04 unverdicted novelty 5.0

VCE mitigates object hallucination in LVLMs by decomposing activation patterns from contrastive visual inputs via SVD to suppress hallucination subspaces through targeted parameter edits.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
cs.LG 2024-02 unverdicted novelty 5.0

POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 16 Pith papers · 11 internal anchors

[1]

Spice: Semantic propo- sitional image caption evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propo- sitional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Confer- ence, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14 , pp. 382–398. Springer,

work page 2016
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

work page 1901
[3]

PaLM: Scaling Language Modeling with Pathways

URL https: //lmsys.org/blog/2023-03-30-vicuna/ . Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Imagenet: A large-scale hi- erarchical image database

10 Published as a conference paper at ICLR 2024 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pp. 248–255. Ieee,

work page 2024
[5]

Beam Search Strategies for Neural Machine Translation

Markus Freitag and Yaser Al-Onaizan. Beam search strategies for neural machine translation.arXiv preprint arXiv:1702.01806,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multi- modal large language models. arXiv preprint arXiv:2306.13394,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Detecting and preventing hallucinations in large vision language models

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394,

work page arXiv
[8]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[9]

Advancing medical imaging with language models: A journey from n-grams to chatgpt

Mingzhe Hu, Shaoyan Pan, Yuheng Li, and Xiaofeng Yang. Advancing medical imaging with language models: A journey from n-grams to chatgpt. arXiv preprint arXiv:2304.04920,

work page arXiv
[10]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a. Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with m...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b. Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. M3it: A large-scale dataset towards ...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,

work page 2014
[13]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

11 Published as a conference paper at ICLR 2024 Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a. Haokun Liu, Yaonan Zhu, Kenji Kato, Izumi Kondo, Tadayoshi Aoyama, and Yasuhisa Hasegawa. Llm-based human-robot collaboratio...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Llm as a robotic brain: Unifying egocentric memory and control

Jinjie Mai, Jun Chen, Bing Li, Guocheng Qian, Mohamed Elhoseiny, and Bernard Ghanem. Llm as a robotic brain: Unifying egocentric memory and control. arXiv preprint arXiv:2304.09349 ,

work page arXiv
[15]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4035–4045,

work page 2018
[16]

Can language models teach weaker agents? teacher explanations improve students via theory of mind

Swarnadeep Saha, Peter Hase, and Mohit Bansal. Can language models teach weaker agents? teacher explanations improve students via theory of mind. arXiv preprint arXiv:2306.09299 ,

work page arXiv
[17]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Evaluation and analysis of hallucination in large vision- language models

12 Published as a conference paper at ICLR 2024 Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, et al. Evaluation and analysis of hallucination in large vision- language models. arXiv preprint arXiv:2308.15126, 2023a. Sheng Wang, Zihao Zhao, Xi Ouyang, Qian Wang, and Dinggang Shen. ...

work page arXiv 2024
[19]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Multi-grained vision language pre-training: Aligning texts with visual concepts

Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276,

work page arXiv
[21]

arXiv preprint arXiv:2305.13534 , year=

Linjun Zhang, Zhun Deng, Kenji Kawaguchi, and James Zou. When and how mixup improves calibration. In International Conference on Machine Learning, pp. 26135–26160. PMLR, 2022a. Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023a. Renrui Zhang, Jiaming ...

work page arXiv 1904
[22]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Experimental Setting for the Uncertainty Analysis

from the COCO dataset, and the image descriptions are generated by MiniGPT-4 based on inference results from 5000 images in the COCO 2014 train dataset. Experimental Setting for the Uncertainty Analysis. Because uncertainty and position analysis are relatively independent from co-occurrence, in order to avoid conducting statistical analysis on the trainin...

work page 2014
[25]

Greedy-Decoding

and aims to guide the model in generating accurate descriptions by focusing on object recognition. • Greedy-Decoding: The difference between the “Greedy-Decoding” strategy and the “Original” strategy is that in the ”Greedy-Decoding” strategy, the model uses greedy decoding instead of sampling during the generation of image descriptions to produce the most...

work page 2024
[26]

15 Published as a conference paper at ICLR 2024 Table 8: Prompts for baselines

We can then write ( ˆβ(1) 1 , ˆβ(1) 2 ) = (ρ0µ∗ 1 + 1 N ρ0·NX i=1 ϵi,1, ρ0µ∗ 2 + 1 N ρ0·NX i=1 ϵi,2). 15 Published as a conference paper at ICLR 2024 Table 8: Prompts for baselines. Teacher: Reference caption: {blip2 caption} Please refer to reference caption and describe this picture: CoT: Human: Please list the main objects in the picture and strictly f...

work page 2024
[27]

16 Published as a conference paper at ICLR 2024 Figure 4: Human evaluation annotation interface

+ 1 2 P(⟨ϕ1(s<i, x), ˆβ(1) 1 ⟩ + ⟨ϕ2(s<i, x), ˆβ(1) 2 ⟩ > 0 | y = −1) = Φ(− ⟨µ∗ 1, ˆβ1⟩ + ⟨β2, ˆµ∗ 2⟩q ∥ ˆβ1∥2 + ∥ ˆβ2∥2 ) = Φ(− ρ0∥µ∗ 1∥2 + ρ0∥µ∗ 2∥2 q ρ2 0∥µ∗ 1∥2 + ρ2 0∥µ∗ 2∥2 + ρ0·d N + ρ0·d N ) + o(1). 16 Published as a conference paper at ICLR 2024 Figure 4: Human evaluation annotation interface. Table 9: The prompt for ChatGPT3.5 evaluation. Instru...

work page 2024
[28]

As ˆβk = µ∗ k + 1 nk Pnk i=1 ϵi := µ∗ k + 1√nk Z, we have ⟨µ∗ k, ˆβk⟩ ∥ ˆβk∥ = ∥βk∥2 + 1√nk ⟨µ∗ k, Z⟩ q ∥µ∗ k∥2 + 2√nk ⟨µ∗ k, Z⟩ + 1 nk ∥Z∥2

= P(⟨ϕ(s<i, x), ˆβk⟩ > 0 | y = −1) = Φ(− ⟨µ∗ k, ˆβk⟩ ∥ ˆβk∥ ). As ˆβk = µ∗ k + 1 nk Pnk i=1 ϵi := µ∗ k + 1√nk Z, we have ⟨µ∗ k, ˆβk⟩ ∥ ˆβk∥ = ∥βk∥2 + 1√nk ⟨µ∗ k, Z⟩ q ∥µ∗ k∥2 + 2√nk ⟨µ∗ k, Z⟩ + 1 nk ∥Z∥2 . As we assume ∥µ∗ k∥2 ≪ d, we have ⟨µ∗ k, ˆβk⟩ ∥ ˆβk∥ = ∥µ∗ k∥2 q ∥µ∗ k∥2 + d nk + o(1). As a result, if the total sample size is fixed, choosing large ...

work page 2002
[29]

BERTScore measures the similarity between a reference text and a generated text by computing contextualized embeddings using BERT

is a method for evaluating the quality of natural language generation or summarization systems. BERTScore measures the similarity between a reference text and a generated text by computing contextualized embeddings using BERT. ROUGE-L ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence (Lin, 2004)) is an evaluation metr...

work page 2004
[30]

Currently, the CHAIR metric can only be applied to the COCO dataset, which limits its usability beyond that dataset

and CC (Conceptual Captions) (Changpinyo et al., 2021). Currently, the CHAIR metric can only be applied to the COCO dataset, which limits its usability beyond that dataset. To overcome this limitation, we manually annotate ImageNet and CC datasets to investigate object hallucination. Specifically, we randomly select 200 images from each dataset to be anno...

work page 2021
[31]

Ori + Cap

For a fair comparison, we conducted additional experiments in Table 14 on these datasets by providing input in the form of the question along with 19 Published as a conference paper at ICLR 2024 Table 10: Performance of different models and baselines on general metrics. Models BLEU-1 BLEU-2 BLEU-3 BLEU-4 BERTS ROUGE-L CLIPS mPLUG-Owl Original 30.37 14.59 ...

work page 2024
[32]

20 Published as a conference paper at ICLR 2024 Table 11: Performance on additional metrics – MENTOR, CIDER, SPICE. Models METEOR CIDER SPICE mPLUG-Owl Original 28.7 0.53 17.5 LURE 36.7 0.66 18.9 LLaVa Original 37.7 0.61 22.6 LURE 43.9 0.67 31.4 LLaMA-Adapter Original 27.6 0.59 21.8 LURE 33.4 0.63 29.2 MiniGPT-4 Original 22.0 0.51 17.9 LURE 25.6 0.55 26.4...

work page 2024
[33]

Our findings reveal that the incorporation of LURE leads to a significant reduction in hallucinatory objects, averaging around 56%, while only slightly affecting the presence of correctly identified ob- jects, with an average decrease of approximately 1.6%. This noteworthy outcome can be attributed to the fact that LURE doesn’t merely eliminate potentiall...

work page 2024
[34]

Original caption

“Original caption” represents the original standard description, while the “Hallucination caption” 25 Published as a conference paper at ICLR 2024 Original Caption:The image shows a man walk- ing down a rainy sidewalk while holding a bright red umbrella to stay dry. The man walks next to a building as rain pours down, making the umbrella a necessary acces...

work page 2024
[35]

column represents the hallucinated description constructed by GPT-3.5

Table 19: Cases of generating hallucinatory descriptions. column represents the hallucinated description constructed by GPT-3.5. The red portions in the hallucination captions indicate the hallucinations added by GPT-3.5 based on co-occurring object lists and uncertain object lists. 26 Published as a conference paper at ICLR 2024 D.3 C ASES OF REWRITING C...

work page 2024
[36]

Original

Upon comparing the descriptions generated by Revisior with those from the other methods, it becomes evident that Revisior surpasses the others in terms of accuracy and level of detail in describing the image. The description produced by Revisior effectively captures the key elements of the image, such as the presence of a man wearing a white shirt walking...

work page 2024