pith. machine review for the scientific record. sign in

arxiv: 2310.00754 · v2 · pith:GNLFWKJ6new · submitted 2023-10-01 · 💻 cs.LG · cs.CL· cs.CV

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Pith reviewed 2026-05-17 22:42 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV
keywords object hallucinationlarge vision-language modelspost-hoc revisionco-occurrence statisticsdecoding uncertaintytext positionLUREvision-language tasks
0
0 comments X

The pith

A post-hoc algorithm called LURE reduces object hallucinations in large vision-language models by reconstructing descriptions based on co-occurrence, uncertainty, and text position.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets object hallucination in large vision-language models, where generated descriptions include objects absent from the input image and thereby undermine downstream tasks such as visual summarization and reasoning. It introduces LURE, a lightweight revisor that corrects outputs after generation without retraining the underlying model. The approach rests on a statistical breakdown showing that hallucinations correlate with frequent object co-occurrences, higher uncertainty at decoding time, and later positions in the generated sentence. A sympathetic reader would value the method because it promises immediate gains on existing open-source models while remaining simple to deploy.

Core claim

LURE post-hoc rectifies object hallucination by reconstructing less hallucinatory descriptions, drawing on three statistical factors: object co-occurrence in training data, uncertainty during LVLM decoding, and the tendency for hallucinations to appear later in generated text; when applied to six open-source LVLMs it yields a 23 percent improvement in standard hallucination metrics and tops both GPT-based and human rankings.

What carries the argument

LURE (LVLM Hallucination Revisor), an algorithm that uses detected statistical patterns in co-occurrence, decoding uncertainty, and sentence position to guide reconstruction of image descriptions.

If this is right

  • LURE improves object hallucination metrics by 23 percent over prior methods when evaluated on six open-source LVLMs.
  • The revisor integrates directly with any existing LVLM without weight changes.
  • Both automated GPT judgments and human raters rank LURE-corrected outputs highest.
  • Corrected descriptions support more trustworthy visual summarization and reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the three factors generalize, similar lightweight revisors could target attribute or relation hallucinations in the same models.
  • Deployment pipelines could insert LURE as an automatic post-processing step to raise user trust without retraining costs.
  • Combining LURE with targeted fine-tuning on domain-specific images might produce further reductions in hallucinations.
  • Pretraining objectives that penalize the identified statistical patterns could lower hallucination rates at the source.

Load-bearing premise

That the three identified statistical factors suffice to direct reliable reconstruction of accurate descriptions across varied models and image domains.

What would settle it

A test set of images containing rare object combinations or a new LVLM whose decoding uncertainty does not align with hallucinated objects, where LURE produces no metric gains or introduces new errors.

read the original abstract

Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at https://github.com/YiyangZhou/LURE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes object hallucination in large vision-language models (LVLMs), identifying three statistical factors—object co-occurrence in images, decoding uncertainty, and later position in generated text—as key contributors. It proposes LURE, a post-hoc revision algorithm that uses these factors to reconstruct less hallucinatory descriptions, which can be integrated with any LVLM. Evaluations on six open-source LVLMs report a 23% improvement in hallucination metrics over the prior best method, with LURE ranking highest in both GPT-based and human evaluations; code and data are released.

Significance. If the central empirical findings hold under broader testing, LURE provides a practical, training-free mitigation strategy for a pervasive issue in vision-language systems, potentially improving reliability in downstream tasks like visual reasoning and summarization. The multi-model evaluation, inclusion of GPT and human judgments, and public release of code/data at the cited GitHub repository are clear strengths that support reproducibility and adoption.

major comments (2)
  1. [Section 4] Section 4 (LURE algorithm description): the claim that the three identified factors are sufficient to drive reliable post-hoc reconstruction is load-bearing for the 23% metric improvement, yet the manuscript provides no ablation experiments that remove or isolate individual factors (co-occurrence, uncertainty, or position) while holding the revision prompt fixed. Without such controls, it remains possible that gains arise primarily from the generic revision step rather than the specific statistical guidance.
  2. [Table 1] Table 1 or equivalent results table (six-LVLM evaluation): the reported 23% average improvement over the previous best approach aggregates across models and metrics, but the paper does not report per-factor contribution or per-model variance that would confirm the factors' causal role versus correlational association. This weakens the sufficiency argument raised in the skeptic note.
minor comments (2)
  1. [Section 3] The notation for uncertainty (e.g., how token-level entropy or probability is aggregated into object-level uncertainty) should be defined more explicitly with an equation or pseudocode to aid replication.
  2. [Figure 2] Figure 2 (or the factor visualization): axis labels and legend entries could be enlarged for clarity when the paper is viewed in print or on smaller screens.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below. Both comments correctly identify gaps in our current empirical validation of the three factors' specific contributions. We will incorporate the requested analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (LURE algorithm description): the claim that the three identified factors are sufficient to drive reliable post-hoc reconstruction is load-bearing for the 23% metric improvement, yet the manuscript provides no ablation experiments that remove or isolate individual factors (co-occurrence, uncertainty, or position) while holding the revision prompt fixed. Without such controls, it remains possible that gains arise primarily from the generic revision step rather than the specific statistical guidance.

    Authors: We agree that the absence of controlled ablations isolating each factor (while holding the revision prompt structure fixed) leaves open the possibility that improvements arise from generic revision rather than the specific statistical guidance. Section 3 presents statistical evidence linking the three factors to hallucination, and LURE explicitly encodes them in the prompt, but this does not substitute for the requested ablations. We will add these experiments to the revised manuscript, reporting performance when each factor is removed individually from the prompt. revision: yes

  2. Referee: [Table 1] Table 1 or equivalent results table (six-LVLM evaluation): the reported 23% average improvement over the previous best approach aggregates across models and metrics, but the paper does not report per-factor contribution or per-model variance that would confirm the factors' causal role versus correlational association. This weakens the sufficiency argument raised in the skeptic note.

    Authors: We acknowledge that the aggregate 23% figure does not yet include per-factor contribution breakdowns or per-model variance statistics that would more directly support a causal interpretation. In the revision we will expand the results section with additional tables or supplementary figures that decompose performance by factor (via the ablations described above) and report per-model means and variances across the six LVLMs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical analysis and post-hoc revision form self-contained method

full rationale

The paper conducts a statistical analysis of hallucination factors (co-occurrence, decoding uncertainty, text position) and uses the resulting observations to design the LURE post-hoc reconstruction algorithm. No derivation chain, equations, or first-principles result is presented that reduces by construction to fitted inputs, self-referential predictions, or load-bearing self-citations. The reported 23% metric improvement is an empirical outcome from evaluation on six LVLMs rather than a quantity forced by the paper's own definitions or prior author work. The approach is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical identification of three hallucination factors and the assumption that a revision procedure can exploit them; no explicit free parameters, new axioms, or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Object hallucination arises primarily from co-occurrence statistics, decoding uncertainty, and generation position, and can be mitigated by post-hoc reconstruction guided by these factors.
    This premise underpins the design of LURE as described in the abstract.

pith-pipeline@v0.9.0 · 5549 in / 1238 out tokens · 44286 ms · 2026-05-17T22:42:30.899600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights

    cs.AI 2026-05 conditional novelty 7.0

    TRIVIA+ is a new long-context RAG hallucination benchmark with four noisy label variants that shows current detectors have substantial room for improvement and are hindered by label noise.

  2. DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.

  3. Letting the neural code speak: Automated characterization of monkey visual neurons through human language

    q-bio.NC 2026-05 unverdicted novelty 6.0

    Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.

  4. Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

    cs.CV 2026-05 unverdicted novelty 6.0

    CAVI framework uses character-guided token pruning, orthogonal feature modulation, and modality-adaptive role steering to resolve modality-role interference in multimodal RPAs.

  5. CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

    cs.CV 2026-05 unverdicted novelty 6.0

    CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...

  6. Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

    cs.LG 2026-05 unverdicted novelty 6.0

    LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.

  7. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  8. Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Decoder-side Temporal Rebalancing (DTR) reduces hallucinations in Video-LLMs by mitigating over-dominance of a single anchor frame during inference without training or auxiliary models.

  9. ReflectCAP: Detailed Image Captioning with Reflective Memory

    cs.AI 2026-04 unverdicted novelty 6.0

    ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-cov...

  10. Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

    cs.LG 2026-05 unverdicted novelty 5.0

    UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.

  11. Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

    cs.CV 2026-04 unverdicted novelty 5.0

    MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.

  12. VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing

    cs.CV 2026-04 unverdicted novelty 5.0

    VCE mitigates object hallucination in LVLMs by decomposing activation patterns from contrastive visual inputs via SVD to suppress hallucination subspaces through targeted parameter edits.

  13. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  14. Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

    cs.LG 2024-02 unverdicted novelty 5.0

    POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

  15. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

  16. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    cs.CV 2024-02 unverdicted novelty 2.0

    The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 16 Pith papers · 11 internal anchors

  1. [1]

    Spice: Semantic propo- sitional image caption evaluation

    Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propo- sitional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Confer- ence, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14 , pp. 382–398. Springer,

  2. [2]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

  3. [3]

    PaLM: Scaling Language Modeling with Pathways

    URL https: //lmsys.org/blog/2023-03-30-vicuna/ . Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,

  4. [4]

    Imagenet: A large-scale hi- erarchical image database

    10 Published as a conference paper at ICLR 2024 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pp. 248–255. Ieee,

  5. [5]

    Beam Search Strategies for Neural Machine Translation

    Markus Freitag and Yaser Al-Onaizan. Beam search strategies for neural machine translation.arXiv preprint arXiv:1702.01806,

  6. [6]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multi- modal large language models. arXiv preprint arXiv:2306.13394,

  7. [7]

    Detecting and preventing hallucinations in large vision language models

    Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394,

  8. [8]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751,

  9. [9]

    Advancing medical imaging with language models: A journey from n-grams to chatgpt

    Mingzhe Hu, Shaoyan Pan, Yuheng Li, and Xiaofeng Yang. Advancing medical imaging with language models: A journey from n-grams to chatgpt. arXiv preprint arXiv:2304.04920,

  10. [10]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a. Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with m...

  11. [11]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b. Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. M3it: A large-scale dataset towards ...

  12. [12]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,

  13. [13]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    11 Published as a conference paper at ICLR 2024 Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a. Haokun Liu, Yaonan Zhu, Kenji Kato, Izumi Kondo, Tadayoshi Aoyama, and Yasuhisa Hasegawa. Llm-based human-robot collaboratio...

  14. [14]

    Llm as a robotic brain: Unifying egocentric memory and control

    Jinjie Mai, Jun Chen, Bing Li, Guocheng Qian, Mohamed Elhoseiny, and Bernard Ghanem. Llm as a robotic brain: Unifying egocentric memory and control. arXiv preprint arXiv:2304.09349 ,

  15. [15]

    Object hallucination in image captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4035–4045,

  16. [16]

    Can language models teach weaker agents? teacher explanations improve students via theory of mind

    Swarnadeep Saha, Peter Hase, and Mohit Bansal. Can language models teach weaker agents? teacher explanations improve students via theory of mind. arXiv preprint arXiv:2306.09299 ,

  17. [17]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

  18. [18]

    Evaluation and analysis of hallucination in large vision- language models

    12 Published as a conference paper at ICLR 2024 Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, et al. Evaluation and analysis of hallucination in large vision- language models. arXiv preprint arXiv:2308.15126, 2023a. Sheng Wang, Zihao Zhao, Xi Ouyang, Qian Wang, and Dinggang Shen. ...

  19. [19]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,

  20. [20]

    Multi-grained vision language pre-training: Aligning texts with visual concepts

    Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276,

  21. [21]

    arXiv preprint arXiv:2305.13534 , year=

    Linjun Zhang, Zhun Deng, Kenji Kawaguchi, and James Zou. When and how mixup improves calibration. In International Conference on Machine Learning, pp. 26135–26160. PMLR, 2022a. Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023a. Renrui Zhang, Jiaming ...

  22. [22]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,

  23. [23]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

  24. [24]

    Experimental Setting for the Uncertainty Analysis

    from the COCO dataset, and the image descriptions are generated by MiniGPT-4 based on inference results from 5000 images in the COCO 2014 train dataset. Experimental Setting for the Uncertainty Analysis. Because uncertainty and position analysis are relatively independent from co-occurrence, in order to avoid conducting statistical analysis on the trainin...

  25. [25]

    Greedy-Decoding

    and aims to guide the model in generating accurate descriptions by focusing on object recognition. • Greedy-Decoding: The difference between the “Greedy-Decoding” strategy and the “Original” strategy is that in the ”Greedy-Decoding” strategy, the model uses greedy decoding instead of sampling during the generation of image descriptions to produce the most...

  26. [26]

    15 Published as a conference paper at ICLR 2024 Table 8: Prompts for baselines

    We can then write ( ˆβ(1) 1 , ˆβ(1) 2 ) = (ρ0µ∗ 1 + 1 N ρ0·NX i=1 ϵi,1, ρ0µ∗ 2 + 1 N ρ0·NX i=1 ϵi,2). 15 Published as a conference paper at ICLR 2024 Table 8: Prompts for baselines. Teacher: Reference caption: {blip2 caption} Please refer to reference caption and describe this picture: CoT: Human: Please list the main objects in the picture and strictly f...

  27. [27]

    16 Published as a conference paper at ICLR 2024 Figure 4: Human evaluation annotation interface

    + 1 2 P(⟨ϕ1(s<i, x), ˆβ(1) 1 ⟩ + ⟨ϕ2(s<i, x), ˆβ(1) 2 ⟩ > 0 | y = −1) = Φ(− ⟨µ∗ 1, ˆβ1⟩ + ⟨β2, ˆµ∗ 2⟩q ∥ ˆβ1∥2 + ∥ ˆβ2∥2 ) = Φ(− ρ0∥µ∗ 1∥2 + ρ0∥µ∗ 2∥2 q ρ2 0∥µ∗ 1∥2 + ρ2 0∥µ∗ 2∥2 + ρ0·d N + ρ0·d N ) + o(1). 16 Published as a conference paper at ICLR 2024 Figure 4: Human evaluation annotation interface. Table 9: The prompt for ChatGPT3.5 evaluation. Instru...

  28. [28]

    As ˆβk = µ∗ k + 1 nk Pnk i=1 ϵi := µ∗ k + 1√nk Z, we have ⟨µ∗ k, ˆβk⟩ ∥ ˆβk∥ = ∥βk∥2 + 1√nk ⟨µ∗ k, Z⟩ q ∥µ∗ k∥2 + 2√nk ⟨µ∗ k, Z⟩ + 1 nk ∥Z∥2

    = P(⟨ϕ(s<i, x), ˆβk⟩ > 0 | y = −1) = Φ(− ⟨µ∗ k, ˆβk⟩ ∥ ˆβk∥ ). As ˆβk = µ∗ k + 1 nk Pnk i=1 ϵi := µ∗ k + 1√nk Z, we have ⟨µ∗ k, ˆβk⟩ ∥ ˆβk∥ = ∥βk∥2 + 1√nk ⟨µ∗ k, Z⟩ q ∥µ∗ k∥2 + 2√nk ⟨µ∗ k, Z⟩ + 1 nk ∥Z∥2 . As we assume ∥µ∗ k∥2 ≪ d, we have ⟨µ∗ k, ˆβk⟩ ∥ ˆβk∥ = ∥µ∗ k∥2 q ∥µ∗ k∥2 + d nk + o(1). As a result, if the total sample size is fixed, choosing large ...

  29. [29]

    BERTScore measures the similarity between a reference text and a generated text by computing contextualized embeddings using BERT

    is a method for evaluating the quality of natural language generation or summarization systems. BERTScore measures the similarity between a reference text and a generated text by computing contextualized embeddings using BERT. ROUGE-L ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence (Lin, 2004)) is an evaluation metr...

  30. [30]

    Currently, the CHAIR metric can only be applied to the COCO dataset, which limits its usability beyond that dataset

    and CC (Conceptual Captions) (Changpinyo et al., 2021). Currently, the CHAIR metric can only be applied to the COCO dataset, which limits its usability beyond that dataset. To overcome this limitation, we manually annotate ImageNet and CC datasets to investigate object hallucination. Specifically, we randomly select 200 images from each dataset to be anno...

  31. [31]

    Ori + Cap

    For a fair comparison, we conducted additional experiments in Table 14 on these datasets by providing input in the form of the question along with 19 Published as a conference paper at ICLR 2024 Table 10: Performance of different models and baselines on general metrics. Models BLEU-1 BLEU-2 BLEU-3 BLEU-4 BERTS ROUGE-L CLIPS mPLUG-Owl Original 30.37 14.59 ...

  32. [32]

    20 Published as a conference paper at ICLR 2024 Table 11: Performance on additional metrics – MENTOR, CIDER, SPICE. Models METEOR CIDER SPICE mPLUG-Owl Original 28.7 0.53 17.5 LURE 36.7 0.66 18.9 LLaVa Original 37.7 0.61 22.6 LURE 43.9 0.67 31.4 LLaMA-Adapter Original 27.6 0.59 21.8 LURE 33.4 0.63 29.2 MiniGPT-4 Original 22.0 0.51 17.9 LURE 25.6 0.55 26.4...

  33. [33]

    Our findings reveal that the incorporation of LURE leads to a significant reduction in hallucinatory objects, averaging around 56%, while only slightly affecting the presence of correctly identified ob- jects, with an average decrease of approximately 1.6%. This noteworthy outcome can be attributed to the fact that LURE doesn’t merely eliminate potentiall...

  34. [34]

    Original caption

    “Original caption” represents the original standard description, while the “Hallucination caption” 25 Published as a conference paper at ICLR 2024 Original Caption:The image shows a man walk- ing down a rainy sidewalk while holding a bright red umbrella to stay dry. The man walks next to a building as rain pours down, making the umbrella a necessary acces...

  35. [35]

    column represents the hallucinated description constructed by GPT-3.5

    Table 19: Cases of generating hallucinatory descriptions. column represents the hallucinated description constructed by GPT-3.5. The red portions in the hallucination captions indicate the hallucinations added by GPT-3.5 based on co-occurring object lists and uncertain object lists. 26 Published as a conference paper at ICLR 2024 D.3 C ASES OF REWRITING C...

  36. [36]

    Original

    Upon comparing the descriptions generated by Revisior with those from the other methods, it becomes evident that Revisior surpasses the others in terms of accuracy and level of detail in describing the image. The description produced by Revisior effectively captures the key elements of the image, such as the presence of a man wearing a white shirt walking...