Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
Pith reviewed 2026-05-17 22:42 UTC · model grok-4.3
The pith
A post-hoc algorithm called LURE reduces object hallucinations in large vision-language models by reconstructing descriptions based on co-occurrence, uncertainty, and text position.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LURE post-hoc rectifies object hallucination by reconstructing less hallucinatory descriptions, drawing on three statistical factors: object co-occurrence in training data, uncertainty during LVLM decoding, and the tendency for hallucinations to appear later in generated text; when applied to six open-source LVLMs it yields a 23 percent improvement in standard hallucination metrics and tops both GPT-based and human rankings.
What carries the argument
LURE (LVLM Hallucination Revisor), an algorithm that uses detected statistical patterns in co-occurrence, decoding uncertainty, and sentence position to guide reconstruction of image descriptions.
If this is right
- LURE improves object hallucination metrics by 23 percent over prior methods when evaluated on six open-source LVLMs.
- The revisor integrates directly with any existing LVLM without weight changes.
- Both automated GPT judgments and human raters rank LURE-corrected outputs highest.
- Corrected descriptions support more trustworthy visual summarization and reasoning.
Where Pith is reading between the lines
- If the three factors generalize, similar lightweight revisors could target attribute or relation hallucinations in the same models.
- Deployment pipelines could insert LURE as an automatic post-processing step to raise user trust without retraining costs.
- Combining LURE with targeted fine-tuning on domain-specific images might produce further reductions in hallucinations.
- Pretraining objectives that penalize the identified statistical patterns could lower hallucination rates at the source.
Load-bearing premise
That the three identified statistical factors suffice to direct reliable reconstruction of accurate descriptions across varied models and image domains.
What would settle it
A test set of images containing rare object combinations or a new LVLM whose decoding uncertainty does not align with hallucinated objects, where LURE produces no metric gains or introduces new errors.
read the original abstract
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at https://github.com/YiyangZhou/LURE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes object hallucination in large vision-language models (LVLMs), identifying three statistical factors—object co-occurrence in images, decoding uncertainty, and later position in generated text—as key contributors. It proposes LURE, a post-hoc revision algorithm that uses these factors to reconstruct less hallucinatory descriptions, which can be integrated with any LVLM. Evaluations on six open-source LVLMs report a 23% improvement in hallucination metrics over the prior best method, with LURE ranking highest in both GPT-based and human evaluations; code and data are released.
Significance. If the central empirical findings hold under broader testing, LURE provides a practical, training-free mitigation strategy for a pervasive issue in vision-language systems, potentially improving reliability in downstream tasks like visual reasoning and summarization. The multi-model evaluation, inclusion of GPT and human judgments, and public release of code/data at the cited GitHub repository are clear strengths that support reproducibility and adoption.
major comments (2)
- [Section 4] Section 4 (LURE algorithm description): the claim that the three identified factors are sufficient to drive reliable post-hoc reconstruction is load-bearing for the 23% metric improvement, yet the manuscript provides no ablation experiments that remove or isolate individual factors (co-occurrence, uncertainty, or position) while holding the revision prompt fixed. Without such controls, it remains possible that gains arise primarily from the generic revision step rather than the specific statistical guidance.
- [Table 1] Table 1 or equivalent results table (six-LVLM evaluation): the reported 23% average improvement over the previous best approach aggregates across models and metrics, but the paper does not report per-factor contribution or per-model variance that would confirm the factors' causal role versus correlational association. This weakens the sufficiency argument raised in the skeptic note.
minor comments (2)
- [Section 3] The notation for uncertainty (e.g., how token-level entropy or probability is aggregated into object-level uncertainty) should be defined more explicitly with an equation or pseudocode to aid replication.
- [Figure 2] Figure 2 (or the factor visualization): axis labels and legend entries could be enlarged for clarity when the paper is viewed in print or on smaller screens.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below. Both comments correctly identify gaps in our current empirical validation of the three factors' specific contributions. We will incorporate the requested analyses in the revised manuscript.
read point-by-point responses
-
Referee: [Section 4] Section 4 (LURE algorithm description): the claim that the three identified factors are sufficient to drive reliable post-hoc reconstruction is load-bearing for the 23% metric improvement, yet the manuscript provides no ablation experiments that remove or isolate individual factors (co-occurrence, uncertainty, or position) while holding the revision prompt fixed. Without such controls, it remains possible that gains arise primarily from the generic revision step rather than the specific statistical guidance.
Authors: We agree that the absence of controlled ablations isolating each factor (while holding the revision prompt structure fixed) leaves open the possibility that improvements arise from generic revision rather than the specific statistical guidance. Section 3 presents statistical evidence linking the three factors to hallucination, and LURE explicitly encodes them in the prompt, but this does not substitute for the requested ablations. We will add these experiments to the revised manuscript, reporting performance when each factor is removed individually from the prompt. revision: yes
-
Referee: [Table 1] Table 1 or equivalent results table (six-LVLM evaluation): the reported 23% average improvement over the previous best approach aggregates across models and metrics, but the paper does not report per-factor contribution or per-model variance that would confirm the factors' causal role versus correlational association. This weakens the sufficiency argument raised in the skeptic note.
Authors: We acknowledge that the aggregate 23% figure does not yet include per-factor contribution breakdowns or per-model variance statistics that would more directly support a causal interpretation. In the revision we will expand the results section with additional tables or supplementary figures that decompose performance by factor (via the ablations described above) and report per-model means and variances across the six LVLMs. revision: yes
Circularity Check
No circularity: empirical analysis and post-hoc revision form self-contained method
full rationale
The paper conducts a statistical analysis of hallucination factors (co-occurrence, decoding uncertainty, text position) and uses the resulting observations to design the LURE post-hoc reconstruction algorithm. No derivation chain, equations, or first-principles result is presented that reduces by construction to fitted inputs, self-referential predictions, or load-bearing self-citations. The reported 23% metric improvement is an empirical outcome from evaluation on six LVLMs rather than a quantity forced by the paper's own definitions or prior author work. The approach is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Object hallucination arises primarily from co-occurrence statistics, decoding uncertainty, and generation position, and can be mitigated by post-hoc reconstruction guided by these factors.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence ... uncertainty ... and object position
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 2.1 ... Err(ˆf(2)2) ≤ Err(ˆf(1)2)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 16 Pith papers
-
Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights
TRIVIA+ is a new long-context RAG hallucination benchmark with four noisy label variants that shows current detectors have substantial room for improvement and are hindered by label noise.
-
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
-
Letting the neural code speak: Automated characterization of monkey visual neurons through human language
Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.
-
Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent
CAVI framework uses character-guided token pruning, orthogonal feature modulation, and modality-adaptive role steering to resolve modality-role interference in multimodal RPAs.
-
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
-
Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time
LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models
Decoder-side Temporal Rebalancing (DTR) reduces hallucinations in Video-LLMs by mitigating over-dominance of a single anchor frame during inference without training or auxiliary models.
-
ReflectCAP: Detailed Image Captioning with Reflective Memory
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-cov...
-
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
-
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
-
VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing
VCE mitigates object hallucination in LVLMs by decomposing activation patterns from contrastive visual inputs via SVD to suppress hallucination subspaces through targeted parameter edits.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
Reference graph
Works this paper leans on
-
[1]
Spice: Semantic propo- sitional image caption evaluation
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propo- sitional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Confer- ence, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14 , pp. 382–398. Springer,
work page 2016
-
[2]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,
work page 1901
-
[3]
PaLM: Scaling Language Modeling with Pathways
URL https: //lmsys.org/blog/2023-03-30-vicuna/ . Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Imagenet: A large-scale hi- erarchical image database
10 Published as a conference paper at ICLR 2024 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pp. 248–255. Ieee,
work page 2024
-
[5]
Beam Search Strategies for Neural Machine Translation
Markus Freitag and Yaser Al-Onaizan. Beam search strategies for neural machine translation.arXiv preprint arXiv:1702.01806,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multi- modal large language models. arXiv preprint arXiv:2306.13394,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Detecting and preventing hallucinations in large vision language models
Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394,
-
[8]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[9]
Advancing medical imaging with language models: A journey from n-grams to chatgpt
Mingzhe Hu, Shaoyan Pan, Yuheng Li, and Xiaofeng Yang. Advancing medical imaging with language models: A journey from n-grams to chatgpt. arXiv preprint arXiv:2304.04920,
-
[10]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a. Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with m...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b. Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. M3it: A large-scale dataset towards ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,
work page 2014
-
[13]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
11 Published as a conference paper at ICLR 2024 Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a. Haokun Liu, Yaonan Zhu, Kenji Kato, Izumi Kondo, Tadayoshi Aoyama, and Yasuhisa Hasegawa. Llm-based human-robot collaboratio...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Llm as a robotic brain: Unifying egocentric memory and control
Jinjie Mai, Jun Chen, Bing Li, Guocheng Qian, Mohamed Elhoseiny, and Bernard Ghanem. Llm as a robotic brain: Unifying egocentric memory and control. arXiv preprint arXiv:2304.09349 ,
-
[15]
Object hallucination in image captioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4035–4045,
work page 2018
-
[16]
Can language models teach weaker agents? teacher explanations improve students via theory of mind
Swarnadeep Saha, Peter Hase, and Mohit Bansal. Can language models teach weaker agents? teacher explanations improve students via theory of mind. arXiv preprint arXiv:2306.09299 ,
-
[17]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Evaluation and analysis of hallucination in large vision- language models
12 Published as a conference paper at ICLR 2024 Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, et al. Evaluation and analysis of hallucination in large vision- language models. arXiv preprint arXiv:2308.15126, 2023a. Sheng Wang, Zihao Zhao, Xi Ouyang, Qian Wang, and Dinggang Shen. ...
-
[19]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Multi-grained vision language pre-training: Aligning texts with visual concepts
Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276,
-
[21]
arXiv preprint arXiv:2305.13534 , year=
Linjun Zhang, Zhun Deng, Kenji Kawaguchi, and James Zou. When and how mixup improves calibration. In International Conference on Machine Learning, pp. 26135–26160. PMLR, 2022a. Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023a. Renrui Zhang, Jiaming ...
-
[22]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Experimental Setting for the Uncertainty Analysis
from the COCO dataset, and the image descriptions are generated by MiniGPT-4 based on inference results from 5000 images in the COCO 2014 train dataset. Experimental Setting for the Uncertainty Analysis. Because uncertainty and position analysis are relatively independent from co-occurrence, in order to avoid conducting statistical analysis on the trainin...
work page 2014
-
[25]
and aims to guide the model in generating accurate descriptions by focusing on object recognition. • Greedy-Decoding: The difference between the “Greedy-Decoding” strategy and the “Original” strategy is that in the ”Greedy-Decoding” strategy, the model uses greedy decoding instead of sampling during the generation of image descriptions to produce the most...
work page 2024
-
[26]
15 Published as a conference paper at ICLR 2024 Table 8: Prompts for baselines
We can then write ( ˆβ(1) 1 , ˆβ(1) 2 ) = (ρ0µ∗ 1 + 1 N ρ0·NX i=1 ϵi,1, ρ0µ∗ 2 + 1 N ρ0·NX i=1 ϵi,2). 15 Published as a conference paper at ICLR 2024 Table 8: Prompts for baselines. Teacher: Reference caption: {blip2 caption} Please refer to reference caption and describe this picture: CoT: Human: Please list the main objects in the picture and strictly f...
work page 2024
-
[27]
16 Published as a conference paper at ICLR 2024 Figure 4: Human evaluation annotation interface
+ 1 2 P(⟨ϕ1(s<i, x), ˆβ(1) 1 ⟩ + ⟨ϕ2(s<i, x), ˆβ(1) 2 ⟩ > 0 | y = −1) = Φ(− ⟨µ∗ 1, ˆβ1⟩ + ⟨β2, ˆµ∗ 2⟩q ∥ ˆβ1∥2 + ∥ ˆβ2∥2 ) = Φ(− ρ0∥µ∗ 1∥2 + ρ0∥µ∗ 2∥2 q ρ2 0∥µ∗ 1∥2 + ρ2 0∥µ∗ 2∥2 + ρ0·d N + ρ0·d N ) + o(1). 16 Published as a conference paper at ICLR 2024 Figure 4: Human evaluation annotation interface. Table 9: The prompt for ChatGPT3.5 evaluation. Instru...
work page 2024
-
[28]
= P(⟨ϕ(s<i, x), ˆβk⟩ > 0 | y = −1) = Φ(− ⟨µ∗ k, ˆβk⟩ ∥ ˆβk∥ ). As ˆβk = µ∗ k + 1 nk Pnk i=1 ϵi := µ∗ k + 1√nk Z, we have ⟨µ∗ k, ˆβk⟩ ∥ ˆβk∥ = ∥βk∥2 + 1√nk ⟨µ∗ k, Z⟩ q ∥µ∗ k∥2 + 2√nk ⟨µ∗ k, Z⟩ + 1 nk ∥Z∥2 . As we assume ∥µ∗ k∥2 ≪ d, we have ⟨µ∗ k, ˆβk⟩ ∥ ˆβk∥ = ∥µ∗ k∥2 q ∥µ∗ k∥2 + d nk + o(1). As a result, if the total sample size is fixed, choosing large ...
work page 2002
-
[29]
is a method for evaluating the quality of natural language generation or summarization systems. BERTScore measures the similarity between a reference text and a generated text by computing contextualized embeddings using BERT. ROUGE-L ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence (Lin, 2004)) is an evaluation metr...
work page 2004
-
[30]
and CC (Conceptual Captions) (Changpinyo et al., 2021). Currently, the CHAIR metric can only be applied to the COCO dataset, which limits its usability beyond that dataset. To overcome this limitation, we manually annotate ImageNet and CC datasets to investigate object hallucination. Specifically, we randomly select 200 images from each dataset to be anno...
work page 2021
-
[31]
For a fair comparison, we conducted additional experiments in Table 14 on these datasets by providing input in the form of the question along with 19 Published as a conference paper at ICLR 2024 Table 10: Performance of different models and baselines on general metrics. Models BLEU-1 BLEU-2 BLEU-3 BLEU-4 BERTS ROUGE-L CLIPS mPLUG-Owl Original 30.37 14.59 ...
work page 2024
-
[32]
20 Published as a conference paper at ICLR 2024 Table 11: Performance on additional metrics – MENTOR, CIDER, SPICE. Models METEOR CIDER SPICE mPLUG-Owl Original 28.7 0.53 17.5 LURE 36.7 0.66 18.9 LLaVa Original 37.7 0.61 22.6 LURE 43.9 0.67 31.4 LLaMA-Adapter Original 27.6 0.59 21.8 LURE 33.4 0.63 29.2 MiniGPT-4 Original 22.0 0.51 17.9 LURE 25.6 0.55 26.4...
work page 2024
-
[33]
Our findings reveal that the incorporation of LURE leads to a significant reduction in hallucinatory objects, averaging around 56%, while only slightly affecting the presence of correctly identified ob- jects, with an average decrease of approximately 1.6%. This noteworthy outcome can be attributed to the fact that LURE doesn’t merely eliminate potentiall...
work page 2024
-
[34]
“Original caption” represents the original standard description, while the “Hallucination caption” 25 Published as a conference paper at ICLR 2024 Original Caption:The image shows a man walk- ing down a rainy sidewalk while holding a bright red umbrella to stay dry. The man walks next to a building as rain pours down, making the umbrella a necessary acces...
work page 2024
-
[35]
column represents the hallucinated description constructed by GPT-3.5
Table 19: Cases of generating hallucinatory descriptions. column represents the hallucinated description constructed by GPT-3.5. The red portions in the hallucination captions indicate the hallucinations added by GPT-3.5 based on co-occurring object lists and uncertain object lists. 26 Published as a conference paper at ICLR 2024 D.3 C ASES OF REWRITING C...
work page 2024
-
[36]
Upon comparing the descriptions generated by Revisior with those from the other methods, it becomes evident that Revisior surpasses the others in terms of accuracy and level of detail in describing the image. The description produced by Revisior effectively captures the key elements of the image, such as the presence of a man wearing a white shirt walking...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.