hub Tool reference

arXiv preprint arXiv:1809.02156 , year=

Object hallucination in image captioning , author= · 2018 · cs.CL · arXiv 1809.02156

Tool reference. 80% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

28 Pith papers citing it

Method reference 80% of classified citations

open full Pith review browse 28 citing papers arXiv PDF

abstract

Despite continuously improving performance, contemporary image captioning models are prone to "hallucinating" objects that are not actually in a scene. One problem is that standard metrics only measure similarity to ground truth captions and may not fully capture image relevance. In this work, we propose a new image relevance metric to evaluate current models with veridical visual labels and assess their rate of object hallucination. We analyze how captioning model architectures and learning objectives contribute to object hallucination, explore when hallucination is likely due to image misclassification or language priors, and assess how well current sentence metrics capture object hallucination. We investigate these questions on the standard image captioning benchmark, MSCOCO, using a diverse set of models. Our analysis yields several interesting findings, including that models which score best on standard sentence metrics do not always have lower hallucination and that models which hallucinate more tend to make errors driven by language priors.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 4 background 1

citation-polarity summary

use dataset 4 background 1

representative citing papers

ZINA: Multimodal Fine-grained Hallucination Detection and Editing

cs.CV · 2025-06-16 · unverdicted · novelty 7.0

ZINA detects fine-grained hallucinations in MLLM outputs, classifies errors into six types, and proposes edits, outperforming GPT-4o and Llama-3.2 on the new VisionHall dataset of annotated and synthetic samples.

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

cs.CV · 2024-06-14 · unverdicted · novelty 7.0

Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

cs.CV · 2026-04-28 · conditional · novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregressive decoding.

Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

cs.CV · 2026-05-20 · conditional · novelty 6.0

SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.

Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models

cs.AI · 2026-02-12 · unverdicted · novelty 6.0

REVIS reduces object hallucination in large vision-language models by about 19% via sparse orthogonal projection in latent space at suppression depths while keeping reasoning intact.

Contextualized Visual Personalization in Vision-Language Models

cs.CV · 2026-02-03 · unverdicted · novelty 6.0 · 2 refs

CoViP is a unified framework for contextualized visual personalization in VLMs that treats personalized image captioning as the core task, applies RL-based post-training and caption-augmented generation, and shows gains on diagnostic evaluations that rule out textual shortcuts plus downstream tasks.

Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models

cs.CV · 2025-11-13 · unverdicted · novelty 6.0

RUDDER creates a persistent visual anchor by extracting CARD from prefill residuals and modulating its injection via an adaptive Beta Gate, cutting CHAIR_S by 24.4% and CHAIR_i by 23.6% on average across LLaVA, Idefics2, InstructBLIP and Qwen2.5-VL with >96% throughput.

Mitigating Object Hallucinations via Sentence-Level Early Intervention

cs.CV · 2025-07-16 · conditional · novelty 6.0

SENTINEL reduces MLLM object hallucinations by over 90% via sentence-level early intervention with detector-bootstrapped preference data and C-DPO loss, outperforming prior SOTA on hallucination and capability benchmarks.

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

cs.CL · 2024-11-15 · conditional · novelty 6.0

Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

cs.CL · 2023-11-13 · unverdicted · novelty 6.0

AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.

Aligning Large Multimodal Models with Factually Augmented RLHF

cs.CV · 2023-09-25 · conditional · novelty 6.0

Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

cs.CV · 2023-06-26 · accept · novelty 6.0

A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.

Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.

CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.

Online Self-Calibration Against Hallucination in Vision-Language Models

cs.CV · 2026-05-01 · unverdicted · novelty 6.0

OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.

Mitigating Multimodal Hallucination via Phase-wise Self-reward

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

cs.CV · 2023-04-20 · conditional · novelty 6.0

MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, creative writing, and instruction following.

Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

cs.CV · 2025-05-27 · unverdicted · novelty 5.0

CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

cs.LG · 2024-02-18 · unverdicted · novelty 5.0

POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

cs.CV · 2023-10-14 · unverdicted · novelty 5.0

MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.

Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

cs.CV · 2026-04-22 · unverdicted · novelty 5.0

MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.

Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation

cs.CV · 2026-04-11 · unverdicted · novelty 5.0

DaID mitigates MLLM hallucinations by attention-guided selection of dual layers that calibrate token generation using internal perceptual discrepancies.

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

cs.CV · 2024-08-03 · conditional · novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

citing papers explorer

Showing 28 of 28 citing papers.

ZINA: Multimodal Fine-grained Hallucination Detection and Editing cs.CV · 2025-06-16 · unverdicted · none · ref 3 · internal anchor
ZINA detects fine-grained hallucinations in MLLM outputs, classifies errors into six types, and proposes edits, outperforming GPT-4o and Llama-3.2 on the new VisionHall dataset of annotated and synthetic samples.
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models cs.CV · 2024-06-14 · unverdicted · none · ref 19 · internal anchor
Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models cs.CV · 2026-04-28 · conditional · none · ref 33
Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregressive decoding.
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models cs.CV · 2026-05-20 · conditional · none · ref 42 · internal anchor
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding cs.CV · 2026-05-15 · unverdicted · none · ref 67 · internal anchor
A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models cs.AI · 2026-02-12 · unverdicted · none · ref 6 · internal anchor
REVIS reduces object hallucination in large vision-language models by about 19% via sparse orthogonal projection in latent space at suppression depths while keeping reasoning intact.
Contextualized Visual Personalization in Vision-Language Models cs.CV · 2026-02-03 · unverdicted · none · ref 1 · 2 links · internal anchor
CoViP is a unified framework for contextualized visual personalization in VLMs that treats personalized image captioning as the core task, applies RL-based post-training and caption-augmented generation, and shows gains on diagnostic evaluations that rule out textual shortcuts plus downstream tasks.
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models cs.CV · 2025-11-13 · unverdicted · none · ref 17 · internal anchor
RUDDER creates a persistent visual anchor by extracting CARD from prefill residuals and modulating its injection via an adaptive Beta Gate, cutting CHAIR_S by 24.4% and CHAIR_i by 23.6% on average across LLaVA, Idefics2, InstructBLIP and Qwen2.5-VL with >96% throughput.
Mitigating Object Hallucinations via Sentence-Level Early Intervention cs.CV · 2025-07-16 · conditional · none · ref 55 · internal anchor
SENTINEL reduces MLLM object hallucinations by over 90% via sentence-level early intervention with detector-bootstrapped preference data and C-DPO loss, outperforming prior SOTA on hallucination and capability benchmarks.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization cs.CL · 2024-11-15 · conditional · none · ref 82 · internal anchor
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation cs.CL · 2023-11-13 · unverdicted · none · ref 9 · internal anchor
AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.
Aligning Large Multimodal Models with Factually Augmented RLHF cs.CV · 2023-09-25 · conditional · none · ref 25 · internal anchor
Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning cs.CV · 2023-06-26 · accept · none · ref 17 · internal anchor
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement cs.CV · 2026-05-12 · unverdicted · none · ref 42
A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering cs.CV · 2026-05-06 · unverdicted · none · ref 40
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.
Online Self-Calibration Against Hallucination in Vision-Language Models cs.CV · 2026-05-01 · unverdicted · none · ref 23
OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.
Mitigating Multimodal Hallucination via Phase-wise Self-reward cs.CV · 2026-04-20 · unverdicted · none · ref 39
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models cs.CV · 2023-04-20 · conditional · none · ref 14
MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, creative writing, and instruction following.
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration cs.CV · 2025-05-27 · unverdicted · none · ref 15 · internal anchor
CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning cs.LG · 2024-02-18 · unverdicted · none · ref 170 · internal anchor
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning cs.CV · 2023-10-14 · unverdicted · none · ref 40 · internal anchor
MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation cs.CV · 2026-04-22 · unverdicted · none · ref 137
MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation cs.CV · 2026-04-11 · unverdicted · none · ref 26
DaID mitigates MLLM hallucinations by attention-guided selection of dual layers that calibrate token generation using internal perceptual discrepancies.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone cs.CV · 2024-08-03 · conditional · none · ref 85
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 141
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
Agent AI: Surveying the Horizons of Multimodal Interaction cs.AI · 2024-01-07 · unverdicted · none · ref 190 · internal anchor
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
A Survey of Hallucination in Large Foundation Models cs.AI · 2023-09-12 · accept · none · ref 141 · internal anchor
A survey classifying hallucination phenomena specific to large foundation models, establishing evaluation criteria, examining mitigation strategies, and discussing future directions.
How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study cs.AI · 2026-05-16 · unreviewed · ref 8 · internal anchor

arXiv preprint arXiv:1809.02156 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer