pith. sign in

hub Tool reference

arXiv preprint arXiv:1809.02156 , year=

Tool reference. 80% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

28 Pith papers citing it
Method reference 80% of classified citations
abstract

Despite continuously improving performance, contemporary image captioning models are prone to "hallucinating" objects that are not actually in a scene. One problem is that standard metrics only measure similarity to ground truth captions and may not fully capture image relevance. In this work, we propose a new image relevance metric to evaluate current models with veridical visual labels and assess their rate of object hallucination. We analyze how captioning model architectures and learning objectives contribute to object hallucination, explore when hallucination is likely due to image misclassification or language priors, and assess how well current sentence metrics capture object hallucination. We investigate these questions on the standard image captioning benchmark, MSCOCO, using a diverse set of models. Our analysis yields several interesting findings, including that models which score best on standard sentence metrics do not always have lower hallucination and that models which hallucinate more tend to make errors driven by language priors.

hub tools

citation-role summary

dataset 4 background 1

citation-polarity summary

representative citing papers

ZINA: Multimodal Fine-grained Hallucination Detection and Editing

cs.CV · 2025-06-16 · unverdicted · novelty 7.0

ZINA detects fine-grained hallucinations in MLLM outputs, classifies errors into six types, and proposes edits, outperforming GPT-4o and Llama-3.2 on the new VisionHall dataset of annotated and synthetic samples.

Contextualized Visual Personalization in Vision-Language Models

cs.CV · 2026-02-03 · unverdicted · novelty 6.0 · 2 refs

CoViP is a unified framework for contextualized visual personalization in VLMs that treats personalized image captioning as the core task, applies RL-based post-training and caption-augmented generation, and shows gains on diagnostic evaluations that rule out textual shortcuts plus downstream tasks.

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

cs.CV · 2024-08-03 · conditional · novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

citing papers explorer

Showing 28 of 28 citing papers.