Evaluating Object Hallucination in Large Vision-Language Models

Jinpeng Wang; Ji-Rong Wen; Kun Zhou; Wayne Xin Zhao; Yifan Du; Yifan Li

arxiv: 2305.10355 · v3 · submitted 2023-05-17 · 💻 cs.CV · cs.CL· cs.MM

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li , Yifan Du , Kun Zhou , Jinpeng Wang , Wayne Xin Zhao , Ji-Rong Wen This is my paper

Pith reviewed 2026-05-11 13:38 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.MM

keywords object hallucinationlarge vision-language modelsevaluation methodPOPEvisual instructionsmultimodal generationimage captioning

0 comments

The pith

Large vision-language models often describe objects absent from the given image, especially those frequent in instructions or co-occurring with visible items.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines object hallucination in large vision-language models created by pairing strong language models with vision encoders. Experiments across representative models reveal that these systems generate image descriptions containing objects not present in the actual images. The authors observe that visual instructions bias the output toward objects that appear often in training data or alongside objects that are truly visible. Existing evaluation approaches vary with input phrasing and output style, so the paper introduces POPE, a polling-based query technique that measures hallucination more consistently across models.

Core claim

Large vision-language models suffer from severe object hallucination by generating objects inconsistent with the target images. Objects that frequently occur in the visual instructions or co-occur with the image objects are obviously prone to be hallucinated. Existing evaluation methods might be affected by the input instructions and generation styles of LVLMs, therefore a polling-based query method called POPE evaluates the object hallucination in a more stable and flexible way.

What carries the argument

POPE, a polling-based query method that asks the model yes/no questions about the presence of candidate objects in a fixed polling format to measure hallucination rates.

If this is right

Visual instructions should be designed to minimize exposure to frequent or co-occurring objects to lower hallucination rates.
POPE allows consistent ranking of different LVLMs on hallucination without dependence on their particular generation styles.
Models will continue to favor hallucinated objects that match patterns in their training instructions unless those patterns are altered.
Improved evaluation reveals specific objects most likely to be invented, guiding targeted fixes in training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hallucination may arise when language-model priors about common object co-occurrences override the actual visual signal.
POPE could be extended to probe other hallucination types such as attributes or relations beyond objects.
Widespread use of POPE on new models would let researchers track whether scaling or new training techniques actually reduce the problem.
If polling reveals systematic over-generation of certain object classes, retraining with balanced negative examples might help.

Load-bearing premise

The selected representative LVLMs and visual instruction datasets are sufficiently typical of the broader class of models, and the polling queries in POPE do not introduce new systematic biases in measuring hallucination.

What would settle it

Run POPE and prior evaluation methods on the same set of model outputs, then compare both against human judgments of object presence in the images; if POPE scores remain stable while prior scores shift with instruction wording, the claim holds.

read the original abstract

Inspired by the superior language abilities of large language models (LLM), large vision-language models (LVLM) have been recently explored by integrating powerful LLMs for improving the performance on complex multimodal tasks. Despite the promising progress on LVLMs, we find that LVLMs suffer from the hallucination problem, i.e. they tend to generate objects that are inconsistent with the target images in the descriptions. To investigate it, this work presents the first systematic study on object hallucination of LVLMs. We conduct the evaluation experiments on several representative LVLMs, and show that they mostly suffer from severe object hallucination issue. We further discuss that the visual instructions may influence the hallucination, and find that: objects that frequently occur in the visual instructions or co-occur with the image objects, are obviously prone to be hallucinated by LVLMs. Besides, we find that existing evaluation methods might be affected by the input instructions and generation styles of LVLMs. Thus, we further design an improved evaluation method for object hallucination by proposing a polling-based query method called POPE. Experiment results demonstrate that our POPE can evaluate the object hallucination in a more stable and flexible way. Our codes and data are publicly available at https://github.com/RUCAIBox/POPE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LVLMs hallucinate objects often, especially frequent or co-occurring ones, and POPE polling looks like a practical fix but may not fully match free-form generation behavior.

read the letter

LVLMs hallucinate objects more than we want, and this paper documents that across a few models while linking it to how often things appear in training instructions or alongside real objects in images. They also introduce POPE, which polls the model with yes/no questions about specific objects to get a cleaner read on the issue. The work is new in being the first to focus systematically on object hallucination in these models. It does a decent job showing the problem is widespread and that frequency and co-occurrence matter. Releasing code and data is helpful too, and the idea of using polling to avoid generation-style biases makes sense on the surface. Where it could be stronger is on the claim that POPE is clearly better. The stress test raises a good point: direct questions might trigger different behavior than open-ended descriptions, like yes-bias from instruction tuning. If the paper doesn't have side-by-side comparisons that control for that, the superiority might be overstated. Also, the abstract is light on exact sample sizes, how they chose the objects for polling, and any stats on the differences they report. That leaves the central findings plausible but not fully locked down yet. This paper is for people building or evaluating vision-language models who need practical ways to measure reliability. A reader interested in benchmarks or multimodal safety will find it worth reading. It has enough substance and a clear contribution that it should go to peer review rather than get desk rejected. I'd recommend sending it out for review, with the expectation that reviewers will ask for more details on the POPE validation.

Referee Report

3 major / 3 minor

Summary. The paper conducts the first systematic study of object hallucination in large vision-language models (LVLMs). Experiments on representative LVLMs show severe hallucination, with objects frequent in visual instructions or co-occurring with image objects being especially prone to generation. Existing evaluation methods are critiqued for sensitivity to instructions and generation style, leading to the proposal of POPE, a polling-based yes/no query method claimed to offer more stable and flexible evaluation. Code and data are released publicly.

Significance. If the central empirical patterns and POPE evaluation hold, the work is significant for multimodal AI research: it quantifies a reliability issue in LVLMs that affects downstream tasks such as captioning and VQA, identifies actionable instruction-related biases, and supplies a practical polling protocol plus public resources for reproducible benchmarking. The explicit release of code and data is a clear strength that supports follow-on mitigation studies.

major comments (3)

[§3] §3 (Experiments): The manuscript does not report exact sample sizes (number of images and queries per model/dataset), controls for prompt variation, or statistical tests (e.g., confidence intervals or significance tests) used to establish the 'severe' hallucination rates and frequency/co-occurrence patterns; without these, the quantitative claims cannot be fully verified.
[§4] §4 (POPE): The claim that POPE evaluates the same underlying hallucination phenomenon as the free-form generation experiments rests on an untested premise; no side-by-side correlation analysis or ablation is presented showing that yes/no polling rates reproduce the frequency and co-occurrence effects observed in open-ended outputs, raising the possibility that POPE instead measures query compliance or yes-bias.
[§4.2] §4.2 (Comparison to prior methods): The superiority of POPE over existing metrics is asserted via stability and flexibility, yet the paper provides no quantitative metric (e.g., variance across prompt styles or inter-rater agreement) demonstrating reduced sensitivity; this is load-bearing for the central methodological contribution.

minor comments (3)

[Abstract] Abstract: Key quantitative findings (e.g., hallucination percentages per model) are omitted; adding one or two headline numbers would improve clarity.
Figure captions and tables: Ensure all axes and legends explicitly label hallucination rate versus object frequency or co-occurrence to avoid ambiguity in interpreting the reported patterns.
Notation: Define 'visual instructions' and 'co-occurrence' operationally in the main text on first use, as these terms are central to the claimed patterns.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our empirical findings and the justification for POPE. We address each major comment below and commit to revisions that strengthen the quantitative rigor and validation of our claims without altering the core contributions.

read point-by-point responses

Referee: [§3] §3 (Experiments): The manuscript does not report exact sample sizes (number of images and queries per model/dataset), controls for prompt variation, or statistical tests (e.g., confidence intervals or significance tests) used to establish the 'severe' hallucination rates and frequency/co-occurrence patterns; without these, the quantitative claims cannot be fully verified.

Authors: We agree that explicit reporting of sample sizes, prompt controls, and uncertainty estimates is necessary for verifiability. The experiments used the full COCO val2014 set (approximately 40k images) for the primary frequency and co-occurrence analyses across models, with 5k-image subsets for efficiency in some LVLM evaluations and 1k random samples for instruction-variation ablations; all queries per image were fixed to the same template to control prompt variation. In the revision we will add a dedicated table listing exact image counts and query counts per model and dataset, state that a single fixed prompt template was used across all models for the main results, and report bootstrap 95% confidence intervals on the reported hallucination percentages and co-occurrence correlations to substantiate the severity claims. revision: yes
Referee: [§4] §4 (POPE): The claim that POPE evaluates the same underlying hallucination phenomenon as the free-form generation experiments rests on an untested premise; no side-by-side correlation analysis or ablation is presented showing that yes/no polling rates reproduce the frequency and co-occurrence effects observed in open-ended outputs, raising the possibility that POPE instead measures query compliance or yes-bias.

Authors: The design of POPE deliberately decouples object hallucination measurement from open-ended generation style by using balanced yes/no queries, but we acknowledge that we did not present a direct quantitative link showing that the same frequency and co-occurrence biases appear under POPE. In the revised manuscript we will add a new analysis that computes per-object hallucination rates under both free-form captioning and POPE on the same set of images and models, reports Pearson correlations between the two, and includes an ablation that balances positive/negative object queries to quantify any yes-bias. This will either confirm that POPE reproduces the key patterns or allow us to clarify the precise relationship between the two evaluation regimes. revision: yes
Referee: [§4.2] §4.2 (Comparison to prior methods): The superiority of POPE over existing metrics is asserted via stability and flexibility, yet the paper provides no quantitative metric (e.g., variance across prompt styles or inter-rater agreement) demonstrating reduced sensitivity; this is load-bearing for the central methodological contribution.

Authors: We presented qualitative evidence that prior metrics vary with instruction phrasing and generation length while POPE remains consistent, but we did not supply a quantitative stability metric such as variance across prompt variants. We will revise §4.2 to include a controlled ablation that applies five distinct prompt phrasings to the same images and models, computes the standard deviation of hallucination rates for POPE versus CHAIR and other baselines, and reports these variance numbers together with the original stability claims. This will provide the requested quantitative support for reduced sensitivity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper performs direct empirical evaluations of object hallucination on representative LVLMs using public datasets and existing methods, then proposes POPE as a polling-based alternative. No derivations, fitted parameters, or predictions reduce by construction to the paper's own inputs or self-citations. Claims about hallucination severity, frequency effects, and POPE's stability rest on experimental comparisons rather than self-referential definitions or load-bearing self-citations. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that object hallucination can be reliably detected via object presence checks and that the tested models and instructions represent the general behavior of LVLMs.

axioms (1)

domain assumption Object hallucination is defined as generating objects inconsistent with the target images in the descriptions.
This definition underpins all evaluation experiments and the design of POPE.

pith-pipeline@v0.9.0 · 5546 in / 1208 out tokens · 52159 ms · 2026-05-11T13:38:09.947713+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals
cs.CV 2026-05 unverdicted novelty 7.0

SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
cs.CL 2026-05 conditional novelty 7.0

AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
cs.AI 2026-05 unverdicted novelty 7.0

Evidence-carrying multimodal agents decompose tool calls into predicates verified by constrained DOM/OCR/AX checkers to block hallucination-enabled unsafe actions.
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 7.0

UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
cs.CV 2026-05 conditional novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
cs.CV 2026-05 unverdicted novelty 7.0

CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
cs.CV 2026-05 unverdicted novelty 7.0

Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and veri...
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
cs.CV 2026-04 conditional novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
Improving Vision-language Models with Perception-centric Process Reward Models
cs.CV 2026-04 unverdicted novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
cs.CR 2026-04 unverdicted novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
cs.CV 2026-04 conditional novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference
cs.CV 2026-04 unverdicted novelty 7.0

ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.
AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model
cs.CV 2025-11 unverdicted novelty 7.0

AIA loss teaches unified multimodal models task-specific cross-modal attention patterns to reduce conflicts between image understanding and generation without architecture decoupling.
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
cs.CL 2025-09 conditional novelty 7.0

V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM perf...
MMSearch-R1: Incentivizing LMMs to Search
cs.CV 2025-06 unverdicted novelty 7.0

MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
VGR: Visual Grounded Reasoning
cs.CV 2025-06 unverdicted novelty 7.0

VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
cs.CV 2025-03 unverdicted novelty 7.0

DualToken disentangles semantics and appearance via separate codebooks in one tokenizer, reporting 0.25 rFID, 82% ImageNet zero-shot accuracy, and gains over VILA-U on understanding and generation benchmarks.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
cs.CV 2024-10 unverdicted novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
cs.CV 2024-06 unverdicted novelty 7.0

Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
cs.CV 2023-10 unverdicted novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
cs.CV 2026-05 conditional novelty 6.0

SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
cs.AI 2026-05 unverdicted novelty 6.0

Evidence-carrying multimodal agents decompose tool calls into predicates, obtain certificates from DOM/OCR/AX verifiers, and use a deterministic gate to authorize actions only when certificates support them, achieving...
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Vision Inference Former adds a direct visual-to-output bridge that continuously injects visual semantics during MLLM decoding to sustain consistency and reduce modality imbalance.
Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
cs.CV 2026-05 unverdicted novelty 6.0

A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

LRCP prunes visual tokens in LVLMs by scoring projection residuals onto a PCA-estimated low-rank subspace, achieving 88.9% image token reduction with 94.7% performance retention and 87.5% video reduction with 97.8% ac...
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
cs.MM 2026-05 unverdicted novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
cs.CV 2026-05 unverdicted novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
Online Self-Calibration Against Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...
X2SAM: Any Segmentation in Images and Videos
cs.CV 2026-04 unverdicted novelty 6.0

X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.
Mitigating Multimodal Hallucination via Phase-wise Self-reward
cs.CV 2026-04 unverdicted novelty 6.0

PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
cs.CV 2026-04 unverdicted novelty 6.0

CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to...
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
cs.AI 2026-04 unverdicted novelty 6.0

EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

HaloProbe introduces a Bayesian method to estimate token-level hallucination probabilities in VLMs by factorizing external and internal signals, enabling more effective mitigation than intervention-based techniques wh...
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
cs.CV 2026-03 unverdicted novelty 6.0

69.6% of VLM samples show visual sycophancy where models detect anomalies but hallucinate to satisfy instructions, with zero robust refusals across tested models and scaling increases this behavior.
Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation
cs.CV 2026-03 unverdicted novelty 6.0

LTS-FS locates hallucination-relevant layers in LVLMs via causal attribution on a constructed dataset and applies sparse layerwise feature steering to mitigate hallucinations while preserving general task performance.
Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models
cs.AI 2026-02 unverdicted novelty 6.0

REVIS reduces object hallucination in large vision-language models by about 19% via sparse orthogonal projection in latent space at suppression depths while keeping reasoning intact.
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
cs.CV 2026-02 unverdicted novelty 6.0

ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.
When Numbers Start Talking: Implicit Numerical Coordination Among LLM-Based Agents
cs.MA 2026-01 unverdicted novelty 6.0

LLM agents exhibit emergent covert numerical coordination in canonical game settings under restricted or absent communication, shaping strategic outcomes.
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
cs.CV 2025-12 unverdicted novelty 6.0

Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
cs.CV 2025-11 unverdicted novelty 6.0

RUDDER creates a persistent visual anchor by extracting CARD from prefill residuals and modulating its injection via an adaptive Beta Gate, cutting CHAIR_S by 24.4% and CHAIR_i by 23.6% on average across LLaVA, Idefic...
ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models
cs.CV 2025-09 unverdicted novelty 6.0

ORCA is an inference-time agentic framework that boosts LVLM accuracy on hallucination benchmarks by 3.64-40.67% and adds adversarial robustness via cross-model validation with small vision tools.
ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models
cs.CV 2025-09 unverdicted novelty 6.0

ORCA is an agentic reasoning framework that enhances factual accuracy and adversarial robustness of pretrained LVLMs via an Observe-Reason-Critique-Act loop with small vision models, reporting accuracy gains of up to ...
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
cs.CV 2025-07 unverdicted novelty 6.0

Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
cs.CV 2025-05 unverdicted novelty 6.0

Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
cs.CV 2025-05 unverdicted novelty 6.0

Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...
Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models
cs.CV 2024-11 unverdicted novelty 6.0

Presents LLaVA-AlignedVQ, an edge-cloud VQA system with AlignedVQ that delivers 1365x feature compression, 96.8% lower transmission than JPEG90, 2-15x speedup, and accuracy within -2.23% to +1.6% of the baseline acros...
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
cs.CV 2024-04 unverdicted novelty 6.0

SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
cs.CV 2023-11 accept novelty 6.0

MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 92 Pith papers · 12 internal anchors

[1]

Harsh Agrawal, Peter Anderson, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. https://doi.org/10.1109/ICCV.2019.00904 nocaps: novel object captioning at scale . In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019 , pa...

work page doi:10.1109/iccv.2019.00904 2019
[2]

Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Kar \' e n Simonyan

Jean - Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Bin...

work page 2022
[3]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015 a . VQA: visual question answering. In ICCV , pages 2425--2433. IEEE Computer Society

work page 2015
[4]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015 b . https://doi.org/10.1109/ICCV.2015.279 VQA: visual question answering . In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015 , pages 2425--2433. IEEE Computer Society

work page doi:10.1109/iccv.2015.279 2015
[5]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. https://doi.org/10.48550/arXiv.2302.04023 A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity . CoRR, abs/2302.04023

work page internal anchor Pith review doi:10.48550/arxiv.2302.04023 2023
[7]

Ali Furkan Biten, Llu \' s G \' o mez, and Dimosthenis Karatzas. 2022. https://doi.org/10.1109/WACV51458.2022.00253 Let there be a clock on the beach: Reducing object hallucination in image captioning . In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022 , pages 2473--2482. IEEE

work page doi:10.1109/wacv51458.2022.00253 2022
[8]

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023. Shikra: Unleashing multimodal llm's referential dialogue magic. CoRR, abs/2306.15195

work page internal anchor Pith review arXiv 2023
[9]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. https://lmsys.org/blog/2023-03-30-vicuna/ Vicuna: An open-source chatbot impressing gpt-4 with 90\

work page 2023
[10]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023 a . Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500

work page internal anchor Pith review arXiv 2023
[11]

Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. 2023 b . https://aclanthology.org/2023.eacl-main.156 Plausible may not be faithful: Probing object hallucination in vision-language pre-training . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-...

work page 2023
[12]

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. 2022. https://doi.org/10.1561/0600000105 Vision-language pre-training: Basics, recent advances, and future trends . Found. Trends Comput. Graph. Vis., 14(3-4):163--352

work page doi:10.1561/0600000105 2022
[13]

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. 2023. https://doi.org/10.48550/arXiv.2304.15010 Llama-adapter V2: parameter-efficient visual instruction model . CoRR, abs/2304.15010

work page internal anchor Pith review doi:10.48550/arxiv.2304.15010 2023
[14]

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. 2023. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790

work page arXiv 2023
[15]

Yash Goyal, Tejas Khot, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. 2017. https://doi.org/10.1109/CVPR.2017.670 Making the V in VQA matter: Elevating the role of image understanding in visual question answering . In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 6325--6334....

work page doi:10.1109/cvpr.2017.670 2017
[16]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2015. Framing image description as a ranking task: Data, models and evaluation metrics (extended abstract). In IJCAI , pages 4188--4192. AAAI Press

work page 2015
[17]

Yi - Chong Huang, Xia - Chong Feng, Xiao - Cheng Feng, and Bing Qin. 2021. http://arxiv.org/abs/2104.14839 The factual inconsistency problem in abstractive text summarization: A survey . CoRR, abs/2104.14839

work page arXiv 2021
[18]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Drew A. Hudson and Christopher D. Manning. 2019. https://doi.org/10.1109/CVPR.2019.00686 GQA: A new dataset for real-world visual reasoning and compositional question answering . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 6700--6709. Computer Vision Foundation / IEEE

work page doi:10.1109/cvpr.2019.00686 2019
[19]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. 2022. http://arxiv.org/abs/2202.03629 Survey of hallucination in natural language generation . CoRR, abs/2202.03629

work page arXiv 2022
[20]

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023 a . https://doi.org/10.48550/arXiv.2305.03726 Otter: A multi-modal model with in-context instruction tuning . CoRR, abs/2305.03726

work page internal anchor Pith review doi:10.48550/arxiv.2305.03726 2023
[21]

Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, and Steven C. H. Hoi. 2022 a . https://doi.org/10.48550/arXiv.2209.09019 LAVIS: A library for language-vision intelligence . CoRR, abs/2209.09019

work page doi:10.48550/arxiv.2209.09019 2022
[22]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023 b . https://doi.org/10.48550/arXiv.2301.12597 BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models . CoRR, abs/2301.12597

work page internal anchor Pith review doi:10.48550/arxiv.2301.12597 2023
[23]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022 b . https://proceedings.mlr.press/v162/li22n.html BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation . In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , volume 162 of Proceedings of Mac...

work page 2022
[24]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. https://doi.org/10.1007/978-3-030-58577-8\_8 Oscar: Object-semantics aligned pre-training for vision-language tasks . In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2...

work page doi:10.1007/978-3-030-58577-8 2020
[25]

In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T

Tsung - Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \' a r, and C. Lawrence Zitnick. 2014. https://doi.org/10.1007/978-3-319-10602-1\_48 Microsoft COCO: common objects in context . In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V , vo...

work page doi:10.1007/978-3-319-10602-1 2014
[26]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. https://doi.org/10.48550/arXiv.2304.08485 Visual instruction tuning . CoRR, abs/2304.08485

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08485 2023
[27]

Bennett, Meredith Ringel Morris, and Edward Cutrell

Haley MacLeod, Cynthia L. Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. https://doi.org/10.1145/3025453.3025814 Understanding blind people's experiences with computer-generated captions of social media images . In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, May 06-11, 2017 , pages 5988--5999. ACM

work page doi:10.1145/3025453.3025814 2017
[28]

OpenAI. 2023. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 technical report . CoRR, abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[29]

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In NIPS , pages 1143--1151

work page 2011
[30]

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. https://doi.org/10.18653/v1/d18-1437 Object hallucination in image captioning . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4035--4045. Association for Computational...

work page doi:10.18653/v1/d18-1437 2018
[31]

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. https://doi.org/10.1007/978-3-031-20074-8\_9 A-OKVQA: A benchmark for visual question answering using world knowledge . In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VIII , volume 13668 of ...

work page doi:10.1007/978-3-031-20074-8 2022
[32]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL (1) , pages 2556--2565. Association for Computational Linguistics

work page 2018
[33]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie - Anne Lachaux, Timoth \' e e Lacroix, Baptiste Rozi \` e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aur \' e lien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. https://proceedings.mlr.press/v162/wang22al.html OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework . In International Conference on Machine Learning, ICML 2022, 17-23 July ...

work page 2022
[35]

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. 2023. https://doi.org/10.48550/arXiv.2304.14178 mplug-owl: Modularization empowers large language models with multimodality . CoRR, abs/2304.14178

work page Pith review doi:10.48550/arxiv.2304.14178 2023
[36]

Peng Zhang, Yash Goyal, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In CVPR , pages 5014--5022. IEEE Computer Society

work page 2016
[37]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. https://doi.org/10.1109/CVPR46437.2021.00553 Vinvl: Revisiting visual representations in vision-language models . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 5579--5588. Computer ...

work page doi:10.1109/cvpr46437.2021.00553 2021
[38]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian - Yun Nie, and Ji - Rong Wen. 2023. https://doi.org/10.48550/arXiv.2303.18223 A survey of large langua...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.18223 2023
[39]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. https://doi.org/10.48550/arXiv.2304.10592 Minigpt-4: Enhancing vision-language understanding with advanced large language models . CoRR, abs/2304.10592

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.10592 2023
[40]

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, and Yong Jae Lee. 2023. https://doi.org/10.48550/arXiv.2304.06718 Segment everything everywhere all at once . CoRR, abs/2304.06718

work page doi:10.48550/arxiv.2304.06718 2023

[1] [1]

Harsh Agrawal, Peter Anderson, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. https://doi.org/10.1109/ICCV.2019.00904 nocaps: novel object captioning at scale . In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019 , pa...

work page doi:10.1109/iccv.2019.00904 2019

[2] [2]

Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Kar \' e n Simonyan

Jean - Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Bin...

work page 2022

[3] [3]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015 a . VQA: visual question answering. In ICCV , pages 2425--2433. IEEE Computer Society

work page 2015

[4] [4]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015 b . https://doi.org/10.1109/ICCV.2015.279 VQA: visual question answering . In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015 , pages 2425--2433. IEEE Computer Society

work page doi:10.1109/iccv.2015.279 2015

[5] [5]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. https://doi.org/10.48550/arXiv.2302.04023 A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity . CoRR, abs/2302.04023

work page internal anchor Pith review doi:10.48550/arxiv.2302.04023 2023

[7] [7]

Ali Furkan Biten, Llu \' s G \' o mez, and Dimosthenis Karatzas. 2022. https://doi.org/10.1109/WACV51458.2022.00253 Let there be a clock on the beach: Reducing object hallucination in image captioning . In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022 , pages 2473--2482. IEEE

work page doi:10.1109/wacv51458.2022.00253 2022

[8] [8]

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023. Shikra: Unleashing multimodal llm's referential dialogue magic. CoRR, abs/2306.15195

work page internal anchor Pith review arXiv 2023

[9] [9]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. https://lmsys.org/blog/2023-03-30-vicuna/ Vicuna: An open-source chatbot impressing gpt-4 with 90\

work page 2023

[10] [10]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023 a . Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500

work page internal anchor Pith review arXiv 2023

[11] [11]

Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. 2023 b . https://aclanthology.org/2023.eacl-main.156 Plausible may not be faithful: Probing object hallucination in vision-language pre-training . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-...

work page 2023

[12] [12]

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. 2022. https://doi.org/10.1561/0600000105 Vision-language pre-training: Basics, recent advances, and future trends . Found. Trends Comput. Graph. Vis., 14(3-4):163--352

work page doi:10.1561/0600000105 2022

[13] [13]

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. 2023. https://doi.org/10.48550/arXiv.2304.15010 Llama-adapter V2: parameter-efficient visual instruction model . CoRR, abs/2304.15010

work page internal anchor Pith review doi:10.48550/arxiv.2304.15010 2023

[14] [14]

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. 2023. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790

work page arXiv 2023

[15] [15]

Yash Goyal, Tejas Khot, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. 2017. https://doi.org/10.1109/CVPR.2017.670 Making the V in VQA matter: Elevating the role of image understanding in visual question answering . In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 6325--6334....

work page doi:10.1109/cvpr.2017.670 2017

[16] [16]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2015. Framing image description as a ranking task: Data, models and evaluation metrics (extended abstract). In IJCAI , pages 4188--4192. AAAI Press

work page 2015

[17] [17]

Yi - Chong Huang, Xia - Chong Feng, Xiao - Cheng Feng, and Bing Qin. 2021. http://arxiv.org/abs/2104.14839 The factual inconsistency problem in abstractive text summarization: A survey . CoRR, abs/2104.14839

work page arXiv 2021

[18] [18]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Drew A. Hudson and Christopher D. Manning. 2019. https://doi.org/10.1109/CVPR.2019.00686 GQA: A new dataset for real-world visual reasoning and compositional question answering . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 6700--6709. Computer Vision Foundation / IEEE

work page doi:10.1109/cvpr.2019.00686 2019

[19] [19]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. 2022. http://arxiv.org/abs/2202.03629 Survey of hallucination in natural language generation . CoRR, abs/2202.03629

work page arXiv 2022

[20] [20]

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023 a . https://doi.org/10.48550/arXiv.2305.03726 Otter: A multi-modal model with in-context instruction tuning . CoRR, abs/2305.03726

work page internal anchor Pith review doi:10.48550/arxiv.2305.03726 2023

[21] [21]

Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, and Steven C. H. Hoi. 2022 a . https://doi.org/10.48550/arXiv.2209.09019 LAVIS: A library for language-vision intelligence . CoRR, abs/2209.09019

work page doi:10.48550/arxiv.2209.09019 2022

[22] [22]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023 b . https://doi.org/10.48550/arXiv.2301.12597 BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models . CoRR, abs/2301.12597

work page internal anchor Pith review doi:10.48550/arxiv.2301.12597 2023

[23] [23]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022 b . https://proceedings.mlr.press/v162/li22n.html BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation . In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , volume 162 of Proceedings of Mac...

work page 2022

[24] [24]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. https://doi.org/10.1007/978-3-030-58577-8\_8 Oscar: Object-semantics aligned pre-training for vision-language tasks . In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2...

work page doi:10.1007/978-3-030-58577-8 2020

[25] [25]

In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T

Tsung - Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \' a r, and C. Lawrence Zitnick. 2014. https://doi.org/10.1007/978-3-319-10602-1\_48 Microsoft COCO: common objects in context . In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V , vo...

work page doi:10.1007/978-3-319-10602-1 2014

[26] [26]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. https://doi.org/10.48550/arXiv.2304.08485 Visual instruction tuning . CoRR, abs/2304.08485

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08485 2023

[27] [27]

Bennett, Meredith Ringel Morris, and Edward Cutrell

Haley MacLeod, Cynthia L. Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. https://doi.org/10.1145/3025453.3025814 Understanding blind people's experiences with computer-generated captions of social media images . In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, May 06-11, 2017 , pages 5988--5999. ACM

work page doi:10.1145/3025453.3025814 2017

[28] [28]

OpenAI. 2023. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 technical report . CoRR, abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023

[29] [29]

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In NIPS , pages 1143--1151

work page 2011

[30] [30]

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. https://doi.org/10.18653/v1/d18-1437 Object hallucination in image captioning . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4035--4045. Association for Computational...

work page doi:10.18653/v1/d18-1437 2018

[31] [31]

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. https://doi.org/10.1007/978-3-031-20074-8\_9 A-OKVQA: A benchmark for visual question answering using world knowledge . In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VIII , volume 13668 of ...

work page doi:10.1007/978-3-031-20074-8 2022

[32] [32]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL (1) , pages 2556--2565. Association for Computational Linguistics

work page 2018

[33] [33]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie - Anne Lachaux, Timoth \' e e Lacroix, Baptiste Rozi \` e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aur \' e lien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. https://proceedings.mlr.press/v162/wang22al.html OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework . In International Conference on Machine Learning, ICML 2022, 17-23 July ...

work page 2022

[35] [35]

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. 2023. https://doi.org/10.48550/arXiv.2304.14178 mplug-owl: Modularization empowers large language models with multimodality . CoRR, abs/2304.14178

work page Pith review doi:10.48550/arxiv.2304.14178 2023

[36] [36]

Peng Zhang, Yash Goyal, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In CVPR , pages 5014--5022. IEEE Computer Society

work page 2016

[37] [37]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. https://doi.org/10.1109/CVPR46437.2021.00553 Vinvl: Revisiting visual representations in vision-language models . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 5579--5588. Computer ...

work page doi:10.1109/cvpr46437.2021.00553 2021

[38] [38]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian - Yun Nie, and Ji - Rong Wen. 2023. https://doi.org/10.48550/arXiv.2303.18223 A survey of large langua...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.18223 2023

[39] [39]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. https://doi.org/10.48550/arXiv.2304.10592 Minigpt-4: Enhancing vision-language understanding with advanced large language models . CoRR, abs/2304.10592

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.10592 2023

[40] [40]

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, and Yong Jae Lee. 2023. https://doi.org/10.48550/arXiv.2304.06718 Segment everything everywhere all at once . CoRR, abs/2304.06718

work page doi:10.48550/arxiv.2304.06718 2023