pith. machine review for the scientific record. sign in

arxiv: 2503.10183 · v4 · submitted 2025-03-13 · 💻 cs.CV · cs.AI

Recognition: unknown

Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding

Authors on Pith no claims yet
classification 💻 cs.CV cs.AI
keywords visualdecodinghallucinationdetailsduringfine-grainedinputlanguage
0
0 comments X
read the original abstract

Existing vision-language models (VLMs) often suffer from visual hallucination, where the generated responses contain inaccuracies that are not grounded in the visual input. Efforts to address this issue without model finetuning primarily mitigate hallucination by contrastively reducing language biases or amplifying the weights of visual embedding during decoding. However, these approaches remain limited in their ability to capture fine-grained visual details. In this work, we propose the Perception Magnifier (PM), a novel visual decoding method that iteratively isolates relevant visual tokens based on attention and magnifies the corresponding regions, spurring the model to concentrate on fine-grained visual details during decoding. By magnifying critical regions while preserving the structural and contextual information at each decoding step, PM allows the VLM to enhance its scrutiny of the visual input, hence producing more accurate and faithful responses. Extensive experimental results demonstrate that PM not only achieves superior hallucination mitigation but also enhances language generation while preserving strong reasoning capabilities. Code can be found at https://github.com/ShunqiM/PM.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

    cs.CV 2026-03 unverdicted novelty 5.0

    A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.

  2. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.