Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding

Beomsik Cho; Jaehyung Kim

arxiv: 2506.09522 · v3 · pith:BEF2TMCEnew · submitted 2025-06-11 · 💻 cs.CV · cs.AI· cs.CL

Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding

Beomsik Cho , Jaehyung Kim This is my paper

Pith reviewed 2026-05-19 09:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords vision tokensLVLMsdecodingvisual semanticshallucinationstraining-free methodmultimodal generation

0 comments

The pith

Vision tokens encode usable semantics that project into text space to steer LVLM decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that vision tokens in large vision-language models retain meaningful visual details even when the model hallucinates. These details sit in the same representational space as text and can be surfaced by limiting the vocabulary during projection. The authors build a decoding procedure that chooses the single most relevant vision token at each generation step and uses its projection to adjust the next-token probabilities. Because the procedure needs no training, it adds little overhead while keeping the output more faithful to the image. If correct, the approach would let existing models produce more accurate answers at lower compute cost across standard multimodal tasks.

Core claim

Vision tokens provide meaningful visual information even when hallucinations occur, and their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. ReVisiT exploits this fact by projecting the selected vision token into the text token distribution and using the resulting distribution to refine the model's output at every decoding step.

What carries the argument

ReVisiT, the training-free procedure that selects the most relevant vision token at each step through context-aware constrained divergence minimization and projects it to adjust the language-model output distribution.

If this is right

Text generated by the model aligns more closely with the visual input on standard multimodal benchmarks.
Decoding runs use up to half the compute of current state-of-the-art methods while matching or exceeding their accuracy.
No additional training is required, so the method can be applied directly to already-deployed LVLMs.
Hallucinations decrease because the output distribution is explicitly pulled toward the visual evidence at each step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection idea could be tested on other multimodal generators that mix discrete tokens from different modalities.
If vision tokens already carry the needed semantics, future model designs might reduce the number of vision tokens without losing performance.
The approach raises the question of whether similar constrained projections would help in purely language models that have access to external knowledge tokens.

Load-bearing premise

Vision token semantics are already encoded in textual space and become explicit enough under vocabulary constraints that their projection can guide decoding without introducing new errors.

What would settle it

A controlled run on the same five benchmarks in which the constrained projection step is added but accuracy does not rise or hallucination rates stay the same or worsen relative to the unmodified baseline.

Figures

Figures reproduced from arXiv: 2506.09522 by Beomsik Cho, Jaehyung Kim.

**Figure 1.** Figure 1: An overview of ReVisiT. Given an input image and text prompt, the LVLM first encodes the image into vision tokens through a vision encoder and a cross-modal projector. ReVisiT re-purposes these vision tokens as reference informers to guide the text generation process. At each decoding step, ReVisiT (1) constrains the vocabulary V to V t cons, (2) projects vision token embeddings into V t cons and selects m… view at source ↗

**Figure 2.** Figure 2: Motivation of ReVisiT. We qualitatively analyzed various vision tokens. Dotted arrows represent vision token projection over specified vocabulary set. For each box, representing text token distribution, we annotated top-5 probable text tokens. Left part illustrate the effectiveness of vocabulary constraint, whereas right part shows the distribution shift during ReVisiT. See Appendix C.1 for a detailed dis… view at source ↗

**Figure 3.** Figure 3: Inference speed. Comparison of per-token inference latency across different decoding strategies for LLaVA1.5-7B (left y-axis) and Qwen2.5-VL-7B (right y-axis), with standard deviations visualized as error bars. Inference speed improvement. To evaluate the inference efficiency of ReVisiT compared to baseline decoding strategies, we measure the per-token computational time. All measurements are conducted … view at source ↗

**Figure 4.** Figure 4: Qualitative example. The input image is a cartoon-style illustration contrasting classical statistical learning and neural network reasoning via a visual metaphor, emphasizing the shift from theoretical rigor to the heuristic of “stacking more layers.” We compare the generated responses of vanilla greedy decoding, M3ID, and ReVisiT, highlighting how ReVisiT better captures the intended visual analogy compa… view at source ↗

**Figure 5.** Figure 5: Without vocabulary subset case study. Qualitative case study from Qwen2.5-VL-7B. w/o subset refers to ablation result of without vocabulary subset constraint, whereas w/ subset refers to our proposed ReVisiT. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Additional qualitative example. The input image is a illustration showing a bear, a cat, and a rabbit seated around a table with a plate of donuts. We compare the responses of vanilla greedy decoding and ReVisiT to the question, “What are the animals in the painting and what are they doing?” While the greedy output introduces a hallucinated detail (“cookie”) and assigns actions not visually supported (e.g.… view at source ↗

read the original abstract

Large Vision Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that guides text generation in LVLMs by Referencing Vision Tokens. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization. Then, ReVisiT uses its constrained projection to refine the output distribution to better incorporate visual semantics. Across five benchmarks on recent LVLMs, ReVisiT achieves competitive or superior results to state-of-the-art decoding baselines while reducing computational cost by up to $2\times$

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ReVisiT, a training-free decoding method for Large Vision-Language Models (LVLMs). It rests on two observations: vision tokens retain meaningful visual information even when hallucinations occur, and their semantics are encoded in textual space, becoming explicit under appropriate vocabulary constraints. The method dynamically selects the most relevant vision token at each step via context-aware constrained divergence minimization and projects it to refine the output text distribution. Across five benchmarks on recent LVLMs, ReVisiT is reported to match or exceed state-of-the-art decoding baselines while reducing computational cost by up to 2×.

Significance. If the experimental support holds, the work offers a lightweight, immediately deployable technique for better exploiting existing vision tokens during LVLM inference. The training-free design and reported efficiency gains address practical concerns around hallucination and compute in multimodal systems, and the underlying observations about vision-token semantics could stimulate further analysis of internal representations.

major comments (2)

[Method] Method section: the precise formulation of the context-aware constrained divergence minimization (including how vocabulary constraints are defined and applied to the projection) is not provided with equations or pseudocode. This detail is load-bearing for verifying that the projection transfers visual semantics without net error increase or unintended bias in the refined distribution.
[Experiments] Experiments section: the abstract and results claim competitive or superior performance with up to 2× cost reduction, yet the manuscript lacks reported statistical significance tests, exact baseline re-implementations, and ablation studies isolating the projection step. These omissions undermine confidence that the gains are robust rather than dependent on particular post-hoc choices.

minor comments (2)

[Abstract] Abstract: the claim of 'up to 2×' computational cost reduction should specify the exact models, benchmarks, and measurement (e.g., FLOPs vs. wall-clock time) under which this holds.
[Method] Notation: ensure consistent use of symbols for vision-token embeddings versus text-token distributions throughout the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements for greater clarity and rigor.

read point-by-point responses

Referee: [Method] Method section: the precise formulation of the context-aware constrained divergence minimization (including how vocabulary constraints are defined and applied to the projection) is not provided with equations or pseudocode. This detail is load-bearing for verifying that the projection transfers visual semantics without net error increase or unintended bias in the refined distribution.

Authors: We appreciate the referee highlighting this gap. While the manuscript describes the high-level approach of context-aware constrained divergence minimization and the subsequent projection, we acknowledge that the detailed equations and pseudocode were not included. In the revised manuscript, we will add the full mathematical formulation, explicitly defining the vocabulary constraints and their application in the projection step. Pseudocode will also be provided to illustrate the dynamic selection and refinement process. This addition will enable verification that visual semantics are incorporated without introducing net error or bias. revision: yes
Referee: [Experiments] Experiments section: the abstract and results claim competitive or superior performance with up to 2× cost reduction, yet the manuscript lacks reported statistical significance tests, exact baseline re-implementations, and ablation studies isolating the projection step. These omissions undermine confidence that the gains are robust rather than dependent on particular post-hoc choices.

Authors: We agree that these additions would strengthen the experimental claims. In the revision, we will include statistical significance tests (such as paired t-tests) for the reported improvements across the five benchmarks. We will also specify the exact baseline re-implementations, including official code sources, versions, and hyperparameter choices used. Furthermore, we will expand the ablation studies to isolate the contribution of the projection step. These updates will be added to the experiments section to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper reports two empirical observations obtained via separate analyses on vision token behavior in LVLMs, then constructs a training-free algorithmic procedure (context-aware constrained divergence projection followed by distribution refinement) that operates on those observations. No equations, fitted parameters, or self-citations are shown that reduce the claimed results to the inputs by construction. The method is presented as a direct procedural application rather than a tautological renaming or self-referential definition, satisfying the default expectation of an independent derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on two domain assumptions extracted from the reported analyses; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption Vision tokens provide meaningful visual information even when hallucinations occur.
Stated as finding (i) that underpins the decision to reference vision tokens during decoding.
domain assumption Vision token semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints.
Stated as finding (ii) that justifies the projection step into the text token distribution.

pith-pipeline@v0.9.0 · 5729 in / 1258 out tokens · 28571 ms · 2026-05-19T09:39:13.360925+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ReVisiT ... projects vision token embeddings into V^t_cons and selects most relevant token ... by minimizing Jensen-Shannon Divergence ... then refines the output distribution
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 11 internal anchors

[1]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[6]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[7]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Minigpt-4: En- hancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[9]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[12]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[13]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In European Conference on Computer Vision (ECCV), 2015

work page 2015
[15]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning (ICML), 2023

work page 2023
[16]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In European Conference on Computer Vision (ECCV), 2017

work page 2017
[17]

Show and tell: A neural image caption generator

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015
[18]

Show, attend and tell: Neural image caption generation with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML), 2015. 12

work page 2015
[19]

Image captioning with semantic attention

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[20]

Multimodal conversational ai: A survey of datasets and approaches

Anirudh Sundar and Larry Heck. Multimodal conversational ai: A survey of datasets and approaches. arXiv preprint arXiv:2205.06907, 2022

work page arXiv 2022
[21]

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Woodpecker: Hallucination correction for multimodal large language models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models. In Science China Information Sciences (SCIS), 2024

work page 2024
[23]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[24]

Multi-modal hallucination control by visual information grounding

Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. Multi-modal hallucination control by visual information grounding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[25]

Self-introspective decoding: Alleviating hallucinations for large vision-language models

Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models. In International Conference on Learning Representations (ICLR), 2025

work page 2025
[26]

See what you are told: Visual attention sink in large multimodal models

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. In International Conference on Learning Representations (ICLR), 2025

work page 2025
[27]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

work page 2018
[28]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023
[29]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Nvlm: Open frontier-class multimodal llms

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nvlm: Open frontier-class multimodal llms. arXiv preprint arXiv:2409.11402, 2024

work page arXiv 2024
[31]

Ferret: Refer and ground anything anywhere at any granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[32]

Lu,et al., Ovis: Structural Embedding Alignment for Multimodal Large Language Model

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model. arXiv preprint arXiv:2405.20797, 2024

work page arXiv 2024
[33]

Dola: Decoding by contrasting layers improves factuality in large language models

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[34]

Branchynet: Fast inference via early exiting from deep neural networks

Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. InInternational Conference on Pattern Recognition (ICPR), 2016. 13

work page 2016
[35]

Depth-adaptive transformer

Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. In International Conference on Learning Representations (ICLR), 2020

work page 2020
[36]

Confident adaptive language modeling

Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[37]

interpreting gpt: the logit lens

nostalgebraist. interpreting gpt: the logit lens. https: // www. lesswrong. com/ posts/ AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens , 2020

work page 2020
[38]

Contrastive decoding: Open-ended text generation as optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023

work page 2023
[39]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In International Conference on Computer Vision (ICCV), 2014

work page 2014
[40]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In International Conference on Computer Vision (ICCV), 2022

work page 2022
[41]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[42]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Mitigating hallucinations in large vision-language models with instruction contrastive decoding

Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715, 2024

work page arXiv 2024
[44]

Damro: Dive into the attention mechanism of lvlm to reduce object hallucination

Xuan Gong, Tianshi Ming, Xinpeng Wang, and Zhihua Wei. Damro: Dive into the attention mechanism of lvlm to reduce object hallucination. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024
[45]

Paying more attention to image: A training-free method for alleviating hallucination in lvlms

Shi Liu, Kecheng Zheng, and Wei Chen. Paying more attention to image: A training-free method for alleviating hallucination in lvlms. In International Conference on Computer Vision (ICCV), 2024

work page 2024
[46]

ArXiv:2405.17820 [cs]

Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, and Changick Kim. Don’t miss the forest for the trees: Attentional vision calibration for large vision language models. arXiv preprint arXiv:2405.17820, 2024

work page arXiv 2024
[47]

Halc: Object hallucination reduction via adaptive focal-contrast decoding

Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduction via adaptive focal-contrast decoding. InProceedings of the International Conference on Machine Learning (ICML), 2024

work page 2024
[48]

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InConference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[49]

Please describe this image in detail

Chenxi Wang, Xiang Chen, Ningyu Zhang, Bozhong Tian, Haoming Xu, Shumin Deng, and Huajun Chen. Mllm can see? dynamic correction decoding for hallucination mitigation. In International Conference on Learning Representations (ICLR), 2025. 14 A Detailed Discussions A.1 Dataset and evaluation setup CHAIR evaluation. CHAIR [27] is a generative benchmark design...

work page 2025
[50]

What are the animals in the painting and what are they doing?

**Y ello w La y er Cak e**: A slice of cak e wit h multiple y ello w and cr eam-color ed la y ers, placed on a black plat e. 3 . **Gr een La y er Cak e**: A slice of cak e wit h multiple gr een and cr eam-color ed la y ers, placed on a black plat e wit h a knif e r esting on t he side. 4 . **Patt erned R oll Cak e**: A r olled cak e wit h a patt erned sur...

work page

[1] [1]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[6] [6]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[7] [7]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Minigpt-4: En- hancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[9] [9]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[12] [12]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[13] [13]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In European Conference on Computer Vision (ECCV), 2015

work page 2015

[15] [15]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning (ICML), 2023

work page 2023

[16] [16]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In European Conference on Computer Vision (ECCV), 2017

work page 2017

[17] [17]

Show and tell: A neural image caption generator

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015

[18] [18]

Show, attend and tell: Neural image caption generation with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML), 2015. 12

work page 2015

[19] [19]

Image captioning with semantic attention

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016

[20] [20]

Multimodal conversational ai: A survey of datasets and approaches

Anirudh Sundar and Larry Heck. Multimodal conversational ai: A survey of datasets and approaches. arXiv preprint arXiv:2205.06907, 2022

work page arXiv 2022

[21] [21]

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Woodpecker: Hallucination correction for multimodal large language models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models. In Science China Information Sciences (SCIS), 2024

work page 2024

[23] [23]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[24] [24]

Multi-modal hallucination control by visual information grounding

Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. Multi-modal hallucination control by visual information grounding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[25] [25]

Self-introspective decoding: Alleviating hallucinations for large vision-language models

Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models. In International Conference on Learning Representations (ICLR), 2025

work page 2025

[26] [26]

See what you are told: Visual attention sink in large multimodal models

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. In International Conference on Learning Representations (ICLR), 2025

work page 2025

[27] [27]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

work page 2018

[28] [28]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023

[29] [29]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Nvlm: Open frontier-class multimodal llms

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nvlm: Open frontier-class multimodal llms. arXiv preprint arXiv:2409.11402, 2024

work page arXiv 2024

[31] [31]

Ferret: Refer and ground anything anywhere at any granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[32] [32]

Lu,et al., Ovis: Structural Embedding Alignment for Multimodal Large Language Model

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model. arXiv preprint arXiv:2405.20797, 2024

work page arXiv 2024

[33] [33]

Dola: Decoding by contrasting layers improves factuality in large language models

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[34] [34]

Branchynet: Fast inference via early exiting from deep neural networks

Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. InInternational Conference on Pattern Recognition (ICPR), 2016. 13

work page 2016

[35] [35]

Depth-adaptive transformer

Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. In International Conference on Learning Representations (ICLR), 2020

work page 2020

[36] [36]

Confident adaptive language modeling

Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[37] [37]

interpreting gpt: the logit lens

nostalgebraist. interpreting gpt: the logit lens. https: // www. lesswrong. com/ posts/ AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens , 2020

work page 2020

[38] [38]

Contrastive decoding: Open-ended text generation as optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023

work page 2023

[39] [39]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In International Conference on Computer Vision (ICCV), 2014

work page 2014

[40] [40]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In International Conference on Computer Vision (ICCV), 2022

work page 2022

[41] [41]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019

[42] [42]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Mitigating hallucinations in large vision-language models with instruction contrastive decoding

Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715, 2024

work page arXiv 2024

[44] [44]

Damro: Dive into the attention mechanism of lvlm to reduce object hallucination

Xuan Gong, Tianshi Ming, Xinpeng Wang, and Zhihua Wei. Damro: Dive into the attention mechanism of lvlm to reduce object hallucination. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024

[45] [45]

Paying more attention to image: A training-free method for alleviating hallucination in lvlms

Shi Liu, Kecheng Zheng, and Wei Chen. Paying more attention to image: A training-free method for alleviating hallucination in lvlms. In International Conference on Computer Vision (ICCV), 2024

work page 2024

[46] [46]

ArXiv:2405.17820 [cs]

Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, and Changick Kim. Don’t miss the forest for the trees: Attentional vision calibration for large vision language models. arXiv preprint arXiv:2405.17820, 2024

work page arXiv 2024

[47] [47]

Halc: Object hallucination reduction via adaptive focal-contrast decoding

Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduction via adaptive focal-contrast decoding. InProceedings of the International Conference on Machine Learning (ICML), 2024

work page 2024

[48] [48]

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InConference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[49] [49]

Please describe this image in detail

Chenxi Wang, Xiang Chen, Ningyu Zhang, Bozhong Tian, Haoming Xu, Shumin Deng, and Huajun Chen. Mllm can see? dynamic correction decoding for hallucination mitigation. In International Conference on Learning Representations (ICLR), 2025. 14 A Detailed Discussions A.1 Dataset and evaluation setup CHAIR evaluation. CHAIR [27] is a generative benchmark design...

work page 2025

[50] [50]

What are the animals in the painting and what are they doing?

**Y ello w La y er Cak e**: A slice of cak e wit h multiple y ello w and cr eam-color ed la y ers, placed on a black plat e. 3 . **Gr een La y er Cak e**: A slice of cak e wit h multiple gr een and cr eam-color ed la y ers, placed on a black plat e wit h a knif e r esting on t he side. 4 . **Patt erned R oll Cak e**: A r olled cak e wit h a patt erned sur...

work page