Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding
Pith reviewed 2026-05-19 09:39 UTC · model grok-4.3
The pith
Vision tokens encode usable semantics that project into text space to steer LVLM decoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vision tokens provide meaningful visual information even when hallucinations occur, and their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. ReVisiT exploits this fact by projecting the selected vision token into the text token distribution and using the resulting distribution to refine the model's output at every decoding step.
What carries the argument
ReVisiT, the training-free procedure that selects the most relevant vision token at each step through context-aware constrained divergence minimization and projects it to adjust the language-model output distribution.
If this is right
- Text generated by the model aligns more closely with the visual input on standard multimodal benchmarks.
- Decoding runs use up to half the compute of current state-of-the-art methods while matching or exceeding their accuracy.
- No additional training is required, so the method can be applied directly to already-deployed LVLMs.
- Hallucinations decrease because the output distribution is explicitly pulled toward the visual evidence at each step.
Where Pith is reading between the lines
- The same projection idea could be tested on other multimodal generators that mix discrete tokens from different modalities.
- If vision tokens already carry the needed semantics, future model designs might reduce the number of vision tokens without losing performance.
- The approach raises the question of whether similar constrained projections would help in purely language models that have access to external knowledge tokens.
Load-bearing premise
Vision token semantics are already encoded in textual space and become explicit enough under vocabulary constraints that their projection can guide decoding without introducing new errors.
What would settle it
A controlled run on the same five benchmarks in which the constrained projection step is added but accuracy does not rise or hallucination rates stay the same or worsen relative to the unmodified baseline.
Figures
read the original abstract
Large Vision Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that guides text generation in LVLMs by Referencing Vision Tokens. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization. Then, ReVisiT uses its constrained projection to refine the output distribution to better incorporate visual semantics. Across five benchmarks on recent LVLMs, ReVisiT achieves competitive or superior results to state-of-the-art decoding baselines while reducing computational cost by up to $2\times$
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ReVisiT, a training-free decoding method for Large Vision-Language Models (LVLMs). It rests on two observations: vision tokens retain meaningful visual information even when hallucinations occur, and their semantics are encoded in textual space, becoming explicit under appropriate vocabulary constraints. The method dynamically selects the most relevant vision token at each step via context-aware constrained divergence minimization and projects it to refine the output text distribution. Across five benchmarks on recent LVLMs, ReVisiT is reported to match or exceed state-of-the-art decoding baselines while reducing computational cost by up to 2×.
Significance. If the experimental support holds, the work offers a lightweight, immediately deployable technique for better exploiting existing vision tokens during LVLM inference. The training-free design and reported efficiency gains address practical concerns around hallucination and compute in multimodal systems, and the underlying observations about vision-token semantics could stimulate further analysis of internal representations.
major comments (2)
- [Method] Method section: the precise formulation of the context-aware constrained divergence minimization (including how vocabulary constraints are defined and applied to the projection) is not provided with equations or pseudocode. This detail is load-bearing for verifying that the projection transfers visual semantics without net error increase or unintended bias in the refined distribution.
- [Experiments] Experiments section: the abstract and results claim competitive or superior performance with up to 2× cost reduction, yet the manuscript lacks reported statistical significance tests, exact baseline re-implementations, and ablation studies isolating the projection step. These omissions undermine confidence that the gains are robust rather than dependent on particular post-hoc choices.
minor comments (2)
- [Abstract] Abstract: the claim of 'up to 2×' computational cost reduction should specify the exact models, benchmarks, and measurement (e.g., FLOPs vs. wall-clock time) under which this holds.
- [Method] Notation: ensure consistent use of symbols for vision-token embeddings versus text-token distributions throughout the method description.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements for greater clarity and rigor.
read point-by-point responses
-
Referee: [Method] Method section: the precise formulation of the context-aware constrained divergence minimization (including how vocabulary constraints are defined and applied to the projection) is not provided with equations or pseudocode. This detail is load-bearing for verifying that the projection transfers visual semantics without net error increase or unintended bias in the refined distribution.
Authors: We appreciate the referee highlighting this gap. While the manuscript describes the high-level approach of context-aware constrained divergence minimization and the subsequent projection, we acknowledge that the detailed equations and pseudocode were not included. In the revised manuscript, we will add the full mathematical formulation, explicitly defining the vocabulary constraints and their application in the projection step. Pseudocode will also be provided to illustrate the dynamic selection and refinement process. This addition will enable verification that visual semantics are incorporated without introducing net error or bias. revision: yes
-
Referee: [Experiments] Experiments section: the abstract and results claim competitive or superior performance with up to 2× cost reduction, yet the manuscript lacks reported statistical significance tests, exact baseline re-implementations, and ablation studies isolating the projection step. These omissions undermine confidence that the gains are robust rather than dependent on particular post-hoc choices.
Authors: We agree that these additions would strengthen the experimental claims. In the revision, we will include statistical significance tests (such as paired t-tests) for the reported improvements across the five benchmarks. We will also specify the exact baseline re-implementations, including official code sources, versions, and hyperparameter choices used. Furthermore, we will expand the ablation studies to isolate the contribution of the projection step. These updates will be added to the experiments section to demonstrate robustness. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper reports two empirical observations obtained via separate analyses on vision token behavior in LVLMs, then constructs a training-free algorithmic procedure (context-aware constrained divergence projection followed by distribution refinement) that operates on those observations. No equations, fitted parameters, or self-citations are shown that reduce the claimed results to the inputs by construction. The method is presented as a direct procedural application rather than a tautological renaming or self-referential definition, satisfying the default expectation of an independent derivation chain.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Vision tokens provide meaningful visual information even when hallucinations occur.
- domain assumption Vision token semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ReVisiT ... projects vision token embeddings into V^t_cons and selects most relevant token ... by minimizing Jensen-Shannon Divergence ... then refines the output distribution
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Instructblip: Towards general-purpose vision-language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[6]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[7]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Minigpt-4: En- hancing vision-language understanding with advanced large language models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[9]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[12]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[13]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In European Conference on Computer Vision (ECCV), 2015
work page 2015
-
[15]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning (ICML), 2023
work page 2023
-
[16]
Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In European Conference on Computer Vision (ECCV), 2017
work page 2017
-
[17]
Show and tell: A neural image caption generator
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015
work page 2015
-
[18]
Show, attend and tell: Neural image caption generation with visual attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML), 2015. 12
work page 2015
-
[19]
Image captioning with semantic attention
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016
work page 2016
-
[20]
Multimodal conversational ai: A survey of datasets and approaches
Anirudh Sundar and Larry Heck. Multimodal conversational ai: A survey of datasets and approaches. arXiv preprint arXiv:2205.06907, 2022
-
[21]
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Woodpecker: Hallucination correction for multimodal large language models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models. In Science China Information Sciences (SCIS), 2024
work page 2024
-
[23]
Mitigating object hallucinations in large vision-language models through visual contrastive decoding
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[24]
Multi-modal hallucination control by visual information grounding
Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. Multi-modal hallucination control by visual information grounding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[25]
Self-introspective decoding: Alleviating hallucinations for large vision-language models
Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models. In International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[26]
See what you are told: Visual attention sink in large multimodal models
Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. In International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[27]
Object hallucination in image captioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018
work page 2018
-
[28]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
work page 2023
-
[29]
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Nvlm: Open frontier-class multimodal llms
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nvlm: Open frontier-class multimodal llms. arXiv preprint arXiv:2409.11402, 2024
-
[31]
Ferret: Refer and ground anything anywhere at any granularity
Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[32]
Lu,et al., Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model. arXiv preprint arXiv:2405.20797, 2024
-
[33]
Dola: Decoding by contrasting layers improves factuality in large language models
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[34]
Branchynet: Fast inference via early exiting from deep neural networks
Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. InInternational Conference on Pattern Recognition (ICPR), 2016. 13
work page 2016
-
[35]
Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. In International Conference on Learning Representations (ICLR), 2020
work page 2020
-
[36]
Confident adaptive language modeling
Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[37]
interpreting gpt: the logit lens
nostalgebraist. interpreting gpt: the logit lens. https: // www. lesswrong. com/ posts/ AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens , 2020
work page 2020
-
[38]
Contrastive decoding: Open-ended text generation as optimization
Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023
work page 2023
-
[39]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In International Conference on Computer Vision (ICCV), 2014
work page 2014
-
[40]
A-okvqa: A benchmark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In International Conference on Computer Vision (ICCV), 2022
work page 2022
-
[41]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019
work page 2019
-
[42]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Mitigating hallucinations in large vision-language models with instruction contrastive decoding
Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715, 2024
-
[44]
Damro: Dive into the attention mechanism of lvlm to reduce object hallucination
Xuan Gong, Tianshi Ming, Xinpeng Wang, and Zhihua Wei. Damro: Dive into the attention mechanism of lvlm to reduce object hallucination. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
work page 2024
-
[45]
Paying more attention to image: A training-free method for alleviating hallucination in lvlms
Shi Liu, Kecheng Zheng, and Wei Chen. Paying more attention to image: A training-free method for alleviating hallucination in lvlms. In International Conference on Computer Vision (ICCV), 2024
work page 2024
-
[46]
Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, and Changick Kim. Don’t miss the forest for the trees: Attentional vision calibration for large vision language models. arXiv preprint arXiv:2405.17820, 2024
-
[47]
Halc: Object hallucination reduction via adaptive focal-contrast decoding
Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduction via adaptive focal-contrast decoding. InProceedings of the International Conference on Machine Learning (ICML), 2024
work page 2024
-
[48]
Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InConference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[49]
Please describe this image in detail
Chenxi Wang, Xiang Chen, Ningyu Zhang, Bozhong Tian, Haoming Xu, Shumin Deng, and Huajun Chen. Mllm can see? dynamic correction decoding for hallucination mitigation. In International Conference on Learning Representations (ICLR), 2025. 14 A Detailed Discussions A.1 Dataset and evaluation setup CHAIR evaluation. CHAIR [27] is a generative benchmark design...
work page 2025
-
[50]
What are the animals in the painting and what are they doing?
**Y ello w La y er Cak e**: A slice of cak e wit h multiple y ello w and cr eam-color ed la y ers, placed on a black plat e. 3 . **Gr een La y er Cak e**: A slice of cak e wit h multiple gr een and cr eam-color ed la y ers, placed on a black plat e wit h a knif e r esting on t he side. 4 . **Patt erned R oll Cak e**: A r olled cak e wit h a patt erned sur...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.