arxiv: 2605.09443 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.CL

Recognition: no theorem link

Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

Yihong Tang , Kehai Chen , Xuefeng Bai , Min Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:46 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords Modality-Role InterferenceMultimodal Role-Playing AgentsCharacter-Aware Visual InterventionCAVIToken PruningFeature ModulationRole ConsistencyMultimodal Large Language Models

0 comments

The pith

A training-free framework lets role-playing agents see visuals through their character's perspective to fix interference between images and consistent identity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard multimodal models extract objective visual features that drown out the subjective traits needed for believable role-playing, creating modality-role interference. It introduces the Character-Aware Visual Intervention method that prunes irrelevant visual tokens, projects remaining features into a character-specific subspace, and adjusts generation steering on the fly. If this holds, existing models can deliver character-consistent responses to images without any retraining. The work matters because it removes a barrier to deploying visually grounded role-play in games, education, and interactive systems. Sympathetic readers would see this as a practical way to make AI agents maintain identity across modalities.

Core claim

The central claim is that Modality-Role Interference arises when objective visual processing overpowers fragile character traits, and that three coordinated interventions—character-guided token pruning to limit the receptive field, orthogonal feature modulation to extract aligned facts, and modality-adaptive role steering to optimize intensity—resolve it in a training-free manner, allowing multimodal role-playing agents to integrate visual grounding while preserving consistency.

What carries the argument

The Character-Aware Visual Intervention (CAVI) framework, which applies character-guided token pruning, orthogonal feature modulation, and modality-adaptive role steering to align visual processing with role identity.

If this is right

Existing multimodal models gain character-consistent visual responses without retraining.
Visual attention is restricted to entities relevant to the assigned role.
Feature representations are projected into subspaces that preserve character context.
Generation steering strength varies automatically with how much the response depends on the image.
Role-playing agents maintain identity across both text and visual inputs in the same conversation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interference pattern may appear in any multimodal system where one modality carries identity-specific information, suggesting CAVI-style interventions could apply beyond role-play to areas like personalized captioning.
Because the method is training-free, it could be combined with other lightweight alignment techniques to test whether visual consistency compounds with other constraints such as safety or style.
If the three components prove additive, future work could isolate which one contributes most on different character types to guide minimal implementations.

Load-bearing premise

That the three character-specific rules for pruning, projection, and steering can be applied to any existing multimodal model without creating fresh inconsistencies or requiring per-character redesign.

What would settle it

A controlled test on a standard role-playing benchmark where CAVI is added to an off-the-shelf MLLM and character-consistency scores show no statistically significant gain over the unmodified baseline.

Figures

Figures reproduced from arXiv: 2605.09443 by Kehai Chen, Min Zhang, Xuefeng Bai, Yihong Tang.

**Figure 2.** Figure 2: Overview of Character-Aware Visual Intervention (CAVI). CAVI resolves Modality-Role Interference via a training-free pipeline: (1) Character Anchor extracts a semantic role reference from textual profiles. (2) Token Pruning (CTP) macroscopically discards role-irrelevant visual tokens. (3) Feature Modulation (OFM) microscopically purifies retained tokens by projecting out orthogonal out-of-character noise. … view at source ↗

**Figure 4.** Figure 4: Role-specific visual gain of token selectors. The displayed score is perexample normalized [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: OFM diagnostics. Left: aligned-versus-orthogonal energy before and after modulation. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Controlled role-swap t-SNE for hidden states of Base, VCD, and CAVI across 6 roles. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Hyperparameter sensitivity evaluated on 100 sampled in-domain test examples. From left [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

The advancement of Multimodal Large Language Models (MLLMs) has expanded Role-Playing Agents (RPAs) into visually grounded environments. However, human vision is inherently subjective and identity-driven, whereas existing MLLMs extract objective, character-agnostic features for general tasks. In RPAs, this generic visual noise overpowers fragile character traits, causing Modality-Role Interference (MRI), where agents struggle to integrate visual grounding and character consistency. To address this, we introduce the training-free Character-Aware Visual Intervention (CAVI) framework, enabling agents to perceive the world through the lens of character. CAVI systematically targets MRI: macroscopically, Character-Guided Token Pruning (CTP) restricts the visual receptive field to role-relevant entities; microscopically, Orthogonal Feature Modulation (OFM) projects tokens onto a character-context subspace to extract aligned facts; and during decoding, Modality-Adaptive Role Steering (MARS) dynamically optimizes steering intensity based on visual reliance. Extensive experiments show CAVI effectively alleviates MRI, significantly enhancing character-consistent multimodal interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a training-free CAVI setup with token pruning, subspace projection, and decoding adjustments to reduce character inconsistency in visual role-play agents, but the absence of numbers and unclear general algorithms for the key steps make the claims hard to evaluate.

read the letter

The main point is that existing MLLMs pull objective visual features that clash with the subjective, identity-driven view needed for role-playing agents. This creates modality-role interference where visuals overpower character traits. CAVI tries to fix it without training by pruning tokens to role-relevant entities, projecting features into a character-context subspace, and steering generation intensity based on visual reliance during decoding.

Referee Report

2 major / 2 minor

Summary. The paper claims that Multimodal Large Language Models (MLLMs) used for Role-Playing Agents (RPAs) suffer from Modality-Role Interference (MRI), where objective visual feature extraction overpowers fragile character traits. It introduces the training-free Character-Aware Visual Intervention (CAVI) framework with three components—Character-Guided Token Pruning (CTP) to restrict the visual field to role-relevant entities, Orthogonal Feature Modulation (OFM) to project tokens onto a character-context subspace, and Modality-Adaptive Role Steering (MARS) to dynamically adjust steering intensity during decoding—and asserts that extensive experiments show CAVI effectively alleviates MRI and enhances character-consistent multimodal interactions.

Significance. If the interventions prove generalizable and the empirical results are robust, CAVI could provide a practical, plug-and-play method for improving role consistency in visually grounded RPAs without model retraining. The decomposition of MRI into macro-level pruning, micro-level feature alignment, and decoding-level steering offers a structured lens for addressing the tension between generic visual perception and identity-driven interpretation in MLLMs.

major comments (2)

Abstract: The assertion that 'Extensive experiments show CAVI effectively alleviates MRI' is presented without any quantitative results, baselines, ablation studies, or error analysis. This absence leaves the central empirical claim without visible supporting evidence and undermines assessment of whether the three interventions deliver the claimed improvements.
Methods (OFM and CTP descriptions): The construction of the 'character-context subspace' for Orthogonal Feature Modulation and the relevance scoring mechanism for Character-Guided Token Pruning are described only at a conceptual level, with no explicit general algorithm (e.g., derivation from prompt embeddings, PCA, or a fixed scoring function). This detail is load-bearing for the training-free and plug-and-play claims; without it, the interventions risk requiring hidden character-specific engineering that could introduce new feature distortions rather than cleanly resolving MRI.

minor comments (2)

Abstract: The term Modality-Role Interference (MRI) is introduced without a concise formal definition or a short illustrative example of how generic visual noise overpowers character traits.
Overall: Adding a schematic diagram showing the flow of CTP, OFM, and MARS and their interaction with an MLLM would improve clarity of the framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We have addressed each major comment point by point below and revised the manuscript to strengthen the presentation of our contributions.

read point-by-point responses

Referee: Abstract: The assertion that 'Extensive experiments show CAVI effectively alleviates MRI' is presented without any quantitative results, baselines, ablation studies, or error analysis. This absence leaves the central empirical claim without visible supporting evidence and undermines assessment of whether the three interventions deliver the claimed improvements.

Authors: We agree that the abstract would be strengthened by including key quantitative highlights. In the revised manuscript, we have updated the abstract to summarize the primary experimental outcomes, including average gains in character consistency metrics over baselines and ablation results demonstrating the contribution of each CAVI component. revision: yes
Referee: Methods (OFM and CTP descriptions): The construction of the 'character-context subspace' for Orthogonal Feature Modulation and the relevance scoring mechanism for Character-Guided Token Pruning are described only at a conceptual level, with no explicit general algorithm (e.g., derivation from prompt embeddings, PCA, or a fixed scoring function). This detail is load-bearing for the training-free and plug-and-play claims; without it, the interventions risk requiring hidden character-specific engineering that could introduce new feature distortions rather than cleanly resolving MRI.

Authors: We acknowledge that greater algorithmic specificity is needed to fully support the training-free claims. The revised manuscript now includes explicit step-by-step derivations, pseudocode, and formulas: the character-context subspace is constructed via orthogonal projection of visual tokens onto the span of prompt-derived character embeddings, and the CTP relevance score is computed as a normalized cosine similarity between token features and character role embeddings extracted from the input prompt. revision: yes

Circularity Check

0 steps flagged

No circularity: CAVI is a heuristic intervention framework without derivations or self-referential reductions

full rationale

The paper presents CAVI as a training-free framework with three interventions (Character-Guided Token Pruning, Orthogonal Feature Modulation, and Modality-Adaptive Role Steering) to mitigate Modality-Role Interference. No equations, parameter fitting, uniqueness theorems, or derivation chains appear in the provided text. Claims rest on experimental validation rather than any step that reduces by construction to its inputs or prior self-citations. The approach is self-contained as a set of proposed rules applied to existing MLLMs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

No mathematical derivations or fitted constants; the paper is an applied engineering contribution that defines a new problem term and a heuristic intervention framework.

invented entities (1)

Modality-Role Interference (MRI) no independent evidence
purpose: Names the conflict between objective visual features and character-specific consistency in RPAs
Newly coined term used to frame the motivation and target of CAVI

pith-pipeline@v0.9.0 · 5499 in / 1052 out tokens · 28742 ms · 2026-05-12T04:46:35.979475+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 2 internal anchors

[1]

The oscars of ai theater: A survey on role-playing with language models.arXiv preprint arXiv:2407.11484, 2024

Nuo Chen, Yan Wang, Yang Deng, and Jia Li. The oscars of ai theater: A survey on role-playing with language models.arXiv preprint arXiv:2407.11484, 2024

work page arXiv 2024
[2]

MMRole: A comprehensive framework for developing and evaluating multimodal role-playing agents

Yanqi Dai, Huanran Hu, Lei Wang, Shengjie Jin, Xu Chen, and Zhiwu Lu. MMRole: A comprehensive framework for developing and evaluating multimodal role-playing agents. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[3]

Video2Roleplay: A multimodal dataset and framework for video-guided role-playing agents

Xueqiao Zhang, Chao Zhang, Jingtao Xu, Yifan Zhu, Xin Shi, Yi Yang, and Yawei Luo. Video2Roleplay: A multimodal dataset and framework for video-guided role-playing agents. InProceedings of the 2025 9 Conference on Empirical Methods in Natural Language Processing, pages 23677–23703, Suzhou, China, November 2025

work page 2025
[4]

Enhancing personalized dialogue generation with contrastive latent variables: Combining sparse and dense persona

Yihong Tang, Bo Wang, Miao Fang, Dongming Zhao, Kun Huang, Ruifang He, and Yuexian Hou. Enhancing personalized dialogue generation with contrastive latent variables: Combining sparse and dense persona. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5456–5468, Toronto, Canada, July 2...

work page doi:10.18653/v1/2023.acl-long.299 2023
[5]

Multi-party chat: Conversational agents in group settings with humans and models.arXiv preprint arXiv:2304.13835, 2023

Jimmy Wei, Kurt Shuster, Arthur Szlam, Jason Weston, Jack Urbanek, and Mojtaba Komeili. Multi-party chat: Conversational agents in group settings with humans and models.arXiv preprint arXiv:2304.13835, 2023

work page arXiv 2023
[6]

Editing personality for large language models

Shengyu Mao, Xiaohan Wang, Mengru Wang, Yong Jiang, Pengjun Xie, Fei Huang, and Ningyu Zhang. Editing personality for large language models. InNatural Language Processing and Chinese Computing: 13th National CCF Conference, NLPCC 2024, Hangzhou, China, November 1–3, 2024, Proceedings, Part II, page 241–254, Berlin, Heidelberg, 2024. Springer-Verlag

work page 2024
[7]

InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews , booktitle =

Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, and Yanghua Xiao. InCharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume ...

work page doi:10.18653/v1/2024.acl-long.102 2024
[8]

Character-LLM: A trainable agent for role-playing

Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-LLM: A trainable agent for role-playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, Singapore, December 2023. Association for Computational Linguistics

work page 2023
[10]

CharacterGLM: Customizing social characters with large language models

Jinfeng Zhou, Zhuang Chen, Dazhen Wan, Bosi Wen, Yi Song, Jifan Yu, Yongkang Huang, Pei Ke, Guanqun Bi, Libiao Peng, JiaMing Yang, Xiyao Xiao, Sahand Sabour, Xiaohan Zhang, Wenjing Hou, Yijia Zhang, Yuxiao Dong, Hongning Wang, Jie Tang, and Minlie Huang. CharacterGLM: Customizing social characters with large language models. InProceedings of the 2024 Conf...

work page doi:10.18653/v1/2024.emnlp-industry.107 2024
[11]

Noah Wang, Z.y. Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhao Huang, Jie Fu, and Junran Peng. RoleLLM: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. InFindings of the Association for Computational ...

work page doi:10.18653/v1/2024.findings-acl.878 2024
[12]

Neeko: Leveraging dynamic LoRA for efficient multi-character role-playing agent

Xiaoyan Yu, Tongxu Luo, Yifan Wei, Fangyu Lei, Yiming Huang, Hao Peng, and Liehuang Zhu. Neeko: Leveraging dynamic LoRA for efficient multi-character role-playing agent. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12540–12557, Miami, Florida, USA, November 2024. Association for Computational Linguistics....

work page doi:10.18653/v1/2024.emnlp-main 2024
[13]

Bernstein

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023

work page 2023
[14]

arXiv preprint arXiv:2407.19412 , year=

Libo Sun, Siyuan Wang, Xuanjing Huang, and Zhongyu Wei. Identity-driven hierarchical role-playing agents.arXiv preprint arXiv:2407.19412, 2024

work page arXiv 2024
[15]

Character is destiny: Can role-playing language agents make persona-driven decisions?arXiv preprint arXiv:2404.12138, 2024

Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, and Yanghua Xiao. Character is destiny: Can role-playing language agents make persona-driven decisions?arXiv preprint arXiv:2404.12138, 2024

work page arXiv 2024
[16]

CharacterGPT: A persona reconstruction framework for role-playing agents

Jeiyoon Park, Chanjun Park, and Heuiseok Lim. CharacterGPT: A persona reconstruction framework for role-playing agents. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), 2025. 10

work page 2025
[17]

arXiv preprint arXiv:2505.18541 , year=

Yongjie Wang, Jonathan Leung, and Zhiqi Shen. Rolerag: Enhancing llm role-playing via graph guided retrieval.arXiv preprint arXiv:2505.18541, 2025

work page arXiv 2025
[18]

CoSER: Coordinating LLM-Based Persona Simulation of Established Roles , journal =

Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, and Shuchang Zhou. Coser: Coordinating llm-based persona simulation of established roles.arXiv preprint arXiv:2502.09082, 2025

work page arXiv 2025
[19]

Thinking in character: Advancing role-playing agents with role-aware reasoning

Yihong Tang, Kehai Chen, Muyun Yang, Zheng-Yu Niu, Jing Li, Tiejun Zhao, and Min Zhang. Thinking in character: Advancing role-playing agents with role-aware reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[20]

Character-r1: Enhancing role-aware reasoning in role-playing agents via rlvr.arXiv preprint arXiv:2601.04611, 2026

Yihong Tang, Kehai Chen, Xuefeng Bai, Benyou Wang, Zeming Liu, Haifeng Wang, and Min Zhang. Character-r1: Enhancing role-aware reasoning in role-playing agents via rlvr.arXiv preprint arXiv:2601.04611, 2026

work page arXiv 2026
[21]

A survey on multimodal large language models.National Science Review, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 2024

work page 2024
[22]

Visual-roleplay: Universal jailbreak attack on multimodal large language models via role-playing image character.arXiv preprint arXiv:2405.20773, 2024

Siyuan Ma, Weidi Luo, Yu Wang, and Xiaogeng Liu. Visual-roleplay: Universal jailbreak attack on multimodal large language models via role-playing image character.arXiv preprint arXiv:2405.20773, 2024

work page arXiv 2024
[23]

Mobile- agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024

work page arXiv 2024
[24]

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024

work page internal anchor Pith review arXiv 2024
[25]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, Singapore, December 2023. Association for Computational Linguistics

work page 2023
[26]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

work page arXiv 2023
[28]

Multi-object hallucination in vision language models.Advances in Neural Information Processing Systems, 2024

Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Jianing Yang, David F Fouhey, Joyce Chai, and Shengyi Qian. Multi-object hallucination in vision language models.Advances in Neural Information Processing Systems, 2024

work page 2024
[29]

V-dpo: Mitigating hallucination in large vision language models via vision-guided direct preference optimization

Yuxi Xie, Guanzhen Li, Xu Xiao, and Min-Yen Kan. V-dpo: Mitigating hallucination in large vision language models via vision-guided direct preference optimization. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

work page 2024
[30]

Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, and Linchao Zhu. Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artifi...

work page 2025
[31]

Investigating and mitigating object hallucinations in pretrained vision-language (clip) models

Yufang Liu, Tao Ji, Changzhi Sun, Yuanbin Wu, and Aimin Zhou. Investigating and mitigating object hallucinations in pretrained vision-language (clip) models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[32]

Analyzing and mitigating object hallucination in large vision-language models,

Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models.arXiv preprint arXiv:2310.00754, 2023

work page arXiv 2023
[33]

Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 2024. 11

work page 2024
[34]

Logical closed loop: Uncovering object hallucinations in large vision-language models

Junfei Wu, Qiang Liu, Ding Wang, Jinghao Zhang, Shu Wu, Liang Wang, and Tieniu Tan. Logical closed loop: Uncovering object hallucinations in large vision-language models. InFindings of the Association for Computational Linguistics: ACL 2024, 2024

work page 2024
[35]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[36]

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[37]

arXiv preprint arXiv:2403.00425 , year=

Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduction via adaptive focal-contrast decoding.arXiv preprint arXiv:2403.00425, 2024

work page arXiv 2024
[38]

Mitigating object hallucinations in large vision-language models with assembly of global and local attention

Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, Qianying Wang, Ping Chen, Xiaoqin Zhang, and Shijian Lu. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

work page 2025
[39]

Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens

Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[40]

Damro: Dive into the attention mechanism of lvlm to reduce object hallucination

Xuan Gong, Tianshi Ming, Xinpeng Wang, and Zhihua Wei. Damro: Dive into the attention mechanism of lvlm to reduce object hallucination. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[41]

Look twice before you answer: Memory-space visual retracing for hallucination mitigation in multimodal large language models

Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Kening Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, and Xuming Hu. Look twice before you answer: Memory-space visual retracing for hallucination mitigation in multimodal large language models. InForty-second International Conference on Machine Learning, 2025

work page 2025
[42]

ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large Language Models

Hao Yin, Gunagzong Si, and Zilei Wang. ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large Language Models . In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14625–14634, Los Alamitos, CA, USA, June 2025. IEEE Computer Society

work page 2025
[43]

Look carefully: Adaptive visual reinforcements in multimodal large language models for hallucination mitigation

Xingyu Zhu, Kesen Zhao, Liang Yi, Shuo Wang, Zhicai Wang, Beier Zhu, Hanwang Zhang, and Xiangnan He. Look carefully: Adaptive visual reinforcements in multimodal large language models for hallucination mitigation. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[44]

Openai gpt-5 system card, 2026

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, et al. Openai gpt-5 system card, 2026

work page 2026
[45]

Gemini 3.Google DeepMind’s Blog, 2025

Google DeepMind. Gemini 3.Google DeepMind’s Blog, 2025. URL https://deepmind.google/ technologies/gemini/

work page 2025
[46]

When you look at this image, what thoughts come to your mind?

Anthropic. Introducing claude 4.Anthropic’s Blog, 2025. URL https://www.anthropic.com/news/ claude-4. A Theoretical Analysis and Proofs This appendix provides the formal information-theoretic context, propositions, geometric assumptions, and detailed proofs supporting the methodology presented in Section 3. A.1 Ineffectiveness of Scalar Attention for Feat...

work page 2025