pith. machine review for the scientific record. sign in

arxiv: 2605.09443 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.CL

Recognition: no theorem link

Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:46 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords Modality-Role InterferenceMultimodal Role-Playing AgentsCharacter-Aware Visual InterventionCAVIToken PruningFeature ModulationRole ConsistencyMultimodal Large Language Models
0
0 comments X

The pith

A training-free framework lets role-playing agents see visuals through their character's perspective to fix interference between images and consistent identity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard multimodal models extract objective visual features that drown out the subjective traits needed for believable role-playing, creating modality-role interference. It introduces the Character-Aware Visual Intervention method that prunes irrelevant visual tokens, projects remaining features into a character-specific subspace, and adjusts generation steering on the fly. If this holds, existing models can deliver character-consistent responses to images without any retraining. The work matters because it removes a barrier to deploying visually grounded role-play in games, education, and interactive systems. Sympathetic readers would see this as a practical way to make AI agents maintain identity across modalities.

Core claim

The central claim is that Modality-Role Interference arises when objective visual processing overpowers fragile character traits, and that three coordinated interventions—character-guided token pruning to limit the receptive field, orthogonal feature modulation to extract aligned facts, and modality-adaptive role steering to optimize intensity—resolve it in a training-free manner, allowing multimodal role-playing agents to integrate visual grounding while preserving consistency.

What carries the argument

The Character-Aware Visual Intervention (CAVI) framework, which applies character-guided token pruning, orthogonal feature modulation, and modality-adaptive role steering to align visual processing with role identity.

If this is right

  • Existing multimodal models gain character-consistent visual responses without retraining.
  • Visual attention is restricted to entities relevant to the assigned role.
  • Feature representations are projected into subspaces that preserve character context.
  • Generation steering strength varies automatically with how much the response depends on the image.
  • Role-playing agents maintain identity across both text and visual inputs in the same conversation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same interference pattern may appear in any multimodal system where one modality carries identity-specific information, suggesting CAVI-style interventions could apply beyond role-play to areas like personalized captioning.
  • Because the method is training-free, it could be combined with other lightweight alignment techniques to test whether visual consistency compounds with other constraints such as safety or style.
  • If the three components prove additive, future work could isolate which one contributes most on different character types to guide minimal implementations.

Load-bearing premise

That the three character-specific rules for pruning, projection, and steering can be applied to any existing multimodal model without creating fresh inconsistencies or requiring per-character redesign.

What would settle it

A controlled test on a standard role-playing benchmark where CAVI is added to an off-the-shelf MLLM and character-consistency scores show no statistically significant gain over the unmodified baseline.

Figures

Figures reproduced from arXiv: 2605.09443 by Kehai Chen, Min Zhang, Xuefeng Bai, Yihong Tang.

Figure 1
Figure 1. Figure 1: Illustration of Modality-Role Interference (MRI). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Character-Aware Visual Intervention (CAVI). CAVI resolves Modality-Role Interference via a training-free pipeline: (1) Character Anchor extracts a semantic role reference from textual profiles. (2) Token Pruning (CTP) macroscopically discards role-irrelevant visual tokens. (3) Feature Modulation (OFM) microscopically purifies retained tokens by projecting out orthogonal out-of-character noise. … view at source ↗
Figure 4
Figure 4. Figure 4: Role-specific visual gain of to￾ken selectors. The displayed score is per￾example normalized [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: OFM diagnostics. Left: aligned-versus-orthogonal energy before and after modulation. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Controlled role-swap t-SNE for hidden states of Base, VCD, and CAVI across 6 roles. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Hyperparameter sensitivity evaluated on 100 sampled in-domain test examples. From left [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

The advancement of Multimodal Large Language Models (MLLMs) has expanded Role-Playing Agents (RPAs) into visually grounded environments. However, human vision is inherently subjective and identity-driven, whereas existing MLLMs extract objective, character-agnostic features for general tasks. In RPAs, this generic visual noise overpowers fragile character traits, causing Modality-Role Interference (MRI), where agents struggle to integrate visual grounding and character consistency. To address this, we introduce the training-free Character-Aware Visual Intervention (CAVI) framework, enabling agents to perceive the world through the lens of character. CAVI systematically targets MRI: macroscopically, Character-Guided Token Pruning (CTP) restricts the visual receptive field to role-relevant entities; microscopically, Orthogonal Feature Modulation (OFM) projects tokens onto a character-context subspace to extract aligned facts; and during decoding, Modality-Adaptive Role Steering (MARS) dynamically optimizes steering intensity based on visual reliance. Extensive experiments show CAVI effectively alleviates MRI, significantly enhancing character-consistent multimodal interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Multimodal Large Language Models (MLLMs) used for Role-Playing Agents (RPAs) suffer from Modality-Role Interference (MRI), where objective visual feature extraction overpowers fragile character traits. It introduces the training-free Character-Aware Visual Intervention (CAVI) framework with three components—Character-Guided Token Pruning (CTP) to restrict the visual field to role-relevant entities, Orthogonal Feature Modulation (OFM) to project tokens onto a character-context subspace, and Modality-Adaptive Role Steering (MARS) to dynamically adjust steering intensity during decoding—and asserts that extensive experiments show CAVI effectively alleviates MRI and enhances character-consistent multimodal interactions.

Significance. If the interventions prove generalizable and the empirical results are robust, CAVI could provide a practical, plug-and-play method for improving role consistency in visually grounded RPAs without model retraining. The decomposition of MRI into macro-level pruning, micro-level feature alignment, and decoding-level steering offers a structured lens for addressing the tension between generic visual perception and identity-driven interpretation in MLLMs.

major comments (2)
  1. Abstract: The assertion that 'Extensive experiments show CAVI effectively alleviates MRI' is presented without any quantitative results, baselines, ablation studies, or error analysis. This absence leaves the central empirical claim without visible supporting evidence and undermines assessment of whether the three interventions deliver the claimed improvements.
  2. Methods (OFM and CTP descriptions): The construction of the 'character-context subspace' for Orthogonal Feature Modulation and the relevance scoring mechanism for Character-Guided Token Pruning are described only at a conceptual level, with no explicit general algorithm (e.g., derivation from prompt embeddings, PCA, or a fixed scoring function). This detail is load-bearing for the training-free and plug-and-play claims; without it, the interventions risk requiring hidden character-specific engineering that could introduce new feature distortions rather than cleanly resolving MRI.
minor comments (2)
  1. Abstract: The term Modality-Role Interference (MRI) is introduced without a concise formal definition or a short illustrative example of how generic visual noise overpowers character traits.
  2. Overall: Adding a schematic diagram showing the flow of CTP, OFM, and MARS and their interaction with an MLLM would improve clarity of the framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We have addressed each major comment point by point below and revised the manuscript to strengthen the presentation of our contributions.

read point-by-point responses
  1. Referee: Abstract: The assertion that 'Extensive experiments show CAVI effectively alleviates MRI' is presented without any quantitative results, baselines, ablation studies, or error analysis. This absence leaves the central empirical claim without visible supporting evidence and undermines assessment of whether the three interventions deliver the claimed improvements.

    Authors: We agree that the abstract would be strengthened by including key quantitative highlights. In the revised manuscript, we have updated the abstract to summarize the primary experimental outcomes, including average gains in character consistency metrics over baselines and ablation results demonstrating the contribution of each CAVI component. revision: yes

  2. Referee: Methods (OFM and CTP descriptions): The construction of the 'character-context subspace' for Orthogonal Feature Modulation and the relevance scoring mechanism for Character-Guided Token Pruning are described only at a conceptual level, with no explicit general algorithm (e.g., derivation from prompt embeddings, PCA, or a fixed scoring function). This detail is load-bearing for the training-free and plug-and-play claims; without it, the interventions risk requiring hidden character-specific engineering that could introduce new feature distortions rather than cleanly resolving MRI.

    Authors: We acknowledge that greater algorithmic specificity is needed to fully support the training-free claims. The revised manuscript now includes explicit step-by-step derivations, pseudocode, and formulas: the character-context subspace is constructed via orthogonal projection of visual tokens onto the span of prompt-derived character embeddings, and the CTP relevance score is computed as a normalized cosine similarity between token features and character role embeddings extracted from the input prompt. revision: yes

Circularity Check

0 steps flagged

No circularity: CAVI is a heuristic intervention framework without derivations or self-referential reductions

full rationale

The paper presents CAVI as a training-free framework with three interventions (Character-Guided Token Pruning, Orthogonal Feature Modulation, and Modality-Adaptive Role Steering) to mitigate Modality-Role Interference. No equations, parameter fitting, uniqueness theorems, or derivation chains appear in the provided text. Claims rest on experimental validation rather than any step that reduces by construction to its inputs or prior self-citations. The approach is self-contained as a set of proposed rules applied to existing MLLMs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

No mathematical derivations or fitted constants; the paper is an applied engineering contribution that defines a new problem term and a heuristic intervention framework.

invented entities (1)
  • Modality-Role Interference (MRI) no independent evidence
    purpose: Names the conflict between objective visual features and character-specific consistency in RPAs
    Newly coined term used to frame the motivation and target of CAVI

pith-pipeline@v0.9.0 · 5499 in / 1052 out tokens · 28742 ms · 2026-05-12T04:46:35.979475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 2 internal anchors

  1. [1]

    The oscars of ai theater: A survey on role-playing with language models.arXiv preprint arXiv:2407.11484, 2024

    Nuo Chen, Yan Wang, Yang Deng, and Jia Li. The oscars of ai theater: A survey on role-playing with language models.arXiv preprint arXiv:2407.11484, 2024

  2. [2]

    MMRole: A comprehensive framework for developing and evaluating multimodal role-playing agents

    Yanqi Dai, Huanran Hu, Lei Wang, Shengjie Jin, Xu Chen, and Zhiwu Lu. MMRole: A comprehensive framework for developing and evaluating multimodal role-playing agents. InThe Thirteenth International Conference on Learning Representations, 2025

  3. [3]

    Video2Roleplay: A multimodal dataset and framework for video-guided role-playing agents

    Xueqiao Zhang, Chao Zhang, Jingtao Xu, Yifan Zhu, Xin Shi, Yi Yang, and Yawei Luo. Video2Roleplay: A multimodal dataset and framework for video-guided role-playing agents. InProceedings of the 2025 9 Conference on Empirical Methods in Natural Language Processing, pages 23677–23703, Suzhou, China, November 2025

  4. [4]

    Enhancing personalized dialogue generation with contrastive latent variables: Combining sparse and dense persona

    Yihong Tang, Bo Wang, Miao Fang, Dongming Zhao, Kun Huang, Ruifang He, and Yuexian Hou. Enhancing personalized dialogue generation with contrastive latent variables: Combining sparse and dense persona. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5456–5468, Toronto, Canada, July 2...

  5. [5]

    Multi-party chat: Conversational agents in group settings with humans and models.arXiv preprint arXiv:2304.13835, 2023

    Jimmy Wei, Kurt Shuster, Arthur Szlam, Jason Weston, Jack Urbanek, and Mojtaba Komeili. Multi-party chat: Conversational agents in group settings with humans and models.arXiv preprint arXiv:2304.13835, 2023

  6. [6]

    Editing personality for large language models

    Shengyu Mao, Xiaohan Wang, Mengru Wang, Yong Jiang, Pengjun Xie, Fei Huang, and Ningyu Zhang. Editing personality for large language models. InNatural Language Processing and Chinese Computing: 13th National CCF Conference, NLPCC 2024, Hangzhou, China, November 1–3, 2024, Proceedings, Part II, page 241–254, Berlin, Heidelberg, 2024. Springer-Verlag

  7. [7]

    InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews , booktitle =

    Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, and Yanghua Xiao. InCharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume ...

  8. [8]

    Character-LLM: A trainable agent for role-playing

    Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-LLM: A trainable agent for role-playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, Singapore, December 2023. Association for Computational Linguistics

  9. [10]

    CharacterGLM: Customizing social characters with large language models

    Jinfeng Zhou, Zhuang Chen, Dazhen Wan, Bosi Wen, Yi Song, Jifan Yu, Yongkang Huang, Pei Ke, Guanqun Bi, Libiao Peng, JiaMing Yang, Xiyao Xiao, Sahand Sabour, Xiaohan Zhang, Wenjing Hou, Yijia Zhang, Yuxiao Dong, Hongning Wang, Jie Tang, and Minlie Huang. CharacterGLM: Customizing social characters with large language models. InProceedings of the 2024 Conf...

  10. [11]

    Noah Wang, Z.y. Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhao Huang, Jie Fu, and Junran Peng. RoleLLM: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. InFindings of the Association for Computational ...

  11. [12]

    Neeko: Leveraging dynamic LoRA for efficient multi-character role-playing agent

    Xiaoyan Yu, Tongxu Luo, Yifan Wei, Fangyu Lei, Yiming Huang, Hao Peng, and Liehuang Zhu. Neeko: Leveraging dynamic LoRA for efficient multi-character role-playing agent. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12540–12557, Miami, Florida, USA, November 2024. Association for Computational Linguistics....

  12. [13]

    Bernstein

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023

  13. [14]

    arXiv preprint arXiv:2407.19412 , year=

    Libo Sun, Siyuan Wang, Xuanjing Huang, and Zhongyu Wei. Identity-driven hierarchical role-playing agents.arXiv preprint arXiv:2407.19412, 2024

  14. [15]

    Character is destiny: Can role-playing language agents make persona-driven decisions?arXiv preprint arXiv:2404.12138, 2024

    Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, and Yanghua Xiao. Character is destiny: Can role-playing language agents make persona-driven decisions?arXiv preprint arXiv:2404.12138, 2024

  15. [16]

    CharacterGPT: A persona reconstruction framework for role-playing agents

    Jeiyoon Park, Chanjun Park, and Heuiseok Lim. CharacterGPT: A persona reconstruction framework for role-playing agents. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), 2025. 10

  16. [17]

    arXiv preprint arXiv:2505.18541 , year=

    Yongjie Wang, Jonathan Leung, and Zhiqi Shen. Rolerag: Enhancing llm role-playing via graph guided retrieval.arXiv preprint arXiv:2505.18541, 2025

  17. [18]

    CoSER: Coordinating LLM-Based Persona Simulation of Established Roles , journal =

    Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, and Shuchang Zhou. Coser: Coordinating llm-based persona simulation of established roles.arXiv preprint arXiv:2502.09082, 2025

  18. [19]

    Thinking in character: Advancing role-playing agents with role-aware reasoning

    Yihong Tang, Kehai Chen, Muyun Yang, Zheng-Yu Niu, Jing Li, Tiejun Zhao, and Min Zhang. Thinking in character: Advancing role-playing agents with role-aware reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  19. [20]

    Character-r1: Enhancing role-aware reasoning in role-playing agents via rlvr.arXiv preprint arXiv:2601.04611, 2026

    Yihong Tang, Kehai Chen, Xuefeng Bai, Benyou Wang, Zeming Liu, Haifeng Wang, and Min Zhang. Character-r1: Enhancing role-aware reasoning in role-playing agents via rlvr.arXiv preprint arXiv:2601.04611, 2026

  20. [21]

    A survey on multimodal large language models.National Science Review, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 2024

  21. [22]

    Visual-roleplay: Universal jailbreak attack on multimodal large language models via role-playing image character.arXiv preprint arXiv:2405.20773, 2024

    Siyuan Ma, Weidi Luo, Yu Wang, and Xiaogeng Liu. Visual-roleplay: Universal jailbreak attack on multimodal large language models via role-playing image character.arXiv preprint arXiv:2405.20773, 2024

  22. [23]

    Mobile- agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024

  23. [24]

    A Survey on Hallucination in Large Vision-Language Models

    Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024

  24. [25]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, Singapore, December 2023. Association for Computational Linguistics

  25. [26]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

  26. [27]

    Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

  27. [28]

    Multi-object hallucination in vision language models.Advances in Neural Information Processing Systems, 2024

    Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Jianing Yang, David F Fouhey, Joyce Chai, and Shengyi Qian. Multi-object hallucination in vision language models.Advances in Neural Information Processing Systems, 2024

  28. [29]

    V-dpo: Mitigating hallucination in large vision language models via vision-guided direct preference optimization

    Yuxi Xie, Guanzhen Li, Xu Xiao, and Min-Yen Kan. V-dpo: Mitigating hallucination in large vision language models via vision-guided direct preference optimization. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

  29. [30]

    Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback

    Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, and Linchao Zhu. Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artifi...

  30. [31]

    Investigating and mitigating object hallucinations in pretrained vision-language (clip) models

    Yufang Liu, Tao Ji, Changzhi Sun, Yuanbin Wu, and Aimin Zhou. Investigating and mitigating object hallucinations in pretrained vision-language (clip) models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  31. [32]

    Analyzing and mitigating object hallucination in large vision-language models,

    Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models.arXiv preprint arXiv:2310.00754, 2023

  32. [33]

    Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 2024. 11

  33. [34]

    Logical closed loop: Uncovering object hallucinations in large vision-language models

    Junfei Wu, Qiang Liu, Ding Wang, Jinghao Zhang, Shu Wu, Liang Wang, and Tieniu Tan. Logical closed loop: Uncovering object hallucinations in large vision-language models. InFindings of the Association for Computational Linguistics: ACL 2024, 2024

  34. [35]

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  35. [36]

    Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  36. [37]

    arXiv preprint arXiv:2403.00425 , year=

    Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduction via adaptive focal-contrast decoding.arXiv preprint arXiv:2403.00425, 2024

  37. [38]

    Mitigating object hallucinations in large vision-language models with assembly of global and local attention

    Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, Qianying Wang, Ping Chen, Xiaoqin Zhang, and Shijian Lu. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

  38. [39]

    Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens

    Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  39. [40]

    Damro: Dive into the attention mechanism of lvlm to reduce object hallucination

    Xuan Gong, Tianshi Ming, Xinpeng Wang, and Zhihua Wei. Damro: Dive into the attention mechanism of lvlm to reduce object hallucination. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  40. [41]

    Look twice before you answer: Memory-space visual retracing for hallucination mitigation in multimodal large language models

    Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Kening Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, and Xuming Hu. Look twice before you answer: Memory-space visual retracing for hallucination mitigation in multimodal large language models. InForty-second International Conference on Machine Learning, 2025

  41. [42]

    ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large Language Models

    Hao Yin, Gunagzong Si, and Zilei Wang. ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large Language Models . In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14625–14634, Los Alamitos, CA, USA, June 2025. IEEE Computer Society

  42. [43]

    Look carefully: Adaptive visual reinforcements in multimodal large language models for hallucination mitigation

    Xingyu Zhu, Kesen Zhao, Liang Yi, Shuo Wang, Zhicai Wang, Beier Zhu, Hanwang Zhang, and Xiangnan He. Look carefully: Adaptive visual reinforcements in multimodal large language models for hallucination mitigation. InThe Fourteenth International Conference on Learning Representations, 2026

  43. [44]

    Openai gpt-5 system card, 2026

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, et al. Openai gpt-5 system card, 2026

  44. [45]

    Gemini 3.Google DeepMind’s Blog, 2025

    Google DeepMind. Gemini 3.Google DeepMind’s Blog, 2025. URL https://deepmind.google/ technologies/gemini/

  45. [46]

    When you look at this image, what thoughts come to your mind?

    Anthropic. Introducing claude 4.Anthropic’s Blog, 2025. URL https://www.anthropic.com/news/ claude-4. A Theoretical Analysis and Proofs This appendix provides the formal information-theoretic context, propositions, geometric assumptions, and detailed proofs supporting the methodology presented in Section 3. A.1 Ineffectiveness of Scalar Attention for Feat...