pith. machine review for the scientific record. sign in

arxiv: 2604.10071 · v1 · submitted 2026-04-11 · 💻 cs.CV

Recognition: unknown

Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords hallucinationMLLMcontrastive decodingvisual attentionintrospective decodingmultimodal large language modelsdecoding framework
0
0 comments X

The pith

Dual-Anchor Introspective Decoding selects spotlight and shadow layers using attention to reduce hallucinations in multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to mitigate hallucinations in multimodal large language models by adjusting the decoding process for each generated token. It mines the model's internal visual attention to choose a spotlight layer that strengthens factual visual signals and a shadow layer that weakens reliance on prior text patterns. This contrastive approach calibrates outputs without changing the model itself. A reader would care because hallucinations limit the reliability of these models in real applications like image description or visual question answering, and this offers a training-free fix that also boosts reasoning.

Core claim

DaID identifies a Spotlight layer to amplify visual factual signals and a Shadow layer to suppress textual inertia, guided by visual attention distributions for precise, token-specific adaptation during generation.

What carries the argument

The dual-anchor selection process where visual attention distributions guide the choice of spotlight and shadow layers for contrastive decoding.

If this is right

  • Significantly reduces hallucination rates on multiple benchmarks and across various MLLMs.
  • Enhances general reasoning capabilities in addition to factual accuracy.
  • Enables dynamic, token-by-token adaptation without retraining or external modules.
  • The method applies broadly to different multimodal models without model-specific changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the attention-based selection works, similar internal discrepancy mining could help in unimodal language models for other error types.
  • This suggests a general principle that models can self-correct by contrasting layers with different strengths.
  • Testing on more complex visual tasks like detailed scene understanding could reveal further benefits.

Load-bearing premise

That mining perceptual discrepancies via attention distributions reliably identifies and corrects visual hallucinations without creating new errors.

What would settle it

Running the method on an MLLM using a benchmark such as CHAIR or POPE and finding that hallucination metrics do not decrease or even increase compared to standard decoding.

Figures

Figures reproduced from arXiv: 2604.10071 by Han Jin, Li Li, Yebo Wu, Zhijiang Guo.

Figure 1
Figure 1. Figure 1: A failure case of Visual Contrastive Decoding [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise diagnosis of MLLM decoding dynamics on the POPE benchmark, revealing the evolution of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Framework of DAID. DAID dynamically identifies dual-anchor layers guided by the visual attention and calibrates the final output by leveraging both the Spotlight and Shadow anchors to ensure visual grounding. we identify the positive anchor (Spotlight layer) as the layer exhibiting maximum visual grounding: L t spot. = argmax l∈{1,...,L} VASt(l). (2) By anchoring to L t spot., we retrieve authentic visual … view at source ↗
Figure 4
Figure 4. Figure 4: Performance evaluation of LLaVA-1.5-7B/13B across five general vision-language benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of visual perception reinforcement [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of language prior suppression intensity [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Generalization analysis of DAID across diverse MLLM architectures on CHAIR and POPE benchmarks. and then to 0.8, performance improves signifi￾cantly (e.g., LLaVA-1.5 reaches 85.92% F1 and 85.08% accuracy), demonstrating that reintroduc￾ing visual signals effectively counteracts the seeing￾then-forgetting phenomenon. However, perfor￾mance degrades when α exceeds 0.8: at α = 1.0, F1 scores drop by 0.71% and … view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative case study of DAID on a representative instance (input image and query detailed in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Analysis of the topological constraint ( [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Inference latency comparison (ms/token) on NVIDIA RTX 5070 Ti. We evaluate the decoding speed of [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of token-specific dynamic anchor selection on LLaVA-1.5. We track the layer indices [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities yet continue to suffer from hallucination, where generated text contradicts visual content. In this paper, we introduce Dual-Anchor Introspective Decoding (DaID), a novel contrastive decoding framework that dynamically calibrates each token generation by mining the model's internal perceptual discrepancies. Specifically, DaID identifies a Spotlight layer to amplify visual factual signals and a Shadow layer to suppress textual inertia. By leveraging visual attention distributions to guide this dual-anchor selection process, our method ensures precise, token-specific adaptation. Experimental results across multiple benchmarks and MLLMs demonstrate that DaID significantly mitigates hallucination while enhancing general reasoning capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Dual-Anchor Introspective Decoding (DaID), a contrastive decoding framework for Multimodal Large Language Models (MLLMs) that dynamically selects a Spotlight layer (to amplify visual factual signals) and a Shadow layer (to suppress textual inertia) at each token generation step, guided by visual attention distributions to mine internal perceptual discrepancies and thereby mitigate hallucinations while improving reasoning.

Significance. If the reported gains hold under rigorous controls, DaID offers a training-free, parameter-free approach that leverages existing internal model states for more reliable multimodal generation; this would be a meaningful contribution to hallucination mitigation in MLLMs, especially given the emphasis on token-specific adaptation.

major comments (2)
  1. [Method (dual-anchor selection and attention guidance)] The central assumption that attention distributions reliably surface perceptual discrepancies corresponding to visual hallucinations (and that Spotlight/Shadow selection can be performed dynamically without new inconsistencies) is load-bearing for the entire framework, yet the manuscript provides no explicit validation, ablation on layer selection criteria, or external grounding beyond internal states; this leaves the method vulnerable to circularity.
  2. [Experiments] Table 1 and Figure 3 (benchmark results): the reported improvements over baselines lack error bars, statistical significance tests, or details on prompt variations and model scales; without these, the claim of 'significant mitigation across multiple benchmarks and MLLMs' cannot be fully evaluated.
minor comments (2)
  1. [3.1] Notation for 'Spotlight' and 'Shadow' layers is introduced without a clear equation defining the contrastive logit combination (e.g., how the two anchors are weighted per token); a single equation would improve clarity.
  2. [Abstract] The abstract states quantitative improvements but supplies none of the actual numbers, baselines, or dataset names; moving a concise results summary into the abstract would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. Below we provide point-by-point responses to the major comments and indicate the revisions we will make to address them.

read point-by-point responses
  1. Referee: [Method (dual-anchor selection and attention guidance)] The central assumption that attention distributions reliably surface perceptual discrepancies corresponding to visual hallucinations (and that Spotlight/Shadow selection can be performed dynamically without new inconsistencies) is load-bearing for the entire framework, yet the manuscript provides no explicit validation, ablation on layer selection criteria, or external grounding beyond internal states; this leaves the method vulnerable to circularity.

    Authors: We acknowledge that the assumption is central and that the current manuscript relies primarily on the design rationale and downstream performance for support. To address the request for explicit validation, we will add a dedicated ablation subsection examining alternative layer-selection criteria (e.g., fixed layers, random selection, and attention-threshold variants) together with an analysis that correlates selected Spotlight/Shadow activations against external visual grounding signals where available. Regarding potential circularity, the attention maps are computed from the model's visual encoder and used only to choose anchors; effectiveness is measured on independent hallucination benchmarks whose metrics do not depend on internal states, providing external grounding. revision: yes

  2. Referee: [Experiments] Table 1 and Figure 3 (benchmark results): the reported improvements over baselines lack error bars, statistical significance tests, or details on prompt variations and model scales; without these, the claim of 'significant mitigation across multiple benchmarks and MLLMs' cannot be fully evaluated.

    Authors: We agree that these elements are required for a rigorous evaluation. In the revised version we will (i) add error bars to Table 1 and Figure 3 computed over at least three independent runs with different random seeds, (ii) report p-values from paired statistical tests comparing DaID against each baseline, and (iii) expand the experimental setup section with the exact prompt templates, input formatting variations, and the full range of model scales tested. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces DaID as an explicit algorithmic construction: visual attention distributions are computed from the MLLM and used to select per-token Spotlight and Shadow layers for contrastive decoding. No equation or result is shown to be equivalent to its own inputs by construction, no parameter is fitted on a subset and then renamed as a prediction, and no load-bearing premise rests on a self-citation chain. The method is presented as falsifiable through benchmark experiments on multiple models and datasets, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.0 · 5421 in / 1149 out tokens · 78242 ms · 2026-05-10T17:02:14.451695+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 24 canonical work pages · 7 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Rawan AlSaad, Alaa Abd-Alrazaq, Sabri Boughorbel, Arfan Ahmed, Max-Antoine Renault, Rafat Damseh, and Javaid Sheikh. 2024. Multimodal large language models in health care: applications, challenges, and future outlook. Journal of medical Internet research, 26:e59505

  4. [4]

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930

  5. [5]

    Davide Bucciarelli, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. Personalizing multimodal large language models for image captioning: an experimental analysis. In European Conference on Computer Vision, pages 351--368. Springer

  6. [6]

    Liwei Che, Tony Qingze Liu, Jing Jia, Weiyi Qin, Ruixiang Tang, and Vladimir Pavlovic. 2025. Hallucinatory image tokens: A training-free eazy approach to detecting and mitigating object hallucinations in lvlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21635--21644

  7. [7]

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. 2023. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883

  8. [8]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems, 36:49250--49267

  9. [9]

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. 2025. Mme: A comprehensive evaluation benchmark for multimodal large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  10. [10]

    Ziliang Gan, Yu Lu, Dong Zhang, Haohan Li, Che Liu, Jian Liu, Ji Liu, Haipang Wu, Chaoyou Fu, Zenglin Xu, et al. 2024. Mme-finance: A multimodal finance benchmark for expert-level understanding and reasoning. arXiv preprint arXiv:2411.03314

  11. [11]

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904--6913

  12. [12]

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 2018. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608--3617

  13. [13]

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. 2024. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418--13427

  14. [14]

    Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709

  15. [15]

    Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. 2024. Self-introspective decoding: Alleviating hallucinations for large vision-language models. arXiv preprint arXiv:2408.02032

  16. [16]

    Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. 2023. Towards mitigating llm hallucination via self reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827--1843

  17. [17]

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2024. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872--13882

  18. [18]

    Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. 2024 a . https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/ Llava-next: Stronger llms supercharge multimodal capabilities in the wild

  19. [19]

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. 2023 a . Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125

  20. [20]

    Ming Li, Keyu Chen, Ziqian Bi, Ming Liu, Benji Peng, Qian Niu, Junyu Liu, Jinlang Wang, Sen Zhang, Xuanhe Pan, et al. 2024 b . Surveying the mllm landscape: A meta-review of current surveys. arXiv preprint arXiv:2409.18991

  21. [21]

    Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori B Hashimoto, Luke Zettlemoyer, and Mike Lewis. 2023 b . Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 12286--12312

  22. [22]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023 c . Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355

  23. [23]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, 36:34892--34916

  24. [24]

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. 2024. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216--233. Springer

  25. [25]

    Jianing Qiu, Wu Yuan, and Kyle Lam. 2024. The application of multimodal large language models in medicine. The Lancet Regional Health--Western Pacific, 45

  26. [26]

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156

  27. [27]

    Sara Sarto, Marcella Cornia, and Rita Cucchiara. 2025. Image captioning evaluation in the age of multimodal llms: Challenges and future perspectives. arXiv preprint arXiv:2503.14604

  28. [28]

    Wei Suo, Lijun Zhang, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, and Yanning Zhang. 2025. Octopus: Alleviating hallucination via dynamic contrastive decoding. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29904--29914

  29. [29]

    Chenxi Wang, Xiang Chen, Ningyu Zhang, Bozhong Tian, Haoming Xu, Shumin Deng, and Huajun Chen. 2024 a . Mllm can see? dynamic correction decoding for hallucination mitigation. arXiv preprint arXiv:2410.11779

  30. [30]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024 b . Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191

  31. [31]

    Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. 2024 c . Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715

  32. [32]

    Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, and Tat-Seng Chua. 2025 a . Combating multimodal llm hallucination via bottom-up holistic reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 8460--8468

  33. [33]

    Yebo Wu, Jingguang Li, Zhijiang Guo, and Li Li. 2025 b . Elastic mixture of rank-wise experts for knowledge reuse in federated fine-tuning. arXiv preprint arXiv:2512.00902

  34. [34]

    Yebo Wu, Jingguang Li, Zhijiang Guo, and Li Li. 2026 a . Developmental federated tuning: A cognitive-inspired paradigm for efficient llm adaptation. In The Fourteenth International Conference on Learning Representations

  35. [35]

    Yebo Wu, Jingguang Li, Chunlin Tian, Zhijiang Guo, and Li Li. 2025 c . Memory-efficient federated fine-tuning of large language models via layer pruning. arXiv preprint arXiv:2508.17209

  36. [36]

    Yebo Wu, Jingguang Li, Chunlin Tian, Kahou Tam, Zhijiang Guo, and Li Li. 2026 b . https://arxiv.org/abs/2604.06819 Beyond end-to-end: Dynamic chain optimization for private llm adaptation on the edge . Preprint, arXiv:2604.06819

  37. [37]

    Yebo Wu, Li Li, and Cheng-zhong Xu. 2025 d . Breaking the memory wall for heterogeneous federated learning via progressive training. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, pages 1623--1632

  38. [38]

    Yebo Wu, Feng Liu, Ziwei Xie, Zhiyuan Liu, Changwang Zhang, Jun Wang, and Li Li. 2026 c . Tsembed: Unlocking task scaling in universal multimodal embeddings. arXiv preprint arXiv:2603.04772

  39. [39]

    misremember

    Naen Xu, Hengyu An, Shuo Shi, Jinghuai Zhang, Chunyi Zhou, Changjiang Li, Tianyu Du, Zhihui Fu, Jun Wang, and Shouling Ji. 2026 a . When agents" misremember" collectively: Exploring the mandela effect in llm-based multi-agent systems. arXiv preprint arXiv:2602.00428

  40. [40]

    Naen Xu, Changjiang Li, Tianyu Du, Minxi Li, Wenjie Luo, Jiacheng Liang, Yuyuan Li, Xuhong Zhang, Meng Han, Jianwei Yin, et al. 2024. Copyrightmeter: Revisiting copyright protection in text-to-image models. arXiv preprint arXiv:2411.13144

  41. [41]

    "I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

    Naen Xu, Jiayi Sheng, Changjiang Li, Chunyi Zhou, Yuyuan Li, Tianyu Du, Jun Wang, Zhihui Fu, Jinbao Li, and Shouling Ji. 2026 b . https://arxiv.org/abs/2604.05930 "i see what you did there": Can large vision-language models understand multimodal puns? Preprint, arXiv:2604.05930

  42. [42]

    Naen Xu, Jinghuai Zhang, Ping He, Chunyi Zhou, Jun Wang, Zhihui Fu, Tianyu Du, Zhaoxiang Wang, and Shouling Ji. 2026 c . Fraudshield: Knowledge graph empowered defense for llms against fraud attacks. arXiv preprint arXiv:2601.22485

  43. [43]

    Naen Xu, Jinghuai Zhang, Changjiang Li, Hengyu An, Chunyi Zhou, Jun Wang, Boyu Xu, Yuyuan Li, Tianyu Du, and Shouling Ji. 2026 d . Bridging the copyright gap: Do large vision-language models recognize and respect copyrighted content? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35949--35957

  44. [44]

    Naen Xu, Jinghuai Zhang, Changjiang Li, Zhi Chen, Chunyi Zhou, Qingming Li, Tianyu Du, and Shouling Ji. 2025. Videoeraser: Concept erasure in text-to-video diffusion models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5965--5994

  45. [45]

    Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. 2025. Look-back: Implicit visual re-focusing in mllm reasoning. arXiv preprint arXiv:2507.03019

  46. [46]

    Xiangtao Zhang, Eleftherios Kofidis, Ruituo Wu, Ce Zhu, Le Zhang, and Yipeng Liu. 2026. Coupled tensor train decomposition in federated learning. Pattern Recognition, 170:112067

  47. [47]

    Xiangtao Zhang, Sheng Li, Ao Li, Yipeng Liu, Fan Zhang, Ce Zhu, and Le Zhang. 2025 a . Subspace constraint and contribution estimation for heterogeneous federated learning. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 20632--20642

  48. [48]

    Xiaofeng Zhang, Yihao Quan, Chen Shen, Chaochen Gu, Xiaosong Yuan, Shaotian Yan, Jiawei Cao, Hao Cheng, Kaijie Wu, and Jieping Ye. 2025 b . Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucination in lvlms. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3512--3534

  49. [49]

    Pengfei Zhao, Rongbo Luan, Wei Zhang, Peng Wu, and Sifeng He. 2025. Guiding cross-modal representations with mllm priors via preference alignment. arXiv preprint arXiv:2506.06970

  50. [50]

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592

  51. [51]

    Younan Zhu, Linwei Tao, Minjing Dong, and Chang Xu. 2025. Mitigating object hallucinations in large vision-language models via attention calibration. arXiv preprint arXiv:2502.01969