CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models
Pith reviewed 2026-05-20 12:16 UTC · model grok-4.3
The pith
Vision-language models count objects using learned priors over the actual visual evidence when the two conflict.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that recent VLMs maintain high accuracy on factual counting images yet show consistent drops when count-relevant attributes are edited to produce contradictory visual evidence. Localized evidence annotations demonstrate that these errors arise from under-attention to the precise visual tokens that determine the count, rather than from missing or ambiguous evidence. A unified inference-time attention modulation method that reweights the selected tokens improves counterfactual counting performance by up to 8 percent across tested models.
What carries the argument
CounterCount paired factual-counterfactual image set with localized evidence annotations that mark count-relevant visual tokens, plus an inference-time attention reweighting operation that increases the weight on those tokens.
If this is right
- Future VLMs will require explicit mechanisms to override object-level priors when visual evidence is presented.
- Diagnostic datasets built on minimal attribute edits can expose similar grounding failures in other reasoning tasks.
- Attention modulation at inference time offers a lightweight way to improve visual grounding without retraining.
- Counting performance on real images may still be inflated by prior knowledge unless counterfactual checks are applied.
Where Pith is reading between the lines
- The same prior-over-evidence pattern is likely to appear in other attribute judgments such as size or color when those conflict with canonical expectations.
- Training objectives that explicitly penalize mismatches between attention maps and human-provided evidence annotations could reduce the need for post-hoc fixes.
- Applications that depend on accurate object counts in unusual scenes, such as inventory or safety monitoring, would benefit from running both factual and counterfactual versions of each query.
Load-bearing premise
The counterfactual edits change only the count-relevant attributes and leave all other scene properties unchanged, and the provided annotations correctly identify the visual tokens that should drive the count.
What would settle it
Models would achieve nearly identical accuracy on both factual and counterfactual versions of the same scenes, or the attention reweighting step would produce no measurable gain in counterfactual accuracy.
Figures
read the original abstract
Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to missing or ambiguous visual evidence, but to models underweighting attention to count-relevant visual tokens. We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8% across multiple VLMs. Overall, CounterCount exposes prior-driven counting failures and provides diagnostic insights for designing future VLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CounterCount, a diagnostic framework consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. It evaluates recent VLMs and reports strong performance on factual images but consistent degradation on counterfactual versions, which the authors attribute to reliance on object-level priors and underweighting of count-relevant visual tokens in attention. An inference-time attention modulation strategy is proposed that improves counterfactual counting accuracy by up to 8%.
Significance. If the counterfactual edits prove free of non-count artifacts, the framework supplies a precise testbed for visual grounding failures in VLMs and a lightweight mitigation technique. The localized annotations help separate missing evidence from attention misallocation, which could inform future model design.
major comments (1)
- [Methods section on counterfactual image construction and annotation] The central claim that performance degradation on counterfactual images demonstrates prior-driven underweighting of visual evidence (rather than edit artifacts) requires that edits alter only count-relevant attributes while exactly preserving all other scene properties. The manuscript describes 'edited count-relevant attributes' with 'verified answers' and 'localized evidence annotations' but does not report quantitative edit-fidelity metrics such as human similarity ratings on non-target attributes or automated checks for introduced inconsistencies (lighting, shadows, spatial relations). This validation is load-bearing for the interpretation of results.
minor comments (1)
- Specify the exact number of image pairs, the full list of evaluated VLMs, and the precise conditions under which the 8% gain is measured so that the improvement claim can be reproduced.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to incorporate additional validation.
read point-by-point responses
-
Referee: [Methods section on counterfactual image construction and annotation] The central claim that performance degradation on counterfactual images demonstrates prior-driven underweighting of visual evidence (rather than edit artifacts) requires that edits alter only count-relevant attributes while exactly preserving all other scene properties. The manuscript describes 'edited count-relevant attributes' with 'verified answers' and 'localized evidence annotations' but does not report quantitative edit-fidelity metrics such as human similarity ratings on non-target attributes or automated checks for introduced inconsistencies (lighting, shadows, spatial relations). This validation is load-bearing for the interpretation of results.
Authors: We agree that demonstrating the fidelity of the counterfactual edits—specifically that only count-relevant attributes are altered while preserving all other scene properties—is critical for attributing performance degradation to object priors rather than edit artifacts. The current manuscript uses verified answers and localized evidence annotations to support the edits, but we acknowledge that quantitative metrics would provide stronger, more explicit evidence. In the revised manuscript, we will add a new subsection to the Methods section reporting edit-fidelity evaluation. This will include human similarity ratings on non-target attributes (e.g., lighting, shadows, spatial relations, and background elements) collected from multiple annotators with inter-rater agreement statistics, as well as automated consistency checks using perceptual similarity metrics. These additions will directly address the load-bearing concern and reinforce the validity of our conclusions. revision: yes
Circularity Check
No significant circularity; empirical diagnostic framework on new image pairs
full rationale
The paper constructs a new dataset of factual/counterfactual image pairs with verified answers and localized evidence annotations, then reports empirical VLM performance degradation and an attention-reweighting intervention. No derivation reduces by construction to fitted parameters, self-defined quantities, or load-bearing self-citations. The central claim (degradation indicates prior reliance) follows directly from the experimental contrast between factual and edited images rather than from any internal fit or renaming. The attention modulation is presented as an external inference-time strategy, not a tautological re-expression of the annotations themselves. This is a standard empirical diagnostic setup with independent content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Counterfactual images can be generated by editing only count-relevant attributes while leaving all other visual properties unchanged and verifiable.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. a. Bi ´nkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual langu...
work page 2022
-
[2]
S. Alghisi, G. Roccabruna, M. Rizzoli, S. M. Mousavi, and G. Riccardi. [de|re]constructing vlms’ reasoning in counting, 2025
work page 2025
- [3]
-
[4]
M. Atabuzzaman, A. Asgarov, and C. Thomas. Benchmarking and mitigating mcqa selection bias of large vision-language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2025
work page 2025
-
[5]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023
work page 2023
-
[6]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...
work page 2025
-
[7]
D. I. Campbell, S. Rane, T. Giallanza, C. N. D. Sabbata, K. Ghods, A. Joshi, A. Ku, S. M. Fran- kland, T. L. Griffiths, J. D. Cohen, and T. W. Webb. Understanding the limits of vision language models through the lens of the binding problem. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[8]
T. T. Chan, V . Suresh, A. Saha, M. Hahn, and V . Demberg. System-mediated attention imbal- ances make vision-language models say yes, 2026
work page 2026
-
[9]
L. Che, Z. Xue, Y . Quan, B. Liu, Z. Shi, M. Hurst, J. Feldman, R. Tang, R. Krishna, and V . Pavlovic. Counting circuits: Mechanistic interpretability of visual reasoning in large vision- language models, 2026
work page 2026
-
[10]
A. Deng, T. Cao, Z. Chen, and B. Hooi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3867–3876, June 2025
work page 2025
-
[11]
A. Favero, L. Zancato, M. Trager, S. Choudhary, P. Perera, A. Achille, A. Swaminathan, and S. Soatto. Multi-modal hallucination control by visual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14303–14312, June 2024. 10
work page 2024
-
[12]
S. Fu, T. Bonnen, D. Guillory, and T. Darrell. Hidden in plain sight: Vlms overlook their visual representations, 2025
work page 2025
-
[13]
I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed. Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, Sept. 2024
work page 2024
-
[14]
M. Golovanevsky, W. Rudman, M. A. Lepori, A. Bar, R. Singh, and C. Eickhoff. Pixels versus priors: Controlling knowledge priors in vision-language models through visual counterfacts. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24837–2...
work page 2025
-
[15]
S. M. Hall, F. G. Abrantes, H. Zhu, G. Sodunke, A. Shtedritski, and H. R. Kirk. Visogender: A dataset for benchmarking gender bias in image-text pronoun resolution. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023
work page 2023
-
[16]
Understanding Counting Mechanisms in Large Language and Vision-Language Models
H. Hasani, A. Izadi, F. Askari, M. Bagherian, S. Mohammadian, M. Izadi, and M. S. Baghshah. Understanding counting mechanisms in large language and vision-language models.arXiv preprint arXiv:2511.17699, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
P. Howard, A. Madasu, T. Le, G. L. Moreno, A. Bhiwandiwalla, and V . Lal. Socialcounter- factuals: Probing and mitigating intersectional social biases in vision-language models with counterfactual examples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11975–11985, June 2024
work page 2024
-
[18]
D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
work page 2019
-
[19]
M. I. Ismithdeen, M. U. Khattak, and S. Khan. Promptception: How sensitive are large multimodal models to prompts? In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23950–23985, Suzhou, China, Nov. 2025. Association for Computational Linguistics
work page 2025
-
[20]
Z. Jiang, J. Chen, B. Zhu, T. Luo, Y . Shen, and X. Yang. Devils in middle layers of large vision- language models: Interpreting, detecting and mitigating object hallucinations via attention lens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25004–25014, June 2025
work page 2025
-
[21]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, and et al. Segment anything. InICCV, 2023
work page 2023
-
[22]
K.-i. Lee, M. Kim, S. Yoon, M. Kim, D. Lee, H. Koh, and K. Jung. VLind-bench: Measuring language priors in large vision-language models. In L. Chiruzzo, A. Ritter, and L. Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025, pages 4129–4144, Albuquerque, New Mexico, Apr. 2025. Association for Computational Linguistics
work page 2025
-
[23]
J. Li, D. Li, C. Xiong, and S. Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages ...
work page 2022
-
[24]
H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023
work page 2023
-
[25]
S. Liu, K. Zheng, and W. Chen. Paying more attention to image: A training-free method for alleviating hallucination in lvlms, 2024
work page 2024
-
[26]
Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin. Mmbench: Is your multi-modal model an all-around player? InThe European Conference on Computer Vision (ECCV), 2024. 11
work page 2024
-
[27]
Z. Liu, Z. Chen, H. Liu, C. Luo, X. Tang, S. Wang, J. Zeng, Z. Dai, Z. Shi, T. Wei, B. Dumoulin, and H. Tong. Seeing but not believing: Probing the disconnect between visual attention and answer correctness in vlms, 2025
work page 2025
-
[28]
P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[29]
T. Luo, A. Cao, G. Lee, J. Johnson, and H. Lee. Probing visual language priors in VLMs. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 41120–41156. PMLR, 13–19 Jul 2025
work page 2025
-
[30]
T. Naous, M. J. Ryan, A. Ritter, and W. Xu. Having beer after prayer? measuring cultural bias in large language models. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16366–16393, Bangkok, Thailand, Aug. 2024. Association for Computat...
work page 2024
-
[31]
F. Ortu, Z. Jin, D. Doimo, and A. Cazzaniga. When seeing overrides knowing: Disentangling knowledge conflicts in vision-language models. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025
work page 2025
- [32]
-
[33]
A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. Bowman. BBQ: A hand-built bias benchmark for question answering. In S. Muresan, P. Nakov, and A. Villavicencio, editors,Findings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105, Dublin, Ireland, May 2022. Association for Computational Linguistics
work page 2022
-
[34]
M. F. Qharabagh, M. Ghofrani, and K. Fountoulakis. LVLM-COUNT: Enhancing the counting ability of large vision-language models, 2025
work page 2025
-
[35]
P. Rahmanzadehgervi, L. Bolton, M. R. Taesiri, and A. T. Nguyen. Vision language models are blind. InProceedings of the Asian Conference on Computer Vision (ACCV), pages 18–34, December 2024
work page 2024
-
[36]
C. Raj, A. Mukherjee, A. Caliskan, A. Anastasopoulos, and Z. Zhu. BiasDora: Exploring hidden biased associations in vision-language models. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10439–10455, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics
work page 2024
-
[37]
W. Rudman, M. Golovanevsky, A. Bar, V . Palit, Y . LeCun, C. Eickhoff, and R. Singh. For- gotten polygons: Multimodal large language models are shape-blind. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Findings of the Association for Computational Linguis- tics: ACL 2025, pages 11983–11998, Vienna, Austria, July 2025. Association for Com...
work page 2025
-
[38]
S. Sengupta, N. Moradinasab, J. Liu, and D. E. Brown. Can vision-language models count? a synthetic benchmark and analysis of attention-based interventions.arXiv preprint arXiv:2511.17722, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng. The woman worked as a babysitter: On biases in language generation. In K. Inui, J. Jiang, V . Ng, and X. Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3...
work page 2019
-
[40]
J. Shin, H. Song, H. Lee, S. Jeong, and J. Park. Ask LLMs directly, “what shapes your bias?”: Measuring social bias in large language models. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 16122– 16143, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics
work page 2024
- [41]
-
[42]
G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, L. Rouillard, T. Mesnard, G. Cideron, J. bastien Grill, S. Ramos, E. Yvinec, M. Casbon, E. Pot, I. Penchev, G. Liu, F. Visin, K. Kenealy, L. Beyer, X. Zhai, A. Tsitsulin, R. Busa-Fekete, A. Feng, N. Sachdeva, B. Coleman, Y . Gao, B. Mustafa,...
work page 2025
-
[43]
S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9568–9578, June 2024
work page 2024
-
[44]
C. Tu, P. Ye, D. Zhou, L. Bai, G. Yu, T. Chen, and W. Ouyang. Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms.Int. J. Comput. Vision, 134(1), Dec. 2025
work page 2025
-
[45]
T. Ullman. The illusion-illusion: Vision language models see illusions where there are none, 2024
work page 2024
- [46]
- [47]
-
[48]
A. V o, M. R. Taesiri, D. Kim, and A. T. Nguyen. B-score: Detecting biases in large language models using response history. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[49]
W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, Z. Wang, Z. Chen, H. Zhang, G. Yang, H. Wang, Q. Wei, J. Yin, W. Li, E. Cui, G. Chen, Z. Ding, C. Tian, 13 Z. Wu, J. Xie, Z. Li, B. Yang, Y . Duan, X. Wang, Z. Hou, H. Hao, T. Zhang, S. Li, X. Zhao, H. Duan, N. Deng, B. Fu, Y . He, Y . Wang, C. He, B. Shi, J. He, Y . Xiong, H....
work page 2025
-
[50]
W. Wang, W. Jiao, J. Huang, R. Dai, J.-t. Huang, Z. Tu, and M. Lyu. Not all countries celebrate thanksgiving: On the cultural dominance in large language models. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6349–6384, Bangkok, Thail...
work page 2024
-
[51]
Z. Wang, B. Xu, Y . Xia, and P. Li. Vegas: Mitigating hallucinations in large vision-language models via vision-encoder attention guided adaptive steering, 2025
work page 2025
-
[52]
Z. Wu, Y . Niu, H. Gao, M. Lin, Z. Zhang, Z. Zhang, Q. Shi, Y . Wang, S. Fu, J. Xu, J. Ao, E. Dai, L. Feng, X. Zhang, and S. Wang. Lanp: Rethinking the impact of language priors in large vision-language models, 2025
work page 2025
-
[53]
T. Yang, Z. Li, J. Cao, and C. Xu. Understanding and mitigating hallucination in large vision- language models via modular attribution and intervention. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 51546–51568, 2025
work page 2025
-
[54]
T. Yang, L. Zhang, J. Lin, G. Hu, D. Wang, and L. Hu. Tracing and mitigating hallucinations in multimodal llms via dynamic attention localization, 2025
work page 2025
-
[55]
Y . Yang, C. P. Lee, S. Feng, D. Zhao, B. Wen, A. Z. Liu, Y . Tsvetkov, and B. Howe. Escaping the spuriverse: Can large vision-language models generalize beyond seen spurious correlations? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025
work page 2025
-
[56]
L. Yu, Z. Chen, P. Kuang, Z. Feng, F. Zhou, L. Wang, and G. Dobbie. Causally-grounded dual-path attention intervention for object hallucination mitigation in lvlms, 2025
work page 2025
-
[57]
W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: evaluating large multimodal models for integrated capabilities. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
work page 2024
-
[58]
X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y . Liu, W. Huang, H. Sun, Y . Su, and W. Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision...
work page 2024
-
[59]
G. Zeno, N. Jedidi, and S. R. Gomez. Choosing ‘right’ from wrong: A closer look at selection bias in spatial multiple-choice questions in large multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 535–544, 2025
work page 2025
- [60]
-
[61]
H. Zhao, S. Si, L. Chen, Y . Zhang, M. Sun, B. Chang, and M. Zhang. Looking beyond text: Reducing language bias in large vision-language models via multimodal dual-attention and soft-image guidance. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processi...
work page 2025
-
[62]
J. Zhao, F. Zhang, X. Sun, C. Feng, and Z. Tan. Tell model where to look: Mitigating hallucinations in mllms by vision-guided attention, 2025. 14 Contents 1 Introduction 1 2 Related Work 2 2.1 Language Priors and Bias in VLMs . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Counterfactual Images for Diagnosing Bias in VLMs . . . . . . . . . . . ....
work page 2025
-
[63]
parrot”) with its category label (e.g., “bird
in Figure 11. Attention maps are computed by averaging over the late 50% of transformer layers following [27], with 95th-percentile contrast enhancement applied prior to normalization for visualization clarity. As observed with Qwen3-VL-8B [6] in the main paper, attention is concentrated within the relevant regions in both failure cases, suggesting that l...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.