pith. machine review for the scientific record. sign in

arxiv: 2604.14363 · v1 · submitted 2026-04-15 · 💻 cs.CL · cs.AI· cs.CV

Recognition: unknown

The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV
keywords centroid replacementmultimodal language modelsmodal competitiontext centroid erasurecontrastive decodingvisual reasoningK-means centroidsinference-time intervention
0
0 comments X

The pith

Centroid replacement shows language representations overshadow vision in multimodal models, costing four times more accuracy when erased.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces centroid replacement as a probe that collapses tokens to their nearest K-means centroids to isolate how much each modality contributes. Across seven models it finds that erasing text centroid structure drops accuracy four times more than erasing visual structure, even on tasks that require visual reasoning. This reveals a structural imbalance where language dominates the shared representation. The authors then turn the finding into an inference-time fix by contrastively decoding against a text-centroid-erased reference, lifting accuracy by up to 16.9 percent on individual tasks. Gains are larger for standard fine-tuned models than for preference-optimized ones, suggesting the competition is both measurable and correctable without retraining.

Core claim

Replacing every token with its nearest K-means centroid erases modal-specific structure in a controlled way; doing so for text reduces accuracy four times more than doing so for vision across architectures, exposing a universal language-over-vision imbalance. This asymmetry can be exploited at inference time through text-centroid contrastive decoding, which recovers up to 16.9 percent accuracy and yields larger gains on standard fine-tuned models than on preference-optimized ones.

What carries the argument

Centroid replacement: collapsing each token to its nearest K-means centroid to remove modal structure while preserving token identity and model architecture.

If this is right

  • The language-vision imbalance appears across three architecture families and seven models.
  • Text-centroid contrastive decoding improves accuracy by up to 16.9 percent on individual tasks without any retraining.
  • Standard fine-tuned models gain more from the intervention (+5.6 percent on average) than preference-optimized models (+1.5 percent on average).
  • Modal competition is localized enough to be diagnosed and mitigated at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same centroid probe could be applied to other input modalities such as audio to test whether language similarly overshadows them.
  • Training objectives could be redesigned to penalize the observed language dominance before inference-time fixes are needed.
  • The magnitude of the asymmetry might serve as a running diagnostic to compare how different multimodal training regimes balance modalities.

Load-bearing premise

Replacing tokens with their nearest K-means centroids removes only the intended modal dependence without introducing other uncontrolled changes to the model's computation or representation space.

What would settle it

A controlled test in which text-centroid erasure produces accuracy drops no larger than visual-centroid erasure, or in which contrastive decoding against the text-centroid reference fails to improve performance on visual tasks.

Figures

Figures reproduced from arXiv: 2604.14363 by Akshay Paruchuri, Ehsan Adeli, Henry Fuchs, Ishan Chatterjee, Piotr Didyk.

Figure 1
Figure 1. Figure 1: Centroid erasure exposes and exploits modal competition in multimodal lan￾guage models. (a) Text centroid erasure more than doubles visual attention (4% → 10%) and flips the prediction from incorrect to correct. (b) Across seven models, replacing text tokens with their nearest centroids costs 4× more accuracy than replacing visual tokens (25.9% vs. 6.5%) on visual perception tasks, exposing a universal lan… view at source ↗
Figure 2
Figure 2. Figure 2: The text–visual asymmetry is universal. Per-task centroid replacement cost (visual in teal, text in red) across seven models. In every model, erasing text centroid structure costs substantially more accuracy than erasing visual centroid structure, on tasks that demand visual reasoning. spectively), indicating that within-cluster variation in visual representations is not merely redundant but can be mildly … view at source ↗
Figure 3
Figure 3. Figure 3: Dose–response confirms task￾selective causality. CD delta versus con￾trastive decoding strength αcd on Qwen2.5- VL-7B. Amplifying text competition (nega￾tive αcd) monotonically hurts all tasks; eras￾ing it (positive αcd) selectively benefits TEXT￾COMPETES tasks while leaving TEXT-NEEDED tasks near baseline. Dose–response. To better establish causal￾ity, we sweep the contrastive decoding strength αcd from −… view at source ↗
Figure 4
Figure 4. Figure 4: Competition is distributed across text segments. CD effect when selectively erasing centroids on different text segments (Qwen2.5-VL-7B, αinterp = 0.4, αcd = 1.0). All-text erasure produces the strongest gains on TEXT-COMPETES tasks (Forensic +7.6%, Vis. Sim. +8.9%, Art Style +6.8%). System-prompt erasure yields a surprisingly large gain on Forensic Detection (+12.1%). Options-only erasure at this dose is … view at source ↗
Figure 5
Figure 5. Figure 5: The task taxonomy is stable across model depth. Mean CD delta for TEXT￾COMPETES and TEXT-NEEDED task groups across 16 layers of Qwen2.5-VL-7B. The two groups diverge consistently from L4 through L22 (shaded), confirming that the competition structure is a stable model property. Specificity controls. To confirm that the improvement requires the learned cen￾troid structure specifically, we restrict era￾sure … view at source ↗
Figure 6
Figure 6. Figure 6: Task examples from BLINK. One representative sample per task showing the input image(s), prompt with four answer choices, and the correct answer. Top row: TEXT￾COMPETES tasks (Forensic Detection, Visual Similarity, Art Style) where text centroid CD improves accuracy. Bottom row: TEXT-NEEDED tasks (Counting, Relative Depth, Spatial Relation) where text structure is required for the task. 16 [PITH_FULL_IMAG… view at source ↗
Figure 7
Figure 7. Figure 7: shows the complete centroid interpolation sweep (αinterp from 0.0 to 0.8) for all seven models under the Phase 2 protocol (N=2,000 COCO images, K=256, αcd=1.0). Each panel displays per-task CD delta as a function of erasure strength, with the fixed reporting protocol (αinterp=0.4) marked by a dashed vertical line. SFT and DPO models exhibit clear task-selective peaks: Forensic Detection and Visual Similari… view at source ↗
Figure 8
Figure 8. Figure 8: Per-task layer sweep (Qwen2.5-VL-7B). Individual task CD deltas across 16 layers, complementing the group means in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Dose–response for real text centroid CD vs. three null controls, options-only erasure, Qwen2.5-VL-7B. Left: TEXT-COMPETES mean across Forensic Detection, Visual Similarity, and Art Style. Right: mean across all 6 BLINK tasks. Real CD (blue) traces an inverted-U peaking at αinterp=0.3; all three controls remain near zero or flat-negative across αinterp ∈ {0.2, 0.3, 0.4, 0.6}, confirming that the gains requi… view at source ↗
read the original abstract

Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces centroid replacement—collapsing each token to its nearest K-means centroid—as a controlled probe for modal dependence in multimodal language models. Across seven models from three architecture families, it finds that erasing text centroid structure costs 4× more accuracy than erasing visual centroid structure, revealing a universal language-over-vision imbalance even on visual-reasoning tasks. The authors exploit this asymmetry via text centroid contrastive decoding, recovering up to +16.9% accuracy, with larger gains for standard fine-tuned models (+5.6% avg.) than preference-optimized ones (+1.5% avg.).

Significance. If the central empirical claims hold after controls, the work offers a practical, inference-time diagnostic and correction for modal competition in multimodal LLMs, without retraining. The cross-model consistency (seven models, three families) and the contrastive decoding intervention provide a falsifiable, quantifiable signal that could guide future training objectives. It strengthens the case that language dominance is structurally localized and correctable at inference.

major comments (2)
  1. [§3 (Experimental Setup)] §3 (Experimental Setup): The claim that centroid replacement isolates modal dependence rests on the assumption that the perturbation is comparable across modalities. Because language and visual tokens typically occupy differently scaled regions, collapsing to nearest centroids necessarily alters second-moment statistics (variance, norms) unequally; without an ablation that matches mean L2 displacement or applies modality-scaled isotropic noise, the reported 4× accuracy gap could arise from differential sensitivity to compression rather than intrinsic language-over-vision bias.
  2. [Results] Results section: The abstract and main claims assert consistent directional effects with precise quantities (4× cost, +16.9% gain) across tasks and models, yet report no error bars, statistical tests, or run-to-run variance. This leaves the central quantitative assertions under-supported and weakens the universality conclusion.
minor comments (2)
  1. Specify whether K-means centroids are fit jointly across modalities or separately, and report sensitivity of the 4× ratio to the choice of K.
  2. Clarify the exact visual-reasoning tasks and confirm that performance drops are not driven by overall token-count reduction rather than modal erasure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the robustness of our experimental design and the strength of our quantitative claims. We respond point by point below, providing substantive clarifications and indicating where we will revise the manuscript to address the concerns.

read point-by-point responses
  1. Referee: [§3 (Experimental Setup)] The claim that centroid replacement isolates modal dependence rests on the assumption that the perturbation is comparable across modalities. Because language and visual tokens typically occupy differently scaled regions, collapsing to nearest centroids necessarily alters second-moment statistics (variance, norms) unequally; without an ablation that matches mean L2 displacement or applies modality-scaled isotropic noise, the reported 4× accuracy gap could arise from differential sensitivity to compression rather than intrinsic language-over-vision bias.

    Authors: We agree that language and visual embeddings can occupy regions with different scales, which might lead to unequal effects from centroid collapse on second-moment statistics. Our method applies K-means independently to text tokens and visual tokens within each model's embedding space, with the number of centroids chosen to capture the dominant structure in each modality. The 4× disparity is replicated across seven models from three distinct architecture families, which would be unlikely if the gap were solely an artifact of differential compression sensitivity. To strengthen the isolation claim, we will add an ablation in the revised manuscript that normalizes both modalities to unit variance prior to clustering and reports the resulting accuracy drops, enabling a direct comparison of perturbation magnitude. revision: yes

  2. Referee: The abstract and main claims assert consistent directional effects with precise quantities (4× cost, +16.9% gain) across tasks and models, yet report no error bars, statistical tests, or run-to-run variance. This leaves the central quantitative assertions under-supported and weakens the universality conclusion.

    Authors: We concur that explicit reporting of variance and statistical tests would better support the reported quantities and the universality conclusion. The main results use fixed random seeds for K-means and decoding to ensure exact reproducibility, but K-means initialization can introduce variability. In the revised manuscript we will rerun the centroid-erasure experiments with multiple seeds (reporting means and standard deviations), and we will include error bars on the contrastive-decoding gains. We will also add paired statistical tests comparing baseline versus contrastive-decoding accuracies to substantiate the +16.9% peak and the average differences between fine-tuned and preference-optimized models. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical probe with direct measurements

full rationale

The paper introduces centroid replacement as an experimental intervention and reports accuracy deltas from applying it to text vs. visual tokens across seven models. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential steps appear in the abstract or described method. Claims rest on controlled empirical comparisons rather than any reduction to inputs by construction. Self-citations (if present) are not invoked as load-bearing uniqueness theorems or ansatzes. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The probe depends on K-means clustering whose cluster count K is a free hyperparameter; no other free parameters, axioms, or invented entities are introduced in the abstract.

free parameters (1)
  • K (number of K-means centroids)
    The granularity of the collapsed representation is controlled by the choice of K, which is not uniquely determined by the data and must be selected.

pith-pipeline@v0.9.0 · 5488 in / 1035 out tokens · 61097 ms · 2026-05-10T13:21:05.184907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 21 canonical work pages · 5 internal anchors

  1. [1]

    Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

    Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687,

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

  3. [3]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pp. 19–35. Springer, 2024a. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Du...

  4. [4]

    Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008,

    URLhttps://arxiv.org/abs/2506.08008. Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pp. 148–166. Springer,

  5. [5]

    The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148.2025

    A Ali Heydari, Ken Gu, Vidya Srinivas, Hong Yu, Zhihan Zhang, Yuwei Zhang, Akshay Paruchuri, Qian He, Hamid Palangi, Nova Hammerquist, et al. The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148,

  6. [6]

    The Platonic Representation Hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representa- tion hypothesis.arXiv preprint arXiv:2405.07987,

  7. [7]

    Rethinking multimodal learning from the perspective of mitigating classification ability disproportion.arXiv preprint arXiv:2502.20120,

    QingYuan Jiang, Longfei Huang, and Yang Yang. Rethinking multimodal learning from the perspective of mitigating classification ability disproportion.arXiv preprint arXiv:2502.20120,

  8. [8]

    See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

    Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321,

  9. [9]

    Mitra, A., Khanpour, H., Rosset, C., and Awadallah, A

    Andrew Lee, Melanie Weber, Fernanda Vi´egas, and Martin Wattenberg. Shared global and local geometry of language model embeddings.arXiv preprint arXiv:2503.21073,

  10. [10]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326,

  11. [11]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pp. 292–305,

  12. [12]

    arXiv preprint arXiv:2510.08510 (2025)

    Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abolmaesumi, and Leonid Sigal. To sink or not to sink: Visual information pathways in large vision-language models.arXiv preprint arXiv:2510.08510,

  13. [13]

    Mitigating hallucinations in large vision-language models (lvlms) via language-contrastive decoding (lcd)

    Avshalom Manevich and Reut Tsarfaty. Mitigating hallucinations in large vision-language models (lvlms) via language-contrastive decoding (lcd). InFindings of the Association for Computational Linguistics: ACL 2024, pp. 6008–6022,

  14. [14]

    In- terpreting the linear structure of vision-language model embedding spaces.arXiv preprint arXiv:2504.11695, 2025

    Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham Kakade, and Stephanie Gil. Inter- preting the linear structure of vision-language model embedding spaces.arXiv preprint arXiv:2504.11695,

  15. [15]

    arXiv preprint arXiv:2406.01506 , year=

    Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. The geometry of categorical and hierarchical concepts in large language models.arXiv preprint arXiv:2406.01506,

  16. [16]

    Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025

    Jianing Qi, Jiawei Liu, Hao Tang, and Zhigang Zhu. Beyond semantics: Rediscovering spatial awareness in vision-language models.arXiv preprint arXiv:2503.17349,

  17. [17]

    VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

    Haz Sameen Shahgir, Xiaofu Chen, Yu Fu, Erfan Shayegani, Nael Abu-Ghazaleh, Yova Kementchedjhieva, and Yue Dong. Vlms need words: Vision language models ignore visual detail in favor of semantic anchors.arXiv preprint arXiv:2604.02486,

  18. [18]

    arXiv preprint arXiv:2401.05654 , year=

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024a. Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and ...

  19. [19]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,

  20. [20]

    Mitigating modal imbalance in multimodal reasoning.arXiv preprint arXiv:2510.02608,

    Chen Henry Wu, Neil Kale, and Aditi Raghunathan. Mitigating modal imbalance in multimodal reasoning.arXiv preprint arXiv:2510.02608,

  21. [21]

    Sdcd: Structure-disrupted contrastive decoding for mitigating hallucinations in large vision-language models.arXiv preprint arXiv:2601.03500,

    Yuxuan Xia, Siheng Wang, and Peng Li. Sdcd: Structure-disrupted contrastive decoding for mitigating hallucinations in large vision-language models.arXiv preprint arXiv:2601.03500,

  22. [22]

    The mirage of performance gains: Why contrastive decoding fails to mitigate object hallucinations in mllms?arXiv preprint arXiv:2504.10020,

    Hao Yin, Guangzong Si, and Zilei Wang. The mirage of performance gains: Why contrastive decoding fails to mitigate object hallucinations in mllms?arXiv preprint arXiv:2504.10020,

  23. [23]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

  24. [24]

    Under review

    14 Preprint. Under review. A Evaluation Details A.1 Models and Implementation We evaluate seven instruction-tuned MLMs spanning four architecture families (Qwen (Bai et al., 2025), InternVL (Chen et al., 2024c), LLaVA (Li et al., 2024), Idefics (Laurenc ¸on et al., 2024)) and three training paradigms (SFT, DPO (Rafailov et al., 2023), SFT+, MPO (Zhu et al...

  25. [25]

    Third, the 7B scale is a practical deployment sweet spot where our inference-time intervention is most actionable

    Second, it is the most widely benchmarked open MLM at this scale in the visual-perception literature (Fu et al., 2024; 2025; Deng et al., 2025; Qi et al., 2025), making our results directly comparable to prior work. Third, the 7B scale is a practical deployment sweet spot where our inference-time intervention is most actionable. Fourth, we validate univer...

  26. [26]

    we use a standardized protocol: 2,000 held-out MS-COCO images streamed with data seed 1337, K-means fit with seed 42, αcd=1.0. Visual and text tokens are separated by their position indices (visual tokens occupy contigu- ous positions after the vision encoder projection) and centroids are fitted independently for each modality. The choice of K=256 and N=2...

  27. [27]

    To validate this choice against the number of fitting images N as well, we ran a 30-cell N×K scaling grid (Qwen2.5-VL-7B, αcd=1.0, per-task best αinterp). The landscape is essentially flat: mean best ∆= 5.2% ± 0.4% across all 30 cells with N∈ { 1K, 2K, 5K, 10K, 20K, 50K} and K∈ {128, 256, 512, 1024, 2048}, and no monotone trend in either direction. The si...

  28. [28]

    could inject noise into the centroid fit and inflate the apparent sufficiency of text centroids. To rule out this confound, we refit both visual and text centroids on Qwen2.5-VL-7B under three exclusion variants applied to the 2,000-image COCO cache: (i)NO DEADdrops bottom-5% norm tokens; (ii)NO SINKdrops top-1% norm tokens; (iii)NO EITHERapplies both fil...

  29. [29]

    recovers additional gains. Table 10:Wilson 95% confidence intervals (Qwen2.5-VL-7B, fixed αinterp=0.4, αcd=1.0, K=256).Greedy decoding produces deterministic predictions; intervals reflect finite-sample binomial uncertainty, not run-to-run variance. Cohen’sh clears the conventional “small” threshold of 0.2 on Visual Similarity and sits near it on Forensic...

  30. [30]

    Task km=42 km=800 km=1337 km=2024 km=8320σ Forensic +11.4 +10.6 +13.6 +10.6 +12.1 1.3 Vis. Sim. +10.4 +9.6 +8.9 +8.9 +9.6 0.6 Art Style +6.8 +5.1 +6.0 +7.7 +5.1 1.1 Counting +2.5 +0.8 +2.5 +1.7 +2.5 0.7 Depth +1.6 +3.2 +0.8 +2.4 +1.6 0.8 Spatial +0 +0 +0 +0.7 +0 0.3 Mean+5.5 +4.9 +5.3 +5.3 +5.2 0.2 24 Preprint. Under review. Table 14:Variance decompositio...

  31. [31]

    Under review

    7/7+6.5+25.9 4×10 42 CV-Bench+MMStar+MMVP† 7/7 —+23.2 — 0 28 VPBench 7/7+3.7+11.4 3.1×3 14 MedBLINK (general) 7/7+5.0+9.7 1.9×11 56 MedBLINK (medical) 2/2+3.1+8.0 2.5×2 16 27 Preprint. Under review. Table 18:VPBench summary (7 models × 2 tasks = 14 points).The 3.1 × aggregate asymmetry (mean text cost +11.4%, mean visual cost +3.7%) replicates the BLINK f...