Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

Antoine Bosselut; Mrinmaya Sachan; Shaobo Cui; Yifan Hou; Yijie Tong

arxiv: 2605.30713 · v1 · pith:REN7Y46Snew · submitted 2026-05-29 · 💻 cs.LG · cs.CV· cs.MM

Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

Yijie Tong , Yifan Hou , Shaobo Cui , Antoine Bosselut , Mrinmaya Sachan This is my paper

Pith reviewed 2026-06-28 23:39 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.MM

keywords test-time computevision-language modelsensemble methodspredictive entropymajority votingmodel diversityconfidence-based selection

0 comments

The pith

Entropy-based selection from model ensembles outperforms majority voting and single best models for vision-language tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that feature-based scoring fails and majority voting yields only modest gains for VLMs because their outputs lack diversity and tend to correlate. It introduces Entropy-based TTC (ETTC), which picks the prediction with the lowest predictive entropy, reducing to voting for one model but favoring higher-confidence models in groups. Experiments across seven VLMs and six benchmarks show ETTC consistently beats both voting and the strongest individual model, with smaller models providing synergistic lifts. A sympathetic reader would care because the approach turns existing model variety into a practical performance advantage at inference time without extra training.

Core claim

The central claim is that limited prediction diversity restricts single-model TTC methods, whereas multi-model ensembles combined with ETTC—which selects the lowest-entropy output—leverage capability differences to exceed both voting and the best single model; the method is proved superior to voting under mild assumptions on predictions and entropies, and smaller models are shown to enhance larger ones in this setting.

What carries the argument

Entropy-based Test-Time Compute (ETTC), the mechanism that selects the prediction with lowest predictive entropy across an ensemble.

If this is right

ETTC reduces exactly to majority voting when applied to a single model.
In ensembles ETTC uses entropy disparities to prioritize stronger models over weaker ones.
Smaller models produce measurable synergistic gains when combined with larger models under ETTC.
The gains appear consistently across the seven VLMs and six benchmarks examined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may transfer to language-only or other multimodal settings whenever prediction diversity exists.
Developers could benefit from releasing model families of mixed sizes rather than scaling a single architecture.
Pairing ETTC with other lightweight TTC heuristics might compound gains if the signals are complementary.
Expanding the evaluation to out-of-distribution tasks would test whether the mild assumptions remain valid.

Load-bearing premise

The theoretical superiority of ETTC over voting rests on unspecified mild assumptions about model predictions and entropies holding in practice.

What would settle it

A controlled test on new VLMs and benchmarks where ETTC shows no gain over majority voting despite measured prediction diversity would falsify the core empirical and theoretical claims.

Figures

Figures reproduced from arXiv: 2605.30713 by Antoine Bosselut, Mrinmaya Sachan, Shaobo Cui, Yifan Hou, Yijie Tong.

**Figure 1.** Figure 1: Comparison of test-time compute (TTC) strategies under two prompting styles. In Direct Answer (left), models are instructed to output only the final answer without reasoning; feature-based methods are inapplicable, and majority voting shows no improvement. In CoT (right), models are prompted to reason step by step. While feature-based methods yield no gains, voting offers modest but consistent improvement … view at source ↗

**Figure 3.** Figure 3: Majority voting improvement decreases with higher prediction dependency. Across models, we can find that voting improvement ∆AMV(16) is negatively correlated with both NMI and ρ, confirming theoretical predictions. both dependency metrics. Smaller models (e.g., Qwen-3B, LLaMA) inherently produce more diverse outputs and therefore reap larger benefits from voting. In contrast, larger or more heavily optimi… view at source ↗

**Figure 4.** Figure 4: Example of a direct QA prompt used for evaluating model predictions without reasoning. Feature-All. We also define a feature set that combines lexical and stylistic signals for each CoT response. Specifically, we consider four interpretable features: response length (token count), lexical diversity (unique token count), number of pivot words, and number of vague words. See Tab. 6 for detailed definitions. … view at source ↗

**Figure 5.** Figure 5: Example of a chain-of-thought (CoT) prompt used to elicit intermediate reasoning steps. This format is used when analyzing consistency or measuring correctness under step-by-step reasoning. C.2. Empirical Evidence to Support Assumption [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Majority voting improvement ∆AMV(16) plotted against average pairwise normalized mutual information (NMI) for each model on each dataset. A negative trend suggests that higher prediction dependency reduces the benefit of majority voting. Findings. As shown in Tab. 8, the benefits of ETTC generalize robustly to the text domain. Across all 8 ensemble configurations, ETTC consistently outperforms majority vot… view at source ↗

**Figure 7.** Figure 7: Majority voting improvement ∆AMV(16) versus average pairwise accuracy correlation (ρ). Consistent with theory, stronger dependency (i.e., higher ρ) corresponds to smaller gains from majority voting. Problem setting. Given Q questions and M models, each model u produces a predictive distribution pqu(·) over K options for question q, aggregated over U=16 stochastic decoding samples (see § 4). The goal is to … view at source ↗

**Figure 8.** Figure 8: Correlation between normalized entropy Heu and accuracy across models on six benchmarks, supporting the Entropy–Accuracy Monotonicity assumption (Assumption 1). Training protocol. To simulate low-resource conditions, we use two-fold cross-validation across questions: each dataset is split into halves, one for training and one for testing, with roles reversed in a second run. This prevents test leakage and … view at source ↗

read the original abstract

Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We present a systematic study of TTC across seven VLMs and six benchmarks, specifically analyzing feature-based scoring and majority voting methods. We find that feature heuristics fail and voting yields only modest gains in single-model settings. We theoretically show that this limitation stems from a lack of prediction diversity: when outputs are highly correlated, voting provides little benefit. In contrast, multi-model ensembles offer richer diversity, yet standard majority voting fails to account for varying model capabilities. To address this, we propose Entropy-based TTC (ETTC), which selects the most confident prediction based on predictive entropy. Our method reduces to majority voting in the single-model case, but in model ensembles, it leverages confidence disparities to prioritize stronger models. We prove that ETTC outperforms majority voting under mild assumptions and empirically demonstrate that it consistently surpasses both voting and the best individual model. Crucially, our results show that smaller models can synergistically enhance larger ones, unlocking ensembling gains not achievable with standard strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ETTC is a clean entropy-based selector for multi-VLM ensembles that beats voting by favoring confident predictions, but the proof's mild assumptions are unspecified and unverified in the provided details.

read the letter

The main takeaway is that ETTC picks the lowest-entropy output from an ensemble of VLMs and claims this beats majority voting by exploiting confidence differences across models of varying strength. It reduces to voting when only one model is used, which keeps the method consistent.

The paper does a reasonable job showing why single-model voting adds little: predictions are too correlated, so diversity is low. It then moves to multi-model cases where diversity exists but standard voting ignores that some models are stronger. The empirical claim that smaller models can still improve larger ones is the part that could matter for practical deployment.

The soft spot is the theory. Outperformance is shown only under mild assumptions, but those assumptions are not listed and there is no sign they were checked against the seven VLMs on the six benchmarks. If the assumptions require entropy to track accuracy or capability ordering, the guarantee does not automatically transfer to the reported results. The abstract also gives no error bars, exclusion rules, or full experimental protocol, so the consistency claim is hard to evaluate.

This is for researchers focused on test-time methods for vision-language models. The combination of a simple new rule, the diversity analysis, and the multi-model results is enough to warrant a serious referee, mainly to pin down the assumptions and review the experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper studies test-time compute (TTC) for vision-language models across seven VLMs and six benchmarks. It reports that feature-based heuristics fail and majority voting yields only modest gains in single-model settings due to low prediction diversity. It proposes Entropy-based TTC (ETTC), which selects the lowest-entropy prediction and reduces to voting for single models; the method is shown to outperform voting in multi-model ensembles by exploiting capability differences. The central claims are a proof of ETTC superiority over voting under mild assumptions and consistent empirical gains, including synergistic benefits from including smaller models.

Significance. If the proof holds and the mild assumptions are satisfied by the tested VLMs, the result supplies a simple, theoretically motivated alternative to voting that can extract additional performance from heterogeneous model collections. The empirical demonstration that smaller models can improve larger ones in ensembles would be a useful practical finding for multimodal systems.

major comments (2)

[Abstract / theoretical analysis] Abstract and theoretical section: the claim that ETTC 'outperforms majority voting under mild assumptions' is load-bearing for the central theoretical contribution, yet the assumptions (e.g., any required correlation between predictive entropy and model accuracy or ordering of model capabilities) are never enumerated. Without an explicit theorem statement listing them, it is impossible to verify whether the seven VLMs satisfy the conditions on the six benchmarks.
[Experiments] Empirical evaluation section: the manuscript asserts that ETTC 'consistently surpasses both voting and the best individual model' and that smaller models 'synergistically enhance larger ones,' but provides no direct check (e.g., a table or figure) confirming that the mild assumptions hold for the evaluated models and tasks. If the assumptions fail, the observed gains cannot be attributed to the ETTC construction rather than coincidence.

minor comments (2)

[Experiments] The experimental protocol should report error bars, number of runs, and any data-exclusion rules to support the 'consistent' superiority claim.
[Method] The precise definition of predictive entropy (including temperature or normalization choices) should be given as an equation before the ETTC algorithm is introduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on the theoretical assumptions and their empirical verification.

read point-by-point responses

Referee: [Abstract / theoretical analysis] Abstract and theoretical section: the claim that ETTC 'outperforms majority voting under mild assumptions' is load-bearing for the central theoretical contribution, yet the assumptions (e.g., any required correlation between predictive entropy and model accuracy or ordering of model capabilities) are never enumerated. Without an explicit theorem statement listing them, it is impossible to verify whether the seven VLMs satisfy the conditions on the six benchmarks.

Authors: We agree that an explicit enumeration of the assumptions is necessary for verifiability. The revised manuscript will include a dedicated theorem statement that lists all mild assumptions, including the correlation between predictive entropy and model accuracy as well as the ordering of model capabilities. This will allow readers to check whether the seven VLMs satisfy the conditions on the six benchmarks. revision: yes
Referee: [Experiments] Empirical evaluation section: the manuscript asserts that ETTC 'consistently surpasses both voting and the best individual model' and that smaller models 'synergistically enhance larger ones,' but provides no direct check (e.g., a table or figure) confirming that the mild assumptions hold for the evaluated models and tasks. If the assumptions fail, the observed gains cannot be attributed to the ETTC construction rather than coincidence.

Authors: We concur that a direct empirical check of the assumptions would strengthen the link between theory and results. The revised version will add a table (or figure) reporting the relevant statistics—such as entropy-accuracy correlations and capability orderings—for all seven VLMs across the six benchmarks. This will confirm that the assumptions hold and support attribution of the gains to the ETTC method. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation is self-contained

full rationale

The paper motivates ETTC from an analysis of prediction diversity, reduces to majority voting in the single-model case by construction, and claims a proof of outperformance under mild assumptions plus empirical gains. No quoted equations or steps reduce a claimed prediction or uniqueness result to a fitted input, self-citation chain, or ansatz smuggled from prior work by the same authors. The central theoretical claim is presented as conditional on external assumptions rather than tautological, and the empirical comparisons are to external baselines (voting, best single model). This is the normal case of an independent derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; no free parameters, invented entities, or detailed axioms are specified beyond the reference to 'mild assumptions' for the proof.

axioms (1)

ad hoc to paper Mild assumptions under which ETTC outperforms majority voting
Invoked in the abstract as the basis for the theoretical result.

pith-pipeline@v0.9.1-grok · 5755 in / 1297 out tokens · 29551 ms · 2026-06-28T23:39:23.081272+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 5 internal anchors

[2]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

OpenReview.net, 2023. URL https://openre view.net/forum?id=yf1icZHC-l9. Gemini Team, G. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https: //arxiv.org/abs/2507.06261. Gemma Team, G. D. Gemma 3 technical report.CoRR, abs/2503.19786, 2025. doi: 10.48550/ARXIV.2503.19

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19 2023
[5]

11 Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models doi: 10.18653/V1/2024.FINDINGS-ACL.108

Association for Computational Linguistics, 2024. 11 Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models doi: 10.18653/V1/2024.FINDINGS-ACL.108. URL https://doi.org/10.18653/v1/2024.fin dings-acl.108. Kim, D., Kim, S., and Kwak, N. Textbook question an- swering with multi-modal context graph understanding and self-supervised open-set ...

work page doi:10.18653/v1/2024.findings-acl.108 2024
[6]

URL https: //doi.org/10.18653/v1/p19-1347

doi: 10.18653/V1/P19-1347. URL https: //doi.org/10.18653/v1/p19-1347. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . Large language models are zero-shot reasoners. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Advances in Neural Infor- mation Processing Systems 35: Annual Conference on Neural Information...

work page doi:10.18653/v1/p19-1347 2022
[9]

Bezhanishvili, B

URL https://doi.org/10.48550/arXiv .2506.08243. Movva, P. and Marupaka, N. H. Enhancing scientific visual question answering through multimodal reasoning and ensemble modeling. In Ghosal, T., Mayr, P., Singh, A., Naik, A., Rehm, G., Freitag, D., Li, D., Schimmler, S., and De Waard, A. (eds.),Proceedings of the Fifth Work- shop on Scholarly Document Proces...

work page internal anchor Pith review doi:10.48550/arxiv 2025
[10]

GPT-4 Technical Report

doi: 10.18653/v1/2025.sdp-1.23. URL https: //aclanthology.org/2025.sdp-1.23/. OpenAI. GPT-4 technical report.CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.sdp-1.23 2025
[11]

GPT-4 Technical Report

doi: 10.48550/ARXIV.2303.08774. URL https: //doi.org/10.48550/arXiv.2303.08774. Rufo, M. and P ´erez, C. Log-linear pool to combine prior distributions: A suggestion for a calibration-based ap- proach.Bayesian Analysis, 7:1–28, 06 2012. doi: 10.1214/12-BA714. 12 Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models Snell, C., Lee, J., ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2012
[12]

VGR: Visual Grounded Reasoning

doi: 10.1080/095400996116839. URL https: //doi.org/10.1080/095400996116839. Wang, J., Kang, Z., Wang, H., Jiang, H., Li, J., Wu, B., Wang, Y ., Ran, J., Liang, X., Feng, C., and Xiao, J. VGR: visual grounded reasoning.CoRR, abs/2506.11991, 2025. doi: 10.48550/ARXIV.2506.11991. URL https: //doi.org/10.48550/arXiv.2506.11991. Wang, K., Pan, J., Shi, W., Lu,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/095400996116839 2025
[13]

In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

URL http://papers.nips.cc/paper_f iles/paper/2024/hash/ad0edc7d5fa1a78 3f063646968b7315b-Abstract-Datasets_ and_Benchmarks_Track.html. Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self- consistency improves chain of thought reasoning in lan- guage models. InThe Eleventh International Confer- ence on Le...

work page doi:10.1109/cvpr52733.2024.01310 2024

[1] [2]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

OpenReview.net, 2023. URL https://openre view.net/forum?id=yf1icZHC-l9. Gemini Team, G. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https: //arxiv.org/abs/2507.06261. Gemma Team, G. D. Gemma 3 technical report.CoRR, abs/2503.19786, 2025. doi: 10.48550/ARXIV.2503.19

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19 2023

[2] [5]

11 Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models doi: 10.18653/V1/2024.FINDINGS-ACL.108

Association for Computational Linguistics, 2024. 11 Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models doi: 10.18653/V1/2024.FINDINGS-ACL.108. URL https://doi.org/10.18653/v1/2024.fin dings-acl.108. Kim, D., Kim, S., and Kwak, N. Textbook question an- swering with multi-modal context graph understanding and self-supervised open-set ...

work page doi:10.18653/v1/2024.findings-acl.108 2024

[3] [6]

URL https: //doi.org/10.18653/v1/p19-1347

doi: 10.18653/V1/P19-1347. URL https: //doi.org/10.18653/v1/p19-1347. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . Large language models are zero-shot reasoners. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Advances in Neural Infor- mation Processing Systems 35: Annual Conference on Neural Information...

work page doi:10.18653/v1/p19-1347 2022

[4] [9]

Bezhanishvili, B

URL https://doi.org/10.48550/arXiv .2506.08243. Movva, P. and Marupaka, N. H. Enhancing scientific visual question answering through multimodal reasoning and ensemble modeling. In Ghosal, T., Mayr, P., Singh, A., Naik, A., Rehm, G., Freitag, D., Li, D., Schimmler, S., and De Waard, A. (eds.),Proceedings of the Fifth Work- shop on Scholarly Document Proces...

work page internal anchor Pith review doi:10.48550/arxiv 2025

[5] [10]

GPT-4 Technical Report

doi: 10.18653/v1/2025.sdp-1.23. URL https: //aclanthology.org/2025.sdp-1.23/. OpenAI. GPT-4 technical report.CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.sdp-1.23 2025

[6] [11]

GPT-4 Technical Report

doi: 10.48550/ARXIV.2303.08774. URL https: //doi.org/10.48550/arXiv.2303.08774. Rufo, M. and P ´erez, C. Log-linear pool to combine prior distributions: A suggestion for a calibration-based ap- proach.Bayesian Analysis, 7:1–28, 06 2012. doi: 10.1214/12-BA714. 12 Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models Snell, C., Lee, J., ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2012

[7] [12]

VGR: Visual Grounded Reasoning

doi: 10.1080/095400996116839. URL https: //doi.org/10.1080/095400996116839. Wang, J., Kang, Z., Wang, H., Jiang, H., Li, J., Wu, B., Wang, Y ., Ran, J., Liang, X., Feng, C., and Xiao, J. VGR: visual grounded reasoning.CoRR, abs/2506.11991, 2025. doi: 10.48550/ARXIV.2506.11991. URL https: //doi.org/10.48550/arXiv.2506.11991. Wang, K., Pan, J., Shi, W., Lu,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/095400996116839 2025

[8] [13]

In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

URL http://papers.nips.cc/paper_f iles/paper/2024/hash/ad0edc7d5fa1a78 3f063646968b7315b-Abstract-Datasets_ and_Benchmarks_Track.html. Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self- consistency improves chain of thought reasoning in lan- guage models. InThe Eleventh International Confer- ence on Le...

work page doi:10.1109/cvpr52733.2024.01310 2024