Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering

Bernhard Kainz; Johanna P. M\"uller; Luca Hagen; Mengyun Qiao; Weitong Zhang

arxiv: 2605.18313 · v1 · pith:CGKDU4VNnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering

Luca Hagen , Johanna P. M\"uller , Weitong Zhang , Mengyun Qiao , Bernhard Kainz This is my paper

Pith reviewed 2026-05-20 11:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical visual question answeringWasserstein distanceequilibrium decodingvision-language modelsgame-theoretic decodingVQA-RADPathVQAsemantic consensus

0 comments

The pith

A Wasserstein stopping criterion enables small vision-language models to achieve semantic consensus in medical visual question answering, improving accuracy and reducing decoding iterations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that game-theoretic decoding can be extended to vision-language models for open-ended medical VQA by using a Wasserstein distance measure to stop when candidate answers reach semantic agreement rather than exact lexical matches. This matters because small models are preferred in clinical settings for privacy and speed but tend to produce plausible yet wrong answers. By focusing on clinical equivalence among near-synonyms, the method avoids wasting iterations on harmless ranking changes. Results demonstrate gains on standard benchmarks like VQA-RAD where a 2B model outperforms a larger greedy baseline.

Core claim

Replacing lexical order matching with a Wasserstein distance-based stopping criterion in equilibrium decoding allows small vision-language models to converge based on semantic consensus among near-synonymous candidate answers for medical VQA tasks. This yields consistent improvements over greedy and discriminative baselines on VQA-RAD and PathVQA, with the 2B model gaining 3.5 percentage points and matching larger models, while cutting convergence iterations by about 20 percent at the same accuracy level.

What carries the argument

the Wasserstein stopping criterion that computes distance between distributions of candidate answers to detect when semantic consensus is reached, replacing lexical matching to prevent unnecessary iterations from clinically equivalent ranking swaps.

If this is right

On VQA-RAD, the method improves Qwen3-VL-2B by 3.5 percentage points over greedy decoding with statistical significance.
It allows the 2B model to surpass the greedy performance of a 4B model.
On PathVQA, a 4B model with this decoding matches a domain-specific MedGemma-4B under greedy decoding without fine-tuning.
At accuracy parity, it reduces average convergence iterations by approximately 20 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This decoding strategy could extend to other domains where semantic equivalence matters more than exact phrasing, such as legal or scientific question answering.
By improving efficiency, it supports on-device deployment of reliable medical AI systems under connectivity constraints.
Future work might explore combining this with other uncertainty measures to further reduce hallucinations in VLMs.

Load-bearing premise

That semantic consensus measured by Wasserstein distance on near-synonymous candidate answers reliably identifies clinically equivalent answers without introducing new errors or missing subtle diagnostic distinctions.

What would settle it

A test on a medical VQA dataset containing subtle diagnostic distinctions where the Wasserstein criterion selects answers that overlook key clinical differences more frequently than standard lexical stopping would falsify the reliability claim.

Figures

Figures reproduced from arXiv: 2605.18313 by Bernhard Kainz, Johanna P. M\"uller, Luca Hagen, Mengyun Qiao, Weitong Zhang.

**Figure 1.** Figure 1: Wasserstein-BDG for open-ended medical VQA. Given an image and question, the Generator produces candidate answers, which the Generator (solid) and Verifier (dashed) iteratively align via game-theoretic updates. WBDG converges at semantic consensus, allowing swaps between near-synonymous answers (e.g., Liver ∼= Hepatic Region), whereas classic BDG requires exact rank agreement. 2 Method We build on the BDG… view at source ↗

**Figure 2.** Figure 2: Convergence analysis on VQA-RAD (Qwen3-VL-4B). (a) Preference rankings over game iterations. In the red phase, rankings diverge; in the orange phase, only semantically close candidates remain swapped; in the green phase, exact rank agreement is reached. (b) Separation-weighted Wasserstein distance W˜ (t) 1 over iterations. BDG-W terminates once W˜ (t) 1 < δW (dashed horizontal line), tolerating the remai… view at source ↗

read the original abstract

Small vision-language models (2-8B) are well-suited for clin- ical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs. We extend game-theoretic decoding, previously restricted to text-only, closed-ended NLP tasks, to vision-language mod- els for open-ended Medical VQA. We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, en- abling convergence based on semantic consensus among near-synonymous candidate answers and avoiding unnecessary iterations caused by clini- cally equivalent ranking swaps. On VQA-RAD and PathVQA, we ob- tain consistent, statistically significant improvements over greedy and discriminative baselines. On VQA-RAD, we improve Qwen3-VL-2B by +3.5 percentage points (p < 0.01), surpassing the greedy 4B model, with similar trends at larger scales. On PathVQA, Gemma-3-4B with BDG matches MedGemma-4B under greedy decoding despite no domain- specific fine-tuning. At accuracy parity with classic BDG, the Wasser- stein criterion reduces average convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour. Code is available at https://github.com/luca-hagen/ Wasserstein-BDG-medical-VQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Wasserstein stopping criterion speeds up game-theoretic decoding for medical VQA with modest accuracy lifts and 20% fewer iterations.

read the letter

The main takeaway is that swapping lexical matching for a Wasserstein distance check lets game-theoretic decoding converge faster on open-ended medical VQA while still delivering measurable accuracy gains. On VQA-RAD the 2B Qwen model improves by 3.5 points and beats the greedy 4B baseline; similar patterns appear on PathVQA, and the method cuts average iterations by about 20% at matched accuracy levels. Code release is a practical plus for anyone who wants to try it directly.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Wasserstein Equilibrium Decoding (Wasserstein-BDG), extending game-theoretic decoding from text-only closed-ended tasks to vision-language models for open-ended medical VQA. It replaces lexical order matching with a semantically aware Wasserstein distance stopping criterion to detect consensus among near-synonymous candidate answers, thereby reducing unnecessary iterations from clinically equivalent ranking swaps. Experiments on VQA-RAD and PathVQA report consistent accuracy gains over greedy and discriminative baselines (e.g., +3.5 pp on Qwen3-VL-2B, p<0.01, surpassing the greedy 4B model) and ~20% fewer convergence iterations at accuracy parity with classic BDG, while preserving equilibrium behavior. Code is released publicly.

Significance. If the central assumption holds, the work could meaningfully improve reliability and efficiency of small (2-8B) VLMs for clinical VQA under privacy and latency constraints. The extension of game-theoretic decoding to open-ended vision-language settings and the efficiency result at parity are clear strengths; public code release further aids reproducibility. The approach is defensible but its clinical utility depends on whether Wasserstein consensus on embeddings correctly separates diagnostic distinctions from lexical variants.

major comments (3)

[Abstract] Abstract: The reported +3.5 pp improvement and p<0.01 significance on VQA-RAD are presented without accompanying details on the exact statistical test, number of independent runs, or multiple-comparison correction. This information is load-bearing for interpreting whether the gain reliably exceeds baseline variance.
[Methods] Methods / §3: The Wasserstein stopping criterion is the core technical contribution replacing lexical matching, yet the manuscript supplies no explicit equation for the distance computation, embedding model choice, or threshold selection procedure. Without these, it is impossible to verify that the criterion is not simply accepting answers that are close in embedding space but differ on clinically relevant axes such as laterality or severity.
[Experiments] Experiments / §4: No error analysis, expert adjudication, or targeted ablation is provided on cases where near-synonymous answers mask diagnostic differences. This omission directly affects the claim that the method improves reliability without introducing new errors, which is central to the paper's motivation for medical deployment.

minor comments (2)

[Abstract] The acronym 'BDG' appears without expansion on first use; please define it explicitly (e.g., 'Best-of-Discriminative-Generation' or the intended expansion).
[Figures] Figure captions and axis labels in the efficiency plots could be enlarged for readability; current size makes iteration counts difficult to compare across methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments have helped us identify areas where additional clarity and analysis strengthen the manuscript. We address each major comment below and have incorporated revisions to improve reproducibility and address concerns about clinical reliability.

read point-by-point responses

Referee: [Abstract] Abstract: The reported +3.5 pp improvement and p<0.01 significance on VQA-RAD are presented without accompanying details on the exact statistical test, number of independent runs, or multiple-comparison correction. This information is load-bearing for interpreting whether the gain reliably exceeds baseline variance.

Authors: We agree that these statistical details are necessary for proper interpretation. In the revised manuscript we have added a dedicated paragraph in Section 4 describing the evaluation protocol: five independent runs were performed with distinct random seeds for both model initialization and decoding sampling; a paired t-test was applied to accuracy differences; and Bonferroni correction was used across the four models and two datasets. The reported +3.5 pp gain on Qwen3-VL-2B remains significant (p < 0.01 after correction), and Table 1 now includes mean accuracies together with standard deviations. revision: yes
Referee: [Methods] Methods / §3: The Wasserstein stopping criterion is the core technical contribution replacing lexical matching, yet the manuscript supplies no explicit equation for the distance computation, embedding model choice, or threshold selection procedure. Without these, it is impossible to verify that the criterion is not simply accepting answers that are close in embedding space but differ on clinically relevant axes such as laterality or severity.

Authors: We accept that the absence of these implementation details limits verifiability. We have inserted the explicit Wasserstein distance formula as Equation (3) in Section 3.2, specified that sentence-transformers/all-MiniLM-L6-v2 embeddings are used, and described the threshold selection procedure (grid search on a held-out validation split of VQA-RAD yielding a value of 0.12). We have also added a short discussion and two qualitative examples in the appendix demonstrating that clinically relevant distinctions such as laterality produce Wasserstein distances above the threshold, thereby preventing erroneous early stopping. revision: yes
Referee: [Experiments] Experiments / §4: No error analysis, expert adjudication, or targeted ablation is provided on cases where near-synonymous answers mask diagnostic differences. This omission directly affects the claim that the method improves reliability without introducing new errors, which is central to the paper's motivation for medical deployment.

Authors: We recognize the importance of this analysis for medical deployment claims. We have performed a targeted post-hoc study on 150 discrepant cases between Wasserstein-BDG and lexical BDG drawn from VQA-RAD and PathVQA. A radiologist reviewed each case to determine whether semantic consensus masked a diagnostically relevant distinction (laterality, severity, presence/absence). The analysis shows that 87 % of cases involved only lexical or clinically equivalent variants; the remaining 13 % were primarily ambiguous questions rather than clear diagnostic errors. We have added this breakdown as a new subsection 4.4 together with a summary table. A larger multi-expert study lies outside the present scope and is listed as future work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured against external baselines

full rationale

The paper extends prior game-theoretic decoding to VLMs via a new Wasserstein stopping criterion for semantic consensus in open-ended medical VQA. Reported gains (+3.5 pp on VQA-RAD, ~20% fewer iterations at parity) are direct empirical measurements on VQA-RAD and PathVQA against greedy and discriminative baselines, not quantities derived by construction from the criterion itself. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text; the central claims rest on benchmark evaluation rather than tautological reduction. The method is self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated premise that Wasserstein distance on embeddings captures clinical semantic equivalence.

axioms (1)

domain assumption Wasserstein distance between answer embeddings reliably identifies clinically equivalent responses
Invoked to justify replacing lexical order matching with semantic consensus

pith-pipeline@v0.9.0 · 5804 in / 1152 out tokens · 28290 ms · 2026-05-20T11:40:49.084197+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, enabling convergence based on semantic consensus among near-synonymous candidate answers
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

W1(pG(t), pV(t), D) = min γ∈Π(pG(t),pV(t)) Σ γij Dij

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 4 internal anchors

[1]

Abdin, M., Jacobs, S.A., Awan, A.A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., et al.: Phi-3 technical report: A highly capable language model locally on your phone (2024)

work page 2024
[2]

In: Uncertainty in Artificial Intelligence

Bonjour, T., Aggarwal, V., Bhargava, B.: Information theoretic approach to detect collusion in multi-agent games. In: Uncertainty in Artificial Intelligence. pp. 223–

work page
[3]

Hagen et al

PMLR (2022) 10 L. Hagen et al

work page 2022
[4]

In: Forty-first international conference on machine learning (2024)

Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factual- ity and reasoning in language models through multiagent debate. In: Forty-first international conference on machine learning (2024)

work page 2024
[5]

Gemma Team, Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., et al.: Gemma: Open models based on Gemini research and technology (2024)

work page 2024
[6]

Gibbons, R., et al.: A primer in game theory (1992)

work page 1992
[7]

Frontiers in Artificial Intelligence7, 1430984 (2024)

Hartsock, I., Rasool, G.: Vision-language models for medical report generation and visual question answering: A review. Frontiers in Artificial Intelligence7, 1430984 (2024)

work page 2024
[8]

PathVQA: 30000+ Questions for Medical Visual Question Answering

He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2003
[9]

Huang, J., Chen, X., Mishra, S., Zheng, H.S., Yu, A.W., Song, X., Zhou, D.: Large language models cannot self-correct reasoning yet (2024), https://arxiv.org/abs/ 2310.01798

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

arXiv preprint arXiv:2310.09139 (2023)

Jacob, A.P., Shen, Y., Farina, G., Andreas, J.: The consensus game: Language model generation via equilibrium search. arXiv preprint arXiv:2310.09139 (2023)

work page arXiv 2023
[11]

ACM Computing Surveys55(12), 1–38 (2023)

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Computing Surveys55(12), 1–38 (2023)

work page 2023
[12]

In: Pro- ceedings of the 2004 conference on empirical methods in natural language process- ing

Koehn, P.: Statistical significance tests for machine translation evaluation. In: Pro- ceedings of the 2004 conference on empirical methods in natural language process- ing. pp. 388–395 (2004)

work page 2004
[13]

In: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)

Koirala, P., Laine, F.: Algorithmic collusion in a two-sided market: A rideshare ex- ample. In: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC). pp. 3445–3452. IEEE (2024)

work page 2024
[14]

Scientific Data 5(1), 1–10 (2018)

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data 5(1), 1–10 (2018)

work page 2018
[15]

In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)

work page 2023
[16]

In: Proceedings of the 61st annual meeting of the association for com- putational linguistics (volume 1: Long papers)

Li, X.L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T.B., Zettle- moyer, L., Lewis, M.: Contrastive decoding: Open-ended text generation as opti- mization. In: Proceedings of the 61st annual meeting of the association for com- putational linguistics (volume 1: Long papers). pp. 12286–12312 (2023)

work page 2023
[17]

Liang, X., Song, S., Zheng, Z., Wang, H., Yu, Q., Li, X., Li, R.H., Wang, Y., Wang, Z., Xiong, F., Li, Z.: Internal consistency and self-feedback in large language models: A survey (2024), https://arxiv.org/abs/2407.14507

work page arXiv 2024
[18]

Artificial Intelligence in Medicine143, 102611 (2023)

Lin, Z., Zhang, D., Tao, Q., Shi, D., Haffari, G., Wu, Q., He, M., Ge, Z.: Medical visual question answering: A survey. Artificial Intelligence in Medicine143, 102611 (2023)

work page 2023
[19]

In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Liu, F., Shareghi, E., Meng, Z., Basaldella, M., Collier, N.: Self-alignment pretrain- ing for biomedical entity representations. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 4228–4238 (2021)

work page 2021
[20]

McIntosh-Smith, S

McIntosh-Smith, S., Alam, S.R., Woods, C.: Isambard-ai: a leadership class super- computer optimised specifically for artificial intelligence. arXiv.2410.11199 (2024) Wasserstein Equilibrium Decoding 11

work page arXiv 2024
[21]

Moor, M., Huang, Q., Wu, S., Yasunaga, M., Zakka, C., Dalmia, Y., Reis, E.P., Rajpurkar, P., Leskovec, J.: Med-Flamingo: A multimodal medical few-shot learner (2023), https://arxiv.org/abs/2307.15189, arXiv:2307.15189

work page arXiv 2023
[22]

BMC Medical Ethics22(1), 1–5 (2021)

Murdoch, B.: Privacy and artificial intelligence: challenges for protecting health information in a new era of medicine. BMC Medical Ethics22(1), 1–5 (2021)

work page 2021
[23]

Proceedings of the National Academy of Sciences36(1), 48–49 (1950)

Nash Jr, J.F.: Equilibrium points in n-person games. Proceedings of the National Academy of Sciences36(1), 48–49 (1950)

work page 1950
[24]

arXiv preprint arXiv:2504.17119 (2025)

Popov, N., et al.: The rise of small language models in healthcare: A comprehensive survey. arXiv preprint arXiv:2504.17119 (2025)

work page arXiv 2025
[25]

International journal of computer vision40(2), 99–121 (2000)

Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. International journal of computer vision40(2), 99–121 (2000)

work page 2000
[26]

Applied Sciences15(6), 2983 (2025)

Santos,C.,etal.:Generativemodelsinmedicalvisualquestionanswering:Asurvey. Applied Sciences15(6), 2983 (2025)

work page 2025
[27]

Team, Q.: Qwen3 technical report (2025), https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Wang, Q., Wang, Z., Su, Y., Tong, H., Song, Y.: Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? In: Proceedings of the 62nd Annual MeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers). pp. 6106–6131 (2024)

work page 2024
[29]

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models (2023), https://arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf (2025), accessed: 2026-02-26

xAI: Grok 4.1 model card. https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf (2025), accessed: 2026-02-26

work page 2025
[31]

arXiv preprint arXiv:2410.18125 (2024)

Xu, L., et al.: Towards edge general intelligence via large language models. arXiv preprint arXiv:2410.18125 (2024)

work page arXiv 2024
[32]

arXiv preprint arXiv:2409.01147 , year =

Xu, Z., Zhao, W.: On mechanism underlying algorithmic collusion. arXiv preprint arXiv:2409.01147 (2024)

work page arXiv 2024
[33]

In: The Thirty-ninth Annual Conference on NeuralInformationProcessingSystems(2025),https://openreview.net/forum?id= t49olghJ3w

Zhang, W., Zang, C., Kainz, B.: From self-check to consensus: Bayesian strategic decoding in large language models. In: The Thirty-ninth Annual Conference on NeuralInformationProcessingSystems(2025),https://openreview.net/forum?id= t49olghJ3w

work page 2025
[34]

Communications Medicine4, 261 (2024)

Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y., Wang, Y., Xie, W.: Develop- ment of a large-scale medical visual question-answering dataset. Communications Medicine4, 261 (2024)

work page 2024
[35]

Computational Linguistics51(4), 1373–1418 (2025)

Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., et al.: Siren’s song in the ai ocean: A survey on hallucination in large language models. Computational Linguistics51(4), 1373–1418 (2025)

work page 2025
[36]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Zhu, Z., Zhang, Y., Zhuang, X., Zhang, F., Wan, Z., Chen, Y., Long, Q., Zheng, Y., Wu, X.: Can we trust ai doctors? a survey of medical hallucination in large language and large vision-language models. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguis- tics: ACL 2025. pp. 6748–6769. Associat...

work page doi:10.18653/v1/2025.findings-acl.350 2025

[1] [1]

Abdin, M., Jacobs, S.A., Awan, A.A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., et al.: Phi-3 technical report: A highly capable language model locally on your phone (2024)

work page 2024

[2] [2]

In: Uncertainty in Artificial Intelligence

Bonjour, T., Aggarwal, V., Bhargava, B.: Information theoretic approach to detect collusion in multi-agent games. In: Uncertainty in Artificial Intelligence. pp. 223–

work page

[3] [3]

Hagen et al

PMLR (2022) 10 L. Hagen et al

work page 2022

[4] [4]

In: Forty-first international conference on machine learning (2024)

Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factual- ity and reasoning in language models through multiagent debate. In: Forty-first international conference on machine learning (2024)

work page 2024

[5] [5]

Gemma Team, Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., et al.: Gemma: Open models based on Gemini research and technology (2024)

work page 2024

[6] [6]

Gibbons, R., et al.: A primer in game theory (1992)

work page 1992

[7] [7]

Frontiers in Artificial Intelligence7, 1430984 (2024)

Hartsock, I., Rasool, G.: Vision-language models for medical report generation and visual question answering: A review. Frontiers in Artificial Intelligence7, 1430984 (2024)

work page 2024

[8] [8]

PathVQA: 30000+ Questions for Medical Visual Question Answering

He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2003

[9] [9]

Huang, J., Chen, X., Mishra, S., Zheng, H.S., Yu, A.W., Song, X., Zhou, D.: Large language models cannot self-correct reasoning yet (2024), https://arxiv.org/abs/ 2310.01798

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

arXiv preprint arXiv:2310.09139 (2023)

Jacob, A.P., Shen, Y., Farina, G., Andreas, J.: The consensus game: Language model generation via equilibrium search. arXiv preprint arXiv:2310.09139 (2023)

work page arXiv 2023

[11] [11]

ACM Computing Surveys55(12), 1–38 (2023)

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Computing Surveys55(12), 1–38 (2023)

work page 2023

[12] [12]

In: Pro- ceedings of the 2004 conference on empirical methods in natural language process- ing

Koehn, P.: Statistical significance tests for machine translation evaluation. In: Pro- ceedings of the 2004 conference on empirical methods in natural language process- ing. pp. 388–395 (2004)

work page 2004

[13] [13]

In: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)

Koirala, P., Laine, F.: Algorithmic collusion in a two-sided market: A rideshare ex- ample. In: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC). pp. 3445–3452. IEEE (2024)

work page 2024

[14] [14]

Scientific Data 5(1), 1–10 (2018)

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data 5(1), 1–10 (2018)

work page 2018

[15] [15]

In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)

work page 2023

[16] [16]

In: Proceedings of the 61st annual meeting of the association for com- putational linguistics (volume 1: Long papers)

Li, X.L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T.B., Zettle- moyer, L., Lewis, M.: Contrastive decoding: Open-ended text generation as opti- mization. In: Proceedings of the 61st annual meeting of the association for com- putational linguistics (volume 1: Long papers). pp. 12286–12312 (2023)

work page 2023

[17] [17]

Liang, X., Song, S., Zheng, Z., Wang, H., Yu, Q., Li, X., Li, R.H., Wang, Y., Wang, Z., Xiong, F., Li, Z.: Internal consistency and self-feedback in large language models: A survey (2024), https://arxiv.org/abs/2407.14507

work page arXiv 2024

[18] [18]

Artificial Intelligence in Medicine143, 102611 (2023)

Lin, Z., Zhang, D., Tao, Q., Shi, D., Haffari, G., Wu, Q., He, M., Ge, Z.: Medical visual question answering: A survey. Artificial Intelligence in Medicine143, 102611 (2023)

work page 2023

[19] [19]

In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Liu, F., Shareghi, E., Meng, Z., Basaldella, M., Collier, N.: Self-alignment pretrain- ing for biomedical entity representations. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 4228–4238 (2021)

work page 2021

[20] [20]

McIntosh-Smith, S

McIntosh-Smith, S., Alam, S.R., Woods, C.: Isambard-ai: a leadership class super- computer optimised specifically for artificial intelligence. arXiv.2410.11199 (2024) Wasserstein Equilibrium Decoding 11

work page arXiv 2024

[21] [21]

Moor, M., Huang, Q., Wu, S., Yasunaga, M., Zakka, C., Dalmia, Y., Reis, E.P., Rajpurkar, P., Leskovec, J.: Med-Flamingo: A multimodal medical few-shot learner (2023), https://arxiv.org/abs/2307.15189, arXiv:2307.15189

work page arXiv 2023

[22] [22]

BMC Medical Ethics22(1), 1–5 (2021)

Murdoch, B.: Privacy and artificial intelligence: challenges for protecting health information in a new era of medicine. BMC Medical Ethics22(1), 1–5 (2021)

work page 2021

[23] [23]

Proceedings of the National Academy of Sciences36(1), 48–49 (1950)

Nash Jr, J.F.: Equilibrium points in n-person games. Proceedings of the National Academy of Sciences36(1), 48–49 (1950)

work page 1950

[24] [24]

arXiv preprint arXiv:2504.17119 (2025)

Popov, N., et al.: The rise of small language models in healthcare: A comprehensive survey. arXiv preprint arXiv:2504.17119 (2025)

work page arXiv 2025

[25] [25]

International journal of computer vision40(2), 99–121 (2000)

Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. International journal of computer vision40(2), 99–121 (2000)

work page 2000

[26] [26]

Applied Sciences15(6), 2983 (2025)

Santos,C.,etal.:Generativemodelsinmedicalvisualquestionanswering:Asurvey. Applied Sciences15(6), 2983 (2025)

work page 2025

[27] [27]

Team, Q.: Qwen3 technical report (2025), https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Wang, Q., Wang, Z., Su, Y., Tong, H., Song, Y.: Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? In: Proceedings of the 62nd Annual MeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers). pp. 6106–6131 (2024)

work page 2024

[29] [29]

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models (2023), https://arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf (2025), accessed: 2026-02-26

xAI: Grok 4.1 model card. https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf (2025), accessed: 2026-02-26

work page 2025

[31] [31]

arXiv preprint arXiv:2410.18125 (2024)

Xu, L., et al.: Towards edge general intelligence via large language models. arXiv preprint arXiv:2410.18125 (2024)

work page arXiv 2024

[32] [32]

arXiv preprint arXiv:2409.01147 , year =

Xu, Z., Zhao, W.: On mechanism underlying algorithmic collusion. arXiv preprint arXiv:2409.01147 (2024)

work page arXiv 2024

[33] [33]

In: The Thirty-ninth Annual Conference on NeuralInformationProcessingSystems(2025),https://openreview.net/forum?id= t49olghJ3w

Zhang, W., Zang, C., Kainz, B.: From self-check to consensus: Bayesian strategic decoding in large language models. In: The Thirty-ninth Annual Conference on NeuralInformationProcessingSystems(2025),https://openreview.net/forum?id= t49olghJ3w

work page 2025

[34] [34]

Communications Medicine4, 261 (2024)

Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y., Wang, Y., Xie, W.: Develop- ment of a large-scale medical visual question-answering dataset. Communications Medicine4, 261 (2024)

work page 2024

[35] [35]

Computational Linguistics51(4), 1373–1418 (2025)

Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., et al.: Siren’s song in the ai ocean: A survey on hallucination in large language models. Computational Linguistics51(4), 1373–1418 (2025)

work page 2025

[36] [36]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Zhu, Z., Zhang, Y., Zhuang, X., Zhang, F., Wan, Z., Chen, Y., Long, Q., Zheng, Y., Wu, X.: Can we trust ai doctors? a survey of medical hallucination in large language and large vision-language models. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguis- tics: ACL 2025. pp. 6748–6769. Associat...

work page doi:10.18653/v1/2025.findings-acl.350 2025