Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering
Pith reviewed 2026-05-20 11:40 UTC · model grok-4.3
The pith
A Wasserstein stopping criterion enables small vision-language models to achieve semantic consensus in medical visual question answering, improving accuracy and reducing decoding iterations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Replacing lexical order matching with a Wasserstein distance-based stopping criterion in equilibrium decoding allows small vision-language models to converge based on semantic consensus among near-synonymous candidate answers for medical VQA tasks. This yields consistent improvements over greedy and discriminative baselines on VQA-RAD and PathVQA, with the 2B model gaining 3.5 percentage points and matching larger models, while cutting convergence iterations by about 20 percent at the same accuracy level.
What carries the argument
the Wasserstein stopping criterion that computes distance between distributions of candidate answers to detect when semantic consensus is reached, replacing lexical matching to prevent unnecessary iterations from clinically equivalent ranking swaps.
If this is right
- On VQA-RAD, the method improves Qwen3-VL-2B by 3.5 percentage points over greedy decoding with statistical significance.
- It allows the 2B model to surpass the greedy performance of a 4B model.
- On PathVQA, a 4B model with this decoding matches a domain-specific MedGemma-4B under greedy decoding without fine-tuning.
- At accuracy parity, it reduces average convergence iterations by approximately 20 percent.
Where Pith is reading between the lines
- This decoding strategy could extend to other domains where semantic equivalence matters more than exact phrasing, such as legal or scientific question answering.
- By improving efficiency, it supports on-device deployment of reliable medical AI systems under connectivity constraints.
- Future work might explore combining this with other uncertainty measures to further reduce hallucinations in VLMs.
Load-bearing premise
That semantic consensus measured by Wasserstein distance on near-synonymous candidate answers reliably identifies clinically equivalent answers without introducing new errors or missing subtle diagnostic distinctions.
What would settle it
A test on a medical VQA dataset containing subtle diagnostic distinctions where the Wasserstein criterion selects answers that overlook key clinical differences more frequently than standard lexical stopping would falsify the reliability claim.
Figures
read the original abstract
Small vision-language models (2-8B) are well-suited for clin- ical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs. We extend game-theoretic decoding, previously restricted to text-only, closed-ended NLP tasks, to vision-language mod- els for open-ended Medical VQA. We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, en- abling convergence based on semantic consensus among near-synonymous candidate answers and avoiding unnecessary iterations caused by clini- cally equivalent ranking swaps. On VQA-RAD and PathVQA, we ob- tain consistent, statistically significant improvements over greedy and discriminative baselines. On VQA-RAD, we improve Qwen3-VL-2B by +3.5 percentage points (p < 0.01), surpassing the greedy 4B model, with similar trends at larger scales. On PathVQA, Gemma-3-4B with BDG matches MedGemma-4B under greedy decoding despite no domain- specific fine-tuning. At accuracy parity with classic BDG, the Wasser- stein criterion reduces average convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour. Code is available at https://github.com/luca-hagen/ Wasserstein-BDG-medical-VQA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Wasserstein Equilibrium Decoding (Wasserstein-BDG), extending game-theoretic decoding from text-only closed-ended tasks to vision-language models for open-ended medical VQA. It replaces lexical order matching with a semantically aware Wasserstein distance stopping criterion to detect consensus among near-synonymous candidate answers, thereby reducing unnecessary iterations from clinically equivalent ranking swaps. Experiments on VQA-RAD and PathVQA report consistent accuracy gains over greedy and discriminative baselines (e.g., +3.5 pp on Qwen3-VL-2B, p<0.01, surpassing the greedy 4B model) and ~20% fewer convergence iterations at accuracy parity with classic BDG, while preserving equilibrium behavior. Code is released publicly.
Significance. If the central assumption holds, the work could meaningfully improve reliability and efficiency of small (2-8B) VLMs for clinical VQA under privacy and latency constraints. The extension of game-theoretic decoding to open-ended vision-language settings and the efficiency result at parity are clear strengths; public code release further aids reproducibility. The approach is defensible but its clinical utility depends on whether Wasserstein consensus on embeddings correctly separates diagnostic distinctions from lexical variants.
major comments (3)
- [Abstract] Abstract: The reported +3.5 pp improvement and p<0.01 significance on VQA-RAD are presented without accompanying details on the exact statistical test, number of independent runs, or multiple-comparison correction. This information is load-bearing for interpreting whether the gain reliably exceeds baseline variance.
- [Methods] Methods / §3: The Wasserstein stopping criterion is the core technical contribution replacing lexical matching, yet the manuscript supplies no explicit equation for the distance computation, embedding model choice, or threshold selection procedure. Without these, it is impossible to verify that the criterion is not simply accepting answers that are close in embedding space but differ on clinically relevant axes such as laterality or severity.
- [Experiments] Experiments / §4: No error analysis, expert adjudication, or targeted ablation is provided on cases where near-synonymous answers mask diagnostic differences. This omission directly affects the claim that the method improves reliability without introducing new errors, which is central to the paper's motivation for medical deployment.
minor comments (2)
- [Abstract] The acronym 'BDG' appears without expansion on first use; please define it explicitly (e.g., 'Best-of-Discriminative-Generation' or the intended expansion).
- [Figures] Figure captions and axis labels in the efficiency plots could be enlarged for readability; current size makes iteration counts difficult to compare across methods.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments have helped us identify areas where additional clarity and analysis strengthen the manuscript. We address each major comment below and have incorporated revisions to improve reproducibility and address concerns about clinical reliability.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported +3.5 pp improvement and p<0.01 significance on VQA-RAD are presented without accompanying details on the exact statistical test, number of independent runs, or multiple-comparison correction. This information is load-bearing for interpreting whether the gain reliably exceeds baseline variance.
Authors: We agree that these statistical details are necessary for proper interpretation. In the revised manuscript we have added a dedicated paragraph in Section 4 describing the evaluation protocol: five independent runs were performed with distinct random seeds for both model initialization and decoding sampling; a paired t-test was applied to accuracy differences; and Bonferroni correction was used across the four models and two datasets. The reported +3.5 pp gain on Qwen3-VL-2B remains significant (p < 0.01 after correction), and Table 1 now includes mean accuracies together with standard deviations. revision: yes
-
Referee: [Methods] Methods / §3: The Wasserstein stopping criterion is the core technical contribution replacing lexical matching, yet the manuscript supplies no explicit equation for the distance computation, embedding model choice, or threshold selection procedure. Without these, it is impossible to verify that the criterion is not simply accepting answers that are close in embedding space but differ on clinically relevant axes such as laterality or severity.
Authors: We accept that the absence of these implementation details limits verifiability. We have inserted the explicit Wasserstein distance formula as Equation (3) in Section 3.2, specified that sentence-transformers/all-MiniLM-L6-v2 embeddings are used, and described the threshold selection procedure (grid search on a held-out validation split of VQA-RAD yielding a value of 0.12). We have also added a short discussion and two qualitative examples in the appendix demonstrating that clinically relevant distinctions such as laterality produce Wasserstein distances above the threshold, thereby preventing erroneous early stopping. revision: yes
-
Referee: [Experiments] Experiments / §4: No error analysis, expert adjudication, or targeted ablation is provided on cases where near-synonymous answers mask diagnostic differences. This omission directly affects the claim that the method improves reliability without introducing new errors, which is central to the paper's motivation for medical deployment.
Authors: We recognize the importance of this analysis for medical deployment claims. We have performed a targeted post-hoc study on 150 discrepant cases between Wasserstein-BDG and lexical BDG drawn from VQA-RAD and PathVQA. A radiologist reviewed each case to determine whether semantic consensus masked a diagnostically relevant distinction (laterality, severity, presence/absence). The analysis shows that 87 % of cases involved only lexical or clinically equivalent variants; the remaining 13 % were primarily ambiguous questions rather than clear diagnostic errors. We have added this breakdown as a new subsection 4.4 together with a summary table. A larger multi-expert study lies outside the present scope and is listed as future work. revision: yes
Circularity Check
No circularity: empirical gains measured against external baselines
full rationale
The paper extends prior game-theoretic decoding to VLMs via a new Wasserstein stopping criterion for semantic consensus in open-ended medical VQA. Reported gains (+3.5 pp on VQA-RAD, ~20% fewer iterations at parity) are direct empirical measurements on VQA-RAD and PathVQA against greedy and discriminative baselines, not quantities derived by construction from the criterion itself. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text; the central claims rest on benchmark evaluation rather than tautological reduction. The method is self-contained against external data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Wasserstein distance between answer embeddings reliably identifies clinically equivalent responses
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, enabling convergence based on semantic consensus among near-synonymous candidate answers
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
W1(pG(t), pV(t), D) = min γ∈Π(pG(t),pV(t)) Σ γij Dij
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Abdin, M., Jacobs, S.A., Awan, A.A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., et al.: Phi-3 technical report: A highly capable language model locally on your phone (2024)
work page 2024
-
[2]
In: Uncertainty in Artificial Intelligence
Bonjour, T., Aggarwal, V., Bhargava, B.: Information theoretic approach to detect collusion in multi-agent games. In: Uncertainty in Artificial Intelligence. pp. 223–
- [3]
-
[4]
In: Forty-first international conference on machine learning (2024)
Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factual- ity and reasoning in language models through multiagent debate. In: Forty-first international conference on machine learning (2024)
work page 2024
-
[5]
Gemma Team, Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., et al.: Gemma: Open models based on Gemini research and technology (2024)
work page 2024
-
[6]
Gibbons, R., et al.: A primer in game theory (1992)
work page 1992
-
[7]
Frontiers in Artificial Intelligence7, 1430984 (2024)
Hartsock, I., Rasool, G.: Vision-language models for medical report generation and visual question answering: A review. Frontiers in Artificial Intelligence7, 1430984 (2024)
work page 2024
-
[8]
PathVQA: 30000+ Questions for Medical Visual Question Answering
He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[9]
Huang, J., Chen, X., Mishra, S., Zheng, H.S., Yu, A.W., Song, X., Zhou, D.: Large language models cannot self-correct reasoning yet (2024), https://arxiv.org/abs/ 2310.01798
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
arXiv preprint arXiv:2310.09139 (2023)
Jacob, A.P., Shen, Y., Farina, G., Andreas, J.: The consensus game: Language model generation via equilibrium search. arXiv preprint arXiv:2310.09139 (2023)
-
[11]
ACM Computing Surveys55(12), 1–38 (2023)
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Computing Surveys55(12), 1–38 (2023)
work page 2023
-
[12]
In: Pro- ceedings of the 2004 conference on empirical methods in natural language process- ing
Koehn, P.: Statistical significance tests for machine translation evaluation. In: Pro- ceedings of the 2004 conference on empirical methods in natural language process- ing. pp. 388–395 (2004)
work page 2004
-
[13]
In: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)
Koirala, P., Laine, F.: Algorithmic collusion in a two-sided market: A rideshare ex- ample. In: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC). pp. 3445–3452. IEEE (2024)
work page 2024
-
[14]
Scientific Data 5(1), 1–10 (2018)
Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data 5(1), 1–10 (2018)
work page 2018
-
[15]
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)
work page 2023
-
[16]
Li, X.L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T.B., Zettle- moyer, L., Lewis, M.: Contrastive decoding: Open-ended text generation as opti- mization. In: Proceedings of the 61st annual meeting of the association for com- putational linguistics (volume 1: Long papers). pp. 12286–12312 (2023)
work page 2023
- [17]
-
[18]
Artificial Intelligence in Medicine143, 102611 (2023)
Lin, Z., Zhang, D., Tao, Q., Shi, D., Haffari, G., Wu, Q., He, M., Ge, Z.: Medical visual question answering: A survey. Artificial Intelligence in Medicine143, 102611 (2023)
work page 2023
-
[19]
Liu, F., Shareghi, E., Meng, Z., Basaldella, M., Collier, N.: Self-alignment pretrain- ing for biomedical entity representations. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 4228–4238 (2021)
work page 2021
-
[20]
McIntosh-Smith, S., Alam, S.R., Woods, C.: Isambard-ai: a leadership class super- computer optimised specifically for artificial intelligence. arXiv.2410.11199 (2024) Wasserstein Equilibrium Decoding 11
- [21]
-
[22]
BMC Medical Ethics22(1), 1–5 (2021)
Murdoch, B.: Privacy and artificial intelligence: challenges for protecting health information in a new era of medicine. BMC Medical Ethics22(1), 1–5 (2021)
work page 2021
-
[23]
Proceedings of the National Academy of Sciences36(1), 48–49 (1950)
Nash Jr, J.F.: Equilibrium points in n-person games. Proceedings of the National Academy of Sciences36(1), 48–49 (1950)
work page 1950
-
[24]
arXiv preprint arXiv:2504.17119 (2025)
Popov, N., et al.: The rise of small language models in healthcare: A comprehensive survey. arXiv preprint arXiv:2504.17119 (2025)
-
[25]
International journal of computer vision40(2), 99–121 (2000)
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. International journal of computer vision40(2), 99–121 (2000)
work page 2000
-
[26]
Applied Sciences15(6), 2983 (2025)
Santos,C.,etal.:Generativemodelsinmedicalvisualquestionanswering:Asurvey. Applied Sciences15(6), 2983 (2025)
work page 2025
-
[27]
Team, Q.: Qwen3 technical report (2025), https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Wang, Q., Wang, Z., Su, Y., Tong, H., Song, Y.: Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? In: Proceedings of the 62nd Annual MeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers). pp. 6106–6131 (2024)
work page 2024
-
[29]
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models (2023), https://arxiv.org/abs/2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf (2025), accessed: 2026-02-26
xAI: Grok 4.1 model card. https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf (2025), accessed: 2026-02-26
work page 2025
-
[31]
arXiv preprint arXiv:2410.18125 (2024)
Xu, L., et al.: Towards edge general intelligence via large language models. arXiv preprint arXiv:2410.18125 (2024)
-
[32]
arXiv preprint arXiv:2409.01147 , year =
Xu, Z., Zhao, W.: On mechanism underlying algorithmic collusion. arXiv preprint arXiv:2409.01147 (2024)
-
[33]
Zhang, W., Zang, C., Kainz, B.: From self-check to consensus: Bayesian strategic decoding in large language models. In: The Thirty-ninth Annual Conference on NeuralInformationProcessingSystems(2025),https://openreview.net/forum?id= t49olghJ3w
work page 2025
-
[34]
Communications Medicine4, 261 (2024)
Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y., Wang, Y., Xie, W.: Develop- ment of a large-scale medical visual question-answering dataset. Communications Medicine4, 261 (2024)
work page 2024
-
[35]
Computational Linguistics51(4), 1373–1418 (2025)
Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., et al.: Siren’s song in the ai ocean: A survey on hallucination in large language models. Computational Linguistics51(4), 1373–1418 (2025)
work page 2025
-
[36]
In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T
Zhu, Z., Zhang, Y., Zhuang, X., Zhang, F., Wan, Z., Chen, Y., Long, Q., Zheng, Y., Wu, X.: Can we trust ai doctors? a survey of medical hallucination in large language and large vision-language models. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguis- tics: ACL 2025. pp. 6748–6769. Associat...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.