arxiv: 2604.20531 · v1 · submitted 2026-04-22 · 💻 cs.CL

Recognition: unknown

Effects of Cross-lingual Evidence in Multilingual Medical Question Answering

Anar Yeginbergen , Maite Oronoz , Rodrigo Agerri

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual medical question answeringcross-lingual retrievallow-resource languagesexternal evidenceweb retrievalmodel scale

0 comments

The pith

For low-resource languages, combining English and target-language retrieval achieves medical QA accuracy comparable to high-resource languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how curated medical repositories, web-retrieved content, and LLM explanations affect question answering performance across high-resource languages like English and Spanish and low-resource languages like Basque and Kazakh. Larger models lead in English baselines, English web data helps high-resource languages most, and low-resource languages gain the most from retrieval that mixes English sources with the target language. These patterns show that external evidence does not improve results in every case and that the best approach depends on available language resources and model size. Curated sources such as PubMed supply reliable expert knowledge but cover too few languages to help broadly.

Core claim

Larger models consistently achieve superior performance in English across baseline evaluations. When incorporating external knowledge, web-retrieved data in English proves most beneficial for high-resource languages. For low-resource languages, the most effective strategy combines retrieval in both English and the target language, achieving comparable accuracy to high-resource language results. These findings challenge the assumption that external knowledge systematically improves performance and reveal that effective strategies depend on both the source of language resources and on model scale. Specialized medical knowledge sources such as PubMed are limited because they lack adequate mult

What carries the argument

Cross-lingual retrieval that combines evidence from English and the target language, tested against monolingual and multilingual alternatives across three evidence types and multiple model sizes.

If this is right

Low-resource languages can reach performance levels similar to high-resource languages when retrieval draws from both English and the target language.
Web-retrieved English content supplies the strongest performance lift for high-resource languages.
Gains from external evidence vary with model scale, so smaller models may need different strategies than larger ones.
Specialized repositories such as PubMed provide authoritative knowledge but cannot serve as primary sources in multilingual settings due to limited language coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Medical QA systems for low-resource languages should treat English web data as a reliable bridge rather than relying solely on native-language sources.
The same combined-retrieval pattern could be tested in non-medical domains to check whether resource-level gaps close in the same way.
Smaller models might benefit from heavier use of parametric explanations instead of retrieval when cross-lingual data is scarce.

Load-bearing premise

The web-retrieved content is assumed to be sufficiently accurate and relevant for medical queries without introducing factual errors or biases, and the chosen languages generalize to other multilingual medical QA settings.

What would settle it

Applying the combined English-plus-target retrieval strategy to a fresh set of low-resource languages and finding that accuracy does not reach levels comparable to high-resource languages would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.20531 by Anar Yeginbergen, Maite Oronoz, Rodrigo Agerri.

**Figure 2.** Figure 2: The comparison between different retrieval settings: monolingual, multilingual and cross-lingual across [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of different models under different document counts from MedExpQA. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Error rate by every external knowledge source and LLM in English [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Error rate by every external knowledge source and LLM in Spanish [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Error rate by every external knowledge source and LLM in Italian [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Error rate by every external knowledge source and LLM in French [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Error rate by every external knowledge source and LLM in Basque [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Error rate by every external knowledge source and LLM in Kazakh [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

This paper investigates Multilingual Medical Question Answering across high-resource (English, Spanish, French, Italian) and low-resource (Basque, Kazakh) languages. We evaluate three types of external evidence sources across models of varying size: curated repositories of specialized medical knowledge, web-retrieved content, and explanations from LLM's parametric knowledge. Moreover, we conduct experiments with multilingual, monolingual and cross-lingual retrieval. Our results demonstrate that larger models consistently achieve superior performance in English across baseline evaluations. When incorporating external knowledge, web-retrieved data in English proves most beneficial for high-resource languages. Conversely, for low-resource languages, the most effective strategy combines retrieval in both English and the target language, achieving comparable accuracy to high-resource language results. These findings challenge the assumption that external knowledge systematically improves performance and reveal that effective strategies depend on both the source of language resources and on model scale. Furthermore, specialized medical knowledge sources such as PubMed are limited: while they provide authoritative expert knowledge, they lack adequate multilingual coverage

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mixing English and target-language web retrieval gets low-resource medical QA close to high-resource levels, but the paper skips any check on whether that web content is factually sound.

read the letter

The main takeaway is that for low-resource languages like Basque and Kazakh, combining web retrieval in English and the target language produces medical QA accuracy comparable to high-resource setups, while English-only web data helps the high-resource languages most. Larger models perform better overall, and the benefits of external knowledge shift depending on available language resources and model scale. Curated sources like PubMed fall short because of weak multilingual coverage.

Referee Report

2 major / 2 minor

Summary. The paper investigates multilingual medical question answering for high-resource languages (English, Spanish, French, Italian) and low-resource languages (Basque, Kazakh). It evaluates three external evidence sources—curated medical repositories, web-retrieved content, and LLM parametric explanations—across models of varying sizes, using multilingual, monolingual, and cross-lingual retrieval strategies. Results show larger models excel in English baselines; English web retrieval benefits high-resource languages most; for low-resource languages, combined English+target retrieval yields comparable accuracy to high-resource settings. The work concludes that external knowledge does not systematically improve performance and that optimal strategies depend on resource level and model scale, while noting limited multilingual coverage in sources like PubMed.

Significance. If the empirical claims hold after addressing gaps in evidence verification and experimental detail, the findings would be significant for demonstrating that retrieval effectiveness in multilingual medical QA is not uniform but varies with language resources and model scale. This challenges assumptions about external knowledge augmentation and could inform more targeted strategies for low-resource medical QA systems. The medical domain makes such nuance particularly relevant, though the absence of fact-checking on retrieved content limits immediate applicability.

major comments (2)

[Abstract and Results] Abstract and experimental results: The comparative claims (e.g., combined retrieval achieving comparable accuracy for Basque and Kazakh) are reported without any details on experimental setup, baselines, statistical tests, error analysis, or controls for retrieval quality and data contamination, making it impossible to assess whether the data support the conclusions.
[Experiments and Evidence Sources] Web-retrieval evaluation: No fact-checking, relevance filtering, or error-rate analysis is reported for the web-retrieved passages (especially non-English medical content). This is load-bearing for the central claim that combined retrieval 'achieves comparable accuracy' via genuine knowledge augmentation, as unverified inaccuracies could instead measure model robustness to noise.

minor comments (2)

[Introduction] The distinction between 'high-resource' and 'low-resource' languages could be defined more explicitly with reference to dataset sizes or medical corpus availability.
[Methodology] Clarify how 'explanations from LLM's parametric knowledge' are generated and distinguished from retrieval-based evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed the major comments point by point below, making revisions to improve clarity and transparency where feasible.

read point-by-point responses

Referee: [Abstract and Results] Abstract and experimental results: The comparative claims (e.g., combined retrieval achieving comparable accuracy for Basque and Kazakh) are reported without any details on experimental setup, baselines, statistical tests, error analysis, or controls for retrieval quality and data contamination, making it impossible to assess whether the data support the conclusions.

Authors: We agree that additional details are necessary to support the claims. In the revised manuscript, we have expanded the Experimental Setup section to include complete descriptions of the models, retrieval pipelines, baselines (no-retrieval, monolingual-only, and random-retrieval controls), statistical tests (McNemar's test with reported p-values for paired accuracy comparisons), error analysis broken down by language and question category, and explicit controls for data contamination (including checks for train-test overlap and use of temporally disjoint retrieval corpora). These elements have also been summarized in the abstract to better ground the comparative claims for Basque and Kazakh. revision: yes
Referee: [Experiments and Evidence Sources] Web-retrieval evaluation: No fact-checking, relevance filtering, or error-rate analysis is reported for the web-retrieved passages (especially non-English medical content). This is load-bearing for the central claim that combined retrieval 'achieves comparable accuracy' via genuine knowledge augmentation, as unverified inaccuracies could instead measure model robustness to noise.

Authors: We acknowledge this limitation. The original experiments did not perform fact-checking, relevance filtering, or error-rate quantification on the web-retrieved passages. This means the reported gains, particularly for low-resource languages, could partly reflect model tolerance to noisy or inaccurate content rather than verified knowledge augmentation. We have added a new Limitations section that explicitly discusses this issue, its potential effect on non-English medical content, and the need for future verification pipelines. We also inserted a brief caveat in the results discussion. Comprehensive fact-checking was not added in this revision, as it would require new human annotation beyond the current scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely empirical study with direct experimental observations

full rationale

This is an empirical evaluation paper reporting experimental results on retrieval strategies for multilingual medical QA across high- and low-resource languages. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains are present in the abstract or described methodology. All claims rest on direct observations from model evaluations rather than any self-referential reduction by construction. The central findings about combined English+target retrieval for low-resource languages are experimental outcomes, not forced by prior definitions or citations within the paper itself. This qualifies as a self-contained empirical study against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical models or derivations are present; the work is an empirical comparison relying on standard NLP practices for retrieval and evaluation.

pith-pipeline@v0.9.0 · 5479 in / 1190 out tokens · 31071 ms · 2026-05-09T23:52:38.069970+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 21 canonical work pages · 9 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[3]

2009 , publisher=

Robertson, Stephen and Zaragoza, Hugo and others , journal=. 2009 , publisher=

2009
[4]

2023 , publisher=

Jin, Qiao and Kim, Won and Chen, Qingyu and Comeau, Donald C and Yeganova, Lana and Wilbur, W John and Lu, Zhiyong , journal=. 2023 , publisher=

2023
[5]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[6]

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Iñigo Alonso and Maite Oronoz and Rodrigo Agerri , keywords =. Artificial Intelligence in Medicine , pages =. 2024 , issn =. doi:https://doi.org/10.1016/j.artmed.2024.102938 , url =

work page doi:10.1016/j.artmed.2024.102938 2024
[7]

Benchmarking Retrieval-Augmented Generation for Medicine

Xiong, Guangzhi and Jin, Qiao and Lu, Zhiyong and Zhang, Aidong. Benchmarking Retrieval-Augmented Generation for Medicine. Findings of the Association for Computational Linguistics ACL 2024. 2024

2024
[8]

Nature , year=

Large language models encode clinical knowledge , author=. Nature , year=
[9]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Mistral 7B

Mistral 7B. arXiv 2023 , author=. arXiv preprint arXiv:2310.06825 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

MedGemma Technical Report

MedGemma Technical Report , author=. arXiv preprint arXiv:2507.05201 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

SEPLN 2023: 39th International Conference of the Spanish Society for Natural Language Processing

Rodrigo Agerri and I. SEPLN 2023: 39th International Conference of the Spanish Society for Natural Language Processing. , year=

2023
[16]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =
[17]

Qwen2 Technical Report

Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year=

work page internal anchor Pith review arXiv
[18]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[19]

Hongbo Zhang and Junying Chen and Feng Jiang and Fei Yu and Zhihong Chen and Jianquan Li and Guiming Chen and Xiangbo Wu and Zhiyi Zhang and Qingying Xiao and Xiang Wan and Benyou Wang and Haizhou Li , journal=
[20]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Junying Chen and Zhenyang Cai and Ke Ji and Xidong Wang and Wanlong Liu and Rongsheng Wang and Jianye Hou and Benyou Wang , year=. 2412.18925 , archivePrefix=

work page internal anchor Pith review arXiv
[21]

2021 , publisher=

Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , journal=. 2021 , publisher=

2021
[22]

Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William W and Lu, Xinghua , journal=
[23]

2022 , organization=

Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan , booktitle=. 2022 , organization=

2022
[24]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[25]

Investigating the factual knowledge boundary of large language models with retrieval augmentation,

Investigating the factual knowledge boundary of large language models with retrieval augmentation , author=. arXiv preprint arXiv:2307.11019 , year=

work page arXiv
[26]

Ahmad, Muhammad Aurangzeb and Yaramis, Ilker and Roy, Taposh Dutta , journal=
[27]

Patterns , volume=

Can large language models reason about medical questions? , author=. Patterns , volume=. 2024 , publisher=

2024
[28]

Healthcare , volume=

Leveraging generative AI and large language models: a comprehensive roadmap for healthcare integration , author=. Healthcare , volume=. 2023 , organization=

2023
[29]

Available: https://arxiv.org/abs/2503.05777

Medical hallucinations in foundation models and their impact on healthcare , author=. arXiv preprint arXiv:2503.05777 , year=

work page arXiv
[30]

and Korfiatis, Panagiotis and Freeman, Robert and Nadkarni, Girish N

Artsi, Yaara and Sorin, Vera and Glicksberg, Benjamin S. and Korfiatis, Panagiotis and Freeman, Robert and Nadkarni, Girish N. and Klang, Eyal , TITLE =. Journal of Clinical Medicine , VOLUME =. 2025 , NUMBER =

2025
[31]

Interactive journal of medical research , volume=

The clinicians’ guide to large language models: A general perspective with a focus on hallucinations , author=. Interactive journal of medical research , volume=. 2025 , publisher=

2025
[32]

Journal of Intensive Medicine , volume=

Large language models in critical care , author=. Journal of Intensive Medicine , volume=. 2025 , publisher=

2025
[33]

PLOS Digital Health , volume=

Retrieval augmented generation for large language models in healthcare: A systematic review , author=. PLOS Digital Health , volume=. 2025 , publisher=

2025
[34]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=
[35]

Emergent Abilities of Large Language Models

Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

work page internal anchor Pith review arXiv
[36]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
[37]

Digital health , volume=

Enhancing medical AI with retrieval-augmented generation: A mini narrative review , author=. Digital health , volume=. 2025 , publisher=

2025
[38]

Shi, Yucheng and Xu, Shaochen and Yang, Tianze and Liu, Zhengliang and Liu, Tianming and Li, Xiang and Liu, Ninghao , booktitle=
[39]

Biocomputing 2025: Proceedings of the Pacific Symposium , pages=

Improving retrieval-augmented generation in medicine with iterative follow-up questions , author=. Biocomputing 2025: Proceedings of the Pacific Symposium , pages=. 2024 , organization=

2025
[40]

arXiv preprint arXiv:2410.23123 , year=

On memorization of large language models in logical reasoning , author=. arXiv preprint arXiv:2410.23123 , year=

work page arXiv
[41]

Rationale-guided retrieval augmented generation for medical question answering , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[42]

Scientific Data , volume=

A Dataset of Medical Questions Paired with automatically Generated answers and Evidence-supported References , author=. Scientific Data , volume=. 2025 , publisher=

2025
[43]

Gupta, Deepak and Demner-Fushman, Dina , journal=
[44]

Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020 , year=

M. Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020 , year=

2020
[45]

Liu, Kevin Lin, John Hewitt, Bhargavi Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[46]

Panickssery, Arjun and Bowman, Samuel and Feng, Shi , journal=
[47]

2025 , publisher=

Shool, Sina and Adimi, Sara and Saboori Amleshi, Reza and Bitaraf, Ehsan and Golpira, Reza and Tara, Mahmood , journal=. 2025 , publisher=

2025
[48]

Wu, Juncheng and Deng, Wenlong and Li, Xingxuan and Liu, Sheng and Mi, Taomian and Peng, Yifan and Xu, Ziyang and Liu, Yi and Cho, Hyunjin and Choi, Chang-In and others , journal=
[49]

Proceedings of the 31st International Conference on Computational Linguistics

He, Qiyuan and Wang, Yizhong and Yu, Jianfei and Wang, Wenya. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025
[50]

Physics of language models: Part 3.2, knowledge manipula- tion.arXiv:2309.14402, 2023

Physics of language models: Part 3.2, knowledge manipulation , author=. arXiv preprint arXiv:2309.14402 , year=

work page arXiv
[51]

Ju, Tianjie and Sun, Weiwei and Du, Wei and Yuan, Xinwei and Ren, Zhaochun and Liu, Gongshen , journal=
[52]

Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K

Chang, Tyler A. and Arnett, Catherine and Tu, Zhuowen and Bergen, Ben. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.236

work page doi:10.18653/v1/2024.emnlp-main.236 2024
[53]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Pfeiffer, Jonas and Goyal, Naman and Lin, Xi and Li, Xian and Cross, James and Riedel, Sebastian and Artetxe, Mikel. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.255

work page doi:10.18653/v1/2022.naacl-main.255 2022
[54]

Advances in neural information processing systems , volume=

Language model tokenizers introduce unfairness between languages , author=. Advances in neural information processing systems , volume=
[55]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

On the evaluation of machine translation systems trained with back-translation , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
[56]

Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q and Artzi, Yoav , journal=
[57]

and Lavie, Alon , booktitle =

Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.213

work page doi:10.18653/v1/2020.emnlp-main.213 2020
[58]

chr F : character n-gram F -score for automatic MT evaluation

Popovi \'c , Maja. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015. doi:10.18653/v1/W15-3049

work page doi:10.18653/v1/w15-3049 2015
[59]

chr F ++: words helping character n-grams

Popovi \'c , Maja. Proceedings of the Second Conference on Machine Translation. 2017. doi:10.18653/v1/W17-4770

work page doi:10.18653/v1/w17-4770 2017
[60]

Liu, Wei and Trenous, Sony and Ribeiro, Leonardo FR and Byrne, Bill and Hieber, Felix , journal=
[61]

Qatar Medical Journal , volume=

Medical research production in native languages: A descriptive analysis of PubMed database , author=. Qatar Medical Journal , volume=. 2024 , publisher=

2024
[62]

arXiv preprint arXiv:2511.06738 , year=

Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights , author=. arXiv preprint arXiv:2511.06738 , year=

work page arXiv