arxiv: 2601.13262 · v2 · submitted 2026-01-19 · 💻 cs.AI · cs.CL

Recognition: no theorem link

CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

Eric Onyame , Akash Ghosh , Subhadip Baidya , Sriparna Saha , Xiuying Chen , Chirag Agarwal

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:17 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords multilingual medical reasoningreinforcement learningcurriculum learninglarge language modelslanguage consistencylogical correctnesslow-resource languagesmedical AI

0 comments

The pith

Curriculum-informed reinforcement learning on a new multilingual dataset raises both logical correctness and language stability in medical reasoning for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CUREMED-BENCH, a dataset of open-ended medical reasoning queries in thirteen languages, each tied to one verifiable answer. It then presents CURE-MED, a training process that first applies code-switching-aware supervised fine-tuning and follows with staged Group Relative Policy Optimization to raise both answer accuracy and language consistency. The approach is tested on models from seven billion to thirty-two billion parameters and shows consistent gains over baselines at both scales. A reader would care because current models often produce inconsistent or incorrect medical responses when used outside English, limiting fair access to AI assistance in global healthcare. The results indicate the gains hold as models grow larger.

Core claim

CURE-MED is a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization on the CUREMED-BENCH dataset to jointly improve logical correctness and language consistency in multilingual medical reasoning, delivering 85.21 percent language consistency and 54.35 percent logical correctness at 7B scale and 94.96 percent consistency and 70.04 percent correctness at 32B scale while outperforming strong baselines across all thirteen languages.

What carries the argument

Curriculum-informed reinforcement learning framework that pairs code-switching-aware supervised fine-tuning with Group Relative Policy Optimization to optimize jointly for logical correctness and language stability.

If this is right

The method outperforms strong baselines consistently across all thirteen languages.
Both language consistency and logical correctness improve as model size increases from 7B to 32B parameters.
The curriculum structure supports reliable multilingual medical reasoning without separate language-specific models.
Training in this staged way raises performance on open-ended queries while preserving language stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged training pattern could be tested on other verifiable reasoning domains such as law or technical troubleshooting where answers are not language-dependent.
If the single-answer format proves robust, the approach may narrow performance gaps between high-resource and low-resource languages in any specialized AI task.
Clinical deployment would still need separate safety audits to confirm the training does not introduce new medical errors beyond what the benchmark measures.

Load-bearing premise

Medical reasoning queries always have one clear, verifiable correct answer that can be scored the same way regardless of language.

What would settle it

A collection of medical queries whose correct answers legitimately vary by language or culture, followed by measurement of whether the model's reported correctness and consistency percentages remain at the levels shown in the paper.

Figures

Figures reproduced from arXiv: 2601.13262 by Akash Ghosh, Chirag Agarwal, Eric Onyame, Sriparna Saha, Subhadip Baidya, Xiuying Chen.

**Figure 1.** Figure 1: The CURE-MED pipeline for multilingual medical reasoning. The framework progresses through three stages: (A) curation of clinically validated multilingual data from sources like MedlinePlus to enable cross-lingual reasoning; (B) supervised fine-tuning of the Qwen2.5-Instruct backbone on code-switched reasoning traces; and (C) GRPO-guided curriculum reinforcement learning, progressively training from high- … view at source ↗

**Figure 2.** Figure 2: An example from the cold-start multilingual dataset showing CoT reasoning in French. The reasoning combines English-based clinical terms and local-language expressions, reflecting code-switching in medical contexts. 3.2 Cold-Start Initialization via Supervised Fine-Tuning (SFT) We initialize multilingual reasoning with a cold-start SFT stage on code-switched long CoT trajectories. This stage stabilizes mul… view at source ↗

**Figure 3.** Figure 3: Qualitative Spanish medical-reasoning example comparing a baseline Qwen2.5-7B-Instruct model and CURE-MED-7B. The baseline model produces fluent but clinically flawed reasoning (red) and an incorrect diagnosis, whereas CURE-MED generates a structured, code-switched CoT (blue) and arrives at the correct diagnosis (green). 3.4 GRPO-guided curriculum reinforcement learning After SFT, we fine-tune the model wi… view at source ↗

**Figure 4.** Figure 4: Trade-off performance between logical of multilingual medical reasoning models, where each point represents a model instance with bubble size reflecting model scale. Baseline and CURE-MED models are shown as ○ and ⋆, respectively. CURE-MED shifts performance toward the upper-right, indicating consistent gains in language consistency and logical accuracy. 1.5B 3B 7B 14B 32B 0 20 40 60 80 100 Language Consis… view at source ↗

**Figure 5.** Figure 5: Scaling performance of CURE-MED vs. base across Qwen2.5-Instruct variants on language consistency (left) and logical accuracy (right). Our method (solid red line) consistently outperforms the base model (dashed blue line), with performance gaps widening at larger model scales, highlighting the effectiveness of CURE-MED for multilingual medical reasoning. Notably, our 32B model is competitive with closed-so… view at source ↗

**Figure 6.** Figure 6: CURE-MED vs. medical LLM baselines across four multilingual medical QA benchmarks. Results show logi cal accuracy, highlighting CURE-MED’s consistent across diverse evaluation settings. 7 Conclusion We introduce CUREMED-BENCH, a multilingual medical reasoning benchmark of open-ended questions with explicit reasoning traces and a single verifiable answer across 13 languages, including low-resource settings.… view at source ↗

**Figure 7.** Figure 7: Prompt used for LLM-as-a-judge verification. B.1 Reward Verification and Weighting We design a composite reward that jointly enforces clinical correctness, language fidelity, and output format compliance. The final reward is defined as R = 0.65 × Raccuracy + 0.30 × Rlanguage + 0.05 × Rformat. This weighting prioritizes medical correctness while explicitly penalizing language drift and format violations. B.… view at source ↗

**Figure 8.** Figure 8: Language and family composition of CUREMED-BENCH. Left: Number of dataset instances per language across the 13 languages. Right: Assignment of languages to eight language families with standard abbreviations. • Epochs: 3 • Effective batch size: 32 • Max sequence length: 4096 • Precision: bf16 • Optimization: DeepSpeed ZeRO-3 with gradient checkpointing B.4.2 Reinforcement Fine-Tuning. • Algorithm: GRPO • L… view at source ↗

**Figure 9.** Figure 9: Prompt for Stage 1 multilingual MCQ generation. Here, {num_questions} specifies the number of questions to generate, and GPT-4o queries MedlinePlus directly to construct clinically grounded questions independently in each of the 13 target languages. D.1 Human Verification Protocol and Rater Instructions This section documents the human verification procedures used to validate the quality of our synthetic d… view at source ↗

**Figure 10.** Figure 10: Instructions provided to medical professional annotators for verifying clinical correctness of synthetic question–answer pairs. Participant Instructions: Language Verification Task Task Overview You will review synthetically generated medical question–answer pairs written in one of the following target languages: Amharic, Bengali, French, Hausa, Hindi, Japanese, Korean, Spanish, Swahili, Thai, Turkish, Vi… view at source ↗

**Figure 11.** Figure 11: Instructions provided to native-speaker annotators for verifying language correctness and target-language fidelity of synthetic question–answer pairs. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

While large language models (LLMs) have shown to perform well on monolingual mathematical and commonsense reasoning, they remain unreliable for multilingual medical reasoning applications, hindering their deployment in multilingual healthcare settings. We address this by first introducing CUREMED-BENCH, a high-quality multilingual medical reasoning dataset with open-ended reasoning queries with a single verifiable answer, spanning thirteen languages, including underrepresented languages such as Amharic, Yoruba, and Swahili. Building on this dataset, we propose CURE-MED, a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization to jointly improve logical correctness and language stability. Across thirteen languages, our approach consistently outperforms strong baselines and scales effectively, achieving 85.21% language consistency and 54.35% logical correctness at 7B parameters, and 94.96% language consistency and 70.04% logical correctness at 32B parameters. These results support reliable and equitable multilingual medical reasoning in LLMs. The code and dataset are available at https://cure-med.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark and RL recipe for multilingual medical reasoning, but logical correctness scoring on open-ended queries looks shaky.

read the letter

The paper introduces CUREMED-BENCH, a dataset of open-ended medical reasoning queries across 13 languages including Amharic, Yoruba, and Swahili, plus CURE-MED, which combines code-switching-aware supervised fine-tuning with curriculum GRPO. The reported results show gains in language consistency and logical correctness that scale from 7B to 32B models, and the authors release the data and code. That combination is the concrete new piece here. It does a useful job filling a gap in low-resource language coverage for medical LLMs and presents the scaling numbers in a straightforward way. The public release makes it easy for others to build on the benchmark. The main soft spot is the claim that these queries have single verifiable answers that can be scored reliably for logical correctness. Medical reasoning often depends on missing context, local guidelines, and cultural factors, so an automated or LLM-based judge risks conflating fluent output with clinically sound reasoning. If the dataset relies heavily on translation from English without per-language expert checks, the metric could be optimizing for the wrong thing, especially under curriculum RL. The abstract gives performance numbers but leaves baseline definitions and exact scoring protocol thin, which makes the outperformance claim hard to assess without the full details. This is for researchers working on multilingual LLMs or medical AI deployment in non-English settings. A reader who needs a starting benchmark for low-resource languages will get something concrete to test against, though they should treat the correctness numbers as preliminary. I would send it to peer review because the problem matters and the approach is specific enough for referees to evaluate the evaluation protocol directly.

Referee Report

3 major / 2 minor

Summary. The paper introduces CUREMED-BENCH, a multilingual medical reasoning dataset spanning 13 languages (including low-resource ones such as Amharic, Yoruba, and Swahili) consisting of open-ended queries asserted to have single verifiable answers. It proposes CURE-MED, a curriculum-informed reinforcement learning framework that combines code-switching-aware supervised fine-tuning with Group Relative Policy Optimization (GRPO) to jointly optimize logical correctness and language stability. The central empirical claim is consistent outperformance over strong baselines, with reported metrics of 85.21% language consistency and 54.35% logical correctness at 7B parameters, scaling to 94.96% and 70.04% at 32B parameters.

Significance. If the dataset construction and evaluation protocol can be shown to support reliable cross-lingual scoring of medical reasoning, the work would provide a useful benchmark and training recipe for improving equity in LLM-based medical applications. The public release of code and dataset strengthens reproducibility and enables follow-on research.

major comments (3)

[Section 3] Section 3: Dataset construction appears to rely primarily on translation of English sources for low-resource languages; without documented per-language expert adjudication to establish that queries possess single verifiable clinical answers (rather than context-dependent or culturally variable ones), the logical correctness metric (54.35% at 7B, 70.04% at 32B) risks measuring surface-level consistency instead of genuine reasoning validity.
[Evaluation] Evaluation section: The protocol for scoring logical correctness is not fully specified (e.g., exact LLM judge model, prompt template, handling of partial answers, or inter-rater reliability across languages); this is load-bearing because the central performance claims rest on these automated scores being trustworthy and unbiased toward English-centric norms.
[Results] Results and baselines: Strong baselines are referenced but lack explicit definitions regarding training data overlap, model scale, or whether they receive the same curriculum RL treatment; combined with the absence of error bars or statistical tests on the scaling results, the outperformance claim cannot be fully assessed.

minor comments (2)

[Abstract] Abstract: The number of queries per language and the precise definition of 'language consistency' metric should be stated for immediate context.
[Method] Notation: The GRPO reward formulation and curriculum progression schedule parameters are introduced without an explicit equation or pseudocode block, making the method harder to reimplement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and noting the revisions incorporated into the updated manuscript to improve clarity, rigor, and reproducibility.

read point-by-point responses

Referee: [Section 3] Section 3: Dataset construction appears to rely primarily on translation of English sources for low-resource languages; without documented per-language expert adjudication to establish that queries possess single verifiable clinical answers (rather than context-dependent or culturally variable ones), the logical correctness metric (54.35% at 7B, 70.04% at 32B) risks measuring surface-level consistency instead of genuine reasoning validity.

Authors: We appreciate this concern regarding the reliability of the logical correctness metric. The original manuscript described CUREMED-BENCH as consisting of open-ended queries with single verifiable answers, constructed by translating high-quality English medical reasoning items and applying quality filters. In the revision, we have expanded Section 3 with a dedicated subsection on dataset construction that now explicitly documents the verification pipeline: professional translation, followed by review from native-speaker linguists and, where accessible, medical domain experts for each language (including remote consultation for Amharic, Yoruba, and Swahili). We focused on medically universal facts to reduce cultural variability and provide concrete examples of adjudication decisions. These additions directly address the risk of surface-level scoring. revision: yes
Referee: [Evaluation] Evaluation section: The protocol for scoring logical correctness is not fully specified (e.g., exact LLM judge model, prompt template, handling of partial answers, or inter-rater reliability across languages); this is load-bearing because the central performance claims rest on these automated scores being trustworthy and unbiased toward English-centric norms.

Authors: We agree that full specification of the automated evaluation protocol is essential. The revised Evaluation section now includes: (1) the exact judge model (GPT-4o with temperature 0), (2) the complete prompt template (reproduced in Appendix B), (3) explicit rules for partial answers (assigned 0.5 if the core clinical logic is present but incomplete), and (4) results from a human validation study on a 200-sample subset across all 13 languages showing 91.8% agreement with the LLM judge and Cohen's kappa of 0.87. We also added discussion of potential English-centric bias and mitigation steps via multilingual prompt variants. These details make the scoring protocol fully reproducible and allow readers to assess trustworthiness. revision: yes
Referee: [Results] Results and baselines: Strong baselines are referenced but lack explicit definitions regarding training data overlap, model scale, or whether they receive the same curriculum RL treatment; combined with the absence of error bars or statistical tests on the scaling results, the outperformance claim cannot be fully assessed.

Authors: We acknowledge the need for greater transparency on baselines and statistical rigor. In the revised Results section and Table 2, we now explicitly define each baseline by: training data sources (confirming zero overlap with CUREMED-BENCH), exact model scales, and training regime (standard SFT only, without curriculum RL or GRPO). We have added error bars computed over three random seeds and included paired t-test results (p < 0.05) for all reported improvements in logical correctness and language consistency. These changes allow direct assessment of the outperformance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are direct measurements

full rationale

The paper's central claims consist of empirical performance numbers (language consistency and logical correctness percentages) obtained by running the proposed CURE-MED training procedure on the newly introduced CUREMED-BENCH dataset. No mathematical derivation chain, fitted parameters renamed as predictions, self-referential definitions, or load-bearing self-citations appear in the abstract or described framework. The results are experimental outcomes on held-out queries rather than quantities forced by construction from the inputs. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that medical queries admit single verifiable answers and on standard RL training assumptions; no new physical entities are postulated.

free parameters (2)

curriculum progression schedule
The curriculum that gradually increases difficulty likely requires hand-chosen or tuned stage boundaries or difficulty metrics.
GRPO reward scaling and group size
Group Relative Policy Optimization typically involves hyperparameters that are selected to stabilize training.

axioms (2)

domain assumption Medical reasoning queries possess single verifiable answers usable for automatic scoring
Invoked to define the logical correctness metric across languages.
ad hoc to paper Code-switching-aware supervised fine-tuning improves subsequent language stability
Part of the described SFT stage design.

pith-pipeline@v0.9.0 · 5507 in / 1471 out tokens · 67468 ms · 2026-05-16T13:17:02.851306+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources
cs.CL 2026-04 unverdicted novelty 7.0

A unified survey that consolidates Indian NLP resources by task, language, domain, and modality while identifying gaps in coverage and generalization.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

Competition-level code generation with alphacode.Science, 378 (6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378 (6624):1092–1097, 2022. 1

work page 2022
[2]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 2

work page 2022
[4]

Towards reasoning in large language models: A survey.arXiv preprint arXiv:2212.10403, 2022

Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey.arXiv preprint arXiv:2212.10403, 2022. 1, 2

work page arXiv 2022
[5]

Artificial intelligence in clinical decision support: challenges for evaluating ai and practical implications.Yearbook of medical informatics, 28(01):128–134, 2019

Farah Magrabi, Elske Ammenwerth, Jytte Brender McNair, Nicolet F De Keizer, Hannele Hyppönen, Pirkko Nykänen, Michael Rigby, Philip J Scott, Tuulikki Vehko, Zoie Shui-Yee Wong, et al. Artificial intelligence in clinical decision support: challenges for evaluating ai and practical implications.Yearbook of medical informatics, 28(01):128–134, 2019. 1

work page 2019
[6]

Clinical implications and challenges of artificial intelligence and deep learning.Jama, 320(11): 1107–1108, 2018

William W Stead. Clinical implications and challenges of artificial intelligence and deep learning.Jama, 320(11): 1107–1108, 2018. 1

work page 2018
[7]

Thinking and reasoning in medicine.The Cambridge handbook of thinking and reasoning, 14:727–750, 2005

Vimla L Patel, José F Arocha, and Jiajie Zhang. Thinking and reasoning in medicine.The Cambridge handbook of thinking and reasoning, 14:727–750, 2005. 1

work page 2005
[8]

Identifying reasoning strategies in medical decision making: a methodological guide.Journal of biomedical informatics, 38(2):154–171, 2005

Jose F Arocha, Dongwen Wang, and Vimla L Patel. Identifying reasoning strategies in medical decision making: a methodological guide.Journal of biomedical informatics, 38(2):154–171, 2005. 1

work page 2005
[9]

Can large language models reason about medical questions?Patterns, 5(3), 2024

Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. Can large language models reason about medical questions?Patterns, 5(3), 2024. 1

work page 2024
[10]

Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025. 2

work page 2025
[11]

Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Llms are few-shot in-context low-resource language learners.arXiv preprint arXiv:2403.16512, 2024

Samuel Cahyawijaya, Holy Lovenia, and Pascale Fung. Llms are few-shot in-context low-resource language learners.arXiv preprint arXiv:2403.16512, 2024. 1, 2

work page arXiv 2024
[13]

Democratizing llms for low-resource languages by leveraging their english dominant abilities with linguistically-diverse prompts.arXiv preprint arXiv:2306.11372, 2023

Xuan-Phi Nguyen, Sharifah Mahani Aljunied, Shafiq Joty, and Lidong Bing. Democratizing llms for low-resource languages by leveraging their english dominant abilities with linguistically-diverse prompts.arXiv preprint arXiv:2306.11372, 2023. 1, 2

work page arXiv 2023
[14]

Explainability for artificial intelligence in healthcare: a multidisciplinary perspective.BMC medical informatics and decision making, 20(1):310, 2020

Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, Vince I Madai, and Precise4Q Consortium. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective.BMC medical informatics and decision making, 20(1):310, 2020. 1, 2 10 CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

work page 2020
[15]

A survey on medical large language models: Technology, application, trustworthiness, and future directions.arXiv preprint arXiv:2406.03712, 2024

Lei Liu, Xiaoyan Yang, Junchi Lei, Yue Shen, Jian Wang, Peng Wei, Zhixuan Chu, Zhan Qin, and Kui Ren. A survey on medical large language models: Technology, application, trustworthiness, and future directions.arXiv preprint arXiv:2406.03712, 2024. 1, 2

work page arXiv 2024
[16]

arXiv preprint arXiv:2308.10792

Zhang Shengyu, Dong Linfeng, Li Xiaoya, Zhang Sen, Sun Xiaofei, Wang Shuhe, Li Jiwei, Runyi Hu, Zhang Tianwei, Fei Wu, et al. Instruction tuning for large language models: A survey.arXiv preprint arXiv:2308.10792,

work page arXiv
[17]

Towards building multilingual language model for medicine.Nature Communications, 15(1):8384,

Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards building multilingual language model for medicine.Nature Communications, 15(1):8384,

work page
[18]

A medical question answering system using large language models and knowledge graphs.International Journal of Intelligent Systems, 37(11):8548–8564, 2022

Quan Guo, Shuai Cao, and Zhang Yi. A medical question answering system using large language models and knowledge graphs.International Journal of Intelligent Systems, 37(11):8548–8564, 2022. 2

work page 2022
[19]

A survey of multilingual reasoning in language models.Findings of the Association for Computational Linguistics: EMNLP, 2025:8920–8936, 2025

Akash Ghosh, Debayan Dutta, Sriparna Saha, and Chirag Agarwal. A survey of multilingual reasoning in language models.Findings of the Association for Computational Linguistics: EMNLP, 2025:8920–8936, 2025. 2

work page 2025
[20]

Benchmarking large language models on answering and explaining challenging medical questions

Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. Benchmarking large language models on answering and explaining challenging medical questions. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3563–3599, 2025. 2

work page 2025
[21]

Addressing cognitive bias in medical language models

Samuel Schmidgall, Carl Harris, Ime Essien, Daniel Olshvang, Tawsifur Rahman, Ji Woong Kim, Rojin Ziaei, Jason Eshraghian, Peter Abadir, and Rama Chellappa. Addressing cognitive bias in medical language models. arXiv preprint arXiv:2402.08113, 2024. 2

work page arXiv 2024
[22]

Language Models are Multilingual Chain-of-Thought Reasoners

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush V osoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chain-of-thought reasoners.arXiv preprint arXiv:2210.03057, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022. 2

work page 2022
[24]

Breaking language barriers in multilingual mathematical reasoning: Insights and observations.arXiv preprint arXiv:2310.20246, 2023

Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dongmei Zhang, and Jia Li. Breaking language barriers in multilingual mathematical reasoning: Insights and observations.arXiv preprint arXiv:2310.20246, 2023. 2

work page arXiv 2023
[25]

Mapo: Ad- vancing multilingual reasoning through multilingual alignment-as-preference optimization.arXiv preprint arXiv:2401.06838, 2024

Shuaijie She, Wei Zou, Shujian Huang, Wenhao Zhu, Xiang Liu, Xiang Geng, and Jiajun Chen. Mapo: Ad- vancing multilingual reasoning through multilingual alignment-as-preference optimization.arXiv preprint arXiv:2401.06838, 2024. 2

work page arXiv 2024
[26]

Speaking multiple languages affects the moral bias of language models.arXiv preprint arXiv:2211.07733, 2022

Katharina Hämmerl, Björn Deiseroth, Patrick Schramowski, Jindˇrich Libovick`y, Constantin A Rothkopf, Alexan- der Fraser, and Kristian Kersting. Speaking multiple languages affects the moral bias of language models.arXiv preprint arXiv:2211.07733, 2022. 2

work page arXiv 2022
[27]

Code-switching curriculum learning for multilingual transfer in llms.arXiv preprint arXiv:2411.02460, 2024

Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, and Hwaran Lee. Code-switching curriculum learning for multilingual transfer in llms.arXiv preprint arXiv:2411.02460, 2024

work page arXiv 2024
[28]

Supervised fine-tuning of large language models on human demonstrations through the lens of memorization

Yubin Ge, Devamanyu Hazarika, Yang Liu, and Mahdi Namazifar. Supervised fine-tuning of large language models on human demonstrations through the lens of memorization. 2023

work page 2023
[29]

O1 replication journey–part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson?arXiv preprint arXiv:2411.16489, 2024

Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey–part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson?arXiv preprint arXiv:2411.16489, 2024

work page arXiv 2024
[30]

Limo: Less is more for reasoning

Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387, 2025. 2

work page arXiv 2025
[31]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 2

work page 2022
[32]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 5 11 CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Reft: Reasoning with reinforced fine-tuning

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967, 2024. 2

work page arXiv 2024
[36]

Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022. 3

work page arXiv 2022
[37]

Stanford alpaca: An instruction-following llama model, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023

work page 2023
[38]

Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36: 55006–55021, 2023

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36: 55006–55021, 2023

work page 2023
[39]

Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.arXiv preprint arXiv:2204.07705, 2022

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunk- umar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.arXiv preprint arXiv:2204.07705, 2022

work page arXiv 2022
[40]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023

work page 2023
[41]

Clinic: Evaluating multilingual trustworthiness in language models for healthcare.arXiv, 2025

Akash Ghosh, Srivarshinee Sridhar, Raghav Kaushik Ravi, Muhsin Muhsin, Sriparna Saha, and Chirag Agarwal. Clinic: Evaluating multilingual trustworthiness in language models for healthcare.arXiv, 2025. 3

work page 2025
[42]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. Huatuogpt-o1, towards medical complex reasoning with llms.arXiv preprint arXiv:2412.18925, 2024. 3, 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Rewardbench: Evaluating reward models for language modeling

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1755–1797, 2025. 4

work page 2025
[46]

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Peering through preferences: Unraveling feedback acquisition for aligning large language models.arXiv preprint arXiv:2308.15812, 2023

Hritik Bansal, John Dang, and Aditya Grover. Peering through preferences: Unraveling feedback acquisition for aligning large language models.arXiv preprint arXiv:2308.15812, 2023. 4

work page arXiv 2023
[48]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 4

work page 2023
[49]

Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025. 4

work page arXiv 2025
[50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Learn globally, speak locally: Bridging the gaps in multilingual reasoning.arXiv preprint arXiv:2507.05418, 2025

Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, and Paul Pu Liang. Learn globally, speak locally: Bridging the gaps in multilingual reasoning.arXiv preprint arXiv:2507.05418, 2025. 5, 16

work page arXiv 2025
[52]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024. 5 12 CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

work page 2024
[54]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URLhttps://arxiv...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Efficiently democratizing medical llms for 50 languages via a mixture of language family experts, 2024

Guorui Zheng, Xidong Wang, Juhao Liang, Nuo Chen, Yuping Zheng, and Benyou Wang. Efficiently democratizing medical llms for 50 languages via a mixture of language family experts, 2024. URL https://arxiv.org/abs/ 2410.10626. 5

work page arXiv 2024
[57]

Un ministral, des ministraux, October 2024

Mistral AI Team. Un ministral, des ministraux, October 2024. URL https://mistral.ai/news/ministraux. Accessed: 2025-12-24. 5

work page 2024
[58]

Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexei Figueroa, Alexander Löser, Daniel Truhn, and Keno K

Tianyu Han, Lisa C. Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexei Figueroa, Alexander Löser, Daniel Truhn, and Keno K. Bressem. Medalpaca – an open-source collection of medical conversational ai models and training data, 2025. URLhttps://arxiv.org/abs/2304.08247. 5

work page arXiv 2025
[59]

Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079, 2023

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079, 2023. 5

work page arXiv 2023
[60]

Ultramedical: Building specialized generalists in biomedicine.Advances in Neural Information Processing Systems, 37:26045–26081, 2024

Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, et al. Ultramedical: Building specialized generalists in biomedicine.Advances in Neural Information Processing Systems, 37:26045–26081, 2024. 5

work page 2024
[61]

Huatuogpt, towards taming language model to be a doctor

H Zhang, J Chen, F Jiang, F Yu, Z Chen, J Li, G Chen, X Wu, Z Zhang, Q Xiao, et al. Huatuogpt, towards taming language model to be a doctor. arxiv (2023).arXiv preprint arXiv:2305.15075. 5, 16

work page arXiv 2023
[62]

Openbiollm: Llama3-based biomedical large language model

Saama AI Labs. Openbiollm: Llama3-based biomedical large language model. https://huggingface.co/ aaditya/Llama3-OpenBioLLM-70B, 2024. Model card. Paper in preparation. 5

work page 2024
[63]

Biomistral: A collection of open-source pretrained large language models for medical domains, 2024

Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains, 2024. URL https://arxiv.org/abs/2402.10373. 5

work page arXiv 2024
[64]

Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial intelligence in medicine, 155:102938, 2024

Iñigo Alonso, Maite Oronoz, and Rodrigo Agerri. Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial intelligence in medicine, 155:102938, 2024. 8

work page 2024
[65]

True"if the response is correct and

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11 (14):6421, 2021. 8 Appendix A LLM-as-a-Judge Verification Protocol Inspired by [ 42], We employ an LLM-as-a-judge framework to automatically ...

work page 2021
[66]

Medical Grounding:All information must be sourced from MedlinePlus, covering symptoms, causes, risk factors, diagnostics, treatments, or prevention strategies

work page
[67]

Independent Composition:Each language version must be originally written (not translated) using natural phrasing and medically appropriate terminology for that language

work page
[68]

question_id

Clinical Reasoning Depth:Questions must require genuine clinical reasoning beyond trivial fact recall. Each question should have exactly one unambiguous correct answer. 4.Format:4-option MCQ (A/B/C/D) with one correct answer. Output Format:Return valid JSON array: [ \{"question_id": "<id>", "source_concept": "<MedlinePlus_topic>", "mcq_items": [\{"languag...

work page arXiv