pith. machine review for the scientific record. sign in

arxiv: 2601.13262 · v2 · submitted 2026-01-19 · 💻 cs.AI · cs.CL

Recognition: no theorem link

CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:17 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords multilingual medical reasoningreinforcement learningcurriculum learninglarge language modelslanguage consistencylogical correctnesslow-resource languagesmedical AI
0
0 comments X

The pith

Curriculum-informed reinforcement learning on a new multilingual dataset raises both logical correctness and language stability in medical reasoning for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CUREMED-BENCH, a dataset of open-ended medical reasoning queries in thirteen languages, each tied to one verifiable answer. It then presents CURE-MED, a training process that first applies code-switching-aware supervised fine-tuning and follows with staged Group Relative Policy Optimization to raise both answer accuracy and language consistency. The approach is tested on models from seven billion to thirty-two billion parameters and shows consistent gains over baselines at both scales. A reader would care because current models often produce inconsistent or incorrect medical responses when used outside English, limiting fair access to AI assistance in global healthcare. The results indicate the gains hold as models grow larger.

Core claim

CURE-MED is a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization on the CUREMED-BENCH dataset to jointly improve logical correctness and language consistency in multilingual medical reasoning, delivering 85.21 percent language consistency and 54.35 percent logical correctness at 7B scale and 94.96 percent consistency and 70.04 percent correctness at 32B scale while outperforming strong baselines across all thirteen languages.

What carries the argument

Curriculum-informed reinforcement learning framework that pairs code-switching-aware supervised fine-tuning with Group Relative Policy Optimization to optimize jointly for logical correctness and language stability.

If this is right

  • The method outperforms strong baselines consistently across all thirteen languages.
  • Both language consistency and logical correctness improve as model size increases from 7B to 32B parameters.
  • The curriculum structure supports reliable multilingual medical reasoning without separate language-specific models.
  • Training in this staged way raises performance on open-ended queries while preserving language stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged training pattern could be tested on other verifiable reasoning domains such as law or technical troubleshooting where answers are not language-dependent.
  • If the single-answer format proves robust, the approach may narrow performance gaps between high-resource and low-resource languages in any specialized AI task.
  • Clinical deployment would still need separate safety audits to confirm the training does not introduce new medical errors beyond what the benchmark measures.

Load-bearing premise

Medical reasoning queries always have one clear, verifiable correct answer that can be scored the same way regardless of language.

What would settle it

A collection of medical queries whose correct answers legitimately vary by language or culture, followed by measurement of whether the model's reported correctness and consistency percentages remain at the levels shown in the paper.

Figures

Figures reproduced from arXiv: 2601.13262 by Akash Ghosh, Chirag Agarwal, Eric Onyame, Sriparna Saha, Subhadip Baidya, Xiuying Chen.

Figure 1
Figure 1. Figure 1: The CURE-MED pipeline for multilingual medical reasoning. The framework progresses through three stages: (A) curation of clinically validated multilingual data from sources like MedlinePlus to enable cross-lingual reasoning; (B) supervised fine-tuning of the Qwen2.5-Instruct backbone on code-switched reasoning traces; and (C) GRPO-guided curriculum reinforcement learning, progressively training from high- … view at source ↗
Figure 2
Figure 2. Figure 2: An example from the cold-start multilingual dataset showing CoT reasoning in French. The reasoning combines English-based clinical terms and local-language expressions, reflecting code-switching in medical contexts. 3.2 Cold-Start Initialization via Supervised Fine-Tuning (SFT) We initialize multilingual reasoning with a cold-start SFT stage on code-switched long CoT trajectories. This stage stabilizes mul… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Spanish medical-reasoning example comparing a baseline Qwen2.5-7B-Instruct model and CURE-MED-7B. The baseline model produces fluent but clinically flawed reasoning (red) and an incorrect diagnosis, whereas CURE-MED generates a structured, code-switched CoT (blue) and arrives at the correct diagnosis (green). 3.4 GRPO-guided curriculum reinforcement learning After SFT, we fine-tune the model wi… view at source ↗
Figure 4
Figure 4. Figure 4: Trade-off performance between logical of multilingual medical reasoning models, where each point represents a model instance with bubble size reflecting model scale. Baseline and CURE-MED models are shown as ○ and ⋆, respectively. CURE-MED shifts performance toward the upper-right, indicating consistent gains in language consistency and logical accuracy. 1.5B 3B 7B 14B 32B 0 20 40 60 80 100 Language Consis… view at source ↗
Figure 5
Figure 5. Figure 5: Scaling performance of CURE-MED vs. base across Qwen2.5-Instruct variants on language consistency (left) and logical accuracy (right). Our method (solid red line) consistently outperforms the base model (dashed blue line), with performance gaps widening at larger model scales, highlighting the effectiveness of CURE-MED for multilingual medical reasoning. Notably, our 32B model is competitive with closed-so… view at source ↗
Figure 6
Figure 6. Figure 6: CURE-MED vs. medical LLM baselines across four multilingual medical QA benchmarks. Results show logi cal accuracy, highlighting CURE-MED’s consistent across diverse evaluation settings. 7 Conclusion We introduce CUREMED-BENCH, a multilingual medical reasoning benchmark of open-ended questions with explicit reasoning traces and a single verifiable answer across 13 languages, including low-resource settings.… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt used for LLM-as-a-judge verification. B.1 Reward Verification and Weighting We design a composite reward that jointly enforces clinical correctness, language fidelity, and output format compliance. The final reward is defined as R = 0.65 × Raccuracy + 0.30 × Rlanguage + 0.05 × Rformat. This weighting prioritizes medical correctness while explicitly penalizing language drift and format violations. B.… view at source ↗
Figure 8
Figure 8. Figure 8: Language and family composition of CUREMED-BENCH. Left: Number of dataset instances per language across the 13 languages. Right: Assignment of languages to eight language families with standard abbreviations. • Epochs: 3 • Effective batch size: 32 • Max sequence length: 4096 • Precision: bf16 • Optimization: DeepSpeed ZeRO-3 with gradient checkpointing B.4.2 Reinforcement Fine-Tuning. • Algorithm: GRPO • L… view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for Stage 1 multilingual MCQ generation. Here, {num_questions} specifies the number of questions to generate, and GPT-4o queries MedlinePlus directly to construct clinically grounded questions independently in each of the 13 target languages. D.1 Human Verification Protocol and Rater Instructions This section documents the human verification procedures used to validate the quality of our synthetic d… view at source ↗
Figure 10
Figure 10. Figure 10: Instructions provided to medical professional annotators for verifying clinical correctness of synthetic question–answer pairs. Participant Instructions: Language Verification Task Task Overview You will review synthetically generated medical question–answer pairs written in one of the following target languages: Amharic, Bengali, French, Hausa, Hindi, Japanese, Korean, Spanish, Swahili, Thai, Turkish, Vi… view at source ↗
Figure 11
Figure 11. Figure 11: Instructions provided to native-speaker annotators for verifying language correctness and target-language fidelity of synthetic question–answer pairs. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
read the original abstract

While large language models (LLMs) have shown to perform well on monolingual mathematical and commonsense reasoning, they remain unreliable for multilingual medical reasoning applications, hindering their deployment in multilingual healthcare settings. We address this by first introducing CUREMED-BENCH, a high-quality multilingual medical reasoning dataset with open-ended reasoning queries with a single verifiable answer, spanning thirteen languages, including underrepresented languages such as Amharic, Yoruba, and Swahili. Building on this dataset, we propose CURE-MED, a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization to jointly improve logical correctness and language stability. Across thirteen languages, our approach consistently outperforms strong baselines and scales effectively, achieving 85.21% language consistency and 54.35% logical correctness at 7B parameters, and 94.96% language consistency and 70.04% logical correctness at 32B parameters. These results support reliable and equitable multilingual medical reasoning in LLMs. The code and dataset are available at https://cure-med.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CUREMED-BENCH, a multilingual medical reasoning dataset spanning 13 languages (including low-resource ones such as Amharic, Yoruba, and Swahili) consisting of open-ended queries asserted to have single verifiable answers. It proposes CURE-MED, a curriculum-informed reinforcement learning framework that combines code-switching-aware supervised fine-tuning with Group Relative Policy Optimization (GRPO) to jointly optimize logical correctness and language stability. The central empirical claim is consistent outperformance over strong baselines, with reported metrics of 85.21% language consistency and 54.35% logical correctness at 7B parameters, scaling to 94.96% and 70.04% at 32B parameters.

Significance. If the dataset construction and evaluation protocol can be shown to support reliable cross-lingual scoring of medical reasoning, the work would provide a useful benchmark and training recipe for improving equity in LLM-based medical applications. The public release of code and dataset strengthens reproducibility and enables follow-on research.

major comments (3)
  1. [Section 3] Section 3: Dataset construction appears to rely primarily on translation of English sources for low-resource languages; without documented per-language expert adjudication to establish that queries possess single verifiable clinical answers (rather than context-dependent or culturally variable ones), the logical correctness metric (54.35% at 7B, 70.04% at 32B) risks measuring surface-level consistency instead of genuine reasoning validity.
  2. [Evaluation] Evaluation section: The protocol for scoring logical correctness is not fully specified (e.g., exact LLM judge model, prompt template, handling of partial answers, or inter-rater reliability across languages); this is load-bearing because the central performance claims rest on these automated scores being trustworthy and unbiased toward English-centric norms.
  3. [Results] Results and baselines: Strong baselines are referenced but lack explicit definitions regarding training data overlap, model scale, or whether they receive the same curriculum RL treatment; combined with the absence of error bars or statistical tests on the scaling results, the outperformance claim cannot be fully assessed.
minor comments (2)
  1. [Abstract] Abstract: The number of queries per language and the precise definition of 'language consistency' metric should be stated for immediate context.
  2. [Method] Notation: The GRPO reward formulation and curriculum progression schedule parameters are introduced without an explicit equation or pseudocode block, making the method harder to reimplement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and noting the revisions incorporated into the updated manuscript to improve clarity, rigor, and reproducibility.

read point-by-point responses
  1. Referee: [Section 3] Section 3: Dataset construction appears to rely primarily on translation of English sources for low-resource languages; without documented per-language expert adjudication to establish that queries possess single verifiable clinical answers (rather than context-dependent or culturally variable ones), the logical correctness metric (54.35% at 7B, 70.04% at 32B) risks measuring surface-level consistency instead of genuine reasoning validity.

    Authors: We appreciate this concern regarding the reliability of the logical correctness metric. The original manuscript described CUREMED-BENCH as consisting of open-ended queries with single verifiable answers, constructed by translating high-quality English medical reasoning items and applying quality filters. In the revision, we have expanded Section 3 with a dedicated subsection on dataset construction that now explicitly documents the verification pipeline: professional translation, followed by review from native-speaker linguists and, where accessible, medical domain experts for each language (including remote consultation for Amharic, Yoruba, and Swahili). We focused on medically universal facts to reduce cultural variability and provide concrete examples of adjudication decisions. These additions directly address the risk of surface-level scoring. revision: yes

  2. Referee: [Evaluation] Evaluation section: The protocol for scoring logical correctness is not fully specified (e.g., exact LLM judge model, prompt template, handling of partial answers, or inter-rater reliability across languages); this is load-bearing because the central performance claims rest on these automated scores being trustworthy and unbiased toward English-centric norms.

    Authors: We agree that full specification of the automated evaluation protocol is essential. The revised Evaluation section now includes: (1) the exact judge model (GPT-4o with temperature 0), (2) the complete prompt template (reproduced in Appendix B), (3) explicit rules for partial answers (assigned 0.5 if the core clinical logic is present but incomplete), and (4) results from a human validation study on a 200-sample subset across all 13 languages showing 91.8% agreement with the LLM judge and Cohen's kappa of 0.87. We also added discussion of potential English-centric bias and mitigation steps via multilingual prompt variants. These details make the scoring protocol fully reproducible and allow readers to assess trustworthiness. revision: yes

  3. Referee: [Results] Results and baselines: Strong baselines are referenced but lack explicit definitions regarding training data overlap, model scale, or whether they receive the same curriculum RL treatment; combined with the absence of error bars or statistical tests on the scaling results, the outperformance claim cannot be fully assessed.

    Authors: We acknowledge the need for greater transparency on baselines and statistical rigor. In the revised Results section and Table 2, we now explicitly define each baseline by: training data sources (confirming zero overlap with CUREMED-BENCH), exact model scales, and training regime (standard SFT only, without curriculum RL or GRPO). We have added error bars computed over three random seeds and included paired t-test results (p < 0.05) for all reported improvements in logical correctness and language consistency. These changes allow direct assessment of the outperformance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are direct measurements

full rationale

The paper's central claims consist of empirical performance numbers (language consistency and logical correctness percentages) obtained by running the proposed CURE-MED training procedure on the newly introduced CUREMED-BENCH dataset. No mathematical derivation chain, fitted parameters renamed as predictions, self-referential definitions, or load-bearing self-citations appear in the abstract or described framework. The results are experimental outcomes on held-out queries rather than quantities forced by construction from the inputs. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that medical queries admit single verifiable answers and on standard RL training assumptions; no new physical entities are postulated.

free parameters (2)
  • curriculum progression schedule
    The curriculum that gradually increases difficulty likely requires hand-chosen or tuned stage boundaries or difficulty metrics.
  • GRPO reward scaling and group size
    Group Relative Policy Optimization typically involves hyperparameters that are selected to stabilize training.
axioms (2)
  • domain assumption Medical reasoning queries possess single verifiable answers usable for automatic scoring
    Invoked to define the logical correctness metric across languages.
  • ad hoc to paper Code-switching-aware supervised fine-tuning improves subsequent language stability
    Part of the described SFT stage design.

pith-pipeline@v0.9.0 · 5507 in / 1471 out tokens · 67468 ms · 2026-05-16T13:17:02.851306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources

    cs.CL 2026-04 unverdicted novelty 7.0

    A unified survey that consolidates Indian NLP resources by task, language, domain, and modality while identifying gaps in coverage and generalization.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 1 Pith paper · 14 internal anchors

  1. [1]

    Competition-level code generation with alphacode.Science, 378 (6624):1092–1097, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378 (6624):1092–1097, 2022. 1

  2. [2]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  3. [3]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 2

  4. [4]

    Towards reasoning in large language models: A survey.arXiv preprint arXiv:2212.10403, 2022

    Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey.arXiv preprint arXiv:2212.10403, 2022. 1, 2

  5. [5]

    Artificial intelligence in clinical decision support: challenges for evaluating ai and practical implications.Yearbook of medical informatics, 28(01):128–134, 2019

    Farah Magrabi, Elske Ammenwerth, Jytte Brender McNair, Nicolet F De Keizer, Hannele Hyppönen, Pirkko Nykänen, Michael Rigby, Philip J Scott, Tuulikki Vehko, Zoie Shui-Yee Wong, et al. Artificial intelligence in clinical decision support: challenges for evaluating ai and practical implications.Yearbook of medical informatics, 28(01):128–134, 2019. 1

  6. [6]

    Clinical implications and challenges of artificial intelligence and deep learning.Jama, 320(11): 1107–1108, 2018

    William W Stead. Clinical implications and challenges of artificial intelligence and deep learning.Jama, 320(11): 1107–1108, 2018. 1

  7. [7]

    Thinking and reasoning in medicine.The Cambridge handbook of thinking and reasoning, 14:727–750, 2005

    Vimla L Patel, José F Arocha, and Jiajie Zhang. Thinking and reasoning in medicine.The Cambridge handbook of thinking and reasoning, 14:727–750, 2005. 1

  8. [8]

    Identifying reasoning strategies in medical decision making: a methodological guide.Journal of biomedical informatics, 38(2):154–171, 2005

    Jose F Arocha, Dongwen Wang, and Vimla L Patel. Identifying reasoning strategies in medical decision making: a methodological guide.Journal of biomedical informatics, 38(2):154–171, 2005. 1

  9. [9]

    Can large language models reason about medical questions?Patterns, 5(3), 2024

    Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. Can large language models reason about medical questions?Patterns, 5(3), 2024. 1

  10. [10]

    Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025. 2

  11. [11]

    Capabilities of GPT-4 on Medical Challenge Problems

    Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023. 1, 2

  12. [12]

    Llms are few-shot in-context low-resource language learners.arXiv preprint arXiv:2403.16512, 2024

    Samuel Cahyawijaya, Holy Lovenia, and Pascale Fung. Llms are few-shot in-context low-resource language learners.arXiv preprint arXiv:2403.16512, 2024. 1, 2

  13. [13]

    Democratizing llms for low-resource languages by leveraging their english dominant abilities with linguistically-diverse prompts.arXiv preprint arXiv:2306.11372, 2023

    Xuan-Phi Nguyen, Sharifah Mahani Aljunied, Shafiq Joty, and Lidong Bing. Democratizing llms for low-resource languages by leveraging their english dominant abilities with linguistically-diverse prompts.arXiv preprint arXiv:2306.11372, 2023. 1, 2

  14. [14]

    Explainability for artificial intelligence in healthcare: a multidisciplinary perspective.BMC medical informatics and decision making, 20(1):310, 2020

    Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, Vince I Madai, and Precise4Q Consortium. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective.BMC medical informatics and decision making, 20(1):310, 2020. 1, 2 10 CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

  15. [15]

    A survey on medical large language models: Technology, application, trustworthiness, and future directions.arXiv preprint arXiv:2406.03712, 2024

    Lei Liu, Xiaoyan Yang, Junchi Lei, Yue Shen, Jian Wang, Peng Wei, Zhixuan Chu, Zhan Qin, and Kui Ren. A survey on medical large language models: Technology, application, trustworthiness, and future directions.arXiv preprint arXiv:2406.03712, 2024. 1, 2

  16. [16]

    arXiv preprint arXiv:2308.10792

    Zhang Shengyu, Dong Linfeng, Li Xiaoya, Zhang Sen, Sun Xiaofei, Wang Shuhe, Li Jiwei, Runyi Hu, Zhang Tianwei, Fei Wu, et al. Instruction tuning for large language models: A survey.arXiv preprint arXiv:2308.10792,

  17. [17]

    Towards building multilingual language model for medicine.Nature Communications, 15(1):8384,

    Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards building multilingual language model for medicine.Nature Communications, 15(1):8384,

  18. [18]

    A medical question answering system using large language models and knowledge graphs.International Journal of Intelligent Systems, 37(11):8548–8564, 2022

    Quan Guo, Shuai Cao, and Zhang Yi. A medical question answering system using large language models and knowledge graphs.International Journal of Intelligent Systems, 37(11):8548–8564, 2022. 2

  19. [19]

    A survey of multilingual reasoning in language models.Findings of the Association for Computational Linguistics: EMNLP, 2025:8920–8936, 2025

    Akash Ghosh, Debayan Dutta, Sriparna Saha, and Chirag Agarwal. A survey of multilingual reasoning in language models.Findings of the Association for Computational Linguistics: EMNLP, 2025:8920–8936, 2025. 2

  20. [20]

    Benchmarking large language models on answering and explaining challenging medical questions

    Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. Benchmarking large language models on answering and explaining challenging medical questions. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3563–3599, 2025. 2

  21. [21]

    Addressing cognitive bias in medical language models

    Samuel Schmidgall, Carl Harris, Ime Essien, Daniel Olshvang, Tawsifur Rahman, Ji Woong Kim, Rojin Ziaei, Jason Eshraghian, Peter Abadir, and Rama Chellappa. Addressing cognitive bias in medical language models. arXiv preprint arXiv:2402.08113, 2024. 2

  22. [22]

    Language Models are Multilingual Chain-of-Thought Reasoners

    Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush V osoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chain-of-thought reasoners.arXiv preprint arXiv:2210.03057, 2022. 2

  23. [23]

    Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022. 2

  24. [24]

    Breaking language barriers in multilingual mathematical reasoning: Insights and observations.arXiv preprint arXiv:2310.20246, 2023

    Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dongmei Zhang, and Jia Li. Breaking language barriers in multilingual mathematical reasoning: Insights and observations.arXiv preprint arXiv:2310.20246, 2023. 2

  25. [25]

    Mapo: Ad- vancing multilingual reasoning through multilingual alignment-as-preference optimization.arXiv preprint arXiv:2401.06838, 2024

    Shuaijie She, Wei Zou, Shujian Huang, Wenhao Zhu, Xiang Liu, Xiang Geng, and Jiajun Chen. Mapo: Ad- vancing multilingual reasoning through multilingual alignment-as-preference optimization.arXiv preprint arXiv:2401.06838, 2024. 2

  26. [26]

    Speaking multiple languages affects the moral bias of language models.arXiv preprint arXiv:2211.07733, 2022

    Katharina Hämmerl, Björn Deiseroth, Patrick Schramowski, Jindˇrich Libovick`y, Constantin A Rothkopf, Alexan- der Fraser, and Kristian Kersting. Speaking multiple languages affects the moral bias of language models.arXiv preprint arXiv:2211.07733, 2022. 2

  27. [27]

    Code-switching curriculum learning for multilingual transfer in llms.arXiv preprint arXiv:2411.02460, 2024

    Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, and Hwaran Lee. Code-switching curriculum learning for multilingual transfer in llms.arXiv preprint arXiv:2411.02460, 2024

  28. [28]

    Supervised fine-tuning of large language models on human demonstrations through the lens of memorization

    Yubin Ge, Devamanyu Hazarika, Yang Liu, and Mahdi Namazifar. Supervised fine-tuning of large language models on human demonstrations through the lens of memorization. 2023

  29. [29]

    O1 replication journey–part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson?arXiv preprint arXiv:2411.16489, 2024

    Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey–part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson?arXiv preprint arXiv:2411.16489, 2024

  30. [30]

    Limo: Less is more for reasoning

    Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387, 2025. 2

  31. [31]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 2

  32. [32]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  33. [33]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  34. [34]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 5 11 CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

  35. [35]

    Reft: Reasoning with reinforced fine-tuning

    Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967, 2024. 2

  36. [36]

    Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

    Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022. 3

  37. [37]

    Stanford alpaca: An instruction-following llama model, 2023

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023

  38. [38]

    Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36: 55006–55021, 2023

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36: 55006–55021, 2023

  39. [39]

    Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.arXiv preprint arXiv:2204.07705, 2022

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunk- umar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.arXiv preprint arXiv:2204.07705, 2022

  40. [40]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023

  41. [41]

    Clinic: Evaluating multilingual trustworthiness in language models for healthcare.arXiv, 2025

    Akash Ghosh, Srivarshinee Sridhar, Raghav Kaushik Ravi, Muhsin Muhsin, Sriparna Saha, and Chirag Agarwal. Clinic: Evaluating multilingual trustworthiness in language models for healthcare.arXiv, 2025. 3

  42. [42]

    HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

    Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. Huatuogpt-o1, towards medical complex reasoning with llms.arXiv preprint arXiv:2412.18925, 2024. 3, 6, 13

  43. [43]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 3

  44. [44]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 3

  45. [45]

    Rewardbench: Evaluating reward models for language modeling

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1755–1797, 2025. 4

  46. [46]

    Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

    Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024. 4

  47. [47]

    Peering through preferences: Unraveling feedback acquisition for aligning large language models.arXiv preprint arXiv:2308.15812, 2023

    Hritik Bansal, John Dang, and Aditya Grover. Peering through preferences: Unraveling feedback acquisition for aligning large language models.arXiv preprint arXiv:2308.15812, 2023. 4

  48. [48]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 4

  49. [49]

    Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

    Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025. 4

  50. [50]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 5

  51. [51]

    Learn globally, speak locally: Bridging the gaps in multilingual reasoning.arXiv preprint arXiv:2507.05418, 2025

    Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, and Paul Pu Liang. Learn globally, speak locally: Bridging the gaps in multilingual reasoning.arXiv preprint arXiv:2507.05418, 2025. 5, 16

  52. [52]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  53. [53]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024. 5 12 CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

  54. [54]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. 5

  55. [55]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URLhttps://arxiv...

  56. [56]

    Efficiently democratizing medical llms for 50 languages via a mixture of language family experts, 2024

    Guorui Zheng, Xidong Wang, Juhao Liang, Nuo Chen, Yuping Zheng, and Benyou Wang. Efficiently democratizing medical llms for 50 languages via a mixture of language family experts, 2024. URL https://arxiv.org/abs/ 2410.10626. 5

  57. [57]

    Un ministral, des ministraux, October 2024

    Mistral AI Team. Un ministral, des ministraux, October 2024. URL https://mistral.ai/news/ministraux. Accessed: 2025-12-24. 5

  58. [58]

    Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexei Figueroa, Alexander Löser, Daniel Truhn, and Keno K

    Tianyu Han, Lisa C. Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexei Figueroa, Alexander Löser, Daniel Truhn, and Keno K. Bressem. Medalpaca – an open-source collection of medical conversational ai models and training data, 2025. URLhttps://arxiv.org/abs/2304.08247. 5

  59. [59]

    Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079, 2023

    Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079, 2023. 5

  60. [60]

    Ultramedical: Building specialized generalists in biomedicine.Advances in Neural Information Processing Systems, 37:26045–26081, 2024

    Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, et al. Ultramedical: Building specialized generalists in biomedicine.Advances in Neural Information Processing Systems, 37:26045–26081, 2024. 5

  61. [61]

    Huatuogpt, towards taming language model to be a doctor

    H Zhang, J Chen, F Jiang, F Yu, Z Chen, J Li, G Chen, X Wu, Z Zhang, Q Xiao, et al. Huatuogpt, towards taming language model to be a doctor. arxiv (2023).arXiv preprint arXiv:2305.15075. 5, 16

  62. [62]

    Openbiollm: Llama3-based biomedical large language model

    Saama AI Labs. Openbiollm: Llama3-based biomedical large language model. https://huggingface.co/ aaditya/Llama3-OpenBioLLM-70B, 2024. Model card. Paper in preparation. 5

  63. [63]

    Biomistral: A collection of open-source pretrained large language models for medical domains, 2024

    Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains, 2024. URL https://arxiv.org/abs/2402.10373. 5

  64. [64]

    Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial intelligence in medicine, 155:102938, 2024

    Iñigo Alonso, Maite Oronoz, and Rodrigo Agerri. Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial intelligence in medicine, 155:102938, 2024. 8

  65. [65]

    True"if the response is correct and

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11 (14):6421, 2021. 8 Appendix A LLM-as-a-Judge Verification Protocol Inspired by [ 42], We employ an LLM-as-a-judge framework to automatically ...

  66. [66]

    Medical Grounding:All information must be sourced from MedlinePlus, covering symptoms, causes, risk factors, diagnostics, treatments, or prevention strategies

  67. [67]

    Independent Composition:Each language version must be originally written (not translated) using natural phrasing and medically appropriate terminology for that language

  68. [68]

    question_id

    Clinical Reasoning Depth:Questions must require genuine clinical reasoning beyond trivial fact recall. Each question should have exactly one unambiguous correct answer. 4.Format:4-option MCQ (A/B/C/D) with one correct answer. Output Format:Return valid JSON array: [ \{"question_id": "<id>", "source_concept": "<MedlinePlus_topic>", "mcq_items": [\{"languag...