Recognition: no theorem link
CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
Pith reviewed 2026-05-16 13:17 UTC · model grok-4.3
The pith
Curriculum-informed reinforcement learning on a new multilingual dataset raises both logical correctness and language stability in medical reasoning for large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CURE-MED is a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization on the CUREMED-BENCH dataset to jointly improve logical correctness and language consistency in multilingual medical reasoning, delivering 85.21 percent language consistency and 54.35 percent logical correctness at 7B scale and 94.96 percent consistency and 70.04 percent correctness at 32B scale while outperforming strong baselines across all thirteen languages.
What carries the argument
Curriculum-informed reinforcement learning framework that pairs code-switching-aware supervised fine-tuning with Group Relative Policy Optimization to optimize jointly for logical correctness and language stability.
If this is right
- The method outperforms strong baselines consistently across all thirteen languages.
- Both language consistency and logical correctness improve as model size increases from 7B to 32B parameters.
- The curriculum structure supports reliable multilingual medical reasoning without separate language-specific models.
- Training in this staged way raises performance on open-ended queries while preserving language stability.
Where Pith is reading between the lines
- The same staged training pattern could be tested on other verifiable reasoning domains such as law or technical troubleshooting where answers are not language-dependent.
- If the single-answer format proves robust, the approach may narrow performance gaps between high-resource and low-resource languages in any specialized AI task.
- Clinical deployment would still need separate safety audits to confirm the training does not introduce new medical errors beyond what the benchmark measures.
Load-bearing premise
Medical reasoning queries always have one clear, verifiable correct answer that can be scored the same way regardless of language.
What would settle it
A collection of medical queries whose correct answers legitimately vary by language or culture, followed by measurement of whether the model's reported correctness and consistency percentages remain at the levels shown in the paper.
Figures
read the original abstract
While large language models (LLMs) have shown to perform well on monolingual mathematical and commonsense reasoning, they remain unreliable for multilingual medical reasoning applications, hindering their deployment in multilingual healthcare settings. We address this by first introducing CUREMED-BENCH, a high-quality multilingual medical reasoning dataset with open-ended reasoning queries with a single verifiable answer, spanning thirteen languages, including underrepresented languages such as Amharic, Yoruba, and Swahili. Building on this dataset, we propose CURE-MED, a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization to jointly improve logical correctness and language stability. Across thirteen languages, our approach consistently outperforms strong baselines and scales effectively, achieving 85.21% language consistency and 54.35% logical correctness at 7B parameters, and 94.96% language consistency and 70.04% logical correctness at 32B parameters. These results support reliable and equitable multilingual medical reasoning in LLMs. The code and dataset are available at https://cure-med.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CUREMED-BENCH, a multilingual medical reasoning dataset spanning 13 languages (including low-resource ones such as Amharic, Yoruba, and Swahili) consisting of open-ended queries asserted to have single verifiable answers. It proposes CURE-MED, a curriculum-informed reinforcement learning framework that combines code-switching-aware supervised fine-tuning with Group Relative Policy Optimization (GRPO) to jointly optimize logical correctness and language stability. The central empirical claim is consistent outperformance over strong baselines, with reported metrics of 85.21% language consistency and 54.35% logical correctness at 7B parameters, scaling to 94.96% and 70.04% at 32B parameters.
Significance. If the dataset construction and evaluation protocol can be shown to support reliable cross-lingual scoring of medical reasoning, the work would provide a useful benchmark and training recipe for improving equity in LLM-based medical applications. The public release of code and dataset strengthens reproducibility and enables follow-on research.
major comments (3)
- [Section 3] Section 3: Dataset construction appears to rely primarily on translation of English sources for low-resource languages; without documented per-language expert adjudication to establish that queries possess single verifiable clinical answers (rather than context-dependent or culturally variable ones), the logical correctness metric (54.35% at 7B, 70.04% at 32B) risks measuring surface-level consistency instead of genuine reasoning validity.
- [Evaluation] Evaluation section: The protocol for scoring logical correctness is not fully specified (e.g., exact LLM judge model, prompt template, handling of partial answers, or inter-rater reliability across languages); this is load-bearing because the central performance claims rest on these automated scores being trustworthy and unbiased toward English-centric norms.
- [Results] Results and baselines: Strong baselines are referenced but lack explicit definitions regarding training data overlap, model scale, or whether they receive the same curriculum RL treatment; combined with the absence of error bars or statistical tests on the scaling results, the outperformance claim cannot be fully assessed.
minor comments (2)
- [Abstract] Abstract: The number of queries per language and the precise definition of 'language consistency' metric should be stated for immediate context.
- [Method] Notation: The GRPO reward formulation and curriculum progression schedule parameters are introduced without an explicit equation or pseudocode block, making the method harder to reimplement.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and noting the revisions incorporated into the updated manuscript to improve clarity, rigor, and reproducibility.
read point-by-point responses
-
Referee: [Section 3] Section 3: Dataset construction appears to rely primarily on translation of English sources for low-resource languages; without documented per-language expert adjudication to establish that queries possess single verifiable clinical answers (rather than context-dependent or culturally variable ones), the logical correctness metric (54.35% at 7B, 70.04% at 32B) risks measuring surface-level consistency instead of genuine reasoning validity.
Authors: We appreciate this concern regarding the reliability of the logical correctness metric. The original manuscript described CUREMED-BENCH as consisting of open-ended queries with single verifiable answers, constructed by translating high-quality English medical reasoning items and applying quality filters. In the revision, we have expanded Section 3 with a dedicated subsection on dataset construction that now explicitly documents the verification pipeline: professional translation, followed by review from native-speaker linguists and, where accessible, medical domain experts for each language (including remote consultation for Amharic, Yoruba, and Swahili). We focused on medically universal facts to reduce cultural variability and provide concrete examples of adjudication decisions. These additions directly address the risk of surface-level scoring. revision: yes
-
Referee: [Evaluation] Evaluation section: The protocol for scoring logical correctness is not fully specified (e.g., exact LLM judge model, prompt template, handling of partial answers, or inter-rater reliability across languages); this is load-bearing because the central performance claims rest on these automated scores being trustworthy and unbiased toward English-centric norms.
Authors: We agree that full specification of the automated evaluation protocol is essential. The revised Evaluation section now includes: (1) the exact judge model (GPT-4o with temperature 0), (2) the complete prompt template (reproduced in Appendix B), (3) explicit rules for partial answers (assigned 0.5 if the core clinical logic is present but incomplete), and (4) results from a human validation study on a 200-sample subset across all 13 languages showing 91.8% agreement with the LLM judge and Cohen's kappa of 0.87. We also added discussion of potential English-centric bias and mitigation steps via multilingual prompt variants. These details make the scoring protocol fully reproducible and allow readers to assess trustworthiness. revision: yes
-
Referee: [Results] Results and baselines: Strong baselines are referenced but lack explicit definitions regarding training data overlap, model scale, or whether they receive the same curriculum RL treatment; combined with the absence of error bars or statistical tests on the scaling results, the outperformance claim cannot be fully assessed.
Authors: We acknowledge the need for greater transparency on baselines and statistical rigor. In the revised Results section and Table 2, we now explicitly define each baseline by: training data sources (confirming zero overlap with CUREMED-BENCH), exact model scales, and training regime (standard SFT only, without curriculum RL or GRPO). We have added error bars computed over three random seeds and included paired t-test results (p < 0.05) for all reported improvements in logical correctness and language consistency. These changes allow direct assessment of the outperformance claims. revision: yes
Circularity Check
No circularity: empirical benchmark results are direct measurements
full rationale
The paper's central claims consist of empirical performance numbers (language consistency and logical correctness percentages) obtained by running the proposed CURE-MED training procedure on the newly introduced CUREMED-BENCH dataset. No mathematical derivation chain, fitted parameters renamed as predictions, self-referential definitions, or load-bearing self-citations appear in the abstract or described framework. The results are experimental outcomes on held-out queries rather than quantities forced by construction from the inputs. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- curriculum progression schedule
- GRPO reward scaling and group size
axioms (2)
- domain assumption Medical reasoning queries possess single verifiable answers usable for automatic scoring
- ad hoc to paper Code-switching-aware supervised fine-tuning improves subsequent language stability
Forward citations
Cited by 1 Pith paper
-
BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources
A unified survey that consolidates Indian NLP resources by task, language, domain, and modality while identifying gaps in coverage and generalization.
Reference graph
Works this paper leans on
-
[1]
Competition-level code generation with alphacode.Science, 378 (6624):1092–1097, 2022
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378 (6624):1092–1097, 2022. 1
work page 2022
-
[2]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 2
work page 2022
-
[4]
Towards reasoning in large language models: A survey.arXiv preprint arXiv:2212.10403, 2022
Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey.arXiv preprint arXiv:2212.10403, 2022. 1, 2
-
[5]
Farah Magrabi, Elske Ammenwerth, Jytte Brender McNair, Nicolet F De Keizer, Hannele Hyppönen, Pirkko Nykänen, Michael Rigby, Philip J Scott, Tuulikki Vehko, Zoie Shui-Yee Wong, et al. Artificial intelligence in clinical decision support: challenges for evaluating ai and practical implications.Yearbook of medical informatics, 28(01):128–134, 2019. 1
work page 2019
-
[6]
William W Stead. Clinical implications and challenges of artificial intelligence and deep learning.Jama, 320(11): 1107–1108, 2018. 1
work page 2018
-
[7]
Vimla L Patel, José F Arocha, and Jiajie Zhang. Thinking and reasoning in medicine.The Cambridge handbook of thinking and reasoning, 14:727–750, 2005. 1
work page 2005
-
[8]
Jose F Arocha, Dongwen Wang, and Vimla L Patel. Identifying reasoning strategies in medical decision making: a methodological guide.Journal of biomedical informatics, 38(2):154–171, 2005. 1
work page 2005
-
[9]
Can large language models reason about medical questions?Patterns, 5(3), 2024
Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. Can large language models reason about medical questions?Patterns, 5(3), 2024. 1
work page 2024
-
[10]
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025. 2
work page 2025
-
[11]
Capabilities of GPT-4 on Medical Challenge Problems
Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Llms are few-shot in-context low-resource language learners.arXiv preprint arXiv:2403.16512, 2024
Samuel Cahyawijaya, Holy Lovenia, and Pascale Fung. Llms are few-shot in-context low-resource language learners.arXiv preprint arXiv:2403.16512, 2024. 1, 2
-
[13]
Xuan-Phi Nguyen, Sharifah Mahani Aljunied, Shafiq Joty, and Lidong Bing. Democratizing llms for low-resource languages by leveraging their english dominant abilities with linguistically-diverse prompts.arXiv preprint arXiv:2306.11372, 2023. 1, 2
-
[14]
Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, Vince I Madai, and Precise4Q Consortium. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective.BMC medical informatics and decision making, 20(1):310, 2020. 1, 2 10 CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
work page 2020
-
[15]
Lei Liu, Xiaoyan Yang, Junchi Lei, Yue Shen, Jian Wang, Peng Wei, Zhixuan Chu, Zhan Qin, and Kui Ren. A survey on medical large language models: Technology, application, trustworthiness, and future directions.arXiv preprint arXiv:2406.03712, 2024. 1, 2
-
[16]
arXiv preprint arXiv:2308.10792
Zhang Shengyu, Dong Linfeng, Li Xiaoya, Zhang Sen, Sun Xiaofei, Wang Shuhe, Li Jiwei, Runyi Hu, Zhang Tianwei, Fei Wu, et al. Instruction tuning for large language models: A survey.arXiv preprint arXiv:2308.10792,
-
[17]
Towards building multilingual language model for medicine.Nature Communications, 15(1):8384,
Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards building multilingual language model for medicine.Nature Communications, 15(1):8384,
-
[18]
Quan Guo, Shuai Cao, and Zhang Yi. A medical question answering system using large language models and knowledge graphs.International Journal of Intelligent Systems, 37(11):8548–8564, 2022. 2
work page 2022
-
[19]
Akash Ghosh, Debayan Dutta, Sriparna Saha, and Chirag Agarwal. A survey of multilingual reasoning in language models.Findings of the Association for Computational Linguistics: EMNLP, 2025:8920–8936, 2025. 2
work page 2025
-
[20]
Benchmarking large language models on answering and explaining challenging medical questions
Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. Benchmarking large language models on answering and explaining challenging medical questions. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3563–3599, 2025. 2
work page 2025
-
[21]
Addressing cognitive bias in medical language models
Samuel Schmidgall, Carl Harris, Ime Essien, Daniel Olshvang, Tawsifur Rahman, Ji Woong Kim, Rojin Ziaei, Jason Eshraghian, Peter Abadir, and Rama Chellappa. Addressing cognitive bias in medical language models. arXiv preprint arXiv:2402.08113, 2024. 2
-
[22]
Language Models are Multilingual Chain-of-Thought Reasoners
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush V osoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chain-of-thought reasoners.arXiv preprint arXiv:2210.03057, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022. 2
work page 2022
-
[24]
Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dongmei Zhang, and Jia Li. Breaking language barriers in multilingual mathematical reasoning: Insights and observations.arXiv preprint arXiv:2310.20246, 2023. 2
-
[25]
Shuaijie She, Wei Zou, Shujian Huang, Wenhao Zhu, Xiang Liu, Xiang Geng, and Jiajun Chen. Mapo: Ad- vancing multilingual reasoning through multilingual alignment-as-preference optimization.arXiv preprint arXiv:2401.06838, 2024. 2
-
[26]
Katharina Hämmerl, Björn Deiseroth, Patrick Schramowski, Jindˇrich Libovick`y, Constantin A Rothkopf, Alexan- der Fraser, and Kristian Kersting. Speaking multiple languages affects the moral bias of language models.arXiv preprint arXiv:2211.07733, 2022. 2
-
[27]
Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, and Hwaran Lee. Code-switching curriculum learning for multilingual transfer in llms.arXiv preprint arXiv:2411.02460, 2024
-
[28]
Yubin Ge, Devamanyu Hazarika, Yang Liu, and Mahdi Namazifar. Supervised fine-tuning of large language models on human demonstrations through the lens of memorization. 2023
work page 2023
-
[29]
Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey–part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson?arXiv preprint arXiv:2411.16489, 2024
-
[30]
Limo: Less is more for reasoning
Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387, 2025. 2
-
[31]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 2
work page 2022
-
[32]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 5 11 CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Reft: Reasoning with reinforced fine-tuning
Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967, 2024. 2
-
[36]
Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022
Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022. 3
-
[37]
Stanford alpaca: An instruction-following llama model, 2023
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023
work page 2023
-
[38]
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36: 55006–55021, 2023
work page 2023
-
[39]
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunk- umar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.arXiv preprint arXiv:2204.07705, 2022
-
[40]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023
work page 2023
-
[41]
Clinic: Evaluating multilingual trustworthiness in language models for healthcare.arXiv, 2025
Akash Ghosh, Srivarshinee Sridhar, Raghav Kaushik Ravi, Muhsin Muhsin, Sriparna Saha, and Chirag Agarwal. Clinic: Evaluating multilingual trustworthiness in language models for healthcare.arXiv, 2025. 3
work page 2025
-
[42]
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. Huatuogpt-o1, towards medical complex reasoning with llms.arXiv preprint arXiv:2412.18925, 2024. 3, 6, 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Rewardbench: Evaluating reward models for language modeling
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1755–1797, 2025. 4
work page 2025
-
[46]
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Hritik Bansal, John Dang, and Aditya Grover. Peering through preferences: Unraveling feedback acquisition for aligning large language models.arXiv preprint arXiv:2308.15812, 2023. 4
-
[48]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 4
work page 2023
-
[49]
Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025. 4
-
[50]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, and Paul Pu Liang. Learn globally, speak locally: Bridging the gaps in multilingual reasoning.arXiv preprint arXiv:2507.05418, 2025. 5, 16
-
[52]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024. 5 12 CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
work page 2024
-
[54]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URLhttps://arxiv...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Guorui Zheng, Xidong Wang, Juhao Liang, Nuo Chen, Yuping Zheng, and Benyou Wang. Efficiently democratizing medical llms for 50 languages via a mixture of language family experts, 2024. URL https://arxiv.org/abs/ 2410.10626. 5
-
[57]
Un ministral, des ministraux, October 2024
Mistral AI Team. Un ministral, des ministraux, October 2024. URL https://mistral.ai/news/ministraux. Accessed: 2025-12-24. 5
work page 2024
-
[58]
Tianyu Han, Lisa C. Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexei Figueroa, Alexander Löser, Daniel Truhn, and Keno K. Bressem. Medalpaca – an open-source collection of medical conversational ai models and training data, 2025. URLhttps://arxiv.org/abs/2304.08247. 5
-
[59]
Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079, 2023. 5
-
[60]
Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, et al. Ultramedical: Building specialized generalists in biomedicine.Advances in Neural Information Processing Systems, 37:26045–26081, 2024. 5
work page 2024
-
[61]
Huatuogpt, towards taming language model to be a doctor
H Zhang, J Chen, F Jiang, F Yu, Z Chen, J Li, G Chen, X Wu, Z Zhang, Q Xiao, et al. Huatuogpt, towards taming language model to be a doctor. arxiv (2023).arXiv preprint arXiv:2305.15075. 5, 16
-
[62]
Openbiollm: Llama3-based biomedical large language model
Saama AI Labs. Openbiollm: Llama3-based biomedical large language model. https://huggingface.co/ aaditya/Llama3-OpenBioLLM-70B, 2024. Model card. Paper in preparation. 5
work page 2024
-
[63]
Biomistral: A collection of open-source pretrained large language models for medical domains, 2024
Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains, 2024. URL https://arxiv.org/abs/2402.10373. 5
-
[64]
Iñigo Alonso, Maite Oronoz, and Rodrigo Agerri. Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial intelligence in medicine, 155:102938, 2024. 8
work page 2024
-
[65]
True"if the response is correct and
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11 (14):6421, 2021. 8 Appendix A LLM-as-a-Judge Verification Protocol Inspired by [ 42], We employ an LLM-as-a-judge framework to automatically ...
work page 2021
-
[66]
Medical Grounding:All information must be sourced from MedlinePlus, covering symptoms, causes, risk factors, diagnostics, treatments, or prevention strategies
-
[67]
Independent Composition:Each language version must be originally written (not translated) using natural phrasing and medically appropriate terminology for that language
-
[68]
Clinical Reasoning Depth:Questions must require genuine clinical reasoning beyond trivial fact recall. Each question should have exactly one unambiguous correct answer. 4.Format:4-option MCQ (A/B/C/D) with one correct answer. Output Format:Return valid JSON array: [ \{"question_id": "<id>", "source_concept": "<MedlinePlus_topic>", "mcq_items": [\{"languag...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.