arxiv: 2412.18925 · v1 · submitted 2024-12-25 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 1 theorem link

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Junying Chen , Zhenyang Cai , Ke Ji , Xidong Wang , Wanlong Liu , Rongsheng Wang , Jianye Hou , Benyou Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords medical reasoningLLMreinforcement learningverifier-guided searchcomplex reasoningverifiable problemsfine-tuning

0 comments

The pith

HuatuoGPT-o1 reaches complex medical reasoning through verifier-guided training on 40,000 problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to improve large language models for medical reasoning by creating verifiable medical problems that a medical verifier can automatically check. It uses a two-stage process: first, the verifier guides a search to find complex reasoning trajectories for fine-tuning the model, then reinforcement learning with rewards from the verifier further refines the reasoning. This approach allows the model to outperform both general and medical-specific baselines despite using a relatively small dataset of 40K problems. The work matters because medicine requires reliable multi-step reasoning, and extending reasoning techniques from math to this domain could lead to more trustworthy AI assistance in healthcare.

Core claim

HuatuoGPT-o1 is a medical LLM trained via a two-stage approach on verifiable medical problems: verifier-guided search for fine-tuning followed by RL with verifier-based rewards, enabling it to perform complex reasoning and outperform general and medical-specific baselines using only 40K problems.

What carries the argument

Verifiable medical problems paired with a medical verifier that checks the correctness of model outputs, which enables the two-stage training of verifier-guided search for fine-tuning and subsequent reinforcement learning with verifier rewards.

If this is right

Complex reasoning trajectories improve medical problem-solving performance.
Reinforcement learning with verifier rewards provides further gains after the initial fine-tuning stage.
The full method succeeds using only 40K verifiable medical problems.
The two-stage verifier approach can be applied to build reasoning capabilities in other specialized domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the medical verifier holds up, it could reduce the need for massive human-annotated datasets when adapting LLMs to new medical guidelines.
The separation of trajectory search from reward-based RL might transfer to other high-stakes reasoning tasks such as legal analysis or engineering design.
Real clinical deployment would still need separate checks against patient outcomes, since the verifier only scores problem answers.
Scaling the verifier itself becomes the next bottleneck once the 40K-problem regime is exceeded.

Load-bearing premise

A medical verifier can reliably and automatically determine the correctness of complex, multi-step reasoning outputs in medicine.

What would settle it

Human experts reviewing a sample of model outputs that the verifier accepts as correct but which contain factual medical errors or invalid reasoning steps.

read the original abstract

The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advancements in medical reasoning through a two-stage approach: (1) using the verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, (2) applying reinforcement learning (RL) with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning, which outperforms general and medical-specific baselines using only 40K verifiable problems. Experiments show complex reasoning improves medical problem-solving and benefits more from RL. We hope our approach inspires advancements in reasoning across medical and other specialized domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces HuatuoGPT-o1, a medical LLM trained via a two-stage process on 40K verifiable medical problems: (1) verifier-guided search to produce complex reasoning trajectories for supervised fine-tuning, followed by (2) reinforcement learning using rewards from the same medical verifier. It claims this yields superior performance over both general-purpose and medical-specific baselines, demonstrates that complex reasoning improves medical problem-solving, and shows particular gains from the RL stage.

Significance. If the verifier proves reliable and the reported gains are not artifacts of verifier error, the work would be significant for extending o1-style reasoning techniques beyond mathematics into high-stakes specialized domains. The data-efficient approach (only 40K problems) and explicit separation of SFT and RL stages could serve as a template for other fields where automatic verification is difficult but partial verifiability exists.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the central claim of outperformance over baselines is stated without any quantitative metrics, baseline names, absolute scores, or statistical significance tests. This makes it impossible to assess whether the gains are meaningful or merely artifacts of the verifier.
[§3 and §4.2] §3 (Verifier) and §4.2 (RL stage): the paper itself notes that 'verifying medical reasoning is challenging, unlike those in mathematics,' yet provides no quantitative validation of the verifier (precision/recall on held-out expert-annotated chains, inter-rater agreement with physicians, or error analysis on ambiguous cases such as differential diagnoses). Without this, both the search stage and the RL reward signal risk reinforcing spurious patterns.

minor comments (1)

[Abstract] The abstract would benefit from one or two concrete performance numbers and a brief statement of the verifier's reported accuracy to allow readers to gauge the scale of the claimed improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing where the original presentation was insufficient and describing the changes made to the revised manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of outperformance over baselines is stated without any quantitative metrics, baseline names, absolute scores, or statistical significance tests. This makes it impossible to assess whether the gains are meaningful or merely artifacts of the verifier.

Authors: We agree that the original abstract and §4 presented the outperformance claims without sufficient quantitative detail. In the revised manuscript we have expanded §4 with a new results table that lists all baseline names (both general-purpose and medical-specific), reports absolute accuracy scores on the evaluation benchmarks, and includes statistical significance tests (paired t-tests with p-values) comparing HuatuoGPT-o1 against each baseline. The abstract has also been updated to reference these concrete metrics. These additions allow readers to judge the magnitude and reliability of the reported gains directly. revision: yes
Referee: [§3 and §4.2] §3 (Verifier) and §4.2 (RL stage): the paper itself notes that 'verifying medical reasoning is challenging, unlike those in mathematics,' yet provides no quantitative validation of the verifier (precision/recall on held-out expert-annotated chains, inter-rater agreement with physicians, or error analysis on ambiguous cases such as differential diagnoses). Without this, both the search stage and the RL reward signal risk reinforcing spurious patterns.

Authors: We acknowledge the concern and the inherent difficulty of medical verification noted in the manuscript. The revised §3 now includes a dedicated validation subsection that reports precision and recall of the verifier on a held-out set of expert-annotated reasoning chains. We also add an error analysis that examines failure modes on ambiguous cases such as differential diagnoses. Inter-rater agreement statistics from the annotation process are reported where available. These quantitative results support the reliability of the verifier used in both the search and RL stages. Full-scale multi-physician inter-rater studies remain resource-intensive and were not feasible within the current scope, but the added metrics and analysis directly address the risk of spurious reinforcement. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training on externally verifiable problems

full rationale

The paper describes a two-stage empirical pipeline (verifier-guided search for SFT, followed by RL with verifier rewards) applied to 40K verifiable medical problems. No equations, derivations, or first-principles results are presented that reduce reported performance gains to fitted parameters, self-definitions, or self-citations. The verifier is introduced as an external component for checking correctness on problems with verifiable ground truth, and outperformance claims rest on direct comparisons to general and medical baselines rather than any internal reduction. This is a standard supervised + RL training setup with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of an accurate automatic medical verifier and on the assumption that complex reasoning trajectories found via search are both learnable and beneficial; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption A medical verifier can correctly judge the validity of complex reasoning outputs
Invoked when the abstract states that verifiable medical problems enable guided search and RL rewards.

pith-pipeline@v0.9.0 · 5512 in / 1237 out tokens · 37877 ms · 2026-05-15T12:32:30.760411+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
cs.CL 2026-04 unverdicted novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
cs.LG 2026-05 unverdicted novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering
cs.AI 2026-04 unverdicted novelty 7.0

MED-VRAG reaches 78.6% average accuracy on four medical QA benchmarks by iteratively retrieving PMC page images with ColQwen2.5 embeddings and a VLM that refines queries over up to three rounds.
Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
cs.CV 2026-05 unverdicted novelty 6.0

VISTA uses prefix resampling and a vision-aware attention score to address data imbalance and language prior bias in self-improvement training of MLLMs, yielding up to 13.66% gains on reasoning tasks.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
cs.CL 2026-05 unverdicted novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
cs.CR 2026-05 unverdicted novelty 6.0

NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model
cs.CE 2026-04 unverdicted novelty 6.0

MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.
Improving Medical VQA through Trajectory-Aware Process Supervision
cs.LG 2026-04 conditional novelty 6.0

A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
cs.LG 2025-07 unverdicted novelty 6.0

RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
Search-o1: Agentic Search-Enhanced Large Reasoning Models
cs.AI 2025-01 unverdicted novelty 6.0

Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
cs.LG 2026-05 unverdicted novelty 5.0

Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.
ReMedi: Reasoner for Medical Clinical Prediction
cs.CL 2026-05 unverdicted novelty 5.0

ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.
Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve
cs.AI 2026-04 unverdicted novelty 5.0

Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
cs.AI 2026-03 unverdicted novelty 5.0

An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
Medical Reasoning with Large Language Models: A Survey and MR-Bench
cs.CL 2026-03 accept novelty 5.0

LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Effects of Cross-lingual Evidence in Multilingual Medical Question Answering
cs.CL 2026-04 unverdicted novelty 4.0

Combining English and target-language web retrieval boosts medical QA for low-resource languages to match high-resource performance, while English web data benefits high-resource languages most and specialized sources...
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · cited by 19 Pith papers · 13 internal anchors

[1]

Melody Y . Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Heylar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative alignment: Reasoning enables safer language models. OpenAI Blog, 2024. 1

work page 2024
[2]

A preliminary study of o1 in medicine: Are we closer to an ai doctor? arXiv preprint arXiv:2409.15277, 2024

Yunfei Xie, Juncheng Wu, Haoqin Tu, Siwei Yang, Bingchen Zhao, Yongshuo Zong, Qiao Jin, Cihang Xie, and Yuyin Zhou. A preliminary study of o1 in medicine: Are we closer to an ai doctor? arXiv preprint arXiv:2409.15277, 2024. 9

work page arXiv 2024
[3]

Evaluation of openai o1: Opportunities and challenges of agi

Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, et al. Evaluation of openai o1: Opportunities and challenges of agi. arXiv preprint arXiv:2409.18486, 2024. 1

work page arXiv 2024
[4]

O1 replication journey: A strategic progress report–part 1

Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, et al. O1 replication journey: A strategic progress report–part 1. arXiv preprint arXiv:2410.18982, 2024. 1, 9

work page arXiv 2024
[5]

Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective

Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Bo Wang, Shimin Li, Yunhua Zhou, Qipeng Guo, Xuanjing Huang, and Xipeng Qiu. Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective. arXiv preprint arXiv:2412.14135, 2024. 9

work page arXiv 2024
[6]

Openr: An open source framework for advanced reasoning with large language models

Jun Wang, Meng Fang, Ziyu Wan, Muning Wen, Jiachen Zhu, Anjie Liu, Ziqin Gong, Yan Song, Lei Chen, Lionel M Ni, et al. Openr: An open source framework for advanced reasoning with large language models. arXiv preprint arXiv:2410.09671, 2024. 1, 9

work page arXiv 2024
[7]

Qwq: Reflect deeply on the boundaries of the unknown, November 2024

Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. 1, 9

work page 2024
[8]

Reft: Reasoning with reinforced fine-tuning

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 2024. 3, 6, 8

work page arXiv 2024
[9]

Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning

Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, et al. Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning. arXiv preprint arXiv:2410.02884, 2024. 1, 9

work page arXiv 2024
[10]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416, 2024. 1, 9

work page internal anchor Pith review arXiv 2024
[11]

Thinking and reasoning in medicine

Vimla L Patel, José F Arocha, and Jiajie Zhang. Thinking and reasoning in medicine. The Cambridge handbook of thinking and reasoning, 14:727–750, 2005

work page 2005
[12]

Cod, towards an interpretable medical agent using chain of diagnosis

Junying Chen, Chi Gui, Anningzhe Gao, Ke Ji, Xidong Wang, Xiang Wan, and Benyou Wang. Cod, towards an interpretable medical agent using chain of diagnosis. arXiv preprint arXiv:2407.13301, 2024. 1, 9

work page arXiv 2024
[13]

Towards next-generation medical agent: How o1 is reshaping decision-making in medical scenarios

Shaochen Xu, Yifan Zhou, Zhengliang Liu, Zihao Wu, Tianyang Zhong, Huaqin Zhao, Yiwei Li, Hanqi Jiang, Yi Pan, Junhao Chen, et al. Towards next-generation medical agent: How o1 is reshaping decision-making in medical scenarios. arXiv preprint arXiv:2411.14461, 2024. 1

work page arXiv 2024
[14]

Openai o1-preview vs

Mohamad-Hani Temsah, Amr Jamal, Khalid Alhasan, Abdulkarim A Temsah, and Khalid H Malki. Openai o1-preview vs. chatgpt in healthcare: A new frontier in medical ai reasoning. Cureus, 16(10):e70640, 2024. 1, 9 11

work page 2024
[15]

Huatuogpt-ii, one-stage training for medical adaption of llms

Junying Chen, Xidong Wang, Ke Ji, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, et al. Huatuogpt-ii, one-stage training for medical adaption of llms. arXiv preprint arXiv:2311.09774, 2023. 1, 6, 9, 23

work page arXiv 2023
[16]

Adapting large language models via reading comprehension

Daixuan Cheng, Shaohan Huang, and Furu Wei. Adapting large language models via reading comprehension. In The Twelfth International Conference on Learning Representations, 2023. 1

work page 2023
[17]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021. 2, 6, 23

work page 2021
[18]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR, 2022. 2, 6

work page 2022
[19]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. 3

work page 2024
[22]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023. 3, 9

work page 2023
[23]

Stream of search (sos): Learning to search in language

Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman. Stream of search (sos): Learning to search in language. arXiv preprint arXiv:2404.03683, 2024. 3, 5, 10

work page arXiv 2024
[24]

Learning by playing solving sparse reward tasks from scratch

Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Wiele, Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning , pages 4344–4353. PMLR, 2018. 6

work page 2018
[25]

Keeping your dis- tance: Solving sparse reward tasks using self-balancing shaped rewards

Alexander Trott, Stephan Zheng, Caiming Xiong, and Richard Socher. Keeping your dis- tance: Solving sparse reward tasks using self-balancing shaped rewards. Advances in Neural Information Processing Systems, 32, 2019. 6

work page 2019
[26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR, 2022. 6

work page 2022
[28]

Ultramedical: Building specialized generalists in biomedicine

Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, et al. Ultramedical: Building specialized generalists in biomedicine. arXiv preprint arXiv:2406.03949, 2024. 6, 9

work page arXiv 2024
[29]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

arXiv preprint arXiv:2305.09617 , year=

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023. 6, 9

work page arXiv 2023
[31]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024. 6 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Yi: Open Foundation Models by 01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Openbiollms: Advancing open-source large language models for healthcare and life sciences, 2024

Malaikannan Sankarasubbu Ankit Pal and Malaikannan Sankarasubbu. Openbiollms: Advancing open-source large language models for healthcare and life sciences, 2024. 6, 9

work page 2024
[35]

Biomistral: A collection of open-source pretrained large language models for medical domains

Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024. 6, 9

work page arXiv 2024
[36]

Pubmedqa: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019. 6

work page arXiv 1909
[37]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024. 8

work page 2024
[39]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Llava-o1: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024. 9

work page arXiv 2024
[41]

o1-coder: an o1 replication for coding

Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu, Jinlin Xiao, Chao Kong, and Jitao Sang. o1-coder: an o1 replication for coding. arXiv preprint arXiv:2412.00154, 2024. 9

work page arXiv 2024
[42]

Marco-o1: Towards open reasoning models for open-ended solutions

Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. Marco-o1: Towards open reasoning models for open-ended solutions. arXiv preprint arXiv:2411.14405, 2024. 9

work page arXiv 2024
[43]

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, and Eric Horvitz. From medprompt to o1: Exploration of run-time strategies for medical challenge problems and beyond. arXiv preprint arXiv:2411.03590, 2024. 9

work page arXiv 2024
[44]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Large language models encode clinical knowledge

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023. 9

work page 2023
[47]

arXiv preprint arXiv:2311.16452 , year=

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452,

work page arXiv
[48]

Agent hospital: A simulacrum of hospital with evolvable medical agents

Junkai Li, Siyu Wang, Meng Zhang, Weitao Li, Yunghwei Lai, Xinhui Kang, Weizhi Ma, and Yang Liu. Agent hospital: A simulacrum of hospital with evolvable medical agents. arXiv preprint arXiv:2405.02957, 2024. 9 13

work page arXiv 2024
[49]

Medicalgpt: Training medical gpt model

Ming Xu. Medicalgpt: Training medical gpt model. https://github.com/shibing624/ MedicalGPT, 2023. 9

work page 2023
[50]

Huatuo: Tuning llama model with chinese medical knowledge, 2023

Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. Huatuo: Tuning llama model with chinese medical knowledge, 2023

work page 2023
[51]

Medalpaca–an open-source collection of medical conversational ai models and training data

Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K Bressem. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023

work page arXiv 2023
[52]

Pmc- llama: toward building open-source language models for medicine

Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. Pmc- llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association, page ocae045, 2024. 9

work page 2024
[53]

Disc-medllm: Bridging general large language models and real-world medical consultation, 2023

Zhijie Bao, Wei Chen, Shengze Xiao, Kuang Ren, Jiaao Wu, Cheng Zhong, Jiajie Peng, Xuanjing Huang, and Zhongyu Wei. Disc-medllm: Bridging general large language models and real-world medical consultation, 2023. 9

work page 2023
[54]

Biomedgpt: a unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks

Kai Zhang, Jun Yu, Eashan Adhikarla, Rong Zhou, Zhiling Yan, Yixin Liu, Zhengliang Liu, Lifang He, Brian Davison, Xiang Li, et al. Biomedgpt: a unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks. arXiv e-prints, pages arXiv–2305, 2023

work page 2023
[55]

Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, et al. Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale. arXiv preprint arXiv:2406.19280, 2024

work page arXiv 2024
[56]

Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people

Xidong Wang, Nuo Chen, Junyin Chen, Yan Hu, Yidong Wang, Xiangbo Wu, Anningzhe Gao, Xiang Wan, Haizhou Li, and Benyou Wang. Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people. arXiv preprint arXiv:2403.03640, 2024

work page arXiv 2024
[57]

Efficiently democratizing medical llms for 50 languages via a mixture of language family experts

Guorui Zheng, Xidong Wang, Juhao Liang, Nuo Chen, Yuping Zheng, and Benyou Wang. Efficiently democratizing medical llms for 50 languages via a mixture of language family experts. arXiv preprint arXiv:2410.10626, 2024

work page arXiv 2024
[58]

Med42–evaluating fine-tuning strategies for medical llms: Full-parameter vs

Clément Christophe, Praveen K Kanithi, Prateek Munjal, Tathagata Raha, Nasir Hayat, Ronnie Rajan, Ahmed Al-Mahrooqi, Avani Gupta, Muhammad Umar Salman, Gurpreet Gosal, et al. Med42–evaluating fine-tuning strategies for medical llms: Full-parameter vs. parameter-efficient approaches. arXiv preprint arXiv:2404.14779, 2024. 9

work page arXiv 2024
[59]

H., Romanou, A., Bonnet, A., Ma- toba, K., Salvi, F., Pagliardini, M., Fan, S., K ¨opf, A., Mohtashami, A., et al

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023. 9

work page arXiv 2023
[60]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neur...

work page 2022
[61]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 9

work page 2023
[62]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, Uni...

work page 2022
[63]

Mondal, and Jyoti Prakash Sahoo

Yisheng Song, Ting Wang, Puyu Cai, Subrota K. Mondal, and Jyoti Prakash Sahoo. A com- prehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. ACM Comput. Surv., 55(13s):271:1–271:40, 2023. 9

work page 2023
[64]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28...

work page 2022
[65]

Large language models can self-improve

Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 1051–1068. Association for Computatio...

work page 2023
[66]

Anderson

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross J. Anderson. The curse of recursion: Training on generated data makes models forget. CoRR, abs/2305.17493, 2023. 9

work page arXiv 2023
[67]

Baraniuk

Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G. Baraniuk. Self-consuming generative models go MAD. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 9

work page 2024
[68]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 10

work page 2024
[69]

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR, abs/2308.09583, 2023. 10

work page arXiv 2023
[70]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Alice Oh, Tristan Naumann, Amir Globerson, K...

work page 2023
[71]

Bowman, Kyunghyun Cho, and Ethan Perez

Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, and Ethan Perez. Improving code generation by training with natural language feedback. CoRR, abs/2303.16749, 2023

work page arXiv 2023
[72]

Generating sequences by learning to self-correct

Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. Generating sequences by learning to self-correct. In The Eleventh Interna- tional Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 10

work page 2023
[73]

Self-polish: Enhance reasoning in large language models via problem refinement

Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Jia Liu, Tao Gui, Qi Zhang, and Xuanjing Huang. Self-polish: Enhance reasoning in large language models via problem refinement. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 1138...

work page 2023
[74]

REFINER: reasoning feedback on intermediate representations

Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. REFINER: reasoning feedback on intermediate representations. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Pa...

work page 2024
[75]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[76]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 10

work page 2024
[77]

Pride and prejudice: LLM amplifies self-bias in self-refinement

Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Wang. Pride and prejudice: LLM amplifies self-bias in self-refinement. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16...

work page 2024
[78]

Shep- herd: A critic for language model generation

Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi- Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. Shep- herd: A critic for language model generation. CoRR, abs/2308.04592, 2023. 10

work page arXiv 2023
[79]

LLM critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback

Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Junyang Lin, Chang Zhou, Wen Xiao, Junjie Hu, Tianyu Liu, and Baobao Chang. LLM critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback. CoRR, abs/2406.14024, 2024

work page arXiv 2024
[80]

Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification

Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, and Hongsheng Li. Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- R...

work page 2024

Showing first 80 references.