Recognition: 1 theorem link
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Pith reviewed 2026-05-15 12:32 UTC · model grok-4.3
The pith
HuatuoGPT-o1 reaches complex medical reasoning through verifier-guided training on 40,000 problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HuatuoGPT-o1 is a medical LLM trained via a two-stage approach on verifiable medical problems: verifier-guided search for fine-tuning followed by RL with verifier-based rewards, enabling it to perform complex reasoning and outperform general and medical-specific baselines using only 40K problems.
What carries the argument
Verifiable medical problems paired with a medical verifier that checks the correctness of model outputs, which enables the two-stage training of verifier-guided search for fine-tuning and subsequent reinforcement learning with verifier rewards.
If this is right
- Complex reasoning trajectories improve medical problem-solving performance.
- Reinforcement learning with verifier rewards provides further gains after the initial fine-tuning stage.
- The full method succeeds using only 40K verifiable medical problems.
- The two-stage verifier approach can be applied to build reasoning capabilities in other specialized domains.
Where Pith is reading between the lines
- If the medical verifier holds up, it could reduce the need for massive human-annotated datasets when adapting LLMs to new medical guidelines.
- The separation of trajectory search from reward-based RL might transfer to other high-stakes reasoning tasks such as legal analysis or engineering design.
- Real clinical deployment would still need separate checks against patient outcomes, since the verifier only scores problem answers.
- Scaling the verifier itself becomes the next bottleneck once the 40K-problem regime is exceeded.
Load-bearing premise
A medical verifier can reliably and automatically determine the correctness of complex, multi-step reasoning outputs in medicine.
What would settle it
Human experts reviewing a sample of model outputs that the verifier accepts as correct but which contain factual medical errors or invalid reasoning steps.
read the original abstract
The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advancements in medical reasoning through a two-stage approach: (1) using the verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, (2) applying reinforcement learning (RL) with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning, which outperforms general and medical-specific baselines using only 40K verifiable problems. Experiments show complex reasoning improves medical problem-solving and benefits more from RL. We hope our approach inspires advancements in reasoning across medical and other specialized domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HuatuoGPT-o1, a medical LLM trained via a two-stage process on 40K verifiable medical problems: (1) verifier-guided search to produce complex reasoning trajectories for supervised fine-tuning, followed by (2) reinforcement learning using rewards from the same medical verifier. It claims this yields superior performance over both general-purpose and medical-specific baselines, demonstrates that complex reasoning improves medical problem-solving, and shows particular gains from the RL stage.
Significance. If the verifier proves reliable and the reported gains are not artifacts of verifier error, the work would be significant for extending o1-style reasoning techniques beyond mathematics into high-stakes specialized domains. The data-efficient approach (only 40K problems) and explicit separation of SFT and RL stages could serve as a template for other fields where automatic verification is difficult but partial verifiability exists.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the central claim of outperformance over baselines is stated without any quantitative metrics, baseline names, absolute scores, or statistical significance tests. This makes it impossible to assess whether the gains are meaningful or merely artifacts of the verifier.
- [§3 and §4.2] §3 (Verifier) and §4.2 (RL stage): the paper itself notes that 'verifying medical reasoning is challenging, unlike those in mathematics,' yet provides no quantitative validation of the verifier (precision/recall on held-out expert-annotated chains, inter-rater agreement with physicians, or error analysis on ambiguous cases such as differential diagnoses). Without this, both the search stage and the RL reward signal risk reinforcing spurious patterns.
minor comments (1)
- [Abstract] The abstract would benefit from one or two concrete performance numbers and a brief statement of the verifier's reported accuracy to allow readers to gauge the scale of the claimed improvement.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing where the original presentation was insufficient and describing the changes made to the revised manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of outperformance over baselines is stated without any quantitative metrics, baseline names, absolute scores, or statistical significance tests. This makes it impossible to assess whether the gains are meaningful or merely artifacts of the verifier.
Authors: We agree that the original abstract and §4 presented the outperformance claims without sufficient quantitative detail. In the revised manuscript we have expanded §4 with a new results table that lists all baseline names (both general-purpose and medical-specific), reports absolute accuracy scores on the evaluation benchmarks, and includes statistical significance tests (paired t-tests with p-values) comparing HuatuoGPT-o1 against each baseline. The abstract has also been updated to reference these concrete metrics. These additions allow readers to judge the magnitude and reliability of the reported gains directly. revision: yes
-
Referee: [§3 and §4.2] §3 (Verifier) and §4.2 (RL stage): the paper itself notes that 'verifying medical reasoning is challenging, unlike those in mathematics,' yet provides no quantitative validation of the verifier (precision/recall on held-out expert-annotated chains, inter-rater agreement with physicians, or error analysis on ambiguous cases such as differential diagnoses). Without this, both the search stage and the RL reward signal risk reinforcing spurious patterns.
Authors: We acknowledge the concern and the inherent difficulty of medical verification noted in the manuscript. The revised §3 now includes a dedicated validation subsection that reports precision and recall of the verifier on a held-out set of expert-annotated reasoning chains. We also add an error analysis that examines failure modes on ambiguous cases such as differential diagnoses. Inter-rater agreement statistics from the annotation process are reported where available. These quantitative results support the reliability of the verifier used in both the search and RL stages. Full-scale multi-physician inter-rater studies remain resource-intensive and were not feasible within the current scope, but the added metrics and analysis directly address the risk of spurious reinforcement. revision: partial
Circularity Check
No circularity: empirical training on externally verifiable problems
full rationale
The paper describes a two-stage empirical pipeline (verifier-guided search for SFT, followed by RL with verifier rewards) applied to 40K verifiable medical problems. No equations, derivations, or first-principles results are presented that reduce reported performance gains to fitted parameters, self-definitions, or self-citations. The verifier is introduced as an external component for checking correctness on problems with verifiable ground truth, and outperformance claims rest on direct comparisons to general and medical baselines rather than any internal reduction. This is a standard supervised + RL training setup with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A medical verifier can correctly judge the validity of complex reasoning outputs
Forward citations
Cited by 19 Pith papers
-
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
-
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
-
Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering
MED-VRAG reaches 78.6% average accuracy on four medical QA benchmarks by iteratively retrieving PMC page images with ColQwen2.5 embeddings and a VLM that refines queries over up to three rounds.
-
Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
VISTA uses prefix resampling and a vision-aware attention score to address data imbalance and language prior bias in self-improvement training of MLLMs, yielding up to 13.66% gains on reasoning tasks.
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
-
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
-
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
-
MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model
MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.
-
Improving Medical VQA through Trajectory-Aware Process Supervision
A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.
-
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
-
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
-
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.
-
ReMedi: Reasoner for Medical Clinical Prediction
ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.
-
Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve
Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.
-
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
-
Medical Reasoning with Large Language Models: A Survey and MR-Bench
LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Effects of Cross-lingual Evidence in Multilingual Medical Question Answering
Combining English and target-language web retrieval boosts medical QA for low-resource languages to match high-resource performance, while English web data benefits high-resource languages most and specialized sources...
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Reference graph
Works this paper leans on
-
[1]
Melody Y . Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Heylar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative alignment: Reasoning enables safer language models. OpenAI Blog, 2024. 1
work page 2024
-
[2]
Yunfei Xie, Juncheng Wu, Haoqin Tu, Siwei Yang, Bingchen Zhao, Yongshuo Zong, Qiao Jin, Cihang Xie, and Yuyin Zhou. A preliminary study of o1 in medicine: Are we closer to an ai doctor? arXiv preprint arXiv:2409.15277, 2024. 9
-
[3]
Evaluation of openai o1: Opportunities and challenges of agi
Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, et al. Evaluation of openai o1: Opportunities and challenges of agi. arXiv preprint arXiv:2409.18486, 2024. 1
-
[4]
O1 replication journey: A strategic progress report–part 1
Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, et al. O1 replication journey: A strategic progress report–part 1. arXiv preprint arXiv:2410.18982, 2024. 1, 9
-
[5]
Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective
Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Bo Wang, Shimin Li, Yunhua Zhou, Qipeng Guo, Xuanjing Huang, and Xipeng Qiu. Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective. arXiv preprint arXiv:2412.14135, 2024. 9
-
[6]
Openr: An open source framework for advanced reasoning with large language models
Jun Wang, Meng Fang, Ziyu Wan, Muning Wen, Jiachen Zhu, Anjie Liu, Ziqin Gong, Yan Song, Lei Chen, Lionel M Ni, et al. Openr: An open source framework for advanced reasoning with large language models. arXiv preprint arXiv:2410.09671, 2024. 1, 9
-
[7]
Qwq: Reflect deeply on the boundaries of the unknown, November 2024
Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. 1, 9
work page 2024
-
[8]
Reft: Reasoning with reinforced fine-tuning
Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 2024. 3, 6, 8
-
[9]
Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning
Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, et al. Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning. arXiv preprint arXiv:2410.02884, 2024. 1, 9
-
[10]
Capabilities of Gemini Models in Medicine
Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416, 2024. 1, 9
work page internal anchor Pith review arXiv 2024
-
[11]
Thinking and reasoning in medicine
Vimla L Patel, José F Arocha, and Jiajie Zhang. Thinking and reasoning in medicine. The Cambridge handbook of thinking and reasoning, 14:727–750, 2005
work page 2005
-
[12]
Cod, towards an interpretable medical agent using chain of diagnosis
Junying Chen, Chi Gui, Anningzhe Gao, Ke Ji, Xidong Wang, Xiang Wan, and Benyou Wang. Cod, towards an interpretable medical agent using chain of diagnosis. arXiv preprint arXiv:2407.13301, 2024. 1, 9
-
[13]
Towards next-generation medical agent: How o1 is reshaping decision-making in medical scenarios
Shaochen Xu, Yifan Zhou, Zhengliang Liu, Zihao Wu, Tianyang Zhong, Huaqin Zhao, Yiwei Li, Hanqi Jiang, Yi Pan, Junhao Chen, et al. Towards next-generation medical agent: How o1 is reshaping decision-making in medical scenarios. arXiv preprint arXiv:2411.14461, 2024. 1
-
[14]
Mohamad-Hani Temsah, Amr Jamal, Khalid Alhasan, Abdulkarim A Temsah, and Khalid H Malki. Openai o1-preview vs. chatgpt in healthcare: A new frontier in medical ai reasoning. Cureus, 16(10):e70640, 2024. 1, 9 11
work page 2024
-
[15]
Huatuogpt-ii, one-stage training for medical adaption of llms
Junying Chen, Xidong Wang, Ke Ji, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, et al. Huatuogpt-ii, one-stage training for medical adaption of llms. arXiv preprint arXiv:2311.09774, 2023. 1, 6, 9, 23
-
[16]
Adapting large language models via reading comprehension
Daixuan Cheng, Shaohan Huang, and Furu Wei. Adapting large language models via reading comprehension. In The Twelfth International Conference on Learning Representations, 2023. 1
work page 2023
-
[17]
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021. 2, 6, 23
work page 2021
-
[18]
Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR, 2022. 2, 6
work page 2022
-
[19]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024. 3
work page 2024
- [22]
-
[23]
Stream of search (sos): Learning to search in language
Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman. Stream of search (sos): Learning to search in language. arXiv preprint arXiv:2404.03683, 2024. 3, 5, 10
-
[24]
Learning by playing solving sparse reward tasks from scratch
Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Wiele, Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning , pages 4344–4353. PMLR, 2018. 6
work page 2018
-
[25]
Keeping your dis- tance: Solving sparse reward tasks using self-balancing shaped rewards
Alexander Trott, Stephan Zheng, Caiming Xiong, and Richard Socher. Keeping your dis- tance: Solving sparse reward tasks using self-balancing shaped rewards. Advances in Neural Information Processing Systems, 32, 2019. 6
work page 2019
-
[26]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR, 2022. 6
work page 2022
-
[28]
Ultramedical: Building specialized generalists in biomedicine
Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, et al. Ultramedical: Building specialized generalists in biomedicine. arXiv preprint arXiv:2406.03949, 2024. 6, 9
-
[29]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
arXiv preprint arXiv:2305.09617 , year=
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023. 6, 9
-
[31]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024. 6 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Yi: Open Foundation Models by 01.AI
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Openbiollms: Advancing open-source large language models for healthcare and life sciences, 2024
Malaikannan Sankarasubbu Ankit Pal and Malaikannan Sankarasubbu. Openbiollms: Advancing open-source large language models for healthcare and life sciences, 2024. 6, 9
work page 2024
-
[35]
Biomistral: A collection of open-source pretrained large language models for medical domains
Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024. 6, 9
-
[36]
Pubmedqa: A dataset for biomedical research question answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019. 6
-
[37]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024. 8
work page 2024
-
[39]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Llava-o1: Let vision language models reason step-by-step
Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024. 9
-
[41]
o1-coder: an o1 replication for coding
Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu, Jinlin Xiao, Chao Kong, and Jitao Sang. o1-coder: an o1 replication for coding. arXiv preprint arXiv:2412.00154, 2024. 9
-
[42]
Marco-o1: Towards open reasoning models for open-ended solutions
Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. Marco-o1: Towards open reasoning models for open-ended solutions. arXiv preprint arXiv:2411.14405, 2024. 9
-
[43]
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond
Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, and Eric Horvitz. From medprompt to o1: Exploration of run-time strategies for medical challenge problems and beyond. arXiv preprint arXiv:2411.03590, 2024. 9
-
[44]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024. 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Large language models encode clinical knowledge
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023. 9
work page 2023
-
[47]
arXiv preprint arXiv:2311.16452 , year=
Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452,
-
[48]
Agent hospital: A simulacrum of hospital with evolvable medical agents
Junkai Li, Siyu Wang, Meng Zhang, Weitao Li, Yunghwei Lai, Xinhui Kang, Weizhi Ma, and Yang Liu. Agent hospital: A simulacrum of hospital with evolvable medical agents. arXiv preprint arXiv:2405.02957, 2024. 9 13
-
[49]
Medicalgpt: Training medical gpt model
Ming Xu. Medicalgpt: Training medical gpt model. https://github.com/shibing624/ MedicalGPT, 2023. 9
work page 2023
-
[50]
Huatuo: Tuning llama model with chinese medical knowledge, 2023
Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. Huatuo: Tuning llama model with chinese medical knowledge, 2023
work page 2023
-
[51]
Medalpaca–an open-source collection of medical conversational ai models and training data
Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K Bressem. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023
-
[52]
Pmc- llama: toward building open-source language models for medicine
Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. Pmc- llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association, page ocae045, 2024. 9
work page 2024
-
[53]
Disc-medllm: Bridging general large language models and real-world medical consultation, 2023
Zhijie Bao, Wei Chen, Shengze Xiao, Kuang Ren, Jiaao Wu, Cheng Zhong, Jiajie Peng, Xuanjing Huang, and Zhongyu Wei. Disc-medllm: Bridging general large language models and real-world medical consultation, 2023. 9
work page 2023
-
[54]
Kai Zhang, Jun Yu, Eashan Adhikarla, Rong Zhou, Zhiling Yan, Yixin Liu, Zhengliang Liu, Lifang He, Brian Davison, Xiang Li, et al. Biomedgpt: a unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks. arXiv e-prints, pages arXiv–2305, 2023
work page 2023
-
[55]
Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale
Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, et al. Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale. arXiv preprint arXiv:2406.19280, 2024
-
[56]
Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people
Xidong Wang, Nuo Chen, Junyin Chen, Yan Hu, Yidong Wang, Xiangbo Wu, Anningzhe Gao, Xiang Wan, Haizhou Li, and Benyou Wang. Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people. arXiv preprint arXiv:2403.03640, 2024
-
[57]
Efficiently democratizing medical llms for 50 languages via a mixture of language family experts
Guorui Zheng, Xidong Wang, Juhao Liang, Nuo Chen, Yuping Zheng, and Benyou Wang. Efficiently democratizing medical llms for 50 languages via a mixture of language family experts. arXiv preprint arXiv:2410.10626, 2024
-
[58]
Med42–evaluating fine-tuning strategies for medical llms: Full-parameter vs
Clément Christophe, Praveen K Kanithi, Prateek Munjal, Tathagata Raha, Nasir Hayat, Ronnie Rajan, Ahmed Al-Mahrooqi, Avani Gupta, Muhammad Umar Salman, Gurpreet Gosal, et al. Med42–evaluating fine-tuning strategies for medical llms: Full-parameter vs. parameter-efficient approaches. arXiv preprint arXiv:2404.14779, 2024. 9
-
[59]
Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023. 9
-
[60]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neur...
work page 2022
-
[61]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 9
work page 2023
-
[62]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, Uni...
work page 2022
-
[63]
Mondal, and Jyoti Prakash Sahoo
Yisheng Song, Ting Wang, Puyu Cai, Subrota K. Mondal, and Jyoti Prakash Sahoo. A com- prehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. ACM Comput. Surv., 55(13s):271:1–271:40, 2023. 9
work page 2023
-
[64]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28...
work page 2022
-
[65]
Large language models can self-improve
Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 1051–1068. Association for Computatio...
work page 2023
- [66]
-
[67]
Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G. Baraniuk. Self-consuming generative models go MAD. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 9
work page 2024
-
[68]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 10
work page 2024
-
[69]
Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR, abs/2308.09583, 2023. 10
-
[70]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Alice Oh, Tristan Naumann, Amir Globerson, K...
work page 2023
-
[71]
Bowman, Kyunghyun Cho, and Ethan Perez
Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, and Ethan Perez. Improving code generation by training with natural language feedback. CoRR, abs/2303.16749, 2023
-
[72]
Generating sequences by learning to self-correct
Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. Generating sequences by learning to self-correct. In The Eleventh Interna- tional Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 10
work page 2023
-
[73]
Self-polish: Enhance reasoning in large language models via problem refinement
Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Jia Liu, Tao Gui, Qi Zhang, and Xuanjing Huang. Self-polish: Enhance reasoning in large language models via problem refinement. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 1138...
work page 2023
-
[74]
REFINER: reasoning feedback on intermediate representations
Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. REFINER: reasoning feedback on intermediate representations. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Pa...
work page 2024
-
[75]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022. 10
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[76]
Large language models cannot self-correct reasoning yet
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 10
work page 2024
-
[77]
Pride and prejudice: LLM amplifies self-bias in self-refinement
Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Wang. Pride and prejudice: LLM amplifies self-bias in self-refinement. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16...
work page 2024
-
[78]
Shep- herd: A critic for language model generation
Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi- Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. Shep- herd: A critic for language model generation. CoRR, abs/2308.04592, 2023. 10
-
[79]
Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Junyang Lin, Chang Zhou, Wen Xiao, Junjie Hu, Tianyu Liu, and Baobao Chang. LLM critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback. CoRR, abs/2406.14024, 2024
-
[80]
Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, and Hongsheng Li. Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- R...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.