{"paper":{"title":"HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"HuatuoGPT-o1 reaches complex medical reasoning through verifier-guided training on 40,000 problems.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Benyou Wang, Jianye Hou, Junying Chen, Ke Ji, Rongsheng Wang, Wanlong Liu, Xidong Wang, Zhenyang Cai","submitted_at":"2024-12-25T15:12:34Z","abstract_excerpt":"The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advan"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"HuatuoGPT-o1, trained with a two-stage approach of verifier-guided search for fine-tuning followed by RL with verifier rewards on only 40K verifiable medical problems, outperforms both general and medical-specific baselines.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"A medical verifier can reliably and automatically determine the correctness of complex, multi-step reasoning outputs in medicine, despite the abstract noting that verifying medical reasoning is inherently challenging unlike in mathematics.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"HuatuoGPT-o1 reaches complex medical reasoning through verifier-guided training on 40,000 problems.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b244664c67bad9dc18a305c747784b113d8cf78e481a11e15e9e6ecba41de3ae"},"source":{"id":"2412.18925","kind":"arxiv","version":1},"verdict":{"id":"ff885d24-18f3-4a30-90c8-104b705cb1ff","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T12:32:30.760411Z","strongest_claim":"HuatuoGPT-o1, trained with a two-stage approach of verifier-guided search for fine-tuning followed by RL with verifier rewards on only 40K verifiable medical problems, outperforms both general and medical-specific baselines.","one_line_summary":"HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"A medical verifier can reliably and automatically determine the correctness of complex, multi-step reasoning outputs in medicine, despite the abstract noting that verifying medical reasoning is inherently challenging unlike in mathematics.","pith_extraction_headline":"HuatuoGPT-o1 reaches complex medical reasoning through verifier-guided training on 40,000 problems."},"references":{"count":101,"sample":[{"doi":"","year":2024,"title":"Melody Y . Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Heylar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and","work_id":"8b6aeb91-6507-42e4-85f3-172a7c5bc6d3","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"A preliminary study of o1 in medicine: Are we closer to an ai doctor? arXiv preprint arXiv:2409.15277 2024","work_id":"64508eac-de46-4b28-a3a3-ba21ad1d08ca","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Evaluation of openai o1: Opportunities and challenges of agi","work_id":"1d164054-4773-4bb7-a5a0-ea3bce501f25","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al","work_id":"0738313f-d0cc-4548-a3bc-eba0fa2d607b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective","work_id":"6a0e5e21-c5d9-4094-9ee7-58c92f267e2b","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":101,"snapshot_sha256":"1ad4702c97f02672c8617a40dc746777f23c2b5fecf8d10e16144b05dfbbdf4a","internal_anchors":17},"formal_canon":{"evidence_count":1,"snapshot_sha256":"183ea54427bbb01cdd70b12e15f6384a69ff71caa7e6d8d9ae65ce6fb5f3da9c"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}