{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:3JQ23PTRKKM6QP4AAHQTDCNUMY","short_pith_number":"pith:3JQ23PTR","schema_version":"1.0","canonical_sha256":"da61adbe715299e83f8001e13189b466283e97db6d2daa6af7e1278d03481ae3","source":{"kind":"arxiv","id":"2412.18925","version":1},"attestation_state":"computed","paper":{"title":"HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"HuatuoGPT-o1 reaches complex medical reasoning through verifier-guided training on 40,000 problems.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Benyou Wang, Jianye Hou, Junying Chen, Ke Ji, Rongsheng Wang, Wanlong Liu, Xidong Wang, Zhenyang Cai","submitted_at":"2024-12-25T15:12:34Z","abstract_excerpt":"The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advan"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2412.18925","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2024-12-25T15:12:34Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"54548b45310d2363c1c0194fb07e47578af818bd72ebe2cb5374f6f1e47c4354","abstract_canon_sha256":"c46f04f60a11f5dcd4b5def012472fa6b68dbba2a111c6289943dd5677855648"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.539720Z","signature_b64":"31HMHHsBQ3dWjJckPB8SjoS3fjKJBKx+zeFWyquDfWDQCHBcoCSNcZDdr0rPhbgABJRPZQvn0Le9DZLg7lAEDA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"da61adbe715299e83f8001e13189b466283e97db6d2daa6af7e1278d03481ae3","last_reissued_at":"2026-05-17T23:38:52.539220Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.539220Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"HuatuoGPT-o1 reaches complex medical reasoning through verifier-guided training on 40,000 problems.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Benyou Wang, Jianye Hou, Junying Chen, Ke Ji, Rongsheng Wang, Wanlong Liu, Xidong Wang, Zhenyang Cai","submitted_at":"2024-12-25T15:12:34Z","abstract_excerpt":"The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advan"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"HuatuoGPT-o1, trained with a two-stage approach of verifier-guided search for fine-tuning followed by RL with verifier rewards on only 40K verifiable medical problems, outperforms both general and medical-specific baselines.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"A medical verifier can reliably and automatically determine the correctness of complex, multi-step reasoning outputs in medicine, despite the abstract noting that verifying medical reasoning is inherently challenging unlike in mathematics.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"HuatuoGPT-o1 reaches complex medical reasoning through verifier-guided training on 40,000 problems.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b244664c67bad9dc18a305c747784b113d8cf78e481a11e15e9e6ecba41de3ae"},"source":{"id":"2412.18925","kind":"arxiv","version":1},"verdict":{"id":"ff885d24-18f3-4a30-90c8-104b705cb1ff","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T12:32:30.760411Z","strongest_claim":"HuatuoGPT-o1, trained with a two-stage approach of verifier-guided search for fine-tuning followed by RL with verifier rewards on only 40K verifiable medical problems, outperforms both general and medical-specific baselines.","one_line_summary":"HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"A medical verifier can reliably and automatically determine the correctness of complex, multi-step reasoning outputs in medicine, despite the abstract noting that verifying medical reasoning is inherently challenging unlike in mathematics.","pith_extraction_headline":"HuatuoGPT-o1 reaches complex medical reasoning through verifier-guided training on 40,000 problems."},"references":{"count":101,"sample":[{"doi":"","year":2024,"title":"Melody Y . Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Heylar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and","work_id":"8b6aeb91-6507-42e4-85f3-172a7c5bc6d3","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"A preliminary study of o1 in medicine: Are we closer to an ai doctor? arXiv preprint arXiv:2409.15277 2024","work_id":"64508eac-de46-4b28-a3a3-ba21ad1d08ca","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Evaluation of openai o1: Opportunities and challenges of agi","work_id":"1d164054-4773-4bb7-a5a0-ea3bce501f25","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al","work_id":"0738313f-d0cc-4548-a3bc-eba0fa2d607b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective","work_id":"6a0e5e21-c5d9-4094-9ee7-58c92f267e2b","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":101,"snapshot_sha256":"1ad4702c97f02672c8617a40dc746777f23c2b5fecf8d10e16144b05dfbbdf4a","internal_anchors":17},"formal_canon":{"evidence_count":1,"snapshot_sha256":"183ea54427bbb01cdd70b12e15f6384a69ff71caa7e6d8d9ae65ce6fb5f3da9c"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2412.18925","created_at":"2026-05-17T23:38:52.539295+00:00"},{"alias_kind":"arxiv_version","alias_value":"2412.18925v1","created_at":"2026-05-17T23:38:52.539295+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2412.18925","created_at":"2026-05-17T23:38:52.539295+00:00"},{"alias_kind":"pith_short_12","alias_value":"3JQ23PTRKKM6","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"3JQ23PTRKKM6QP4A","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"3JQ23PTR","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":33,"internal_anchor_count":33,"sample":[{"citing_arxiv_id":"2605.23629","citing_title":"DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2503.17599","citing_title":"Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2504.02181","citing_title":"A Survey of Scaling in Large Language Model Reasoning","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2507.14200","citing_title":"A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2507.20917","citing_title":"MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17368","citing_title":"RadGenome-Anatomy: A Large-Scale Anatomy-Labeled Chest Radiograph Dataset via Physically Grounded Volumetric Projection","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18067","citing_title":"PPAI: Enabling Personalized LLM Agent Interoperability for Collaborative Edge Intelligence","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08949","citing_title":"Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2505.19630","citing_title":"Real-World Doctor Agent with Proactive Consultation through Multi-Agent Reinforcement Learning","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2509.02547","citing_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","ref_index":198,"is_internal_anchor":true},{"citing_arxiv_id":"2509.08827","citing_title":"A Survey of Reinforcement Learning for Large Reasoning Models","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2501.18362","citing_title":"MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2601.13262","citing_title":"CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2601.20375","citing_title":"LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08559","citing_title":"Medical Reasoning with Large Language Models: A Survey and MR-Bench","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14543","citing_title":"RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2603.23964","citing_title":"From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments","ref_index":209,"is_internal_anchor":true},{"citing_arxiv_id":"2501.05366","citing_title":"Search-o1: Agentic Search-Enhanced Large Reasoning Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11931","citing_title":"Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2507.17746","citing_title":"Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2502.17419","citing_title":"From System 1 to System 2: A Survey of Reasoning Large Language Models","ref_index":113,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27724","citing_title":"Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2503.09567","citing_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","ref_index":86,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09584","citing_title":"CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10761","citing_title":"RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology","ref_index":21,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/3JQ23PTRKKM6QP4AAHQTDCNUMY","json":"https://pith.science/pith/3JQ23PTRKKM6QP4AAHQTDCNUMY.json","graph_json":"https://pith.science/api/pith-number/3JQ23PTRKKM6QP4AAHQTDCNUMY/graph.json","events_json":"https://pith.science/api/pith-number/3JQ23PTRKKM6QP4AAHQTDCNUMY/events.json","paper":"https://pith.science/paper/3JQ23PTR"},"agent_actions":{"view_html":"https://pith.science/pith/3JQ23PTRKKM6QP4AAHQTDCNUMY","download_json":"https://pith.science/pith/3JQ23PTRKKM6QP4AAHQTDCNUMY.json","view_paper":"https://pith.science/paper/3JQ23PTR","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2412.18925&json=true","fetch_graph":"https://pith.science/api/pith-number/3JQ23PTRKKM6QP4AAHQTDCNUMY/graph.json","fetch_events":"https://pith.science/api/pith-number/3JQ23PTRKKM6QP4AAHQTDCNUMY/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/3JQ23PTRKKM6QP4AAHQTDCNUMY/action/timestamp_anchor","attest_storage":"https://pith.science/pith/3JQ23PTRKKM6QP4AAHQTDCNUMY/action/storage_attestation","attest_author":"https://pith.science/pith/3JQ23PTRKKM6QP4AAHQTDCNUMY/action/author_attestation","sign_citation":"https://pith.science/pith/3JQ23PTRKKM6QP4AAHQTDCNUMY/action/citation_signature","submit_replication":"https://pith.science/pith/3JQ23PTRKKM6QP4AAHQTDCNUMY/action/replication_record"}},"created_at":"2026-05-17T23:38:52.539295+00:00","updated_at":"2026-05-17T23:38:52.539295+00:00"}