{"paper":{"title":"ToolRL: Reward is All Tool Learning Needs","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A principled reward design for tool-use tasks lets reinforcement learning outperform supervised fine-tuning in training LLMs to use tools.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Cheng Qian, Dilek Hakkani-T\\\"ur, Emre Can Acikgoz, Gokhan Tur, Heng Ji, Hongru Wang, Qi He, Xiusi Chen","submitted_at":"2025-04-16T21:45:32Z","abstract_excerpt":"Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The explored reward strategies and the proposed principled design are assumed to transfer to tool-use scenarios outside the specific benchmarks and tool sets used in the experiments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A principled reward design for tool-use tasks lets reinforcement learning outperform supervised fine-tuning in training LLMs to use tools.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"e7806655713c5806b083448c7e35d7fcabdbc7ab0734f85664d0c75665d8e2ee"},"source":{"id":"2504.13958","kind":"arxiv","version":1},"verdict":{"id":"e2d1cb57-4334-4172-9e1b-d0df4bfd74db","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T00:21:51.946869Z","strongest_claim":"Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models.","one_line_summary":"A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The explored reward strategies and the proposed principled design are assumed to transfer to tool-use scenarios outside the specific benchmarks and tool sets used in the experiments.","pith_extraction_headline":"A principled reward design for tool-use tasks lets reinforcement learning outperform supervised fine-tuning in training LLMs to use tools."},"references":{"count":46,"sample":[{"doi":"","year":null,"title":"Can a single model master both multi-turn conversations and tool use? coalm: A uni- fied conversational agentic language model. Preprint, arXiv:2502.08820. Jinheon Baek, Sujay Kumar Jauhar, Silviu Cuc","work_id":"0fc5f988-a294-4233-99e6-0d734965f4b5","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Researchagent: Iterative research idea generation over scientific literature with large language models,","work_id":"41213a8f-51aa-4065-b3d5-2f154966db88","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks","work_id":"618aa44c-a6c6-425c-abce-8aa8aa842921","ref_index":3,"cited_arxiv_id":"2211.12588","is_internal_anchor":true},{"doi":"","year":2024,"title":"In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 9354–9366, Bangkok, Thailand","work_id":"90cd51e7-3c1c-451d-a021-7a7d089d473b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training","work_id":"258dd934-025c-47f5-b4f6-5a0c1c338cc6","ref_index":5,"cited_arxiv_id":"2501.17161","is_internal_anchor":true}],"resolved_work":46,"snapshot_sha256":"b24efdc154cb9fd05b118265ae3687bb9f4eabdcbb50524828d2ae6b46f82a53","internal_anchors":19},"formal_canon":{"evidence_count":2,"snapshot_sha256":"102fb83dfcb9d006b2485fa91c8a330fbcf79fa368aa5600b6839a1d96fbcc89"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}