{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:EYMHDPD5YM3H6XWTQ4LIE3APIW","short_pith_number":"pith:EYMHDPD5","schema_version":"1.0","canonical_sha256":"261871bc7dc3367f5ed38716826c0f459c73573a005c87b75c51d4dcf1edc70c","source":{"kind":"arxiv","id":"2504.13958","version":1},"attestation_state":"computed","paper":{"title":"ToolRL: Reward is All Tool Learning Needs","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A principled reward design for tool-use tasks lets reinforcement learning outperform supervised fine-tuning in training LLMs to use tools.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Cheng Qian, Dilek Hakkani-T\\\"ur, Emre Can Acikgoz, Gokhan Tur, Heng Ji, Hongru Wang, Qi He, Xiusi Chen","submitted_at":"2025-04-16T21:45:32Z","abstract_excerpt":"Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2504.13958","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2025-04-16T21:45:32Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"a6a3d8cbe619dc8a2acd102e7ff2545163a89ac48f1be9b07f337af429f6db69","abstract_canon_sha256":"554a6c4040adfff956401c0dcf839c06f1adf4c031130927d473224b6450fda5"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T03:22:05.943746Z","signature_b64":"SmZpG9mh8k5OaQrbinTWKTXOzXXfuGXTvHomzUkDu8kOldur5NF+0uJqYxOkPI/invvMXGBTVzB0vHsdRDx7Cw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"261871bc7dc3367f5ed38716826c0f459c73573a005c87b75c51d4dcf1edc70c","last_reissued_at":"2026-05-18T03:22:05.942883Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T03:22:05.942883Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"ToolRL: Reward is All Tool Learning Needs","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A principled reward design for tool-use tasks lets reinforcement learning outperform supervised fine-tuning in training LLMs to use tools.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Cheng Qian, Dilek Hakkani-T\\\"ur, Emre Can Acikgoz, Gokhan Tur, Heng Ji, Hongru Wang, Qi He, Xiusi Chen","submitted_at":"2025-04-16T21:45:32Z","abstract_excerpt":"Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The explored reward strategies and the proposed principled design are assumed to transfer to tool-use scenarios outside the specific benchmarks and tool sets used in the experiments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A principled reward design for tool-use tasks lets reinforcement learning outperform supervised fine-tuning in training LLMs to use tools.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"e7806655713c5806b083448c7e35d7fcabdbc7ab0734f85664d0c75665d8e2ee"},"source":{"id":"2504.13958","kind":"arxiv","version":1},"verdict":{"id":"e2d1cb57-4334-4172-9e1b-d0df4bfd74db","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T00:21:51.946869Z","strongest_claim":"Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models.","one_line_summary":"A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The explored reward strategies and the proposed principled design are assumed to transfer to tool-use scenarios outside the specific benchmarks and tool sets used in the experiments.","pith_extraction_headline":"A principled reward design for tool-use tasks lets reinforcement learning outperform supervised fine-tuning in training LLMs to use tools."},"references":{"count":46,"sample":[{"doi":"","year":null,"title":"Can a single model master both multi-turn conversations and tool use? coalm: A uni- fied conversational agentic language model. Preprint, arXiv:2502.08820. Jinheon Baek, Sujay Kumar Jauhar, Silviu Cuc","work_id":"0fc5f988-a294-4233-99e6-0d734965f4b5","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Researchagent: Iterative research idea generation over scientific literature with large language models,","work_id":"41213a8f-51aa-4065-b3d5-2f154966db88","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks","work_id":"618aa44c-a6c6-425c-abce-8aa8aa842921","ref_index":3,"cited_arxiv_id":"2211.12588","is_internal_anchor":true},{"doi":"","year":2024,"title":"In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 9354–9366, Bangkok, Thailand","work_id":"90cd51e7-3c1c-451d-a021-7a7d089d473b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training","work_id":"258dd934-025c-47f5-b4f6-5a0c1c338cc6","ref_index":5,"cited_arxiv_id":"2501.17161","is_internal_anchor":true}],"resolved_work":46,"snapshot_sha256":"b24efdc154cb9fd05b118265ae3687bb9f4eabdcbb50524828d2ae6b46f82a53","internal_anchors":19},"formal_canon":{"evidence_count":2,"snapshot_sha256":"102fb83dfcb9d006b2485fa91c8a330fbcf79fa368aa5600b6839a1d96fbcc89"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2504.13958","created_at":"2026-05-18T03:22:05.943020+00:00"},{"alias_kind":"arxiv_version","alias_value":"2504.13958v1","created_at":"2026-05-18T03:22:05.943020+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2504.13958","created_at":"2026-05-18T03:22:05.943020+00:00"},{"alias_kind":"pith_short_12","alias_value":"EYMHDPD5YM3H","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"EYMHDPD5YM3H6XWT","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"EYMHDPD5","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":39,"internal_anchor_count":39,"sample":[{"citing_arxiv_id":"2605.19447","citing_title":"What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20061","citing_title":"Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2508.08791","citing_title":"Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2509.02547","citing_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","ref_index":99,"is_internal_anchor":true},{"citing_arxiv_id":"2509.18847","citing_title":"Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2510.00568","citing_title":"ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2510.07794","citing_title":"HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2510.22977","citing_title":"The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2601.12538","citing_title":"Agentic Reasoning for Large Language Models","ref_index":208,"is_internal_anchor":true},{"citing_arxiv_id":"2504.21776","citing_title":"WebThinker: Empowering Large Reasoning Models with Deep Research Capability","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2601.21257","citing_title":"MoCo: A One-Stop Shop for Model Collaboration Research","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2508.07407","citing_title":"A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems","ref_index":74,"is_internal_anchor":true},{"citing_arxiv_id":"2603.16876","citing_title":"Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation","ref_index":96,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11853","citing_title":"GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2601.05242","citing_title":"GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14126","citing_title":"Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2603.24709","citing_title":"Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12943","citing_title":"Reinforced Collaboration in Multi-Agent Flow Networks","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11853","citing_title":"GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09544","citing_title":"TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09931","citing_title":"PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24339","citing_title":"See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05750","citing_title":"RVPO: Risk-Sensitive Alignment via Variance Regularization","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04777","citing_title":"Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00380","citing_title":"ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning","ref_index":49,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/EYMHDPD5YM3H6XWTQ4LIE3APIW","json":"https://pith.science/pith/EYMHDPD5YM3H6XWTQ4LIE3APIW.json","graph_json":"https://pith.science/api/pith-number/EYMHDPD5YM3H6XWTQ4LIE3APIW/graph.json","events_json":"https://pith.science/api/pith-number/EYMHDPD5YM3H6XWTQ4LIE3APIW/events.json","paper":"https://pith.science/paper/EYMHDPD5"},"agent_actions":{"view_html":"https://pith.science/pith/EYMHDPD5YM3H6XWTQ4LIE3APIW","download_json":"https://pith.science/pith/EYMHDPD5YM3H6XWTQ4LIE3APIW.json","view_paper":"https://pith.science/paper/EYMHDPD5","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2504.13958&json=true","fetch_graph":"https://pith.science/api/pith-number/EYMHDPD5YM3H6XWTQ4LIE3APIW/graph.json","fetch_events":"https://pith.science/api/pith-number/EYMHDPD5YM3H6XWTQ4LIE3APIW/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/EYMHDPD5YM3H6XWTQ4LIE3APIW/action/timestamp_anchor","attest_storage":"https://pith.science/pith/EYMHDPD5YM3H6XWTQ4LIE3APIW/action/storage_attestation","attest_author":"https://pith.science/pith/EYMHDPD5YM3H6XWTQ4LIE3APIW/action/author_attestation","sign_citation":"https://pith.science/pith/EYMHDPD5YM3H6XWTQ4LIE3APIW/action/citation_signature","submit_replication":"https://pith.science/pith/EYMHDPD5YM3H6XWTQ4LIE3APIW/action/replication_record"}},"created_at":"2026-05-18T03:22:05.943020+00:00","updated_at":"2026-05-18T03:22:05.943020+00:00"}