pith. sign in
Pith Number

pith:EYMHDPD5

pith:2025:EYMHDPD5YM3H6XWTQ4LIE3APIW
not attested not anchored not stored refs resolved

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Dilek Hakkani-T\"ur, Emre Can Acikgoz, Gokhan Tur, Heng Ji, Hongru Wang, Qi He, Xiusi Chen

A principled reward design for tool-use tasks lets reinforcement learning outperform supervised fine-tuning in training LLMs to use tools.

arxiv:2504.13958 v1 · 2025-04-16 · cs.LG · cs.AI · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{EYMHDPD5YM3H6XWTQ4LIE3APIW}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models.

C2weakest assumption

The explored reward strategies and the proposed principled design are assumed to transfer to tool-use scenarios outside the specific benchmarks and tool sets used in the experiments.

C3one line summary

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

References

46 extracted · 46 resolved · 19 Pith anchors

[1] Can a single model master both multi-turn conversations and tool use? coalm: A uni- fied conversational agentic language model. Preprint, arXiv:2502.08820. Jinheon Baek, Sujay Kumar Jauhar, Silviu Cuc
[2] Researchagent: Iterative research idea generation over scientific literature with large language models,
[3] Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks · arXiv:2211.12588
[4] In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 9354–9366, Bangkok, Thailand 2024
[5] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training · arXiv:2501.17161

Formal links

2 machine-checked theorem links

Cited by

39 papers in Pith

Receipt and verification
First computed 2026-05-18T03:22:05.942883Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

261871bc7dc3367f5ed38716826c0f459c73573a005c87b75c51d4dcf1edc70c

Aliases

arxiv: 2504.13958 · arxiv_version: 2504.13958v1 · doi: 10.48550/arxiv.2504.13958 · pith_short_12: EYMHDPD5YM3H · pith_short_16: EYMHDPD5YM3H6XWT · pith_short_8: EYMHDPD5
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/EYMHDPD5YM3H6XWTQ4LIE3APIW \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 261871bc7dc3367f5ed38716826c0f459c73573a005c87b75c51d4dcf1edc70c
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "554a6c4040adfff956401c0dcf839c06f1adf4c031130927d473224b6450fda5",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2025-04-16T21:45:32Z",
    "title_canon_sha256": "a6a3d8cbe619dc8a2acd102e7ff2545163a89ac48f1be9b07f337af429f6db69"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2504.13958",
    "kind": "arxiv",
    "version": 1
  }
}