pith:YJDU6KGM
Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning
Inverse reinforcement learning extracts reusable process rewards from expert reasoning traces that improve language model training and inference beyond imitation.
arxiv:2510.01857 v5 · 2025-10-02 · cs.AI
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{YJDU6KGMQWSUFZX5GL4R5YQ3E2}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Through experiments on GSM8K, MMLU-Pro and MedReason we show that the reasoning reward function learned with R-AIRL can be effectively used throughout the training and inference pipeline: (1) to provide a training signal for post-training, outperforming SFT in most of the considered settings, (2) for inference-time reranking, improving pass@1 by up to 17.4 points, and (3) for process-level evaluation, localising reasoning failures with up to 86.1% accuracy.
That the adversarial inverse RL procedure can recover a generalizable process-level reward from expert demonstrations that truly reflects reasoning quality and transfers to states not explicitly present in the training traces.
R-AIRL learns a reasoning reward function from expert demonstrations using adversarial inverse RL and shows it outperforms SFT for training while enabling reranking and failure localization on GSM8K, MMLU-Pro, and MedReason.
Formal links
Receipt and verification
| First computed | 2026-05-20T00:02:56.224220Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
c2474f28cc85a542e6fd32f91ee21b26b7a68b8294cf9787d857ef88bef7d048
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/YJDU6KGMQWSUFZX5GL4R5YQ3E2 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c2474f28cc85a542e6fd32f91ee21b26b7a68b8294cf9787d857ef88bef7d048
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "6d314dbed37570ccc909fbb7b6ee6c4e098077bbb05c45f09759a24f32d8f5ff",
"cross_cats_sorted": [],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.AI",
"submitted_at": "2025-10-02T09:55:26Z",
"title_canon_sha256": "6a28ac5efefddb9edae47459bf6763473346f440fb319e984c41b9b6400f1388"
},
"schema_version": "1.0",
"source": {
"id": "2510.01857",
"kind": "arxiv",
"version": 5
}
}