Pith Number

pith:YJDU6KGM

pith:2025:YJDU6KGMQWSUFZX5GL4R5YQ3E2

not attested not anchored not stored refs pending

Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

Claudio Fanconi, Mihaela van der Schaar, Nicol\'as Astorga

Inverse reinforcement learning extracts reusable process rewards from expert reasoning traces that improve language model training and inference beyond imitation.

arxiv:2510.01857 v5 · 2025-10-02 · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{YJDU6KGMQWSUFZX5GL4R5YQ3E2}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Through experiments on GSM8K, MMLU-Pro and MedReason we show that the reasoning reward function learned with R-AIRL can be effectively used throughout the training and inference pipeline: (1) to provide a training signal for post-training, outperforming SFT in most of the considered settings, (2) for inference-time reranking, improving pass@1 by up to 17.4 points, and (3) for process-level evaluation, localising reasoning failures with up to 86.1% accuracy.

C2weakest assumption

That the adversarial inverse RL procedure can recover a generalizable process-level reward from expert demonstrations that truly reflects reasoning quality and transfers to states not explicitly present in the training traces.

C3one line summary

R-AIRL learns a reasoning reward function from expert demonstrations using adversarial inverse RL and shows it outperforms SFT for training while enabling reranking and failure localization on GSM8K, MMLU-Pro, and MedReason.

Formal links

1 machine-checked theorem link

Receipt and verification

First computed	2026-05-20T00:02:56.224220Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

c2474f28cc85a542e6fd32f91ee21b26b7a68b8294cf9787d857ef88bef7d048

Aliases

arxiv: 2510.01857 · arxiv_version: 2510.01857v5 · doi: 10.48550/arxiv.2510.01857 · pith_short_12: YJDU6KGMQWSU · pith_short_16: YJDU6KGMQWSUFZX5 · pith_short_8: YJDU6KGM

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/YJDU6KGMQWSUFZX5GL4R5YQ3E2 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c2474f28cc85a542e6fd32f91ee21b26b7a68b8294cf9787d857ef88bef7d048

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "6d314dbed37570ccc909fbb7b6ee6c4e098077bbb05c45f09759a24f32d8f5ff",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2025-10-02T09:55:26Z",
    "title_canon_sha256": "6a28ac5efefddb9edae47459bf6763473346f440fb319e984c41b9b6400f1388"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2510.01857",
    "kind": "arxiv",
    "version": 5
  }
}