pith. sign in
Pith Number

pith:VTEXE334

pith:2026:VTEXE334DGF3QC6EIU5TXOHBJE
not attested not anchored not stored refs resolved

Process Rewards with Learned Reliability

Chengsong Huang, Donghong Cai, Jiaxin Huang, Jinyuan Li, Langlin Huang, Shaoyang Xu, Wenxuan Zhang, Yuyi Yang

A process reward model learns both step success probability and the reliability of that probability to guide more efficient reasoning search.

arxiv:2605.15529 v1 · 2026-05-15 · cs.CL · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{VTEXE334DGF3QC6EIU5TXOHBJE}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

C2weakest assumption

The assumption that the Beta posterior learned from finite Monte Carlo continuations provides a calibrated and actionable measure of true prediction reliability rather than merely reflecting sampling noise or model bias in the continuation process itself (implicit in the Beta-Binomial likelihood description).

C3one line summary

BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.

References

75 extracted · 75 resolved · 10 Pith anchors

[1] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv e-prints, pages arXiv–2502, 2025 2025
[2] What If We Allocate Test-Time Compute Adaptively? 2026 · arXiv:2602.01070
[3] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling 2024 · arXiv:2407.21787
[4] An augmented benchmark dataset for geometric question answering through dual parallel text encoding 2022
[5] Web-shepherd: Advancing PRMs for reinforcing web agents 2026

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-20T00:01:03.617279Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

acc9726f7c198bb80bc4453b3bb8e1491f1854f3753e2a6961fdcdfa100e7c24

Aliases

arxiv: 2605.15529 · arxiv_version: 2605.15529v1 · doi: 10.48550/arxiv.2605.15529 · pith_short_12: VTEXE334DGF3 · pith_short_16: VTEXE334DGF3QC6E · pith_short_8: VTEXE334
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/VTEXE334DGF3QC6EIU5TXOHBJE \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: acc9726f7c198bb80bc4453b3bb8e1491f1854f3753e2a6961fdcdfa100e7c24
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "e83e96a90d1c1027b3412052968f37772dc914a154b20ad7d4f86d123f5ab6db",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-05-15T01:57:11Z",
    "title_canon_sha256": "23a708fc428a0e28c2bf7fc6b9c21bee42b41a0f16031c2e27dbb82b71c0b6e5"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15529",
    "kind": "arxiv",
    "version": 1
  }
}