pith. the verified trust layer for science. sign in
Pith Number

pith:IGPFAYCX

pith:2026:IGPFAYCXFTYJWAYOSLHXOQ4TAP
not attested not anchored not stored refs pending

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Alec Chiu, Avinash Thangali, Chaitanya Kulkarni, Linsey Pang, Prakhar Mehrotra, Shivani Shekhar, Uma Kona, Yirou Ge, Yixi Li, Yun-Shiuan Chuang, Zijie Pan

Proxy state-based evaluation replaces costly deterministic backends with LLM trackers and judges for benchmarking multi-turn tool-calling agents.

arxiv:2602.16246 v3 · 2026-02-18 · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{IGPFAYCXFTYJWAYOSLHXOQ4TAP}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents, producing stable model-differentiating rankings and on-/off-policy supervision that transfers to unseen scenarios.

C2weakest assumption

That LLM state trackers and judges, when given carefully specified scenarios, can infer accurate proxy states and verify goal completion with near-zero hallucination rates and high reliability.

C3one line summary

Introduces Proxy State-Based Evaluation as a scalable LLM-based method for verifiable assessment of multi-turn tool-calling agents using proxy states inferred from interaction traces.

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-18T02:44:31.143396Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

419e5060572cf09b030e92cf77439303ed3a8dde833cd41f12183ab4a2884b89

Aliases

arxiv: 2602.16246 · arxiv_version: 2602.16246v3 · doi: 10.48550/arxiv.2602.16246 · pith_short_12: IGPFAYCXFTYJ · pith_short_16: IGPFAYCXFTYJWAYO · pith_short_8: IGPFAYCX
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/IGPFAYCXFTYJWAYOSLHXOQ4TAP \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 419e5060572cf09b030e92cf77439303ed3a8dde833cd41f12183ab4a2884b89
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "3122410118920069546efceb4c409f16fc4600b24e7703a8fdcb3fc53e8ff3f0",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2026-02-18T07:49:47Z",
    "title_canon_sha256": "635d77fbd1eda53c4cd6f4a49799b35c1f57cb644153e4e92193aa1e23a02da9"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2602.16246",
    "kind": "arxiv",
    "version": 3
  }
}