pith:WQZ6GTIB
Interactive Benchmarks
Interactive benchmarks using budgeted multi-turn interaction with objective feedback assess AI reasoning more robustly than fixed tests or preference judgments.
arxiv:2603.04737 v4 · 2026-03-05 · cs.AI · cs.CL · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{WQZ6GTIBOZEYPCQ7EABSJOLDRQ}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.
That budgeted multi-turn interaction with objective feedback accurately isolates and measures core reasoning ability without introducing new biases from the interaction protocol or judge design.
Interactive Benchmarks assess AI reasoning via budgeted multi-turn interactions in proof and game settings, offering a more robust alternative to saturated fixed benchmarks and subjective preferences.
Cited by
Receipt and verification
| First computed | 2026-05-20T00:03:07.791472Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
b433e34d017649878a1f200324b9638c0fe4230eee1cb2c35ec246b9b08169a0
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/WQZ6GTIBOZEYPCQ7EABSJOLDRQ \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b433e34d017649878a1f200324b9638c0fe4230eee1cb2c35ec246b9b08169a0
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "0ee398b76bdb86d4f25b173f42d223f44710e1cc5bc18a3f88e8f313e07343e9",
"cross_cats_sorted": [
"cs.CL",
"cs.LG"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.AI",
"submitted_at": "2026-03-05T02:18:26Z",
"title_canon_sha256": "19375cc5ad0fd20976fdc336fd90a3833c683d81ae6c67c68b0c67971f295ab8"
},
"schema_version": "1.0",
"source": {
"id": "2603.04737",
"kind": "arxiv",
"version": 4
}
}