pith. sign in
Pith Number

pith:WQZ6GTIB

pith:2026:WQZ6GTIBOZEYPCQ7EABSJOLDRQ
not attested not anchored not stored refs pending

Interactive Benchmarks

Baoqing Yue, Brian Fan, Hufei Yang, Jichen Feng, Mengdi Wang, Qian Sun, Yifan Zhang, Yutong Han, Zihan Zhu

Interactive benchmarks using budgeted multi-turn interaction with objective feedback assess AI reasoning more robustly than fixed tests or preference judgments.

arxiv:2603.04737 v4 · 2026-03-05 · cs.AI · cs.CL · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{WQZ6GTIBOZEYPCQ7EABSJOLDRQ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.

C2weakest assumption

That budgeted multi-turn interaction with objective feedback accurately isolates and measures core reasoning ability without introducing new biases from the interaction protocol or judge design.

C3one line summary

Interactive Benchmarks assess AI reasoning via budgeted multi-turn interactions in proof and game settings, offering a more robust alternative to saturated fixed benchmarks and subjective preferences.

Cited by

2 papers in Pith

Receipt and verification
First computed 2026-05-20T00:03:07.791472Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

b433e34d017649878a1f200324b9638c0fe4230eee1cb2c35ec246b9b08169a0

Aliases

arxiv: 2603.04737 · arxiv_version: 2603.04737v4 · doi: 10.48550/arxiv.2603.04737 · pith_short_12: WQZ6GTIBOZEY · pith_short_16: WQZ6GTIBOZEYPCQ7 · pith_short_8: WQZ6GTIB
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/WQZ6GTIBOZEYPCQ7EABSJOLDRQ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b433e34d017649878a1f200324b9638c0fe4230eee1cb2c35ec246b9b08169a0
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "0ee398b76bdb86d4f25b173f42d223f44710e1cc5bc18a3f88e8f313e07343e9",
    "cross_cats_sorted": [
      "cs.CL",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2026-03-05T02:18:26Z",
    "title_canon_sha256": "19375cc5ad0fd20976fdc336fd90a3833c683d81ae6c67c68b0c67971f295ab8"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2603.04737",
    "kind": "arxiv",
    "version": 4
  }
}