pith. sign in
Pith Number

pith:PNUMF4I5

pith:2026:PNUMF4I5F4UBAAWJNAIM5YHKXE
not attested not anchored not stored refs resolved

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Alvin Cheung, Dawn Song, Hanchen Li, Hao Wang, Koushik Sen, Qiuyang Mang

BenchJack automatically uncovers reward-hacking exploits that let agents score near-perfect on popular benchmarks without completing tasks.

arxiv:2605.12673 v1 · 2026-05-12 · cs.AI · cs.CR

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{PNUMF4I5F4UBAAWJNAIM5YHKXE}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations.

C2weakest assumption

The assumption that exploits discovered by BenchJack using its own auditing agents represent genuine, transferable reward hacks that would succeed on standard frontier models rather than being artifacts of the clairvoyant auditing setup or specific model choices.

C3one line summary

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

References

129 extracted · 129 resolved · 21 Pith anchors

[1] Concrete Problems in AI Safety 2016 · arXiv:1606.06565
[2] Alignment risk update: Claude mythos preview, 2026 2026
[3] Anthropic / Community Sources. Claude code. https://www.anthropic.com/product/ claude-code, 2026 2026
[4] Analyzing and improving chain-of-thought monitorability through information theory, 2026 2026
[5] Rewardhackingagents: Benchmarking evaluation integrity for llm ml-engineering agents, 2026 2026
Receipt and verification
First computed 2026-05-18T03:09:50.151516Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

7b68c2f11d2f281002c96810cee0eab925709e41a21c18d9313fdf305699d6e9

Aliases

arxiv: 2605.12673 · arxiv_version: 2605.12673v1 · doi: 10.48550/arxiv.2605.12673 · pith_short_12: PNUMF4I5F4UB · pith_short_16: PNUMF4I5F4UBAAWJ · pith_short_8: PNUMF4I5
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/PNUMF4I5F4UBAAWJNAIM5YHKXE \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 7b68c2f11d2f281002c96810cee0eab925709e41a21c18d9313fdf305699d6e9
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "1a83aa8499adde69f08fde59df96c5575c4fc34c55b4c84509cb5fc20e9e858c",
    "cross_cats_sorted": [
      "cs.CR"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2026-05-12T19:22:45Z",
    "title_canon_sha256": "94e7076a81333e03e389c3bb835a57a5466d36b95c4b2bbf9fba0687072f6530"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12673",
    "kind": "arxiv",
    "version": 1
  }
}