pith. sign in
Pith Number

pith:CZZMY6FB

pith:2026:CZZMY6FBQHAXED3CZOMW5EWPGP
not attested not anchored not stored refs resolved

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

Cihang Xie, Fang Wu, Haonian Ji, Huaxiu Yao, Jiaqi Liu, Jike Zhong, Kaide Zeng, Kaiwen Xiong, Peng Xia, Yuxiang Lai, Zeyu Zheng

ClawForge generates executable command-line benchmarks that test agents on pre-existing state conflicts, with top models reaching only 45.3 percent strict accuracy.

arxiv:2605.14133 v1 · 2026-05-13 · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{CZZMY6FBQHAXED3CZOMW5EWPGP}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

The best model reaches only 45.3% strict accuracy, wrong-state replacement remains below 17% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting.

C2weakest assumption

That the 17 generated scenarios and their validators faithfully capture the distribution of state conflicts that appear in real command-line workflows.

C3one line summary

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.

References

101 extracted · 101 resolved · 42 Pith anchors

[1] A-MEM: Agentic Memory for LLM Agents · arXiv:2502.12110
[2] Zep: A Temporal Knowledge Graph Architecture for Agent Memory · arXiv:2501.13956
[3] Proceedings of the AAAI conference on artificial intelligence , volume=
[4] Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages= 2025
[5] MemGPT: towards LLMs as operating systems. , author=. 2023 , publisher= 2023
Receipt and verification
First computed 2026-05-17T23:39:11.771097Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

1672cc78a181c1720f62cb996e92cf33de18527add55b7b3b71745a47fc47ded

Aliases

arxiv: 2605.14133 · arxiv_version: 2605.14133v1 · doi: 10.48550/arxiv.2605.14133 · pith_short_12: CZZMY6FBQHAX · pith_short_16: CZZMY6FBQHAXED3C · pith_short_8: CZZMY6FB
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/CZZMY6FBQHAXED3CZOMW5EWPGP \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 1672cc78a181c1720f62cb996e92cf33de18527add55b7b3b71745a47fc47ded
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "cc686c0882d6287dccdf6f47c8720a1a11370f9ebf93297dbfb3c92e8f376d7e",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2026-05-13T21:34:08Z",
    "title_canon_sha256": "ce555f5f97c1a6fef39238bcef7a08a72616148e4e6f90f86baf4d72c056491d"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.14133",
    "kind": "arxiv",
    "version": 1
  }
}