pith. sign in
Pith Number

pith:RCF65HK3

pith:2026:RCF65HK36IPNXMUMKNPK36CCGZ
not attested not anchored not stored refs pending

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

Jiaqing Liang, Jinghao Zhang, Keyi Wang, Lipeng Ma, Shisong Chen, Sihang Jiang, Tengfei Wang, Tianjun Pan, Weijia Li, Yanghua Xiao, Zhiyu Lu, Zhonghua Hong

Success rate alone creates a capability illusion for LLM agents, while the sequential convergence of token consumption distinguishes genuine self-evolution from pseudo-evolution.

arxiv:2604.08988 v3 · 2026-04-10 · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{RCF65HK36IPNXMUMKNPK36CCGZ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

under identical success rates, token consumption differs by up to 31.2× across frameworks, with divergent evolutionary trajectories under sequential analysis -- demonstrating that success rate alone creates a capability illusion and that the sequential convergence of T is the key criterion for distinguishing genuine evolution from pseudo-evolution.

C2weakest assumption

That the sequential task stream design in SEA-Eval enables independent quantification of evolutionary gain, stability, and alignment without confounding from task similarity, agent initialization, or unstated priors in the Flywheel architecture.

C3one line summary

SEA-Eval is the first benchmark for self-evolving agents that uses sequential tasks to show success rate alone misleads while convergence in token efficiency T distinguishes genuine evolution.

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-26T01:03:29.708097Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

888bee9d5bf21edbb28c535eadf8423679e0bbad79c4da371aa9d912cd88008f

Aliases

arxiv: 2604.08988 · arxiv_version: 2604.08988v3 · doi: 10.48550/arxiv.2604.08988 · pith_short_12: RCF65HK36IPN · pith_short_16: RCF65HK36IPNXMUM · pith_short_8: RCF65HK3
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/RCF65HK36IPNXMUMKNPK36CCGZ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 888bee9d5bf21edbb28c535eadf8423679e0bbad79c4da371aa9d912cd88008f
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "25c4bca7fb3ad2de5949922b4df69798dfc61a08ba3dfd00145975b327d6836f",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2026-04-10T05:49:50Z",
    "title_canon_sha256": "91bcea8d3ea6453766ce8714ab436ac17368eb54a20d24573fc0b71276e4b085"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2604.08988",
    "kind": "arxiv",
    "version": 3
  }
}