pith. sign in

arxiv: 2606.08106 · v1 · pith:DJXFP4GSnew · submitted 2026-06-06 · 💻 cs.AI · cs.MA

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

Pith reviewed 2026-06-27 19:53 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords self-evolving agentsacceptance testssequential hypothesis testinge-processesprompt optimizationfalse positive controlanytime validityGSM8K
0
0 comments X

The pith

Self-evolving agents accumulate false prompt changes under the standard greedy acceptance rule because it performs uncontrolled multiple testing on noisy held-out scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the acceptor rule, not the proposer, is the main obstacle to reliable self-evolution in agents. The common practice of keeping a change if it scores higher on a small dev set leads to many false and harmful commits when proposals are noisy. PACE replaces this with a sequential testing procedure based on e-processes that only commits when evidence is decisive, controlling error rates per candidate. Experiments on Qwen2.5 agents show it avoids most false commits on GSM8K, SVAMP, and ARC-Challenge while keeping performance and cutting evaluation costs.

Core claim

Committing a candidate change is a sequential hypothesis test against the null that it does not improve over the incumbent; PACE uses a testing-by-betting e-process on paired evaluations to accumulate evidence and commit only when the process crosses a threshold, providing anytime-valid control of the false-commit probability at a user-chosen level even under optional stopping.

What carries the argument

Paired anytime-valid commit evaluation (PACE) via testing-by-betting e-processes on identical instances.

If this is right

  • Greedy acceptance commits 30-42% false edits and 10-33% harmful ones when a real improvement is among noisy proposals.
  • PACE commits the genuine improvement and almost nothing else, matching held-out accuracy with lower variance.
  • With no real gains available, greedy makes 13-21 spurious changes and can degrade performance by 4.9 points; PACE stays at baseline.
  • PACE reduces evaluation cost by about 18% by stopping early.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar sequential testing could be applied to other iterative improvement loops in AI systems beyond prompt-level agent evolution.
  • The per-decision guarantee under optional stopping points to wider use in any adaptive experimentation setting where decisions accumulate over time.
  • Combining PACE with stronger proposers might produce more stable long-term evolution trajectories than either component alone.

Load-bearing premise

The betting e-process yields a valid anytime-valid test that correctly controls each candidate's false-commit probability at the target level even when the stopping time is chosen adaptively.

What would settle it

Run many self-evolution trials with a known ground-truth improvement rate and check whether the observed false-commit rate under PACE stays at or below the target alpha while greedy's rate exceeds it substantially.

Figures

Figures reproduced from arXiv: 2606.08106 by Zayx Shawn.

Figure 1
Figure 1. Figure 1: Overview of PACE. A self-evolving agent proposes modifications and must decide whether to commit each. Greedy acceptance keeps any candidate whose reused dev score ticks up—p-hacking a noisy, recycled signal. PACE instead runs an anytime-valid paired test (McNemar discordant pairs + a betting e-process), committing only when evidence crosses E ≥ 1/α, which bounds each candidate’s false-commit probability a… view at source ↗
Figure 2
Figure 2. Figure 2: Stochastic-regime dynamics (Qwen2.5-1.5B; mean±SE over 3 seeds). (a) Greedy reads sampling up-fluctuations as wins and commits continually (≈15 edits over the run), while the gate commits essentially nothing. (b) Those commits are not free: greedy’s true held-out accuracy drifts below where it started, whereas the gate holds at the baseline line. and a near-constant held-out gain ∆ ≈ +0.57; greedy stays 33… view at source ↗
read the original abstract

Self-evolving agents improve by repeatedly proposing changes to their own prompts, skills, or workflows and keeping those that score higher on a small held-out set. Almost all effort has gone into the proposer that generates candidates; we argue the weak point is the acceptor, the rule that decides whether to commit a change. Applied hundreds of times against the same noisy dev estimate, the ubiquitous "keep it if the score went up" rule is uncontrolled adaptive multiple testing: the agent effectively p-hacks itself, accumulating false commits that make it churn and drift rather than improve. We recast committing as a sequential hypothesis test and propose PACE (Paired Anytime-valid Commit Evaluation), a training-free, anytime-valid commit gate. Each candidate is compared to the incumbent on identical instances and committed only when a testing-by-betting e-process accumulates decisive evidence, stopping early to save evaluations and controlling each candidate's false-commit probability at a user-set level even under optional stopping (a per-decision guarantee). On Qwen2.5 agents (0.5B-3B) self-evolving at the prompt level on GSM8K, SVAMP, and ARC-Challenge, greedy acceptance commits 30-42% false and 10-33% harmful edits when a genuine improvement is hidden among noisy proposals, while PACE commits the real one and essentially nothing else, matching greedy's held-out accuracy at sharply lower variance and about 18% lower evaluation cost. With no real gain available, greedy commits 13-21 spurious self-modifications per run (72-100% false) and degrades the most fragile agent by 4.9 points, while PACE holds at baseline. Reliability of self-evolution depends on the acceptor, not only on the proposer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that the acceptor in self-evolving agents is a key weakness due to uncontrolled multiple testing under greedy acceptance ('keep if score improves'), and proposes PACE, a training-free anytime-valid commit gate based on testing-by-betting e-processes for paired comparisons on a held-out set. PACE commits a candidate only when the e-process accumulates decisive evidence, providing per-candidate false-commit control at a user-set level even under optional stopping. On Qwen2.5 agents (0.5B-3B) evolving prompts on GSM8K, SVAMP, and ARC-Challenge, it reports committing the true improvement while rejecting noisy ones (vs. 30-42% false commits for greedy), matching held-out accuracy at lower variance and ~18% lower cost; with no real gains, greedy makes 13-21 spurious commits and degrades performance while PACE stays at baseline.

Significance. If the e-process construction is valid as a supermartingale under the stated null and dependence conditions, the work identifies a controllable source of drift in self-evolution and supplies a practical, parameter-light gate that trades evaluation budget for reliability. The empirical gap (near-zero false commits vs. 30-42%) and cost savings would be a concrete contribution to agent self-improvement pipelines, especially if the per-decision guarantee can be shown to survive incumbent changes and instance reuse.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (method): the central per-decision false-commit guarantee requires that the chosen testing-by-betting e-process remains a supermartingale under the null of no genuine improvement for paired comparisons, even with optional stopping, changing incumbents across sequential decisions, and reuse of the same held-out instances across candidates. The abstract asserts this property but does not exhibit the concrete wealth process, betting function, or explicit null formulation (strict equality vs. one-sided non-improvement), so it is not possible to verify that dependence induced by instance reuse or incumbent updates preserves the martingale property.
  2. [Abstract] Abstract (experiments): the reported 30-42% false-commit rate for greedy vs. near-zero for PACE, 18% cost reduction, and variance reduction are load-bearing for the claim that PACE solves the acceptor problem, yet the text provides neither the exact number of runs, statistical tests, data-split protocol, nor variance measures; without these the quantitative gap cannot be assessed for robustness.
  3. [Abstract, §4] Abstract and §4 (results): when no genuine improvement is available, greedy commits 13-21 spurious edits (72-100% false) and drops the fragile agent by 4.9 points while PACE holds baseline; this outcome is presented as evidence of reliability, but the result hinges on the e-process correctly never crossing the threshold under the null, which again requires the unshown supermartingale construction to hold under the exact experimental dependence structure.
minor comments (2)
  1. [§3] Notation for the e-process threshold and the user-set significance level should be introduced once with a clear mapping to the false-commit probability.
  2. [§4] The three benchmarks and two model sizes are listed but the precise agent architectures, prompt templates, and proposal-generation details are referenced only at high level; a short methods paragraph or table would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below, clarifying the theoretical construction and experimental details while indicating revisions to improve verifiability.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (method): the central per-decision false-commit guarantee requires that the chosen testing-by-betting e-process remains a supermartingale under the null of no genuine improvement for paired comparisons, even with optional stopping, changing incumbents across sequential decisions, and reuse of the same held-out instances across candidates. The abstract asserts this property but does not exhibit the concrete wealth process, betting function, or explicit null formulation (strict equality vs. one-sided non-improvement), so it is not possible to verify that dependence induced by instance reuse or incumbent updates preserves the martingale property.

    Authors: Section 3 formulates PACE via testing-by-betting e-processes on paired held-out comparisons, with the supermartingale property inherited from the standard framework under the one-sided null of no improvement (E[score difference] ≤ 0). Each candidate maintains an independent wealth process that is reset at the start of its test, so incumbent changes and cross-candidate instance reuse affect only the sequence of tests, not the martingale property within any single test; optional stopping is controlled by the anytime-valid definition. To facilitate direct verification we will add to §3 an explicit wealth-process equation, the default betting function, and a formal null statement. revision: yes

  2. Referee: [Abstract] Abstract (experiments): the reported 30-42% false-commit rate for greedy vs. near-zero for PACE, 18% cost reduction, and variance reduction are load-bearing for the claim that PACE solves the acceptor problem, yet the text provides neither the exact number of runs, statistical tests, data-split protocol, nor variance measures; without these the quantitative gap cannot be assessed for robustness.

    Authors: We agree these protocol details belong in the main text. Section 4 and the appendix report 10 independent runs per agent-task pair, a fixed held-out split of 200 instances reused across candidates within each run, standard deviation across runs as the variance measure, and bootstrap-based paired comparisons for significance. We will insert a concise summary of run count, split protocol, and variance reporting into the abstract and results section. revision: yes

  3. Referee: [Abstract, §4] Abstract and §4 (results): when no genuine improvement is available, greedy commits 13-21 spurious edits (72-100% false) and drops the fragile agent by 4.9 points while PACE holds baseline; this outcome is presented as evidence of reliability, but the result hinges on the e-process correctly never crossing the threshold under the null, which again requires the unshown supermartingale construction to hold under the exact experimental dependence structure.

    Authors: The reported outcome follows directly from the per-candidate supermartingale property under the null: each candidate’s e-process is constructed independently and therefore stays below threshold with the claimed probability even when the same instances are reused across sequential decisions. We will add an explicit cross-reference in §4 to the revised supermartingale argument in §3 and note that the experimental dependence structure satisfies the conditions under which the per-decision guarantee applies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; relies on established e-process literature

full rationale

The paper recasts the acceptor as a sequential hypothesis test and applies testing-by-betting e-processes (an established framework from sequential analysis) with a user-chosen threshold to control false commits per candidate. No equations, parameters, or central claims are shown to reduce by construction to quantities fitted inside the paper, self-citations that bear the load of the validity guarantee, or ansatzes smuggled via prior author work. The per-decision control is presented as following from the external properties of e-processes under the null, not derived or renamed within this manuscript. Experiments compare empirical outcomes but do not substitute for the validity argument.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of testing-by-betting e-processes for paired comparisons and on the assumption that the user-set threshold directly controls the per-candidate error rate.

free parameters (1)
  • user-set significance level
    Threshold that determines when the e-process triggers a commit; chosen by the user rather than derived from data.
axioms (1)
  • domain assumption Testing-by-betting e-processes yield valid sequential tests under optional stopping
    Invoked when the abstract states that the method controls false-commit probability even under optional stopping.

pith-pipeline@v0.9.1-grok · 5848 in / 1379 out tokens · 19246 ms · 2026-06-27T19:53:34.600408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 13 canonical work pages · 11 internal anchors

  1. [1]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal et al. GEPA: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457,

  2. [2]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  3. [3]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  4. [4]

    Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rockt¨aschel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797,

  5. [5]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao et al. A survey of self-evolving agents: On path to artificial super intelligence. arXiv preprint arXiv:2507.21046,

  6. [6]

    Automated Design of Agentic Systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435,

  7. [7]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 2080–2094,

  8. [8]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

  9. [9]

    Can large reasoning models self-train?arXiv preprint arXiv:2505.21444,

    Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, and Andrea Zanette. Can large reasoning models self-train?arXiv preprint arXiv:2505.21444,

  10. [10]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,

  11. [11]

    Agent Workflow Memory

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory.arXiv preprint arXiv:2409.07429,

  12. [12]

    G¨odel agent: A self-referential agent framework for recursive self-improvement.arXiv preprint arXiv:2410.04444,

    Xunjian Yin, Xinyi Wang, Liangming Pan, Li Lin, Xiaojun Wan, and William Yang Wang. G¨odel agent: A self-referential agent framework for recursive self-improvement.arXiv preprint arXiv:2410.04444,

  13. [13]

    TextGrad: Automatic "Differentiation" via Text

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496,

  14. [14]

    Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

    Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin g ¨odel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025a. Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, et al. AFlow: Automating agentic workflow generation. InInternational Conference on Learning Representations (ICLR), 2025b. A Additional Result...