Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph

Yohei Nakajima

arxiv: 2606.10241 · v1 · pith:YQEH46DLnew · submitted 2026-06-08 · 💻 cs.AI

Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph

Yohei Nakajima This is my paper

Pith reviewed 2026-06-27 16:02 UTC · model grok-4.3

classification 💻 cs.AI

keywords agent improvement loopsevent sourcingheld-out validationprompt repairfailure diagnosisLongMemEvalauditable runtime

0 comments

The pith

An event-sourced runtime makes controlled agent improvement a first-class auditable workflow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when an agent's state is a deterministic projection of an append-only event log, failures get recorded automatically, runs replay exactly from the log, candidate patches stay scoped to typed pipeline seams, gates become auditable, and every promotion or discard itself becomes an event. It demonstrates the approach with Regimes, a target-agnostic loop that diagnoses failures, proposes repairs at pipeline points, and promotes them only after static checks, sandbox runs, in-sample tests, and held-out validation. On LongMemEval-S the dominant failure mode is reconciliation rather than retrieval, and the loop finds reader-prompt repairs that raise held-out accuracy. A sympathetic reader would care because the method removes external scaffolding and turns improvement into something that can be replayed and inspected inside the agent's own history.

Core claim

The central claim is that an event-sourced agent runtime removes the friction that usually makes autonomous improvement loops untrustworthy. Failures are recorded in the log, a run replays exactly from its history, candidate patches scope to typed pipeline seams, gates are auditable, and every promote-or-discard decision is itself an event. Regimes implements this loop on ActiveGraph and applies it to LongMemEval-S, where it discovers reader-prompt repairs that improve final held-out accuracy by 0.05 to 0.10 in four of five seeded splits and by 0.01 in one over-promotion split.

What carries the argument

The Regimes loop on the ActiveGraph runtime: it routes failures through a taxonomy to pipeline locations, proposes typed repairs, and gates promotion behind static checks, sandbox execution, in-sample evaluation, and held-out validation.

If this is right

The same control flow runs against different tasks through a common interface.
Every promotion or discard decision is recorded as an event in the agent's history.
Candidate repairs are scoped to typed pipeline seams rather than arbitrary code changes.
The dominant failure on LongMemEval-S is reconciliation of evidence already present in the assembled context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the taxonomy delivers positive marginal value, the same routing approach could be tested on other agent pipelines that combine retrieval and generation steps.
Treating prompts as discovery probes suggests that similar diagnosis-and-repair loops could be applied to non-prompt components such as retrieval rankers or tool-use policies.
Because every decision is an event, downstream analyses could replay entire improvement histories to measure cumulative effects across multiple cycles.
The target-agnostic property implies the loop could be applied to tasks outside LongMemEval without rewriting the control flow.

Load-bearing premise

The failure-regime taxonomy routes each failure to a pipeline location whose marginal value over an unrouted baseline is positive.

What would settle it

Measure accuracy gains from an unrouted improvement baseline versus the same loop using the failure-regime taxonomy on the same five seeded held-out splits of LongMemEval-S.

Figures

Figures reproduced from arXiv: 2606.10241 by Yohei Nakajima.

**Figure 2.** Figure 2: Loop control flow with gates. The loop diagnoses failures into a regime histogram, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Flip barcode across the five fresh stratified splits. Each row shows one split’s final [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Candidate funnel per split: authored, static-rejected, discarded on OPTIMIZE, dis [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Promote trajectories for seed 5 and seed 101, held-out delta by promote index. Seed [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

Autonomous improvement loops are hard to trust because the improvement process is usually external scaffolding bolted onto the agent: failures go unlogged, diagnoses cannot be replayed, and promote-or-discard decisions land in a side database rather than the agent's own history. We show that an event-sourced agent runtime removes that friction and turns controlled improvement into a first-class workflow. When the agent's state is a deterministic projection of an append-only event log, failures are recorded, a run replays exactly from its log, candidate patches scope to typed pipeline seams, gates are auditable, and every promotion or discard is itself an event. We demonstrate this with Regimes, a loop on the ActiveGraph runtime that diagnoses failed evaluations, proposes a repair at a pipeline point, and promotes it only after static checks, sandbox execution, in-sample evaluation, and held-out validation. The loop is target-agnostic: the same control flow runs against different tasks through a common interface. On LongMemEval-S the dominant failure is not retrieval but reconciliation: the evidence is already in the assembled context, yet the reader answers incorrectly. Across five seeded held-out splits, Regimes discovers reader-prompt repairs that improve final held-out accuracy by +0.05 to +0.10 in four splits and +0.01 in one over-promotion split; two splits are individually significant (seed 5 unadjusted for its sequential promotion structure), and the pooled count is descriptive only, since the splits share one 500-question pool. The durable contributions are ActiveGraph as an auditable substrate that makes controlled improvement loops tractable, the held-out-gated loop it supports, the failure-regime taxonomy routing each failure to a pipeline location (whose marginal value over an unrouted baseline is the primary open question), and the prompt-as-discovery-probe hypothesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a concrete event-sourced runtime and gated repair loop that produces held-out gains on LongMemEval, but leaves the taxonomy's marginal value over a plain repair loop untested as the authors flag.

read the letter

The main takeaway is that this work turns controlled agent improvement into something you can actually audit and replay inside the runtime itself. ActiveGraph logs everything as events so failures, patches, and promotions become part of the agent's own history rather than side scaffolding. The loop runs the same control flow across tasks, proposes repairs at typed pipeline points, and only promotes after static checks, sandbox runs, in-sample tests, and held-out validation.

They show this on LongMemEval-S where the main issue turns out to be reader reconciliation rather than retrieval. Across five seeded splits the reader-prompt repairs lift held-out accuracy by 0.05-0.10 in four cases and 0.01 in one, with two splits individually significant. The paper is clear that the pooled numbers are only descriptive because the splits share the same 500-question pool.

What works is the substrate and the workflow discipline. The event-sourced approach removes the usual friction around logging and replay, and the held-out gate is a straightforward way to keep the loop honest. They also state their primary open question directly: whether the failure-regime taxonomy adds anything over an unrouted baseline.

The soft spots line up with that open question. No unrouted control is described, so the reported gains could come from any systematic patch generator. Error bars and the full statistical protocol are not detailed in the abstract, the scope stays narrow to one runtime and one benchmark family, and the over-promotion split shows only a tiny lift. These are real limits but they are acknowledged rather than hidden.

This is for people building agent systems who need a practical, auditable substrate more than another theoretical equation. It deserves a serious referee because it ships a working loop with numbers and honest caveats even if the taxonomy piece needs tighter controls.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Regimes, an auditable, held-out-gated improvement loop on the ActiveGraph event-sourced runtime. Failures on LongMemEval-S are diagnosed via a failure-regime taxonomy that routes each to a pipeline location; reader-prompt repairs are proposed, statically checked, sandboxed, evaluated in-sample, and promoted only after held-out validation. Across five seeded held-out splits the loop reports accuracy gains of +0.05 to +0.10 in four splits and +0.01 in one over-promotion split, with two splits individually significant; the taxonomy's marginal value over an unrouted baseline is explicitly flagged as the primary open question.

Significance. If the taxonomy's marginal contribution is confirmed by an unrouted control, the work supplies a concrete, replayable substrate that makes controlled agent improvement loops first-class and auditable rather than external scaffolding. The emphasis on deterministic event logs, typed pipeline seams, and promotion-as-event is a clear engineering contribution to reproducible agent workflows.

major comments (2)

[Abstract] Abstract: the central attribution of the reported +0.05–+0.10 held-out gains to the regime taxonomy lacks any unrouted baseline (always-target-reader, random location, or fixed-location control). Without this comparison the improvements cannot be distinguished from those produced by any systematic repair generator; the manuscript itself identifies this comparison as the primary open question.
[Abstract] Abstract: concrete accuracy deltas are stated without error bars, a full statistical protocol, or details on how the failure-regime taxonomy was derived from the same 500-question pool used for the splits. This weakens the claim that two splits are individually significant.

minor comments (1)

[Abstract] The manuscript notes that pooled results are descriptive only because splits share the 500-question pool; this limitation should be stated more prominently when presenting the per-split numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, noting where the manuscript already acknowledges the limitation and where we will make revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central attribution of the reported +0.05–+0.10 held-out gains to the regime taxonomy lacks any unrouted baseline (always-target-reader, random location, or fixed-location control). Without this comparison the improvements cannot be distinguished from those produced by any systematic repair generator; the manuscript itself identifies this comparison as the primary open question.

Authors: We agree that the manuscript lacks an unrouted baseline comparison. The abstract already explicitly identifies the marginal value of the regime taxonomy over an unrouted baseline as the primary open question and presents the gains with this caveat rather than claiming definitive attribution to the taxonomy. The work is framed as supplying an auditable substrate for such controlled experiments. No revision is required on this point. revision: no
Referee: [Abstract] Abstract: concrete accuracy deltas are stated without error bars, a full statistical protocol, or details on how the failure-regime taxonomy was derived from the same 500-question pool used for the splits. This weakens the claim that two splits are individually significant.

Authors: We accept the point. The abstract reports deltas without error bars or a complete statistical protocol, and taxonomy derivation details are not expanded. The manuscript already caveats that the pooled count is descriptive only (due to the shared 500-question pool) and that significance for seed 5 is unadjusted. We will revise the abstract to report error bars across the seeded splits and expand the methods section with taxonomy derivation and statistical protocol details. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical demonstration on held-out splits is independent

full rationale

The paper presents an engineering demonstration of an auditable improvement loop with held-out validation as an explicit gate. The reported accuracy gains are measured on seeded held-out splits, and the abstract explicitly flags the taxonomy's marginal value over an unrouted baseline as the primary open question rather than asserting it as demonstrated. No equations, self-citations, fitted parameters renamed as predictions, or ansatzes appear in the text. The derivation chain consists of a runtime substrate and a gated workflow whose outputs are externally checked on held-out data; it does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the system is described as relying on standard event-sourcing and validation practices.

pith-pipeline@v0.9.1-grok · 5867 in / 1229 out tokens · 29352 ms · 2026-06-27T16:02:46.712340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 7 linked inside Pith

[9]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In AAAI, 2024

2024
[11]

arXiv preprint arXiv:2605.21997 , year =

The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems , author =. arXiv preprint arXiv:2605.21997 , year =

Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2410.10813 , year =

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author =. arXiv preprint arXiv:2410.10813 , year =

Pith/arXiv arXiv
[13]

arXiv preprint arXiv:2605.12493 , year =

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues , author =. arXiv preprint arXiv:2605.12493 , year =

Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2504.15228 , year =

A Self-Improving Coding Agent , author =. arXiv preprint arXiv:2504.15228 , year =

arXiv
[15]

arXiv preprint arXiv:2605.29668 , year =

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents , author =. arXiv preprint arXiv:2605.29668 , year =

Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2503.13657 , year =

Why Do Multi-Agent LLM Systems Fail? , author =. arXiv preprint arXiv:2503.13657 , year =

Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2509.25370 , year =

Where LLM Agents Fail and How They Can Learn From Failures , author =. arXiv preprint arXiv:2509.25370 , year =

arXiv
[18]

arXiv preprint arXiv:2303.11366 , year =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. arXiv preprint arXiv:2303.11366 , year =

Pith/arXiv arXiv
[19]

AAAI , year =

ExpeL: LLM Agents Are Experiential Learners , author =. AAAI , year =
[20]

arXiv preprint arXiv:2310.03714 , year =

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines , author =. arXiv preprint arXiv:2310.03714 , year =

Pith/arXiv arXiv

[1] [9]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In AAAI, 2024

2024

[2] [11]

arXiv preprint arXiv:2605.21997 , year =

The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems , author =. arXiv preprint arXiv:2605.21997 , year =

Pith/arXiv arXiv

[3] [12]

arXiv preprint arXiv:2410.10813 , year =

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author =. arXiv preprint arXiv:2410.10813 , year =

Pith/arXiv arXiv

[4] [13]

arXiv preprint arXiv:2605.12493 , year =

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues , author =. arXiv preprint arXiv:2605.12493 , year =

Pith/arXiv arXiv

[5] [14]

arXiv preprint arXiv:2504.15228 , year =

A Self-Improving Coding Agent , author =. arXiv preprint arXiv:2504.15228 , year =

arXiv

[6] [15]

arXiv preprint arXiv:2605.29668 , year =

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents , author =. arXiv preprint arXiv:2605.29668 , year =

Pith/arXiv arXiv

[7] [16]

arXiv preprint arXiv:2503.13657 , year =

Why Do Multi-Agent LLM Systems Fail? , author =. arXiv preprint arXiv:2503.13657 , year =

Pith/arXiv arXiv

[8] [17]

arXiv preprint arXiv:2509.25370 , year =

Where LLM Agents Fail and How They Can Learn From Failures , author =. arXiv preprint arXiv:2509.25370 , year =

arXiv

[9] [18]

arXiv preprint arXiv:2303.11366 , year =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. arXiv preprint arXiv:2303.11366 , year =

Pith/arXiv arXiv

[10] [19]

AAAI , year =

ExpeL: LLM Agents Are Experiential Learners , author =. AAAI , year =

[11] [20]

arXiv preprint arXiv:2310.03714 , year =

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines , author =. arXiv preprint arXiv:2310.03714 , year =

Pith/arXiv arXiv