arxiv: 2605.09650 · v1 · submitted 2026-05-10 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Workspace Optimization: How to Train Your Agent

Elad Sarafian , Gal Kaplun , Ron Banner , Daniel Soudry , Boris Ginsburg

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:23 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords workspace optimizationmulti-agent harnessARC-AGI-3agent workspacefrontier language modelsexternal artifactsinteractive environmentsDreamTeam

0 comments

The pith

Agents can improve on hard interactive tasks by evolving their external workspace instead of updating model weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that frontier language models cannot change their weights, so the remaining trainable component is the agent's workspace: the structured external substrate that the agent reads, writes, and tests. Workspace optimization evolves this substrate by treating artifacts as parameters, evidence as data, counterexamples as losses, and textual feedback as gradients. DreamTeam implements the idea as a multi-agent harness with specialized roles that build world models, plan, hypothesize, probe, strategize, and route failures. On the official 25-game ARC-AGI-3 public set, the system raises the protocol-matched state-of-the-art score from 36 percent to 38.4 percent while cutting environment actions by 31 percent. Readers would care because the approach lets fixed models tackle tasks that require multiple rounds of interaction without any fine-tuning.

Core claim

Workspace optimization evolves the agent's workspace—the structured external substrate it reads, writes, and tests—by mirroring weight-space training: artifacts stand in for parameters, evidence for data, counterexamples for losses, and textual feedback for gradients. The DreamTeam multi-agent harness realizes this idea through roles that construct an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25-game ARC-AGI-3 public set under the official protocol and averaged over two runs, it improves the state-of-the-art protocol-matched score from 36 percent to 38.4 percent while using 31 percent fewer environment actions per game.

What carries the argument

Workspace optimization, the process of evolving the agent's structured external workspace by substituting artifacts for parameters, evidence for data, counterexamples for losses, and textual feedback for gradients.

If this is right

Fixed-weight frontier models can reach higher success rates on multi-turn tasks by evolving external artifacts through interaction.
A multi-agent harness with distinct roles for world-model building, planning, and failure routing can implement workspace optimization on ARC-like environments.
Gains on ARC-AGI-3 occur together with a reduction in the number of environment actions required.
The training analogy extends to any interactive setting where the base model holds useful priors yet needs iterative refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same workspace-evolution pattern could be tested on other interactive benchmarks where models start with partial knowledge but require multi-step exploration.
Focusing optimization on external structures rather than weights may reduce overall compute for agent improvement in settings that forbid fine-tuning.
Extending the harness to include automated role discovery or dynamic role switching might further lower the number of actions needed.

Load-bearing premise

The multi-agent harness can reliably evolve a useful workspace for tasks where the base model has strong priors but cannot solve the task in a single shot.

What would settle it

Applying the same DreamTeam protocol to a larger or held-out set of ARC-AGI-3 tasks and observing no score gain or an increase in actions per game would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.09650 by Boris Ginsburg, Daniel Soudry, Elad Sarafian, Gal Kaplun, Ron Banner.

**Figure 1.** Figure 1: (a) Per-game RHAE gap ∆g = RHAEDREAMTEAM(g) − RHAEhuman(g), averaged over our two independent runs of DREAMTEAM, with the human score sampled at DREAMTEAM’s action budget; bold game labels mark equal-level finishers. (b) The DREAMTEAM workspace-optimization loop: retrodiction over historical evidence yields an updated workspace (hypotheses, strategies, logs) that roles read on the next step to refine their… view at source ↗

**Figure 2.** Figure 2: Weight-space RL updates agent weights from interaction losses over histories, actions, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The DREAMTEAM role graph and executable WM interface. The Observer encodes ot into zt and grounds it via render. The Simulator carries history ht+1 = history(zt, ht, at) and predicts one step ahead zˆt+1 = predict(zt, ht+1, at); Observer and Simulator exchange feedback when retrodiction disagrees with the next encoding (zˆt+1 ̸≈ zt+1). The Inductive Explorer (IE) proposes a sub-goal set Gt and policies Πt;… view at source ↗

**Figure 4.** Figure 4: A commit-then-retrodict repair on a real failure (game [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: (a) 24h matched-budget RHAE timeline averaged over our two runs (per-run range shaded): DREAMTEAM (blue) vs the matched-budget human (amber). (b) Per-game RHAE ∆g vs Symbolica’s single published run; each bar is the per-game mean of our two runs. +/− next to a label marks a baseline change in human actions (15 games have changed, see [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Modern agents built on frontier language models often cannot adapt their weights. What, then, remains trainable? We argue it is the agent's \emph{workspace}, the structured external substrate it reads, writes, and tests; we call its evolution workspace optimization. Workspace optimization targets hard multi-turn environments where a frontier model has strong priors but cannot solve the task in a single shot, so the agent must learn through interaction. We propose a principled way to evolve the workspace, mirroring the structure of weight-space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients. We instantiate the idea in DreamTeam, a multi-agent harness for ARC-AGI-3 whose roles build an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25-game ARC-AGI-3 public set under the official scoring protocol and averaged over two independent runs, DreamTeam improves the SOTA protocol-matched agent's score from 36% to 38.4%, while using 31% fewer environment actions per game.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Workspace optimization is a clean new framing but DreamTeam's 2.4-point ARC-AGI lift rests on too thin an experiment to carry the claim.

read the letter

The paper's main move is to treat the agent's external workspace as the trainable part when weights are frozen. They map training concepts onto it—artifacts for parameters, counterexamples for losses, textual feedback for gradients—and build DreamTeam, a multi-agent setup with distinct roles for world-model building, hypothesizing, probing, and routing failures on ARC-AGI-3 tasks. That framing and the concrete role split are not in the prior literature they cite, so the conceptual contribution is real and worth noting.

Referee Report

2 major / 0 minor

Summary. The paper argues that since frontier language models cannot adapt their weights, the trainable component for agents in hard multi-turn environments is the external 'workspace' (structured substrate for reading, writing, and testing). It proposes 'workspace optimization' as an analogy to weight-space training, replacing parameters with artifacts, data with evidence, losses with counterexamples, and gradients with textual feedback. The idea is instantiated in DreamTeam, a multi-agent harness for ARC-AGI-3 with specialized roles for world-model building, hypothesizing, probing, and failure routing. On the 25-game public ARC-AGI-3 set under the official protocol (averaged over two runs), DreamTeam raises the SOTA protocol-matched score from 36% to 38.4% while using 31% fewer environment actions per game.

Significance. If the empirical result can be substantiated with controls, ablations, and statistical reporting, the workspace-optimization framing offers a potentially useful paradigm for agent adaptation in settings where weight updates are unavailable or undesirable. The explicit training analogy (artifacts for parameters, counterexamples for losses) is a clear conceptual contribution that could guide future multi-agent designs. The reported efficiency improvement (fewer actions) is a positive signal, but the small absolute gain on a 25-game subset limits immediate impact without further validation.

major comments (2)

[Abstract] Abstract: The central empirical claim (38.4% vs. 36% SOTA, 31% fewer actions) is presented without any description of the baseline SOTA agent's architecture or prompting, the precise definition of the 'official scoring protocol,' per-run scores, standard deviation, or statistical test for the 2.4-point difference. This absence makes it impossible to determine whether the gain is attributable to workspace optimization rather than noise, multi-agent parallelism, or differences in total interaction budget.
[Abstract] Abstract: No ablation or controlled comparison is described that isolates the contribution of the proposed workspace-evolution mechanism (artifacts, evidence, counterexamples, textual feedback) from the multi-agent role structure itself or from simply increasing the number of environment interactions. Without such controls, the result cannot falsify the alternative explanation that any multi-agent harness would produce similar gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where the presentation of our empirical results can be strengthened for greater transparency and rigor. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim (38.4% vs. 36% SOTA, 31% fewer actions) is presented without any description of the baseline SOTA agent's architecture or prompting, the precise definition of the 'official scoring protocol,' per-run scores, standard deviation, or statistical test for the 2.4-point difference. This absence makes it impossible to determine whether the gain is attributable to workspace optimization rather than noise, multi-agent parallelism, or differences in total interaction budget.

Authors: We agree that the abstract should be more self-contained to allow readers to evaluate the claims without immediately consulting the main text. In the revised version we will expand the abstract to briefly characterize the baseline (the current protocol-matched SOTA agent from the ARC-AGI-3 public leaderboard, whose architecture and prompting details appear in Section 3.2), state the official scoring protocol (binary success per game on the 25-game public set under the standard per-game action limit), and report the two individual run scores together with their standard deviation. We will also note that the 2.4-point improvement was observed in both runs and that DreamTeam operates under a strictly lower action budget (31% fewer environment actions). While a formal statistical test with only two runs has limited power, we will include the per-run data so readers can assess consistency. These additions will make it clearer that the reported gains are not the result of an expanded interaction budget. revision: yes
Referee: [Abstract] Abstract: No ablation or controlled comparison is described that isolates the contribution of the proposed workspace-evolution mechanism (artifacts, evidence, counterexamples, textual feedback) from the multi-agent role structure itself or from simply increasing the number of environment interactions. Without such controls, the result cannot falsify the alternative explanation that any multi-agent harness would produce similar gains.

Authors: We acknowledge that an explicit ablation isolating the workspace-evolution components from the multi-agent scaffolding would strengthen the causal claim. The DreamTeam design integrates the workspace-optimization mechanisms (artifact creation, evidence logging, counterexample-driven feedback, and textual gradients) directly into the specialized roles; therefore the multi-agent structure is not an independent variable but the vehicle for realizing the proposed training analogy. Nevertheless, to address the referee’s concern we will add a controlled comparison in the revised manuscript. Specifically, we will include results from a stripped-down multi-agent baseline that retains parallel agents but removes the workspace-specific mechanisms (no explicit world-model artifacts, no counterexample routing, and generic textual feedback). This comparison will be run under the same action budget, allowing readers to see whether the performance and efficiency gains persist when the core workspace-optimization primitives are ablated. We believe this addition will help rule out the alternative explanation that any multi-agent harness suffices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical result on external benchmark

full rationale

The paper's central claim is an empirical performance gain (38.4% vs 36% on 25-game ARC-AGI-3 set) achieved by instantiating a multi-agent harness called DreamTeam. The workspace-optimization framing is presented as a conceptual analogy to weight-space training (artifacts for parameters, counterexamples for losses, etc.), but no equations, fitted parameters, or derivations are given that reduce the reported outcome to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The result is measured against an external protocol-matched SOTA baseline under stated conditions (two runs, official scoring), rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based solely on the abstract, no explicit free parameters, mathematical axioms, or independently evidenced invented entities are described beyond the high-level conceptual introduction of the workspace.

invented entities (1)

Workspace no independent evidence
purpose: Structured external substrate that the agent reads, writes, and tests as the trainable component
Presented as the central new object of optimization in place of model weights.

pith-pipeline@v0.9.0 · 5499 in / 1181 out tokens · 67767 ms · 2026-05-12T02:23:24.348412+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a principled way to evolve the workspace, mirroring the structure of weight-space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DREAMTEAM... roles build an executable world model, plan, hypothesize, probe, strategize, and route failures.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Arcmemo: Abstract reasoning composition with lifelong llm memory.arXiv preprint arXiv:2509.04439, 2025

EMNLP 2023. Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. In ICLR, 2024. Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Yichi Yang, Zhijian Liu, Zhiting Hu, and Lianhui Qin. ArcMemo: Abstract reasoning composition with lifelong LLM memory.arXiv preprint arXiv:2509.04439, 2025. Frozen-weight test-time adaptati...

work page arXiv 2023
[2]

Meta-Harness: End-to-End Optimization of Model Harnesses

Published 2026-05-01. 160-replay failure-mode analysis on the released benchmark. Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026. Guillaume Levy, Cedric Colas, Pierre-Yves Oudeyer, Thomas Carta, and Clement Romac. WorldLLM:...

work page internal anchor Pith review arXiv 2026
[3]

ICLR 2026. OpenAI. OpenAI codex.https://openai.com/codex, 2025. Accessed 2026-05-06. Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. ReasoningBank: Scaling agent self-evolving ...

work page internal anchor Pith review arXiv 2026
[4]

Generative Agents: Interactive Simulacra of Human Behavior

ICLR 2026. Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior.arXiv preprint arXiv:2304.03442, 2023. UIST 2023. Julien Pourcel, Cédric Colas, and Pierre-Yves Oudeyer. Self-improving language models for evo- lutionary program synthesis: ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

and Goodman, Noah D

ICLR 2024. Lionel Wong, Gabriel Grand, Alexander K. Lew, Noah D. Goodman, Vikash K. Mansinghka, Jacob Andreas, and Joshua B. Tenenbaum. From word models to world models: Translating from natural language to the probabilistic language of thought.arXiv preprint arXiv:2306.12672, 2023. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasim...

work page arXiv 2024
[6]

ICLR 2023. Alex L. Zhang, Tim Kraska, and Omar Khattab. Recursive language models.arXiv preprint arXiv:2512.24601, 2025a. Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin Gödel Machine: Open- ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025b. Self-improving system that iteratively rewrites its own codin...

work page internal anchor Pith review arXiv 2023
[7]

Python source validation rejects writes that fail to parse and reports the syntax error back to the caller

work page
[8]

World-model loader middleware re-imports the patched module into sys.modules so subsequent harness operations see the new functions

work page
[9]

The delta is feedback rather than a hard gate, so the role can keep iterating within the round budget

World-model validation middleware reruns predict and render on the latest transition under the new code and emits a delta block that names which test cases passed or failed. The delta is feedback rather than a hard gate, so the role can keep iterating within the round budget. Patches to text artifacts (step log, level log, helper) and JSONL data files ( z...

work page
[10]

thepreferredaction when one is available from an in-flight multi-action batch (the next planned action that has not yet executed); 2.UNDOwhenUNDOis in the available-actions set

work page
[11]

the most recent non-RESET action committed during the current run

work page
[12]

ACTION:entity

a step-zero scaffold ofUSE,CLICK 32 32, orUP, whichever is first available. Third, the fallback is executed through the same single-action path as a normal commitment, so all bookkeeping (transition event, ledger update, step-log trim, agent-chat save) fires. Fourth, the harness sets a one-shot crash-recovery banner that is rendered at the top of the next...

work page 2026