Recognition: 2 theorem links
· Lean TheoremWorkspace Optimization: How to Train Your Agent
Pith reviewed 2026-05-12 02:23 UTC · model grok-4.3
The pith
Agents can improve on hard interactive tasks by evolving their external workspace instead of updating model weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Workspace optimization evolves the agent's workspace—the structured external substrate it reads, writes, and tests—by mirroring weight-space training: artifacts stand in for parameters, evidence for data, counterexamples for losses, and textual feedback for gradients. The DreamTeam multi-agent harness realizes this idea through roles that construct an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25-game ARC-AGI-3 public set under the official protocol and averaged over two runs, it improves the state-of-the-art protocol-matched score from 36 percent to 38.4 percent while using 31 percent fewer environment actions per game.
What carries the argument
Workspace optimization, the process of evolving the agent's structured external workspace by substituting artifacts for parameters, evidence for data, counterexamples for losses, and textual feedback for gradients.
If this is right
- Fixed-weight frontier models can reach higher success rates on multi-turn tasks by evolving external artifacts through interaction.
- A multi-agent harness with distinct roles for world-model building, planning, and failure routing can implement workspace optimization on ARC-like environments.
- Gains on ARC-AGI-3 occur together with a reduction in the number of environment actions required.
- The training analogy extends to any interactive setting where the base model holds useful priors yet needs iterative refinement.
Where Pith is reading between the lines
- The same workspace-evolution pattern could be tested on other interactive benchmarks where models start with partial knowledge but require multi-step exploration.
- Focusing optimization on external structures rather than weights may reduce overall compute for agent improvement in settings that forbid fine-tuning.
- Extending the harness to include automated role discovery or dynamic role switching might further lower the number of actions needed.
Load-bearing premise
The multi-agent harness can reliably evolve a useful workspace for tasks where the base model has strong priors but cannot solve the task in a single shot.
What would settle it
Applying the same DreamTeam protocol to a larger or held-out set of ARC-AGI-3 tasks and observing no score gain or an increase in actions per game would falsify the central claim.
Figures
read the original abstract
Modern agents built on frontier language models often cannot adapt their weights. What, then, remains trainable? We argue it is the agent's \emph{workspace}, the structured external substrate it reads, writes, and tests; we call its evolution workspace optimization. Workspace optimization targets hard multi-turn environments where a frontier model has strong priors but cannot solve the task in a single shot, so the agent must learn through interaction. We propose a principled way to evolve the workspace, mirroring the structure of weight-space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients. We instantiate the idea in DreamTeam, a multi-agent harness for ARC-AGI-3 whose roles build an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25-game ARC-AGI-3 public set under the official scoring protocol and averaged over two independent runs, DreamTeam improves the SOTA protocol-matched agent's score from 36% to 38.4%, while using 31% fewer environment actions per game.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that since frontier language models cannot adapt their weights, the trainable component for agents in hard multi-turn environments is the external 'workspace' (structured substrate for reading, writing, and testing). It proposes 'workspace optimization' as an analogy to weight-space training, replacing parameters with artifacts, data with evidence, losses with counterexamples, and gradients with textual feedback. The idea is instantiated in DreamTeam, a multi-agent harness for ARC-AGI-3 with specialized roles for world-model building, hypothesizing, probing, and failure routing. On the 25-game public ARC-AGI-3 set under the official protocol (averaged over two runs), DreamTeam raises the SOTA protocol-matched score from 36% to 38.4% while using 31% fewer environment actions per game.
Significance. If the empirical result can be substantiated with controls, ablations, and statistical reporting, the workspace-optimization framing offers a potentially useful paradigm for agent adaptation in settings where weight updates are unavailable or undesirable. The explicit training analogy (artifacts for parameters, counterexamples for losses) is a clear conceptual contribution that could guide future multi-agent designs. The reported efficiency improvement (fewer actions) is a positive signal, but the small absolute gain on a 25-game subset limits immediate impact without further validation.
major comments (2)
- [Abstract] Abstract: The central empirical claim (38.4% vs. 36% SOTA, 31% fewer actions) is presented without any description of the baseline SOTA agent's architecture or prompting, the precise definition of the 'official scoring protocol,' per-run scores, standard deviation, or statistical test for the 2.4-point difference. This absence makes it impossible to determine whether the gain is attributable to workspace optimization rather than noise, multi-agent parallelism, or differences in total interaction budget.
- [Abstract] Abstract: No ablation or controlled comparison is described that isolates the contribution of the proposed workspace-evolution mechanism (artifacts, evidence, counterexamples, textual feedback) from the multi-agent role structure itself or from simply increasing the number of environment interactions. Without such controls, the result cannot falsify the alternative explanation that any multi-agent harness would produce similar gains.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which identifies key areas where the presentation of our empirical results can be strengthened for greater transparency and rigor. We address each major comment below and describe the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claim (38.4% vs. 36% SOTA, 31% fewer actions) is presented without any description of the baseline SOTA agent's architecture or prompting, the precise definition of the 'official scoring protocol,' per-run scores, standard deviation, or statistical test for the 2.4-point difference. This absence makes it impossible to determine whether the gain is attributable to workspace optimization rather than noise, multi-agent parallelism, or differences in total interaction budget.
Authors: We agree that the abstract should be more self-contained to allow readers to evaluate the claims without immediately consulting the main text. In the revised version we will expand the abstract to briefly characterize the baseline (the current protocol-matched SOTA agent from the ARC-AGI-3 public leaderboard, whose architecture and prompting details appear in Section 3.2), state the official scoring protocol (binary success per game on the 25-game public set under the standard per-game action limit), and report the two individual run scores together with their standard deviation. We will also note that the 2.4-point improvement was observed in both runs and that DreamTeam operates under a strictly lower action budget (31% fewer environment actions). While a formal statistical test with only two runs has limited power, we will include the per-run data so readers can assess consistency. These additions will make it clearer that the reported gains are not the result of an expanded interaction budget. revision: yes
-
Referee: [Abstract] Abstract: No ablation or controlled comparison is described that isolates the contribution of the proposed workspace-evolution mechanism (artifacts, evidence, counterexamples, textual feedback) from the multi-agent role structure itself or from simply increasing the number of environment interactions. Without such controls, the result cannot falsify the alternative explanation that any multi-agent harness would produce similar gains.
Authors: We acknowledge that an explicit ablation isolating the workspace-evolution components from the multi-agent scaffolding would strengthen the causal claim. The DreamTeam design integrates the workspace-optimization mechanisms (artifact creation, evidence logging, counterexample-driven feedback, and textual gradients) directly into the specialized roles; therefore the multi-agent structure is not an independent variable but the vehicle for realizing the proposed training analogy. Nevertheless, to address the referee’s concern we will add a controlled comparison in the revised manuscript. Specifically, we will include results from a stripped-down multi-agent baseline that retains parallel agents but removes the workspace-specific mechanisms (no explicit world-model artifacts, no counterexample routing, and generic textual feedback). This comparison will be run under the same action budget, allowing readers to see whether the performance and efficiency gains persist when the core workspace-optimization primitives are ablated. We believe this addition will help rule out the alternative explanation that any multi-agent harness suffices. revision: yes
Circularity Check
No circularity: empirical result on external benchmark
full rationale
The paper's central claim is an empirical performance gain (38.4% vs 36% on 25-game ARC-AGI-3 set) achieved by instantiating a multi-agent harness called DreamTeam. The workspace-optimization framing is presented as a conceptual analogy to weight-space training (artifacts for parameters, counterexamples for losses, etc.), but no equations, fitted parameters, or derivations are given that reduce the reported outcome to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The result is measured against an external protocol-matched SOTA baseline under stated conditions (two runs, official scoring), rendering the derivation chain self-contained.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Workspace
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a principled way to evolve the workspace, mirroring the structure of weight-space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DREAMTEAM... roles build an executable world model, plan, hypothesize, probe, strategize, and route failures.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
EMNLP 2023. Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. In ICLR, 2024. Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Yichi Yang, Zhijian Liu, Zhiting Hu, and Lianhui Qin. ArcMemo: Abstract reasoning composition with lifelong LLM memory.arXiv preprint arXiv:2509.04439, 2025. Frozen-weight test-time adaptati...
-
[2]
Meta-Harness: End-to-End Optimization of Model Harnesses
Published 2026-05-01. 160-replay failure-mode analysis on the released benchmark. Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026. Guillaume Levy, Cedric Colas, Pierre-Yves Oudeyer, Thomas Carta, and Clement Romac. WorldLLM:...
work page internal anchor Pith review arXiv 2026
-
[3]
ICLR 2026. OpenAI. OpenAI codex.https://openai.com/codex, 2025. Accessed 2026-05-06. Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. ReasoningBank: Scaling agent self-evolving ...
work page internal anchor Pith review arXiv 2026
-
[4]
Generative Agents: Interactive Simulacra of Human Behavior
ICLR 2026. Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior.arXiv preprint arXiv:2304.03442, 2023. UIST 2023. Julien Pourcel, Cédric Colas, and Pierre-Yves Oudeyer. Self-improving language models for evo- lutionary program synthesis: ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
ICLR 2024. Lionel Wong, Gabriel Grand, Alexander K. Lew, Noah D. Goodman, Vikash K. Mansinghka, Jacob Andreas, and Joshua B. Tenenbaum. From word models to world models: Translating from natural language to the probabilistic language of thought.arXiv preprint arXiv:2306.12672, 2023. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasim...
-
[6]
ICLR 2023. Alex L. Zhang, Tim Kraska, and Omar Khattab. Recursive language models.arXiv preprint arXiv:2512.24601, 2025a. Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin Gödel Machine: Open- ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025b. Self-improving system that iteratively rewrites its own codin...
work page internal anchor Pith review arXiv 2023
-
[7]
Python source validation rejects writes that fail to parse and reports the syntax error back to the caller
-
[8]
World-model loader middleware re-imports the patched module into sys.modules so subsequent harness operations see the new functions
-
[9]
World-model validation middleware reruns predict and render on the latest transition under the new code and emits a delta block that names which test cases passed or failed. The delta is feedback rather than a hard gate, so the role can keep iterating within the round budget. Patches to text artifacts (step log, level log, helper) and JSONL data files ( z...
-
[10]
thepreferredaction when one is available from an in-flight multi-action batch (the next planned action that has not yet executed); 2.UNDOwhenUNDOis in the available-actions set
-
[11]
the most recent non-RESET action committed during the current run
-
[12]
a step-zero scaffold ofUSE,CLICK 32 32, orUP, whichever is first available. Third, the fallback is executed through the same single-action path as a normal commitment, so all bookkeeping (transition event, ledger update, step-log trim, agent-chat save) fires. Fourth, the harness sets a one-shot crash-recovery banner that is rendered at the top of the next...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.