Executable World Models for ARC-AGI-3 in the Era of Coding Agents
Pith reviewed 2026-05-08 17:47 UTC · model grok-4.3
The pith
A coding-agent system using executable Python world models solves 7 of 25 public ARC-AGI-3 games with no game-specific code.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The agent fully solved 7 games, achieved a Relative Human Action Efficiency greater than 75% on 6 games, and obtained a mean per-game RHAE of 32.58%. Because the system uses no game-specific code, it can serve as a game-general baseline for ARC-AGI-3. Performance on the private validation set remains to be tested. Overall, the results provide preliminary evidence that verifier-driven executable world models are a promising approach for ARC-AGI-3 agents.
What carries the argument
Verifier-driven executable Python world models that are refactored toward simpler abstractions as a practical proxy for minimum-description-length simplicity bias, then used for planning before action.
Load-bearing premise
That verifier-driven executable Python world models with refactoring as a proxy for simplicity bias will generalize to the private validation set and other unseen tasks without requiring game-specific logic.
What would settle it
Measuring the same agent's success rate and relative efficiency on the private validation set of ARC-AGI-3 games; low scores there would show the public results do not extend to held-out tasks.
read the original abstract
We evaluate an initial coding-agent system for ARC-AGI-3 in which the agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions as a practical proxy for an MDL-like simplicity bias, and plans through the model before acting. The system is intentionally direct: it uses a scripted controller, predefined world-model interfaces, verifier programs, and a plan executor, but no hand-coded game-specific logic. The agent-facing prompts, workspace, and controller contain no game-specific code, game-specific prompts, hand-coded heuristics, hidden solutions, or other game-specific information; the same agent and prompts are used across games. Because the coding agent has broad system access, we audit unintended information channels, describe earlier vulnerable harnesses, and explain how the current harness closes observed leakage channels while reducing benchmark-specific information exposure. We report results on the 25 public ARC-AGI-3 games. Each playthrough starts from a fresh agent instance and clean workspace, with no access to files or conversation state from earlier playthroughs. With GPT-5.5 high reasoning effort, the agent fully solved 15 games and achieved a mean per-game RHAE of 58.12%. With GPT-5.4 high reasoning effort, it fully solved 8 games and achieved a mean per-game RHAE of 41.29%. Performance on the private validation set, which is not yet available to us, remains to be tested. Overall, the results provide preliminary evidence that verifier-driven executable world models are a promising approach for ARC-AGI-3 agents. Full run artifacts are released with the code at https://github.com/astroseger/arc-3-agents-baseline1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an initial coding-agent system for ARC-AGI-3 that maintains an executable Python world model, verifies it against prior observations, refactors toward simpler abstractions as a proxy for an MDL-like simplicity bias, and plans through the model before acting. Using a scripted controller and no hand-coded game-specific logic, the system is evaluated on the 25 public ARC-AGI-3 games (with fresh agent instances per playthrough and some run-to-run variability reported). It fully solves 7 games, achieves RHAE >75% on 6 games, and obtains a mean per-game RHAE of 32.58%. The authors position the approach as a preliminary game-general baseline for ARC-AGI-3, while noting that private validation set performance remains to be tested.
Significance. If the approach generalizes beyond the public games, this would supply a valuable game-general baseline for ARC-AGI-3 agents that relies on verifier-driven executable world models rather than task-specific engineering. The explicit avoidance of game-specific code and the reporting of variability across fresh instances are concrete strengths that could facilitate future comparisons. The work offers preliminary empirical support for executable world models in this domain, though its impact hinges on demonstrated transfer.
major comments (3)
- [Abstract] Abstract: The assertion that the system 'can serve as a game-general baseline for ARC-AGI-3' because it uses no game-specific code is not yet supported by the evidence. All reported results (7 solved games, 6 with RHAE >75%, mean RHAE 32.58%) are confined to the 25 public games, and the abstract states that private-set performance 'remains to be tested.' This makes the baseline claim aspirational rather than demonstrated and is load-bearing for the paper's positioning.
- [Implementation/Results] Implementation description (throughout, especially Results and Methods sections): The manuscript provides only high-level descriptions of the scripted controller, predefined world-model interfaces, verifier programs, and plan executor, without sufficient code-level or algorithmic detail. This absence hinders reproducibility and makes it difficult to evaluate how the refactoring step functions as a practical MDL proxy or to diagnose sources of the observed run-to-run variability.
- [Results] Results section: While the paper notes multiple independent playthroughs for a few games to illustrate variability, the majority of the 25 games have only a single recorded playthrough. A per-game breakdown of failure modes on the 18 unsolved games, or at least aggregate error analysis, is needed to substantiate the claim of 'preliminary evidence' that the approach is promising.
minor comments (2)
- [Abstract] Abstract: The mean RHAE of 32.58% would be more informative if accompanied by a measure of spread (e.g., standard deviation or per-game range) given the noted run-to-run variability.
- [Results] The paper would benefit from a short table or figure summarizing per-game outcomes (solved/unsolved, RHAE values) to allow readers to assess consistency across the 25 public games.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We address each of the major comments point by point below, agreeing to revisions where they strengthen the manuscript's clarity and rigor.
read point-by-point responses
-
Referee: The assertion that the system 'can serve as a game-general baseline for ARC-AGI-3' because it uses no game-specific code is not yet supported by the evidence. All reported results are confined to the 25 public games, and the abstract states that private-set performance 'remains to be tested.' This makes the baseline claim aspirational rather than demonstrated.
Authors: We concur that the baseline claim is preliminary and aspirational at this stage. The manuscript emphasizes the lack of game-specific code to support its potential as a general baseline, but we recognize that private set results are necessary for stronger claims. We will update the abstract to read: 'Because the system uses no game-specific code, it provides a preliminary game-general baseline for ARC-AGI-3. Performance on the private validation set remains to be tested.' This adjustment aligns the positioning more closely with the presented evidence. revision: yes
-
Referee: The manuscript provides only high-level descriptions of the scripted controller, predefined world-model interfaces, verifier programs, and plan executor, without sufficient code-level or algorithmic detail. This absence hinders reproducibility and makes it difficult to evaluate how the refactoring step functions as a practical MDL proxy or to diagnose sources of the observed run-to-run variability.
Authors: This observation is accurate and highlights an area for improvement. The original manuscript prioritizes the conceptual overview. In the revision, we will enhance the Methods and Implementation sections with additional algorithmic specifics, including a step-by-step description of the refactoring procedure and its relation to MDL principles, as well as factors contributing to variability such as stochastic LLM outputs and independent agent initializations. Pseudocode for key components will be included to aid reproducibility. revision: yes
-
Referee: While the paper notes multiple independent playthroughs for a few games to illustrate variability, the majority of the 25 games have only a single recorded playthrough. A per-game breakdown of failure modes on the 18 unsolved games, or at least aggregate error analysis, is needed to substantiate the claim of 'preliminary evidence' that the approach is promising.
Authors: We accept this point. Due to resource limitations, only a subset of games received multiple runs. We will incorporate an aggregate analysis of failure modes across the unsolved games in the revised Results section, grouping issues such as world-model verification failures, planning errors, and execution discrepancies. This will provide better support for the preliminary evidence claim without requiring extensive new experiments. revision: partial
Circularity Check
No significant circularity; results are direct empirical measurements
full rationale
The paper reports empirical outcomes (7 solved games, RHAE statistics) from running a described coding-agent system on the 25 public ARC-AGI-3 games. No equations, fitted parameters, or derivations are presented that reduce any reported quantity to a prior fit or self-defined input by construction. The claim of serving as a game-general baseline rests on the absence of hand-coded game-specific logic, which is an architectural property rather than a tautological redefinition of the performance numbers. Private-set performance is explicitly noted as untested, but this is a limitation on generalization, not a circular reduction in the reported public results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Executable Python code can accurately model the dynamics of ARC-AGI-3 games from observations
- ad hoc to paper Refactoring toward simpler abstractions serves as a practical proxy for an MDL-like simplicity bias
Lean theorems connected to this paper
-
Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the agent is repeatedly asked to refactor its executable model, replacing special cases with simpler abstractions ... a practical proxy for an MDL-like simplicity bias
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.