Executable World Models for ARC-AGI-3 in the Era of Coding Agents

Sergey Rodionov

arxiv: 2605.05138 · v2 · pith:2X6LUKKGnew · submitted 2026-05-06 · 💻 cs.AI

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

Sergey Rodionov This is my paper

Pith reviewed 2026-05-08 17:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords ARC-AGI-3executable world modelscoding agentsverifier-driven planningrefactoringsimplicity biasgame-general baselineAI agents

0 comments

The pith

A coding-agent system using executable Python world models solves 7 of 25 public ARC-AGI-3 games with no game-specific code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates a system in which an agent builds and maintains an executable Python world model of each game, checks it against observed outcomes, refactors the model toward simpler abstractions, and then plans actions by executing possible sequences inside that model. The entire setup relies on a scripted controller, fixed interfaces, and a verifier program but never receives hand-written logic tailored to any individual game. On the 25 public games the agent completed 7 fully, reached more than 75 percent relative human action efficiency on 6 others, and recorded an average efficiency of 32.58 percent across all games. A reader would care because ARC-AGI-3 is designed to test whether an agent can handle many different tasks without per-task engineering; a single general method that already reaches these numbers supplies a concrete starting point for further development rather than a collection of isolated solutions.

Core claim

The agent fully solved 7 games, achieved a Relative Human Action Efficiency greater than 75% on 6 games, and obtained a mean per-game RHAE of 32.58%. Because the system uses no game-specific code, it can serve as a game-general baseline for ARC-AGI-3. Performance on the private validation set remains to be tested. Overall, the results provide preliminary evidence that verifier-driven executable world models are a promising approach for ARC-AGI-3 agents.

What carries the argument

Verifier-driven executable Python world models that are refactored toward simpler abstractions as a practical proxy for minimum-description-length simplicity bias, then used for planning before action.

Load-bearing premise

That verifier-driven executable Python world models with refactoring as a proxy for simplicity bias will generalize to the private validation set and other unseen tasks without requiring game-specific logic.

What would settle it

Measuring the same agent's success rate and relative efficiency on the private validation set of ARC-AGI-3 games; low scores there would show the public results do not extend to held-out tasks.

read the original abstract

We evaluate an initial coding-agent system for ARC-AGI-3 in which the agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions as a practical proxy for an MDL-like simplicity bias, and plans through the model before acting. The system is intentionally direct: it uses a scripted controller, predefined world-model interfaces, verifier programs, and a plan executor, but no hand-coded game-specific logic. The agent-facing prompts, workspace, and controller contain no game-specific code, game-specific prompts, hand-coded heuristics, hidden solutions, or other game-specific information; the same agent and prompts are used across games. Because the coding agent has broad system access, we audit unintended information channels, describe earlier vulnerable harnesses, and explain how the current harness closes observed leakage channels while reducing benchmark-specific information exposure. We report results on the 25 public ARC-AGI-3 games. Each playthrough starts from a fresh agent instance and clean workspace, with no access to files or conversation state from earlier playthroughs. With GPT-5.5 high reasoning effort, the agent fully solved 15 games and achieved a mean per-game RHAE of 58.12%. With GPT-5.4 high reasoning effort, it fully solved 8 games and achieved a mean per-game RHAE of 41.29%. Performance on the private validation set, which is not yet available to us, remains to be tested. Overall, the results provide preliminary evidence that verifier-driven executable world models are a promising approach for ARC-AGI-3 agents. Full run artifacts are released with the code at https://github.com/astroseger/arc-3-agents-baseline1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives concrete public-set numbers for a no-game-specific executable Python world model agent on ARC-AGI-3 but leaves the generalization claim untested.

read the letter

The paper evaluates a scripted coding agent that builds and maintains an executable Python world model for ARC-AGI-3 games. The agent verifies the model against observations, refactors it for simpler abstractions, and plans actions through the model. It uses no hand-coded game logic and starts each playthrough with a fresh instance. On the 25 public games it solves 7 fully, reaches RHAE above 75% on 6, and averages 32.58% RHAE per game, with some runs repeated to show variability.

Referee Report

3 major / 2 minor

Summary. The paper presents an initial coding-agent system for ARC-AGI-3 that maintains an executable Python world model, verifies it against prior observations, refactors toward simpler abstractions as a proxy for an MDL-like simplicity bias, and plans through the model before acting. Using a scripted controller and no hand-coded game-specific logic, the system is evaluated on the 25 public ARC-AGI-3 games (with fresh agent instances per playthrough and some run-to-run variability reported). It fully solves 7 games, achieves RHAE >75% on 6 games, and obtains a mean per-game RHAE of 32.58%. The authors position the approach as a preliminary game-general baseline for ARC-AGI-3, while noting that private validation set performance remains to be tested.

Significance. If the approach generalizes beyond the public games, this would supply a valuable game-general baseline for ARC-AGI-3 agents that relies on verifier-driven executable world models rather than task-specific engineering. The explicit avoidance of game-specific code and the reporting of variability across fresh instances are concrete strengths that could facilitate future comparisons. The work offers preliminary empirical support for executable world models in this domain, though its impact hinges on demonstrated transfer.

major comments (3)

[Abstract] Abstract: The assertion that the system 'can serve as a game-general baseline for ARC-AGI-3' because it uses no game-specific code is not yet supported by the evidence. All reported results (7 solved games, 6 with RHAE >75%, mean RHAE 32.58%) are confined to the 25 public games, and the abstract states that private-set performance 'remains to be tested.' This makes the baseline claim aspirational rather than demonstrated and is load-bearing for the paper's positioning.
[Implementation/Results] Implementation description (throughout, especially Results and Methods sections): The manuscript provides only high-level descriptions of the scripted controller, predefined world-model interfaces, verifier programs, and plan executor, without sufficient code-level or algorithmic detail. This absence hinders reproducibility and makes it difficult to evaluate how the refactoring step functions as a practical MDL proxy or to diagnose sources of the observed run-to-run variability.
[Results] Results section: While the paper notes multiple independent playthroughs for a few games to illustrate variability, the majority of the 25 games have only a single recorded playthrough. A per-game breakdown of failure modes on the 18 unsolved games, or at least aggregate error analysis, is needed to substantiate the claim of 'preliminary evidence' that the approach is promising.

minor comments (2)

[Abstract] Abstract: The mean RHAE of 32.58% would be more informative if accompanied by a measure of spread (e.g., standard deviation or per-game range) given the noted run-to-run variability.
[Results] The paper would benefit from a short table or figure summarizing per-game outcomes (solved/unsolved, RHAE values) to allow readers to assess consistency across the 25 public games.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each of the major comments point by point below, agreeing to revisions where they strengthen the manuscript's clarity and rigor.

read point-by-point responses

Referee: The assertion that the system 'can serve as a game-general baseline for ARC-AGI-3' because it uses no game-specific code is not yet supported by the evidence. All reported results are confined to the 25 public games, and the abstract states that private-set performance 'remains to be tested.' This makes the baseline claim aspirational rather than demonstrated.

Authors: We concur that the baseline claim is preliminary and aspirational at this stage. The manuscript emphasizes the lack of game-specific code to support its potential as a general baseline, but we recognize that private set results are necessary for stronger claims. We will update the abstract to read: 'Because the system uses no game-specific code, it provides a preliminary game-general baseline for ARC-AGI-3. Performance on the private validation set remains to be tested.' This adjustment aligns the positioning more closely with the presented evidence. revision: yes
Referee: The manuscript provides only high-level descriptions of the scripted controller, predefined world-model interfaces, verifier programs, and plan executor, without sufficient code-level or algorithmic detail. This absence hinders reproducibility and makes it difficult to evaluate how the refactoring step functions as a practical MDL proxy or to diagnose sources of the observed run-to-run variability.

Authors: This observation is accurate and highlights an area for improvement. The original manuscript prioritizes the conceptual overview. In the revision, we will enhance the Methods and Implementation sections with additional algorithmic specifics, including a step-by-step description of the refactoring procedure and its relation to MDL principles, as well as factors contributing to variability such as stochastic LLM outputs and independent agent initializations. Pseudocode for key components will be included to aid reproducibility. revision: yes
Referee: While the paper notes multiple independent playthroughs for a few games to illustrate variability, the majority of the 25 games have only a single recorded playthrough. A per-game breakdown of failure modes on the 18 unsolved games, or at least aggregate error analysis, is needed to substantiate the claim of 'preliminary evidence' that the approach is promising.

Authors: We accept this point. Due to resource limitations, only a subset of games received multiple runs. We will incorporate an aggregate analysis of failure modes across the unsolved games in the revised Results section, grouping issues such as world-model verification failures, planning errors, and execution discrepancies. This will provide better support for the preliminary evidence claim without requiring extensive new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical measurements

full rationale

The paper reports empirical outcomes (7 solved games, RHAE statistics) from running a described coding-agent system on the 25 public ARC-AGI-3 games. No equations, fitted parameters, or derivations are presented that reduce any reported quantity to a prior fit or self-defined input by construction. The claim of serving as a game-general baseline rests on the absence of hand-coded game-specific logic, which is an architectural property rather than a tautological redefinition of the performance numbers. Private-set performance is explicitly noted as untested, but this is a limitation on generalization, not a circular reduction in the reported public results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that ARC-AGI-3 games are representable and verifiable via executable Python code and that refactoring provides a workable proxy for simplicity bias; these are domain assumptions rather than derived results.

axioms (2)

domain assumption Executable Python code can accurately model the dynamics of ARC-AGI-3 games from observations
Invoked when the agent maintains and verifies world models against previous observations.
ad hoc to paper Refactoring toward simpler abstractions serves as a practical proxy for an MDL-like simplicity bias
Explicitly stated in the abstract as the mechanism for improving the world model.

pith-pipeline@v0.9.0 · 5523 in / 1435 out tokens · 64257 ms · 2026-05-08T17:47:48.925371+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the agent is repeatedly asked to refactor its executable model, replacing special cases with simpler abstractions ... a practical proxy for an MDL-like simplicity bias

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.