Executable World Models for ARC-AGI-3 in the Era of Coding Agents

Sergey Rodionov

REVIEW 1 major objections 2 minor 2 cited by

A coding agent maintains executable Python world models, verifies them against observations, and refactors for simplicity to solve ARC-AGI-3 games without any game-specific code or prompts.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 23:27 UTC pith:2X6LUKKG

load-bearing objection A coding agent maintains and refactors executable Python world models for ARC-AGI-3, solving 15 of 25 public games with the same prompts and audited harness, but private-set results and model-fidelity checks are still missing. the 1 major comments →

arxiv 2605.05138 v2 pith:2X6LUKKG submitted 2026-05-06 cs.AI

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

Sergey Rodionov This is my paper

classification cs.AI

keywords ARC-AGI-3executable world modelscoding agentsverifier programsplanningPython modelsgame solvingAI agents

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates whether a single agent can build and use executable world models to handle diverse unseen games. The agent observes gameplay, checks its Python model against past moves via verifier programs, simplifies the code as a stand-in for preferring compact descriptions, and then plans actions inside that model. The same agent setup runs on all 25 public games, starting fresh each time with no prior game knowledge or leaked information. Results show full solutions on 15 games with one model version and 8 with another, along with mean scores of 58 percent and 41 percent. This supplies initial support for verifier-driven executable models as a workable strategy for this benchmark.

Core claim

The agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions, and plans through the model before acting. The system uses a scripted controller and predefined interfaces but contains no game-specific code, prompts, or heuristics, and the same agent instance applies across all games. On the 25 public ARC-AGI-3 games, it fully solved 15 with GPT-5.5 high reasoning effort (mean RHAE 58.12 percent) and 8 with GPT-5.4 (mean RHAE 41.29 percent). The results supply preliminary evidence that verifier-driven executable world models form a promising approach for ARC-AGI-3 agents.

What carries the argument

The verifier-driven executable world model: a Python program the agent builds from observations, checks with separate verifier code, refactors toward simpler form, and then queries to generate action plans.

Load-bearing premise

An agent can keep its executable world models accurate across many different unseen games by watching, verifying, and refactoring alone, without building in errors that break later planning.

What would settle it

Running the same agent on the private ARC-AGI-3 validation set and checking whether the solve rate and mean RHAE remain comparable to the public-set numbers.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

The same agent and prompts can be reused on new games without any per-game modifications.
Higher reasoning effort in the underlying language model increases the number of fully solved games and the mean score.
Releasing full run artifacts and code allows direct reproduction and extension by others.
Performance on the private validation set will determine whether the approach generalizes beyond the public games.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The refactoring step may serve as a practical way to inject a simplicity preference into model construction without explicit minimum-description-length calculations.
The method could extend to other interactive environments where rules are expressible in executable code, provided observation and verification remain reliable.
Auditing and closing information-leakage channels in the harness shows one concrete way to reduce benchmark contamination when agents have broad system access.
Combining the world-model approach with additional search or ensemble techniques might raise solve rates further on harder instances.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

A coding agent maintains and refactors executable Python world models for ARC-AGI-3, solving 15 of 25 public games with the same prompts and audited harness, but private-set results and model-fidelity checks are still missing.

read the letter

The core result is a working baseline where the agent builds, verifies against observations, and simplifies its own Python world models before planning. It uses a scripted controller and game-agnostic prompts, runs fresh instances each time, and reports 15 full solves plus 58% mean RHAE with GPT-5.5 on the public set.

The audited harness and full artifact release are useful. They close obvious leakage paths and let others reproduce the exact runs. The decision to keep the controller and verifiers predefined while letting the LLM handle only model construction and refactoring is a clean split that avoids hand-coded game logic.

The main gaps are the lack of ablations, error bars, or failure-mode analysis in the reported numbers. The stress-test concern about correlated LLM errors surviving verification also lands: nothing in the abstract shows that the final models match the true mechanics rather than just the observed traces. Private-set performance is still unknown.

People working on agent loops that combine code synthesis with planning will find the concrete setup and released code worth examining. It is grounded enough and ships enough artifacts to merit a serious referee, even though the current evidence is preliminary and the generalization claim needs more tests.

Referee Report

1 major / 2 minor

Summary. The paper presents a coding-agent system for ARC-AGI-3 that maintains executable Python world models, verifies them against observations, refactors toward simpler abstractions, and plans through the model before acting. The system uses a scripted controller, predefined interfaces, and verifier programs with no game-specific code, prompts, or heuristics; the same agent instance is used across all games. On the 25 public ARC-AGI-3 games, with fresh agent instances per playthrough, GPT-5.5 (high reasoning effort) fully solves 15 games (mean per-game RHAE 58.12%) while GPT-5.4 solves 8 (mean RHAE 41.29%). The manuscript audits information leakage channels, releases full run artifacts and code, and concludes that the results supply preliminary evidence that verifier-driven executable world models are a promising approach for ARC-AGI-3 agents. Private-set performance is noted as future work.

Significance. If the reported solve rates and RHAE values hold under the stated conditions, the work supplies concrete empirical support for an approach that lets LLM-based agents synthesize and use executable models for planning without hand-engineered game logic. The explicit release of code, artifacts, and the audit of leakage channels are strengths that enable direct replication and extension. The results are preliminary (public games only) but address a core ARC-AGI challenge of generalization across novel tasks.

major comments (1)

[Abstract and §4] Abstract and §4 (results): the central claim of 'preliminary evidence' for verifier-driven models rests on the assumption that LLM-mediated model construction produces faithful executables rather than merely trajectory-consistent ones. The manuscript reports solve counts and RHAE but supplies no quantitative checks (e.g., held-out observation prediction error, model-vs-ground-truth divergence on unseen states, or systematic error analysis) that would rule out correlated misunderstandings surviving the verification step.

minor comments (2)

[Abstract] Abstract: the acronym RHAE is used without expansion on first occurrence; a parenthetical definition or reference to its definition in the main text would improve clarity.
The manuscript states that full run artifacts are released at the GitHub link; confirming that the released logs include per-step model versions and verification outcomes would further support reproducibility claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the distinction between trajectory-consistent and faithful world models. We address the comment below and note that the current results are framed as preliminary evidence based on end-to-end task performance.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (results): the central claim of 'preliminary evidence' for verifier-driven models rests on the assumption that LLM-mediated model construction produces faithful executables rather than merely trajectory-consistent ones. The manuscript reports solve counts and RHAE but supplies no quantitative checks (e.g., held-out observation prediction error, model-vs-ground-truth divergence on unseen states, or systematic error analysis) that would rule out correlated misunderstandings surviving the verification step.

Authors: We agree that the manuscript does not provide quantitative checks such as held-out prediction error or divergence on unseen states. The verification procedure ensures consistency with all observed trajectories up to the current timestep, and successful planning through the model is required to solve the games, but this does not formally rule out correlated errors that happen to be consistent with the observed data. We will revise §4 to explicitly acknowledge this limitation, add a brief error analysis of the final models on the solved games (e.g., number of states where the model diverges from ground-truth dynamics when probed with additional synthetic inputs), and clarify that the 'preliminary evidence' claim rests on functional utility rather than exhaustive faithfulness verification. Systematic held-out testing across all games would require a more extensive experimental protocol that is outside the scope of the current work. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical system evaluation on public benchmarks

full rationale

The paper describes an agent architecture and reports direct empirical measurements (solve counts and RHAE) on the 25 public ARC-AGI-3 games using fixed prompts and game-agnostic harnesses. No equations, fitted parameters, predictions, or derivations appear in the provided text; performance figures are raw outcome counts rather than quantities computed from model-internal quantities. No self-citation chains, ansatzes, or uniqueness theorems are invoked to justify core claims. The evaluation is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical systems evaluation; the abstract introduces no explicit mathematical axioms, free parameters, or new postulated entities beyond standard assumptions that code execution can simulate game dynamics and that verification against observations is feasible.

pith-pipeline@v0.9.1-grok · 5845 in / 1155 out tokens · 20222 ms · 2026-06-30T23:27:49.399615+00:00 · methodology

0 comments

read the original abstract

We evaluate an initial coding-agent system for ARC-AGI-3 in which the agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions as a practical proxy for an MDL-like simplicity bias, and plans through the model before acting. The system is intentionally direct: it uses a scripted controller, predefined world-model interfaces, verifier programs, and a plan executor, but no hand-coded game-specific logic. The agent-facing prompts, workspace, and controller contain no game-specific code, game-specific prompts, hand-coded heuristics, hidden solutions, or other game-specific information; the same agent and prompts are used across games. Because the coding agent has broad system access, we audit unintended information channels, describe earlier vulnerable harnesses, and explain how the current harness closes observed leakage channels while reducing benchmark-specific information exposure. We report results on the 25 public ARC-AGI-3 games. Each playthrough starts from a fresh agent instance and clean workspace, with no access to files or conversation state from earlier playthroughs. With GPT-5.5 high reasoning effort, the agent fully solved 15 games and achieved a mean per-game RHAE of 58.12%. With GPT-5.4 high reasoning effort, it fully solved 8 games and achieved a mean per-game RHAE of 41.29%. Performance on the private validation set, which is not yet available to us, remains to be tested. Overall, the results provide preliminary evidence that verifier-driven executable world models are a promising approach for ARC-AGI-3 agents. Full run artifacts are released with the code at https://github.com/astroseger/arc-3-agents-baseline1.

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tycho: Active Abstraction with Programmatic World Models for ARC-AGI-3
cs.AI 2026-07 conditional novelty 6.5

Selective programmatic world modeling (actor-requested builder) yields 100 RHAE on all 183 public ARC-AGI-3 levels, while automatic repair is more transition-exact but weaker at play.
PRO-LONG: Programmatic Memory Enables Long-Horizon Reasoning
cs.AI 2026-07 conditional novelty 5.0

An append-only game log searched by a coding agent with grep/Python matches specialized ARC-AGI-3 harnesses while using 4-6x fewer tokens.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 2 Pith papers

[1]

ARC-AGI-3: A new challenge for frontier agentic intelligence, 2026

ARC Prize Foundation. ARC-AGI-3: A new challenge for frontier agentic intelligence, 2026

work page 2026
[2]

Generating code world models with large language models guided by monte carlo tree search, 2024

Nicola Dainese, Matteo Merler, Minttu Alakuijala, and Pekka Marttinen. Generating code world models with large language models guided by monte carlo tree search, 2024

work page 2024
[3]

Tenenbaum

Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales, Luke Hewitt, Armando Solar-Lezama, and Joshua B. Tenenbaum. DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning, 2020

work page 2020
[4]

Olausson, Muxin Liu, Joshua B

Gabriel Grand, Lionel Wong, Maddy Bowers, Theo X. Olausson, Muxin Liu, Joshua B. Tenen- baum, and Jacob Andreas. LILO: Learning interpretable libraries by compressing and docu- menting code, 2023

work page 2023
[5]

A tutorial introduction to the minimum description length principle, 2004

Peter Gr¨ unwald. A tutorial introduction to the minimum description length principle, 2004

work page 2004
[6]

World models, 2018

David Ha and J¨ urgen Schmidhuber. World models, 2018

work page 2018
[7]

Chiu, Celine Lee, Wenting Zhao, and Kevin Ellis

Ziga Kovacic, Justin T. Chiu, Celine Lee, Wenting Zhao, and Kevin Ellis. Refactoring code- bases through library design, 2025. 7

work page 2025
[8]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´ emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Suther- land Robson, P...

work page doi:10.1126/science.abq1158 2022
[9]

Codex CLI.https://developers.openai.com/codex/cli, 2026

OpenAI. Codex CLI.https://developers.openai.com/codex/cli, 2026. Accessed 2026- 04-30

work page 2026
[10]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Ba- log, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Peng- ming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discov- eries from program search with large language models.Nature, 625:468–475, 2024. doi: 10.1038/s41586-023-06924-6

work page doi:10.1038/s41586-023-06924-6 2024
[11]

WorldCoder, a model-based LLM agent: Building world models by writing code and interacting with the environment, 2024

Hao Tang, Darren Key, and Kevin Ellis. WorldCoder, a model-based LLM agent: Building world models by writing code and interacting with the environment, 2024

work page 2024
[12]

Trinh, Yuhuai Wu, Quoc V

Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625:476–482, 2024. doi: 10.1038/s41586-023-06747-5. 8

work page doi:10.1038/s41586-023-06747-5 2024

[1] [1]

ARC-AGI-3: A new challenge for frontier agentic intelligence, 2026

ARC Prize Foundation. ARC-AGI-3: A new challenge for frontier agentic intelligence, 2026

work page 2026

[2] [2]

Generating code world models with large language models guided by monte carlo tree search, 2024

Nicola Dainese, Matteo Merler, Minttu Alakuijala, and Pekka Marttinen. Generating code world models with large language models guided by monte carlo tree search, 2024

work page 2024

[3] [3]

Tenenbaum

Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales, Luke Hewitt, Armando Solar-Lezama, and Joshua B. Tenenbaum. DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning, 2020

work page 2020

[4] [4]

Olausson, Muxin Liu, Joshua B

Gabriel Grand, Lionel Wong, Maddy Bowers, Theo X. Olausson, Muxin Liu, Joshua B. Tenen- baum, and Jacob Andreas. LILO: Learning interpretable libraries by compressing and docu- menting code, 2023

work page 2023

[5] [5]

A tutorial introduction to the minimum description length principle, 2004

Peter Gr¨ unwald. A tutorial introduction to the minimum description length principle, 2004

work page 2004

[6] [6]

World models, 2018

David Ha and J¨ urgen Schmidhuber. World models, 2018

work page 2018

[7] [7]

Chiu, Celine Lee, Wenting Zhao, and Kevin Ellis

Ziga Kovacic, Justin T. Chiu, Celine Lee, Wenting Zhao, and Kevin Ellis. Refactoring code- bases through library design, 2025. 7

work page 2025

[8] [8]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´ emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Suther- land Robson, P...

work page doi:10.1126/science.abq1158 2022

[9] [9]

Codex CLI.https://developers.openai.com/codex/cli, 2026

OpenAI. Codex CLI.https://developers.openai.com/codex/cli, 2026. Accessed 2026- 04-30

work page 2026

[10] [10]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Ba- log, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Peng- ming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discov- eries from program search with large language models.Nature, 625:468–475, 2024. doi: 10.1038/s41586-023-06924-6

work page doi:10.1038/s41586-023-06924-6 2024

[11] [11]

WorldCoder, a model-based LLM agent: Building world models by writing code and interacting with the environment, 2024

Hao Tang, Darren Key, and Kevin Ellis. WorldCoder, a model-based LLM agent: Building world models by writing code and interacting with the environment, 2024

work page 2024

[12] [12]

Trinh, Yuhuai Wu, Quoc V

Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625:476–482, 2024. doi: 10.1038/s41586-023-06747-5. 8

work page doi:10.1038/s41586-023-06747-5 2024