arxiv: 2603.24621 · v2 · submitted 2026-03-24 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

ARC Prize Foundation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:04 UTC · model grok-4.3

classification 💻 cs.AI

keywords ARC-AGIagentic intelligenceinteractive benchmarkcore knowledge priorsfluid intelligenceAI evaluationgoal inferenceplanning

0 comments

The pith

ARC-AGI-3 introduces interactive environments where humans solve every task but current frontier AI systems score below 1 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ARC-AGI-3 as a benchmark of turn-based abstract environments that require agents to explore, infer goals, build models of dynamics, and plan actions without instructions or language. It is constructed to rely solely on core knowledge priors and calibrated through repeated human testing so that the tasks stay novel. Humans achieve complete success across the full set of environments. Frontier AI systems, as measured in March 2026, remain below 1 percent success. The scoring framework compares AI performance directly to human action baselines to quantify adaptive efficiency on unseen problems.

Core claim

ARC-AGI-3 consists of novel, language-free interactive environments that test an agent's ability to explore, infer implicit goals, construct internal models of environment dynamics, and execute effective action sequences. Human test-takers solve 100 percent of the environments after calibration, while frontier AI systems score below 1 percent. The benchmark evaluates fluid adaptive efficiency on tasks that use only core knowledge priors and avoids any reliance on external knowledge or language.

What carries the argument

ARC-AGI-3 interactive benchmark, a collection of turn-based abstract environments whose difficulty is set by human performance baselines and core-knowledge priors.

If this is right

AI progress can be tracked by measuring how close agents come to human action efficiency on these novel tasks.
Success requires agents to perform goal inference and dynamic modeling without explicit training signals.
The efficiency-based scoring allows direct numerical comparison between AI agents and human baselines.
Passing the benchmark would demonstrate fluid intelligence on tasks that avoid language and memorized knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improvements on ARC-AGI-3 could indicate AI systems that handle uncertainty and novelty more robustly than current methods.
The benchmark may serve as a template for creating further interactive tests that isolate adaptive reasoning from language use.
Future versions could add multi-step planning requirements or partial observability to increase the challenge.

Load-bearing premise

The environments remain genuinely novel and language-free to AI systems once they have been calibrated only through human test-takers.

What would settle it

A frontier AI system reaching 50 percent or higher success on the full ARC-AGI-3 suite would show that the benchmark no longer separates current AI capabilities from human performance.

Figures

Figures reproduced from arXiv: 2603.24621 by ARC Prize Foundation.

**Figure 2.** Figure 2: Screenshot of ARC-AGI-3 environment ls20. 2.3.1 The Observation Space The agent views a 64x64 grid where each cell is one of 16 possible colors. A given grid state is called a “frame”. At each turn, the agent receives a frame or frame sequence. Frame sequences allow for noninteractive animations (e.g., an object moving across the screen) between player turns. 2.3.2 The Action Space Each environment offers… view at source ↗

**Figure 3.** Figure 3: First level of ls20 in graph form. Notice the three repeating states – an artifact of the three-life mechanic of the level. Pwin for this level is exactly 1 in 355. 3.6 ARC-AGI-3 environment selection The ARC-AGI-3 benchmark consists of the following datasets: Public demonstration set. The public set is designed to demonstrate the ARC-AGI-3 environment format, while being accessible and engaging for human … view at source ↗

**Figure 4.** Figure 4: Action progression and RHAE scoring for environment [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Participant demographics. 5.3 Human performance on ARC-AGI-3 In total, we recorded 486 unique participants across 414 candidate environments. This resulted in 2,893 total environment attempts. 0% 50% 100% 150% 200% 250% 300% 0 50 100 150 200 250 Matches median (100%) Efficiency vs Median Count (level completions) Per-Level Efficiency Distribution Relative to Human Baseline [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 6.** Figure 6: Per-level efficiency distribution relative to the median human baseline across all public environ [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Time spent on environments by outcome, split between successful runs (“correct”) and unsuccessful [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Total actions by level for environment ls20. 6 ARC-AGI-3 pre-launch testing Unlike ARC-AGI-1 and 2, we decided to release previews of ARC-AGI-3 prior to the full launch in order to guide our final benchmark design. This gave us critical feedback on what environments were easier and more engaging, and enabled early AI tests to vet our design choices. To incentivize this, we both hosted an agent preview comp… view at source ↗

read the original abstract

We introduce ARC-AGI-3, an interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments in which agents must explore, infer goals, build internal models of environment dynamics, and plan effective action sequences without explicit instructions. Like its predecessors ARC-AGI-1 and 2, ARC-AGI-3 focuses entirely on evaluating fluid adaptive efficiency on novel tasks, while avoiding language and external knowledge. ARC-AGI-3 environments only leverage Core Knowledge priors and are difficulty-calibrated via extensive testing with human test-takers. Our testing shows humans can solve 100% of the environments, in contrast to frontier AI systems which, as of March 2026, score below 1%. In this paper, we present the benchmark design, its efficiency-based scoring framework grounded in human action baselines, and the methodology used to construct, validate, and calibrate the environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARC-AGI-3 adds turn-based interaction and goal inference to the ARC setup but the 100% human versus sub-1% AI claim needs explicit confirmation that the evaluation protocols matched.

read the letter

The new piece here is the shift to interactive, turn-based environments where an agent must explore, infer an implicit goal, build a model of the dynamics, and then act without any language or instructions. That moves beyond the static grid tasks in the earlier ARC versions and ties the scoring to an efficiency measure grounded in how many actions humans actually needed. The design keeps everything inside core knowledge priors and uses human testing to set difficulty, which is consistent with the original ARC intent of measuring fluid adaptation on novel problems. If the human calibration holds up, the benchmark gives a clearer target for agentic work than pure accuracy scores on fixed puzzles. The paper lays out the environment construction, validation steps, and the efficiency framework in enough outline form to see how they tried to make the tasks comparable across runs. That part is useful for anyone thinking about how to score planning and model-building in agents. The soft spot is the missing detail on the actual numbers and controls. The abstract states the 100% human and sub-1% AI results but does not report how many environments were used, how many human participants provided the baselines, which specific frontier models were run, or the exact observation and action interface given to the AI agents. The stress-test point about possible mismatch in state representation or trial limits is reasonable because those factors can easily change apparent performance. Without those specifics it is hard to judge whether the gap is intrinsic to the tasks or partly an artifact of how the two groups were tested. This paper is for groups already working on agent benchmarks or core-knowledge-style tasks who want a new interactive challenge to run their systems against. A reader focused on planning or unsupervised model learning would get concrete ideas from the environment design even if they end up modifying the scoring. I would send it to peer review so the calibration protocol and interface details can be checked directly rather than desk-rejecting on the current level of description.

Referee Report

2 major / 2 minor

Summary. The paper introduces ARC-AGI-3, an interactive benchmark of novel abstract turn-based environments that test agentic intelligence via exploration, goal inference, internal model construction, and planning without language or external knowledge. Environments are restricted to Core Knowledge priors and calibrated through human testing; the central empirical claim is that humans solve 100% of tasks while frontier AI systems score below 1% as of March 2026. The manuscript describes the benchmark design, an efficiency-based scoring framework grounded in human action baselines, and the methodology for construction, validation, and calibration.

Significance. If the performance gap is shown to arise from matched protocols rather than evaluation artifacts, the benchmark would provide a valuable language-free test of fluid adaptive efficiency, extending the ARC-AGI series and offering a concrete challenge for agentic capabilities in frontier models.

major comments (2)

[Methodology / Evaluation Protocol] Methodology section (evaluation protocol): the description of the AI testing setup omits the precise observation/action interface, episode length limits, number of trials per environment, and prompting regime supplied to frontier models. Because the <1% claim is load-bearing for the headline result, any deviation from the human calibration protocol (e.g., richer state representations or additional trials) could render the reported gap non-comparable.
[Abstract and Results] Results and abstract: the 100% human / <1% AI figures are stated without reporting the number of environments, human sample size, identity of the specific frontier models tested, or any statistical controls for variance. This absence leaves the central empirical claim without sufficient verifiable support.

minor comments (2)

[Scoring Framework] The efficiency-based scoring framework would benefit from an explicit equation or pseudocode showing how human action baselines are normalized into the final score.
[Benchmark Design] Figure captions and environment descriptions could be expanded to clarify the exact turn-based interaction loop for readers unfamiliar with prior ARC-AGI versions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of the evaluation protocol and empirical results.

read point-by-point responses

Referee: Methodology section (evaluation protocol): the description of the AI testing setup omits the precise observation/action interface, episode length limits, number of trials per environment, and prompting regime supplied to frontier models. Because the <1% claim is load-bearing for the headline result, any deviation from the human calibration protocol (e.g., richer state representations or additional trials) could render the reported gap non-comparable.

Authors: We agree that the current description of the AI evaluation protocol is insufficiently detailed for full reproducibility and comparability. The manuscript provides a high-level overview but does not enumerate the exact observation and action spaces, maximum episode lengths, trial counts per environment, or the precise prompting format used with frontier models. In the revised manuscript we will expand the Methodology section to specify these parameters explicitly and to document how they were aligned with the human calibration protocol (including identical state representations and trial limits). This will allow readers to verify that the reported performance gap reflects matched conditions rather than protocol differences. revision: yes
Referee: Results and abstract: the 100% human / <1% AI figures are stated without reporting the number of environments, human sample size, identity of the specific frontier models tested, or any statistical controls for variance. This absence leaves the central empirical claim without sufficient verifiable support.

Authors: We acknowledge that the abstract and results sections currently present the headline performance figures without the supporting quantitative details the referee requests. The manuscript states the aggregate outcomes but does not list the total number of environments, the size of the human test-taker cohort, the exact frontier models evaluated as of March 2026, or variance statistics. In the revision we will add these elements: the benchmark size, human sample size and recruitment criteria, the specific model versions tested, and appropriate statistical controls (e.g., per-environment success rates and confidence intervals). These additions will provide the verifiable support needed for the central claim while preserving the existing narrative. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or performance claims

full rationale

The paper presents ARC-AGI-3 as an empirical benchmark whose environments are constructed and calibrated through separate human testing protocols, with reported human (100%) and AI (<1%) scores arising from distinct evaluation runs rather than any derivation or equation that reduces one to the other by construction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text; references to prior ARC versions serve only as background and do not justify the new calibration or scoring results. The efficiency-based scoring framework is described as grounded in observed human action baselines without evidence that the headline performance gap is mathematically forced by the calibration inputs themselves. The derivation chain is therefore self-contained as an independent empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unverified assumption that the new environments are strictly limited to Core Knowledge priors and that human calibration ensures they remain novel for AI systems.

axioms (2)

domain assumption Environments only leverage Core Knowledge priors
Explicitly stated in the abstract as the basis for avoiding language and external knowledge.
domain assumption Human test-taker results provide a valid difficulty calibration and efficiency baseline
The scoring framework and 100% human solve rate depend on this calibration process described in the abstract.

pith-pipeline@v0.9.0 · 5442 in / 1288 out tokens · 51170 ms · 2026-05-15T00:04:04.422243+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ARC-AGI-3 environments only leverage Core Knowledge priors... efficiency-based scoring framework grounded in human action baselines
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

turn-based... 8 levels... action efficiency

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
cs.AI 2026-05 conditional novelty 7.0

State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
cs.AI 2026-05 unverdicted novelty 7.0

Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
cs.CL 2026-05 unverdicted novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners
cs.AI 2026-05 unverdicted novelty 6.0

Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
cs.AI 2026-05 unverdicted novelty 6.0

Agentick is a new unified benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human agents across 37 tasks, showing no single approach dominates.
Counting as a minimal probe of language model reliability
cs.CL 2026-05 unverdicted novelty 6.0

Language models have limited stable counting capacity well below context limits and rely on a finite set of count-like internal states, collapsing to guessing once exhausted.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 5.0

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 3.0

This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 8 Pith papers · 3 internal anchors

[1]

ARC Prize 2024 Competition.https://arcprize.org/competitions/2024, 2024

work page 2024
[2]

ARC Prize 2025 Competition.https://arcprize.org/competitions/2025, 2025

work page 2025
[3]

Founders: Mike Knoop, François Chollet

ARC Prize Foundation.https://arcprize.org/, 2026. Founders: Mike Knoop, François Chollet. Operations: Bryan Landers, Greg Kamradt

work page 2026
[4]

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents, 2018

work page 2018
[5]

ARC-AGI Community Leaderboard.https://github.com/arcprize/ ARC-AGI-Community-Leaderboard, 2026

ARC Prize Foundation. ARC-AGI Community Leaderboard.https://github.com/arcprize/ ARC-AGI-Community-Leaderboard, 2026

work page 2026
[6]

ARC-AGI Toolkit.https://github.com/arcprize/ARC-AGI, 2026

ARC Prize Foundation. ARC-AGI Toolkit.https://github.com/arcprize/ARC-AGI, 2026

work page 2026
[7]

Gemini 3 Deep Think Preview Verification on ARC-AGI-2.https:// huggingface.co/datasets/arcprize/arc_agi_v2_public_eval, 2026

ARC Prize Foundation. Gemini 3 Deep Think Preview Verification on ARC-AGI-2.https:// huggingface.co/datasets/arcprize/arc_agi_v2_public_eval, 2026. 22

work page 2026
[8]

On the Measure of Intelligence

François Chollet. On the Measure of Intelligence.https://arxiv.org/abs/1911.01547, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[9]

OpenAI o3 Breakthrough High Score on ARC-AGI-Pub.https://arcprize.org/ blog/oai-o3-pub-breakthrough, December 2024

François Chollet. OpenAI o3 Breakthrough High Score on ARC-AGI-Pub.https://arcprize.org/ blog/oai-o3-pub-breakthrough, December 2024

work page 2024
[10]

AbstractionandReasoningChallenge

FrançoisChollet, KatherineTong, WalterReade, andJuliaElliott. AbstractionandReasoningChallenge. https://kaggle.com/competitions/abstraction-and-reasoning-challenge, 2020. Kaggle

work page 2020
[11]

Hill-climbing arc-agi-3, 2026

Alexis Fox, Junlin Wang, Paul Rosu, and Bhuwan Dhingra. Hill-climbing arc-agi-3, 2026

work page 2026
[12]

Post on LRM automation discovering novel results in quantum physics.https://x.com/ hsu_steve/status/1996034522308026435, 2025

Steve Hsu. Post on LRM automation discovering novel results in quantum physics.https://x.com/ hsu_steve/status/1996034522308026435, 2025

work page arXiv 2025
[13]

ARC-AGI-3 Preview: 30-Day Learnings.https://arcprize.org/blog/ arc-agi-3-preview-30-day-learnings, August 2025

Greg Kamradt. ARC-AGI-3 Preview: 30-Day Learnings.https://arcprize.org/blog/ arc-agi-3-preview-30-day-learnings, August 2025

work page 2025
[14]

Arcgentica: ARC-AGI-3 Agent Harness Built on the Agentica SDK

Samuel Knutsen and Victoria Klein. Arcgentica: ARC-AGI-3 Agent Harness Built on the Agentica SDK. https://github.com/symbolica-ai/ARC-AGI-3-Agents, 2026

work page 2026
[15]

ARCathon 2022.https://lab42.global/past-challenges/2022-arcathon/, 2022

Lab42. ARCathon 2022.https://lab42.global/past-challenges/2022-arcathon/, 2022

work page 2022
[16]

ARCathon 2023.https://lab42.global/past-challenges/2023-arcathon/, 2023

Lab42. ARCathon 2023.https://lab42.global/past-challenges/2023-arcathon/, 2023

work page 2023
[17]

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Do- minik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the...

work page 2016
[18]

ARC3 Solution.https://github.com/DriesSmit/ARC3-solution, 2025

Dries Smit. ARC3 Solution.https://github.com/DriesSmit/ARC3-solution, 2025

work page 2025
[19]

Sorokin and Jean-Francois Puget

I. Sorokin and Jean-Francois Puget. NVARC Solution to ARC-AGI-2 2025.https://drive.google. com/file/d/1vkEluaaJTzaZiJL69TkZovJUkPSDH5Xc/view, 2025

work page 2025
[20]

Spelke and Katherine D

Elizabeth S. Spelke and Katherine D. Kinzler. Core knowledge.Developmental science, pages 89–96, 2007

work page 2007
[21]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need.https://arxiv.org/abs/1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

ARC-AGI-3 Agents.https://github.com/wd13ca/ARC-AGI-3-Agents, 2025

wd13ca. ARC-AGI-3 Agents.https://github.com/wd13ca/ARC-AGI-3-Agents, 2025

work page 2025
[23]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.https: //arxiv.org/abs/2201.11903, 2022. 23

work page internal anchor Pith review Pith/arXiv arXiv 2022