Recognition: 2 theorem links
· Lean TheoremARC-AGI-3: A New Challenge for Frontier Agentic Intelligence
Pith reviewed 2026-05-15 00:04 UTC · model grok-4.3
The pith
ARC-AGI-3 introduces interactive environments where humans solve every task but current frontier AI systems score below 1 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ARC-AGI-3 consists of novel, language-free interactive environments that test an agent's ability to explore, infer implicit goals, construct internal models of environment dynamics, and execute effective action sequences. Human test-takers solve 100 percent of the environments after calibration, while frontier AI systems score below 1 percent. The benchmark evaluates fluid adaptive efficiency on tasks that use only core knowledge priors and avoids any reliance on external knowledge or language.
What carries the argument
ARC-AGI-3 interactive benchmark, a collection of turn-based abstract environments whose difficulty is set by human performance baselines and core-knowledge priors.
If this is right
- AI progress can be tracked by measuring how close agents come to human action efficiency on these novel tasks.
- Success requires agents to perform goal inference and dynamic modeling without explicit training signals.
- The efficiency-based scoring allows direct numerical comparison between AI agents and human baselines.
- Passing the benchmark would demonstrate fluid intelligence on tasks that avoid language and memorized knowledge.
Where Pith is reading between the lines
- Improvements on ARC-AGI-3 could indicate AI systems that handle uncertainty and novelty more robustly than current methods.
- The benchmark may serve as a template for creating further interactive tests that isolate adaptive reasoning from language use.
- Future versions could add multi-step planning requirements or partial observability to increase the challenge.
Load-bearing premise
The environments remain genuinely novel and language-free to AI systems once they have been calibrated only through human test-takers.
What would settle it
A frontier AI system reaching 50 percent or higher success on the full ARC-AGI-3 suite would show that the benchmark no longer separates current AI capabilities from human performance.
Figures
read the original abstract
We introduce ARC-AGI-3, an interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments in which agents must explore, infer goals, build internal models of environment dynamics, and plan effective action sequences without explicit instructions. Like its predecessors ARC-AGI-1 and 2, ARC-AGI-3 focuses entirely on evaluating fluid adaptive efficiency on novel tasks, while avoiding language and external knowledge. ARC-AGI-3 environments only leverage Core Knowledge priors and are difficulty-calibrated via extensive testing with human test-takers. Our testing shows humans can solve 100% of the environments, in contrast to frontier AI systems which, as of March 2026, score below 1%. In this paper, we present the benchmark design, its efficiency-based scoring framework grounded in human action baselines, and the methodology used to construct, validate, and calibrate the environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ARC-AGI-3, an interactive benchmark of novel abstract turn-based environments that test agentic intelligence via exploration, goal inference, internal model construction, and planning without language or external knowledge. Environments are restricted to Core Knowledge priors and calibrated through human testing; the central empirical claim is that humans solve 100% of tasks while frontier AI systems score below 1% as of March 2026. The manuscript describes the benchmark design, an efficiency-based scoring framework grounded in human action baselines, and the methodology for construction, validation, and calibration.
Significance. If the performance gap is shown to arise from matched protocols rather than evaluation artifacts, the benchmark would provide a valuable language-free test of fluid adaptive efficiency, extending the ARC-AGI series and offering a concrete challenge for agentic capabilities in frontier models.
major comments (2)
- [Methodology / Evaluation Protocol] Methodology section (evaluation protocol): the description of the AI testing setup omits the precise observation/action interface, episode length limits, number of trials per environment, and prompting regime supplied to frontier models. Because the <1% claim is load-bearing for the headline result, any deviation from the human calibration protocol (e.g., richer state representations or additional trials) could render the reported gap non-comparable.
- [Abstract and Results] Results and abstract: the 100% human / <1% AI figures are stated without reporting the number of environments, human sample size, identity of the specific frontier models tested, or any statistical controls for variance. This absence leaves the central empirical claim without sufficient verifiable support.
minor comments (2)
- [Scoring Framework] The efficiency-based scoring framework would benefit from an explicit equation or pseudocode showing how human action baselines are normalized into the final score.
- [Benchmark Design] Figure captions and environment descriptions could be expanded to clarify the exact turn-based interaction loop for readers unfamiliar with prior ARC-AGI versions.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of the evaluation protocol and empirical results.
read point-by-point responses
-
Referee: Methodology section (evaluation protocol): the description of the AI testing setup omits the precise observation/action interface, episode length limits, number of trials per environment, and prompting regime supplied to frontier models. Because the <1% claim is load-bearing for the headline result, any deviation from the human calibration protocol (e.g., richer state representations or additional trials) could render the reported gap non-comparable.
Authors: We agree that the current description of the AI evaluation protocol is insufficiently detailed for full reproducibility and comparability. The manuscript provides a high-level overview but does not enumerate the exact observation and action spaces, maximum episode lengths, trial counts per environment, or the precise prompting format used with frontier models. In the revised manuscript we will expand the Methodology section to specify these parameters explicitly and to document how they were aligned with the human calibration protocol (including identical state representations and trial limits). This will allow readers to verify that the reported performance gap reflects matched conditions rather than protocol differences. revision: yes
-
Referee: Results and abstract: the 100% human / <1% AI figures are stated without reporting the number of environments, human sample size, identity of the specific frontier models tested, or any statistical controls for variance. This absence leaves the central empirical claim without sufficient verifiable support.
Authors: We acknowledge that the abstract and results sections currently present the headline performance figures without the supporting quantitative details the referee requests. The manuscript states the aggregate outcomes but does not list the total number of environments, the size of the human test-taker cohort, the exact frontier models evaluated as of March 2026, or variance statistics. In the revision we will add these elements: the benchmark size, human sample size and recruitment criteria, the specific model versions tested, and appropriate statistical controls (e.g., per-environment success rates and confidence intervals). These additions will provide the verifiable support needed for the central claim while preserving the existing narrative. revision: yes
Circularity Check
No significant circularity in benchmark construction or performance claims
full rationale
The paper presents ARC-AGI-3 as an empirical benchmark whose environments are constructed and calibrated through separate human testing protocols, with reported human (100%) and AI (<1%) scores arising from distinct evaluation runs rather than any derivation or equation that reduces one to the other by construction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text; references to prior ARC versions serve only as background and do not justify the new calibration or scoring results. The efficiency-based scoring framework is described as grounded in observed human action baselines without evidence that the headline performance gap is mathematically forced by the calibration inputs themselves. The derivation chain is therefore self-contained as an independent empirical report.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Environments only leverage Core Knowledge priors
- domain assumption Human test-taker results provide a valid difficulty calibration and efficiency baseline
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ARC-AGI-3 environments only leverage Core Knowledge priors... efficiency-based scoring framework grounded in human action baselines
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
turn-based... 8 levels... action efficiency
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 10 Pith papers
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
-
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.
-
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...
-
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
-
Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners
Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.
-
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Agentick is a new unified benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human agents across 37 tasks, showing no single approach dominates.
-
Counting as a minimal probe of language model reliability
Language models have limited stable counting capacity well below context limits and rely on a finite set of count-like internal states, collapsing to guessing once exhausted.
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
Reference graph
Works this paper leans on
-
[1]
ARC Prize 2024 Competition.https://arcprize.org/competitions/2024, 2024
work page 2024
-
[2]
ARC Prize 2025 Competition.https://arcprize.org/competitions/2025, 2025
work page 2025
-
[3]
Founders: Mike Knoop, François Chollet
ARC Prize Foundation.https://arcprize.org/, 2026. Founders: Mike Knoop, François Chollet. Operations: Bryan Landers, Greg Kamradt
work page 2026
-
[4]
Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents, 2018
work page 2018
-
[5]
ARC-AGI Community Leaderboard.https://github.com/arcprize/ ARC-AGI-Community-Leaderboard, 2026
ARC Prize Foundation. ARC-AGI Community Leaderboard.https://github.com/arcprize/ ARC-AGI-Community-Leaderboard, 2026
work page 2026
-
[6]
ARC-AGI Toolkit.https://github.com/arcprize/ARC-AGI, 2026
ARC Prize Foundation. ARC-AGI Toolkit.https://github.com/arcprize/ARC-AGI, 2026
work page 2026
-
[7]
ARC Prize Foundation. Gemini 3 Deep Think Preview Verification on ARC-AGI-2.https:// huggingface.co/datasets/arcprize/arc_agi_v2_public_eval, 2026. 22
work page 2026
-
[8]
On the Measure of Intelligence
François Chollet. On the Measure of Intelligence.https://arxiv.org/abs/1911.01547, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[9]
François Chollet. OpenAI o3 Breakthrough High Score on ARC-AGI-Pub.https://arcprize.org/ blog/oai-o3-pub-breakthrough, December 2024
work page 2024
-
[10]
AbstractionandReasoningChallenge
FrançoisChollet, KatherineTong, WalterReade, andJuliaElliott. AbstractionandReasoningChallenge. https://kaggle.com/competitions/abstraction-and-reasoning-challenge, 2020. Kaggle
work page 2020
-
[11]
Alexis Fox, Junlin Wang, Paul Rosu, and Bhuwan Dhingra. Hill-climbing arc-agi-3, 2026
work page 2026
-
[12]
Steve Hsu. Post on LRM automation discovering novel results in quantum physics.https://x.com/ hsu_steve/status/1996034522308026435, 2025
-
[13]
Greg Kamradt. ARC-AGI-3 Preview: 30-Day Learnings.https://arcprize.org/blog/ arc-agi-3-preview-30-day-learnings, August 2025
work page 2025
-
[14]
Arcgentica: ARC-AGI-3 Agent Harness Built on the Agentica SDK
Samuel Knutsen and Victoria Klein. Arcgentica: ARC-AGI-3 Agent Harness Built on the Agentica SDK. https://github.com/symbolica-ai/ARC-AGI-3-Agents, 2026
work page 2026
-
[15]
ARCathon 2022.https://lab42.global/past-challenges/2022-arcathon/, 2022
Lab42. ARCathon 2022.https://lab42.global/past-challenges/2022-arcathon/, 2022
work page 2022
-
[16]
ARCathon 2023.https://lab42.global/past-challenges/2023-arcathon/, 2023
Lab42. ARCathon 2023.https://lab42.global/past-challenges/2023-arcathon/, 2023
work page 2023
-
[17]
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Do- minik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the...
work page 2016
-
[18]
ARC3 Solution.https://github.com/DriesSmit/ARC3-solution, 2025
Dries Smit. ARC3 Solution.https://github.com/DriesSmit/ARC3-solution, 2025
work page 2025
-
[19]
Sorokin and Jean-Francois Puget
I. Sorokin and Jean-Francois Puget. NVARC Solution to ARC-AGI-2 2025.https://drive.google. com/file/d/1vkEluaaJTzaZiJL69TkZovJUkPSDH5Xc/view, 2025
work page 2025
-
[20]
Elizabeth S. Spelke and Katherine D. Kinzler. Core knowledge.Developmental science, pages 89–96, 2007
work page 2007
-
[21]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need.https://arxiv.org/abs/1706.03762, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
ARC-AGI-3 Agents.https://github.com/wd13ca/ARC-AGI-3-Agents, 2025
wd13ca. ARC-AGI-3 Agents.https://github.com/wd13ca/ARC-AGI-3-Agents, 2025
work page 2025
-
[23]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.https: //arxiv.org/abs/2201.11903, 2022. 23
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.