ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

· 2026 · cs.AI · arXiv 2603.24621

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

We introduce ARC-AGI-3, an interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments in which agents must explore, infer goals, build internal models of environment dynamics, and plan effective action sequences without explicit instructions. Like its predecessors ARC-AGI-1 and 2, ARC-AGI-3 focuses entirely on evaluating fluid adaptive efficiency on novel tasks, while avoiding language and external knowledge. ARC-AGI-3 environments only leverage Core Knowledge priors and are difficulty-calibrated via extensive testing with human test-takers. Our testing shows humans can solve 100% of the environments, in contrast to frontier AI systems which, as of March 2026, score below 1%. In this paper, we present the benchmark design, its efficiency-based scoring framework grounded in human action baselines, and the methodology used to construct, validate, and calibrate the environments.

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

cs.AI · 2026-05-11 · unverdicted · novelty 6.0 · 3 refs

Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

cs.CL · 2026-05-09 · conditional · novelty 6.0 · 2 refs

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complexity models on language tasks.

Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

cs.CV · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

Language models fail at extended rule following

cs.CL · 2026-05-03 · unverdicted · novelty 5.0

LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents cs.AI · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning cs.AI · 2026-05-13 · unverdicted · none · ref 7 · internal anchor
MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 16 · 3 links · internal anchor
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners cs.AI · 2026-05-08 · unverdicted · none · ref 16 · internal anchor
Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer