hub Canonical reference

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, Henry Pinkard · 2025 · cs.AI · arXiv 2505.11831

Canonical reference. 70% of citing Pith papers cite this work as background.

27 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 27 citing papers arXiv PDF

abstract

The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI), introduced in 2019, established a challenging benchmark for evaluating the general fluid intelligence of artificial systems via a set of unique, novel tasks only requiring minimal prior knowledge. While ARC-AGI has spurred significant research activity over the past five years, recent AI progress calls for benchmarks capable of finer-grained evaluation at higher levels of cognitive complexity. We introduce ARC-AGI-2, an upgraded version of the benchmark. ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers. It incorporates a newly curated and expanded set of tasks specifically designed to provide a more granular signal to assess abstract reasoning and problem-solving abilities at higher levels of fluid intelligence. To contextualize the difficulty and characteristics of ARC-AGI-2, we present extensive results from human testing, providing a robust baseline that highlights the benchmark's accessibility to human intelligence, yet difficulty for current AI systems. ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 dataset 3

citation-polarity summary

background 7 use dataset 2 unclear 1

representative citing papers

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

Harnessing Agentic Evolution

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

Don't Pause! Every prediction matters in a streaming video

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.

SASAV: Self-Directed Agent for Scientific Analysis and Visualization

cs.GR · 2026-04-03 · unverdicted · novelty 7.0

SASAV introduces the first fully autonomous multi-agent system for scientific data analysis and visualization that operates without external prompting or human-in-the-loop feedback.

Less is More: Recursive Reasoning with Tiny Networks

cs.LG · 2025-10-06 · unverdicted · novelty 7.0

TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.

VCBench: Benchmarking LLMs in Venture Capital

cs.AI · 2025-09-17 · unverdicted · novelty 7.0

VCBench is a new privacy-preserving benchmark showing LLMs like DeepSeek-V3 achieve over six times the market baseline precision in predicting founder success.

Neural Cellular Automata: From Cells to Pixels

cs.CV · 2025-06-28 · unverdicted · novelty 7.0

Hybrid coarse-grid NCA plus implicit decoder produces arbitrary-resolution real-time outputs for morphogenesis and texture synthesis on grids and meshes while preserving self-organization.

Open-World Evaluations for Measuring Frontier AI Capabilities

cs.AI · 2026-05-19 · conditional · novelty 6.0

Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.

Generative Recursive Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.

One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

Agentic Frameworks for Reasoning Tasks: An Empirical Study

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outperforming HRM on Sudoku-extreme and Maze when paired with the new ItrSA++ model.

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

cs.CV · 2026-04-02 · unverdicted · novelty 6.0

VLMs bypass visual comparison by recovering semantic labels for nameable entities and hallucinate on unnamable ones, as shown by performance gaps and Logit Lens analysis.

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

cs.LG · 2026-02-20 · conditional · novelty 6.0 · 2 refs

MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

cs.CV · 2025-11-25 · unverdicted · novelty 6.0

SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

cs.AI · 2025-06-07 · unverdicted · novelty 6.0

LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

cs.CL · 2025-06-02 · conditional · novelty 6.0

High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.

Probabilistic Tiny Recursive Model

cs.AI · 2026-05-19 · conditional · novelty 5.0

PTRM adds stochastic Gaussian noise to Tiny Recursive Model recursion for parallel trajectory exploration and Q-head selection, raising Sudoku-Extreme accuracy from 87.4% to 98.75% and Pencil Puzzle Bench from 62.6% to 91.2% without retraining.

Language models fail at extended rule following

cs.CL · 2026-05-03 · unverdicted · novelty 5.0 · 2 refs

LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.

Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid

cs.AI · 2026-05-02 · unverdicted · novelty 5.0

A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.

Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency

cs.LG · 2026-04-09 · unverdicted · novelty 5.0

KoPE adds Kuramoto-based oscillatory phase states and synchronization to Vision Transformers, improving training, parameter, and data efficiency on structured vision tasks.

Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution

cs.AI · 2026-04-09 · unverdicted · novelty 5.0

Squeeze Evolve is a multi-model orchestration framework that improves efficiency and performance in verifier-free evolutionary inference, cutting costs up to 3x while matching verifier-based methods on several benchmarks.

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

cs.CV · 2025-11-06 · unverdicted · novelty 5.0

Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.

citing papers explorer

Showing 27 of 27 citing papers.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs cs.LG · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
Harnessing Agentic Evolution cs.AI · 2026-05-13 · unverdicted · none · ref 5 · internal anchor
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
Don't Pause! Every prediction matters in a streaming video cs.CV · 2026-04-27 · unverdicted · none · ref 15 · internal anchor
SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
SASAV: Self-Directed Agent for Scientific Analysis and Visualization cs.GR · 2026-04-03 · unverdicted · none · ref 10 · internal anchor
SASAV introduces the first fully autonomous multi-agent system for scientific data analysis and visualization that operates without external prompting or human-in-the-loop feedback.
Less is More: Recursive Reasoning with Tiny Networks cs.LG · 2025-10-06 · unverdicted · none · ref 4 · internal anchor
TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.
VCBench: Benchmarking LLMs in Venture Capital cs.AI · 2025-09-17 · unverdicted · none · ref 5 · internal anchor
VCBench is a new privacy-preserving benchmark showing LLMs like DeepSeek-V3 achieve over six times the market baseline precision in predicting founder success.
Neural Cellular Automata: From Cells to Pixels cs.CV · 2025-06-28 · unverdicted · none · ref 1 · internal anchor
Hybrid coarse-grid NCA plus implicit decoder produces arbitrary-resolution real-time outputs for morphogenesis and texture synthesis on grids and meshes while preserving self-organization.
Open-World Evaluations for Measuring Frontier AI Capabilities cs.AI · 2026-05-19 · conditional · none · ref 17 · internal anchor
Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.
Generative Recursive Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 14 · 2 links · internal anchor
GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models cs.LG · 2026-04-20 · unverdicted · none · ref 156 · internal anchor
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 15 · internal anchor
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Agentic Frameworks for Reasoning Tasks: An Empirical Study cs.AI · 2026-04-17 · unverdicted · none · ref 52 · internal anchor
An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions cs.LG · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outperforming HRM on Sudoku-extreme and Maze when paired with the new ItrSA++ model.
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors cs.CV · 2026-04-02 · unverdicted · none · ref 3 · internal anchor
VLMs bypass visual comparison by recovering semantic labels for nameable entities and hallucinate on unnamable ones, as shown by performance gaps and Logit Lens analysis.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs? cs.LG · 2026-02-20 · conditional · none · ref 16 · 2 links · internal anchor
MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
SPHINX: A Synthetic Environment for Visual Perception and Reasoning cs.CV · 2025-11-25 · unverdicted · none · ref 12 · internal anchor
SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity cs.AI · 2025-06-07 · unverdicted · none · ref 8 · internal anchor
LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning cs.CL · 2025-06-02 · conditional · none · ref 1 · internal anchor
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
Probabilistic Tiny Recursive Model cs.AI · 2026-05-19 · conditional · none · ref 4 · internal anchor
PTRM adds stochastic Gaussian noise to Tiny Recursive Model recursion for parallel trajectory exploration and Q-head selection, raising Sudoku-Extreme accuracy from 87.4% to 98.75% and Pencil Puzzle Bench from 62.6% to 91.2% without retraining.
Language models fail at extended rule following cs.CL · 2026-05-03 · unverdicted · none · ref 17 · 2 links · internal anchor
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.
Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid cs.AI · 2026-05-02 · unverdicted · none · ref 180 · internal anchor
A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.
Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency cs.LG · 2026-04-09 · unverdicted · none · ref 4 · internal anchor
KoPE adds Kuramoto-based oscillatory phase states and synchronization to Vision Transformers, improving training, parameter, and data efficiency on structured vision tasks.
Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution cs.AI · 2026-04-09 · unverdicted · none · ref 10 · internal anchor
Squeeze Evolve is a multi-model orchestration framework that improves efficiency and performance in verifier-free evolutionary inference, cutting costs up to 3x while matching verifier-based methods on several benchmarks.
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm cs.CV · 2025-11-06 · unverdicted · none · ref 5 · internal anchor
Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.
Hierarchical Reasoning Model cs.AI · 2025-06-26 · unverdicted · none · ref 30 · internal anchor
HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples without pre-training or CoT supervision.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 136 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Measuring AI Reasoning: A Guide for Researchers cs.AI · 2026-05-04 · unverdicted · none · ref 160 · internal anchor
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer