NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.
hub Canonical reference
On the Measure of Intelligence
Canonical reference. 81% of citing Pith papers cite this work as background.
abstract
To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to "buy" arbitrary levels of skills for a system, in a way that masks the system's own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that h
- background depth transformers with this capability. These works have a similar aim to ours, enabling reasoning in latent space, but approach this goal from separate directions. For additional discussions related to the idea of construct- ing a prior that incentivizes reasoning and algorithm learn- ing at the expense of memorization of simple patterns, we also refer to Chollet (2019), Schwarzschild (2023), Li et al. (2020b) and Moulton (2023). 9. Future Work Aside from work extending and analyzing the scali
- background These techniques can be categorized into two main types based on the source of feedback: process reward models (PRMs) and prompted LLMs. The performance comparison are mainly shown in Table 4. Process Feedback from Process Rewarded Model Recent studies highlight the significance of feedback in developing effective PRMs for complex reasoning tasks, particularly in a step-level view [134, 423, 528]. (1) Process Annotated PRM Training: Earlier, Lightman et al. [449] demon- strate that training proc
co-cited works
representative citing papers
Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.
Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.
GraphARC is a scalable benchmark for few-shot graph transformation learning that exposes a comprehension-execution gap in language models on abstract reasoning tasks.
StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.
Introduces Abstraction Gap metric and CAGE benchmark showing seven of eight VLMs have large gaps between text plausibility and chain-based causal reasoning, with one model succeeding.
DiscoverPhysics is a new benchmark with 22 on-demand N-body simulated worlds where LLM agents design experiments to infer non-standard physics, evaluated via held-out trajectory MSE and LLM-judged explanation quality.
VisAnalog is a new controlled benchmark showing VLMs substantially underperform humans on visual concept transfer under one- to four-step deterministic transformations, with relation inference as the main failure mode.
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.
The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.
Humans exhibit abstraction learning consistent with prospective compression of future tasks in non-stationary domains, unlike retrospective compression algorithms or LLM-based approaches.
An 800K-parameter Lattice Deduction Transformer reaches 100% accuracy on Sudoku-Extreme and Snowflake Sudoku and 99.9% on Maze-Hard by using lattice projections and abstract-interpretation supervision, while frontier LLMs score 0%.
Intervention complexity provides a family of canonical rewards indexed by resource bias that completes the Legg-Hutter framework and enables a two-dimensional view of intelligence as competence plus learning efficiency.
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
A domain-independent analogy engine transfers Lean tactic patterns from probability to representation theory, producing four new machine-verified proofs.
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.
TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.
VCBench is a new privacy-preserving benchmark showing LLMs like DeepSeek-V3 achieve over six times the market baseline precision in predicting founder success.
PuzzleWorld benchmark reveals state-of-the-art AI models solve only 18% of complex puzzlehunt problems with 40% stepwise accuracy, matching novices but trailing enthusiasts, while fine-tuning on traces yields modest gains.
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
citing papers explorer
-
Why we need an AI-resilient society
Applies forensic psychology profiling to characterize AI risks via nine features and proposes cognitive sovereignty, measurable control, and partial autonomy as a framework for an AI-resilient society.