pith. sign in

hub Canonical reference

On the Measure of Intelligence

Canonical reference. 81% of citing Pith papers cite this work as background.

99 Pith papers citing it
Background 81% of classified citations
abstract

To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to "buy" arbitrary levels of skills for a system, in a way that masks the system's own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.

hub tools

citation-role summary

background 14 dataset 2

citation-polarity summary

claims ledger

  • abstract To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that h
  • background depth transformers with this capability. These works have a similar aim to ours, enabling reasoning in latent space, but approach this goal from separate directions. For additional discussions related to the idea of construct- ing a prior that incentivizes reasoning and algorithm learn- ing at the expense of memorization of simple patterns, we also refer to Chollet (2019), Schwarzschild (2023), Li et al. (2020b) and Moulton (2023). 9. Future Work Aside from work extending and analyzing the scali
  • background These techniques can be categorized into two main types based on the source of feedback: process reward models (PRMs) and prompted LLMs. The performance comparison are mainly shown in Table 4. Process Feedback from Process Rewarded Model Recent studies highlight the significance of feedback in developing effective PRMs for complex reasoning tasks, particularly in a step-level view [134, 423, 528]. (1) Process Annotated PRM Training: Earlier, Lightman et al. [449] demon- strate that training proc

co-cited works

clear filters

representative citing papers

Gradient-Based Program Synthesis with Neurally Interpreted Languages

cs.LG · 2026-04-20 · unverdicted · novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

Are Flat Minima an Illusion?

cs.LG · 2026-03-24 · unverdicted · novelty 8.0

Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.

Show Me Examples: Inferring Visual Concepts from Image Sets

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.

What Drives Interactive Improvement from Feedback?

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.

$\Omega$: Operator-based Mixture Ensemble for Generative Assimilation

cs.LG · 2026-06-18 · unverdicted · novelty 7.0

Ω is a generative assimilation method that learns residual discrepancies from ensemble data using a conditional Gaussian baseline, then reconstructs full non-Gaussian posteriors via Gaussian mixtures and annealed Langevin sampling.

HLL: Can Agents Cross Humanity's Last Line of Verification?

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

HLL is a new benchmark that evaluates eight frontier multimodal agents on closed-loop interactive CAPTCHA solving, showing sharp performance drops under realism stressors and trace validation.

The Abstraction Gap in Vision-Language Causal Reasoning

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Introduces Abstraction Gap metric and CAGE benchmark showing seven of eight VLMs have large gaps between text plausibility and chain-based causal reasoning, with one model succeeding.

Test-Time Learning with an Evolving Library

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.

Prospective Compression in Human Abstraction Learning

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Humans exhibit abstraction learning consistent with prospective compression of future tasks in non-stationary domains, unlike retrospective compression algorithms or LLM-based approaches.

Lattice Deduction Transformers

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

An 800K-parameter Lattice Deduction Transformer reaches 100% accuracy on Sudoku-Extreme and Snowflake Sudoku and 99.9% on Maze-Hard by using lattice projections and abstract-interpretation supervision, while frontier LLMs score 0%.

citing papers explorer

Showing 50 of 99 citing papers.

  • Gradient-Based Program Synthesis with Neurally Interpreted Languages cs.LG · 2026-04-20 · unverdicted · none · ref 119 · internal anchor

    NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

  • Are Flat Minima an Illusion? cs.LG · 2026-03-24 · unverdicted · none · ref 135 · internal anchor

    Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.

  • Show Me Examples: Inferring Visual Concepts from Image Sets cs.CV · 2026-07-02 · unverdicted · none · ref 13 · internal anchor

    Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.

  • What Drives Interactive Improvement from Feedback? cs.AI · 2026-06-29 · unverdicted · none · ref 1 · internal anchor

    Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.

  • Lexical Consensus: Grounded Word Learning and Shared Meaning in Artificial Agents cs.CL · 2026-06-20 · unverdicted · none · ref 6 · internal anchor

    Agents acquire lexical labels for visual concepts following a perceptual coherence gradient where perceptual distance predicts learning accuracy independently of semantic distance in a pre-registered CIFAR-100 experiment.

  • $\Omega$: Operator-based Mixture Ensemble for Generative Assimilation cs.LG · 2026-06-18 · unverdicted · none · ref 74 · internal anchor

    Ω is a generative assimilation method that learns residual discrepancies from ensemble data using a conditional Gaussian baseline, then reconstructs full non-Gaussian posteriors via Gaussian mixtures and annealed Langevin sampling.

  • Definitional alignment before capability alignment: a Design-Science framework for adjudicating claims about AGI cs.AI · 2026-06-10 · unverdicted · none · ref 3 · internal anchor

    Introduces DAF-AGI, a second-order conceptual artifact with ordinal criteria for AGI definition fitness and a structured governance audit, demonstrated on five measurement families and tested against a generative-systems arrival claim.

  • HLL: Can Agents Cross Humanity's Last Line of Verification? cs.AI · 2026-06-01 · unverdicted · none · ref 13 · internal anchor

    HLL is a new benchmark that evaluates eight frontier multimodal agents on closed-loop interactive CAPTCHA solving, showing sharp performance drops under realism stressors and trace validation.

  • GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning cs.AI · 2026-05-29 · unverdicted · none · ref 4 · internal anchor

    GraphARC is a scalable benchmark for few-shot graph transformation learning that exposes a comprehension-execution gap in language models on abstract reasoning tasks.

  • StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning cs.CV · 2026-05-29 · unverdicted · none · ref 13 · internal anchor

    StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.

  • The Abstraction Gap in Vision-Language Causal Reasoning cs.CL · 2026-05-27 · unverdicted · none · ref 3 · internal anchor

    Introduces Abstraction Gap metric and CAGE benchmark showing seven of eight VLMs have large gaps between text plausibility and chain-based causal reasoning, with one model succeeding.

  • DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking stat.ML · 2026-05-25 · unverdicted · none · ref 6 · internal anchor

    DiscoverPhysics is a new benchmark with 22 on-demand N-body simulated worlds where LLM agents design experiments to infer non-standard physics, evaluated via held-out trajectory MSE and LLM-judged explanation quality.

  • VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images cs.CV · 2026-05-22 · unverdicted · none · ref 4 · internal anchor

    VisAnalog is a new controlled benchmark showing VLMs substantially underperform humans on visual concept transfer under one- to four-step deterministic transformations, with relation inference as the main failure mode.

  • Test-Time Learning with an Evolving Library cs.LG · 2026-05-14 · unverdicted · none · ref 17 · internal anchor

    EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.

  • Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers cs.AI · 2026-05-13 · conditional · none · ref 3 · internal anchor

    The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.

  • Prospective Compression in Human Abstraction Learning cs.AI · 2026-05-11 · unverdicted · none · ref 51 · internal anchor

    Humans exhibit abstraction learning consistent with prospective compression of future tasks in non-stationary domains, unlike retrospective compression algorithms or LLM-based approaches.

  • Lattice Deduction Transformers cs.LG · 2026-05-09 · unverdicted · none · ref 41 · internal anchor

    An 800K-parameter Lattice Deduction Transformer reaches 100% accuracy on Sudoku-Extreme and Snowflake Sudoku and 99.9% on Maze-Hard by using lattice projections and abstract-interpretation supervision, while frontier LLMs score 0%.

  • Intervention Complexity as a Canonical Reward and a Measure of Intelligence cs.AI · 2026-05-04 · unverdicted · none · ref 4 · 2 links · internal anchor

    Intervention complexity provides a family of canonical rewards indexed by resource bias that completes the Legg-Hutter framework and enables a two-dimensional view of intelligence as competence plus learning efficiency.

  • AI scientists produce results without reasoning scientifically cs.AI · 2026-04-20 · conditional · none · ref 18 · internal anchor

    LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.

  • Yanasse: Finding New Proofs from Deep Vision's Analogies, Part 1 cs.AI · 2026-04-19 · unverdicted · none · ref 19 · internal anchor

    A domain-independent analogy engine transfers Lean tactic patterns from probability to representation theory, producing four new machine-verified proofs.

  • Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs cs.AI · 2026-04-09 · accept · none · ref 16 · internal anchor

    The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.

  • Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism cs.LO · 2026-04-07 · unverdicted · none · ref 123 · internal anchor

    ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.

  • Factorization Regret mediates compositional generalization in latent space cs.LG · 2026-03-28 · unverdicted · none · ref 18 · internal anchor

    Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.

  • Less is More: Recursive Reasoning with Tiny Networks cs.LG · 2025-10-06 · unverdicted · none · ref 3 · internal anchor

    TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.

  • VCBench: Benchmarking LLMs in Venture Capital cs.AI · 2025-09-17 · unverdicted · none · ref 4 · internal anchor

    VCBench is a new privacy-preserving benchmark showing LLMs like DeepSeek-V3 achieve over six times the market baseline precision in predicting founder success.

  • PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts cs.CL · 2025-06-06 · conditional · none · ref 9 · internal anchor

    PuzzleWorld benchmark reveals state-of-the-art AI models solve only 18% of complex puzzlehunt problems with 40% stepwise accuracy, matching novices but trailing enthusiasts, while fine-tuning on traces yields modest gains.

  • PRIMETIME : Limits of LLMs in Temporal Primitives cs.NE · 2025-04-22 · unverdicted · none · ref 24 · internal anchor

    PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

  • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 33 · internal anchor

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  • GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models cs.LG · 2024-10-07 · accept · none · ref 64 · internal anchor

    LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.

  • Automated Design of Agentic Systems cs.AI · 2024-08-15 · conditional · none · ref 143 · internal anchor

    Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.

  • AGC-Bench: Measuring Artificial General Creativity cs.CL · 2026-07-01 · unverdicted · none · ref 48 · 2 links · internal anchor

    AGC-Bench introduces a multi-domain creativity benchmark for LLMs, recovers a general 'c' factor explaining 81.5% of variance, and finds humans still outperform top models on matched tasks.

  • Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2 cs.AI · 2026-06-30 · unverdicted · none · ref 10 · internal anchor

    A modality-driven search system with holistic trace judging for ARC-AGI-2 reaches 72.9% on the semi-private set and 76.1% on the public set, outperforming GPT-5.2 Pro and Gemini 3 Pro by 18.7 points while releasing full code.

  • COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives cs.LG · 2026-06-26 · unverdicted · none · ref 3 · internal anchor

    COCOLogic-V2 is a new object-centric dataset for visual inductive reasoning that splits samples into positives, near-boundary negatives, and far-from-boundary negatives to expose model failures on logical inconsistencies.

  • DiARC: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models cs.CL · 2026-06-25 · unverdicted · none · ref 9 · internal anchor

    DiARC improves LLM performance on ARC-like benchmarks by constructing and training on preference pairs from three types of negative samples while keeping demonstrations fixed.

  • You Don't Need to Run Every Eval cs.LG · 2026-06-22 · conditional · none · ref 55 · internal anchor

    The benchmark score matrix of 84 models on 133 tasks is approximately rank-2; BenchPress recovers held-out scores to within 4.6 points and identifies 5-benchmark subsets that predict the full scorecard to within 3.93-4.55 points.

  • The Metanym Game: A Self-Contained, Self-Consistent LLM Peer-Community Benchmark for Structural Intelligence cs.CL · 2026-06-19 · unverdicted · none · ref 34 · internal anchor

    Proposes the Metanym Game as a self-contained LLM benchmark using peer ratings and SVD to extract generator and judge competence, with skills dissociating and correlation to GPQA Diamond at r=0.92.

  • M\"OVE: A Holistic LLM Benchmark for the German Public Sector cs.CL · 2026-06-11 · unverdicted · none · ref 39 · internal anchor

    MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.

  • Slots, Transitions, Loops: Learning Composable World Models for ARC cs.CV · 2026-06-10 · unverdicted · none · ref 2 · internal anchor

    Loop-OWM uses color-prototype slots, demonstration-conditioned task summaries, and looped transitions to model ARC rules as visual-symbolic state changes and outperforms baselines on ARC-1 and ARC-2.

  • HERO'S JOURNEY: Testing Complex Rule Induction with Text Games cs.CL · 2026-06-01 · unverdicted · none · ref 40 · internal anchor

    HERO'S JOURNEY benchmark evaluates LLMs on attribute and procedural rule induction across four structural forms, finding limited uneven performance with execution as the main bottleneck and steering helping only attribute tasks.

  • TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL cs.AI · 2026-06-01 · unverdicted · none · ref 7 · internal anchor

    TRON supplies 520 rule-verifiable online visual reasoning environments across five ability buckets that generate unlimited training instances for RL post-training, yielding consistent gains on ten external multimodal benchmarks for three vision-language models.

  • Open-World Evaluations for Measuring Frontier AI Capabilities cs.AI · 2026-05-19 · conditional · none · ref 13 · internal anchor

    Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.

  • optimize_anything: A Universal API for Optimizing any Text Parameter cs.CL · 2026-05-19 · unverdicted · none · ref 8 · internal anchor

    A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.

  • Generative Recursive Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 13 · 2 links · internal anchor

    GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.

  • LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design cs.LG · 2026-05-14 · unverdicted · none · ref 3 · internal anchor

    LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.

  • The Evaluation Trap: Benchmark Design as Theoretical Commitment cs.AI · 2026-05-13 · unverdicted · none · ref 4 · internal anchor

    AI benchmarks trap progress by operationalizing assumptions that redefine capabilities around the benchmarks themselves, and Epistematics provides an audit procedure to detect when evaluations cannot discriminate claimed capabilities from proxy behaviors.

  • The Generalized Turing Test: A Foundation for Comparing Intelligence cs.AI · 2026-05-11 · unverdicted · none · ref 24 · internal anchor

    The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.

  • When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 11 · 3 links · internal anchor

    Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.

  • Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs cs.AI · 2026-05-09 · unverdicted · none · ref 12 · internal anchor

    OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

  • Query-efficient model evaluation using cached responses cs.LG · 2026-05-08 · unverdicted · none · ref 7 · internal anchor

    DKPS-based methods predict new model benchmark scores using cached responses, matching baseline mean absolute error with substantially fewer queries and an offline query selection approach.

  • Continuous Latent Diffusion Language Model cs.CL · 2026-05-07 · unverdicted · none · ref 16 · internal anchor

    Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model