On the Measure of Intelligence

Fran\c{c}ois Chollet

arxiv: 1911.01547 · v2 · submitted 2019-11-05 · 💻 cs.AI

On the Measure of Intelligence

Fran\c{c}ois Chollet This is my paper

Pith reviewed 2026-05-12 12:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords intelligence measurementskill acquisition efficiencyalgorithmic information theorygeneralizationabstraction and reasoning corpuspriorsAI benchmarksfluid intelligence

0 comments

The pith

Intelligence is the efficiency of acquiring skills from limited experience, not performance on fixed tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing ways of assessing intelligence in AI focus on how well systems do at specific tasks such as games, yet this measure can be artificially raised by supplying unlimited prior knowledge or training data. A more accurate approach defines intelligence as skill-acquisition efficiency, which accounts for the range of tasks a system can address, the difficulty of generalizing to new ones, and the starting priors plus experience required. This definition draws from algorithmic information theory to separate the system's own generalization power from external advantages. If correct, it would allow direct, fair comparisons of intelligence between AI systems and humans without masking differences through data volume. The paper supplies concrete guidelines for such benchmarks and introduces the Abstraction and Reasoning Corpus built on priors intended to match human innate knowledge.

Core claim

Intelligence is formalized as skill-acquisition efficiency: the rate at which a system develops new capabilities given a defined scope of tasks, a level of generalization difficulty, and a quantity of experience, while incorporating its priors. This formulation, rooted in algorithmic information theory, treats skill at any single task as an insufficient proxy because priors and experience heavily modulate observed performance. The definition therefore directs evaluation toward how economically a system converts limited experience into broad competence.

What carries the argument

The formal definition of intelligence as skill-acquisition efficiency from algorithmic information theory, which isolates generalization power by controlling for priors and experience across tasks of varying difficulty.

If this is right

Benchmarks must limit the priors and experience supplied to systems so that measured performance reflects acquisition efficiency rather than external resources.
Comparisons between AI and humans become possible once both operate under comparable innate priors on the same task scope.
AI progress can be tracked by improvements in skill-acquisition efficiency rather than by gains on any single fixed task.
The Abstraction and Reasoning Corpus provides one concrete realization of these guidelines for measuring fluid, human-like intelligence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems optimized under this measure may generalize more readily to open-ended real-world problems than those trained on narrow, data-heavy tasks.
The definition could be applied to non-benchmark settings by defining new task scopes and measuring acquisition rates under controlled priors.
If the approach holds, large-scale pretraining on fixed datasets would be revealed as a limited path to general intelligence.

Load-bearing premise

The explicit priors chosen for the Abstraction and Reasoning Corpus are sufficiently close to innate human priors that performance differences on the benchmark reflect genuine differences in generalization power.

What would settle it

An AI system that reaches high scores on the Abstraction and Reasoning Corpus yet fails to acquire skills efficiently when tested on a fresh set of tasks with matched priors and experience would show that the benchmark does not isolate the intended form of intelligence.

read the original abstract

To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to "buy" arbitrary levels of skills for a system, in a way that masks the system's own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chollet defines intelligence as skill-acquisition efficiency via AIT and proposes ARC with author-chosen priors, but the lack of validation for those priors matching human ones is the central weakness.

read the letter

The key thing here is that Chollet defines intelligence as the efficiency with which a system acquires new skills from limited experience, grounded in algorithmic information theory, and he proposes the ARC benchmark to test that. This shifts focus from raw performance on fixed tasks to how well systems generalize with minimal priors and data. The paper does well at laying out the problems with existing benchmarks. It explains clearly how unlimited training data or priors can inflate skill without reflecting true intelligence. The historical summary of psychology and AI definitions is useful, and the guidelines for a general benchmark—emphasizing scope, generalization difficulty, and explicit priors—are straightforward. Where it gets soft is in the claim that ARC's priors are close to innate human priors. The paper lists things like objectness, basic geometry, and counting, and says they are designed to be as close as possible, but there's no derivation, no human data comparison, and no test of what happens if you tweak them. Without that, it's hard to know if ARC scores really isolate generalization efficiency or just prior alignment. The whole argument for fair human-AI comparisons rests on this, and it's presented informally. This paper is aimed at researchers thinking about how to evaluate progress toward general AI. Anyone building or critiquing benchmarks will find the framework helpful. It deserves peer review because the ideas are coherent and the critique of current practices is on point, even if the benchmark itself would benefit from more validation work.

Referee Report

2 major / 2 minor

Summary. The paper summarizes and critically assesses historical definitions of intelligence from psychology and AI, identifies two implicit conceptions guiding them, argues that task-specific skill benchmarks fail to measure intelligence because skill depends on priors and experience, articulates a new formal definition of intelligence grounded in Algorithmic Information Theory as skill-acquisition efficiency (incorporating scope, generalization difficulty, priors, and experience), proposes guidelines for general AI benchmarks, and introduces the Abstraction and Reasoning Corpus (ARC) built on an explicit set of priors designed to approximate innate human priors for measuring human-like fluid intelligence and enabling fair AI-human comparisons.

Significance. If the definition is sound and the ARC priors sufficiently match human innate priors, the work could meaningfully shift AI evaluation toward measuring generalization efficiency rather than acquired skill, providing a principled alternative to current task-specific benchmarks and influencing the design of future intelligence tests.

major comments (2)

[Section on the new formal definition] The section articulating the new formal definition: intelligence is defined as skill-acquisition efficiency drawing on AIT, but the manuscript provides only a conceptual description without a precise mathematical formulation, derivation from core AIT quantities (such as Kolmogorov complexity), or operationalization that would allow quantitative computation or direct falsification of the definition.
[ARC benchmark description] The ARC benchmark description and guidelines section: the central claim that ARC enables fair human-AI comparisons rests on the assumption that its enumerated priors (objectness, basic geometry, counting, etc.) are close enough to innate human priors that performance differences isolate generalization efficiency; however, no derivation, empirical calibration against human data, or sensitivity analysis is supplied to show that modifying any listed prior would not materially alter relative scores.

minor comments (2)

[Abstract and introduction] The abstract and introduction reference 'two historical conceptions of intelligence' without naming or briefly characterizing them, which reduces clarity for readers unfamiliar with the cited psychology and AI literature.
[Conclusions] The manuscript would benefit from an explicit statement of the scope of the proposed definition (e.g., whether it applies only to fluid intelligence or extends to other forms) to avoid overgeneralization in the conclusions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on the manuscript. We address each major comment below, indicating where revisions will be incorporated.

read point-by-point responses

Referee: [Section on the new formal definition] The section articulating the new formal definition: intelligence is defined as skill-acquisition efficiency drawing on AIT, but the manuscript provides only a conceptual description without a precise mathematical formulation, derivation from core AIT quantities (such as Kolmogorov complexity), or operationalization that would allow quantitative computation or direct falsification of the definition.

Authors: We acknowledge that the definition is presented conceptually, drawing on AIT to frame intelligence as skill-acquisition efficiency without supplying a closed-form mathematical expression or explicit derivation from Kolmogorov complexity. This choice was made to emphasize the definition's implications for evaluation and to keep it accessible across psychology and AI. In revision, we will expand the section with a more explicit mapping to AIT notions, such as relating efficiency to the incremental reduction in description length for novel tasks, and add discussion of possible operationalizations along with their limitations for direct falsification. revision: partial
Referee: [ARC benchmark description] The ARC benchmark description and guidelines section: the central claim that ARC enables fair human-AI comparisons rests on the assumption that its enumerated priors (objectness, basic geometry, counting, etc.) are close enough to innate human priors that performance differences isolate generalization efficiency; however, no derivation, empirical calibration against human data, or sensitivity analysis is supplied to show that modifying any listed prior would not materially alter relative scores.

Authors: The referee is correct that the fairness claim for human-AI comparisons depends on the priors approximating innate human ones, and that the manuscript supplies neither empirical calibration nor sensitivity analysis. We will revise the relevant section to elaborate the rationale for each prior with additional citations from cognitive science literature on core knowledge systems. We will also add an explicit limitations paragraph acknowledging the absence of sensitivity analysis and noting that full empirical calibration is an important avenue for subsequent work. revision: partial

Circularity Check

0 steps flagged

No circularity: formal AIT-based definition and benchmark guidelines are independent of fitted inputs or self-referential loops.

full rationale

The paper articulates a definition of intelligence as skill-acquisition efficiency drawing directly from established Algorithmic Information Theory concepts (scope, generalization difficulty, priors, experience) without any equations or derivations that reduce back to the paper's own data or assumptions by construction. Guidelines for benchmarks follow from this definition. ARC is presented as one implementation using an explicitly enumerated prior set chosen by the authors; while the claim of closeness to human priors is an unvalidated assumption rather than a derived result, it does not create a self-definitional loop, fitted-parameter prediction, or load-bearing self-citation chain. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the applicability of AIT to intelligence and the design of human-like priors, with no free parameters or invented entities explicitly fitted or postulated in the abstract.

axioms (2)

domain assumption Intelligence can be formalized as skill-acquisition efficiency using concepts from Algorithmic Information Theory
The paper bases its new definition directly on AIT without deriving it from more fundamental principles in the abstract.
domain assumption The priors in ARC are close to innate human priors
This assumption is required for the claim that ARC enables fair human-AI comparisons.

pith-pipeline@v0.9.0 · 5589 in / 1455 out tokens · 63432 ms · 2026-05-12T12:59:33.747900+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/LawOfExistence law_of_existence echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience.
Foundation/HierarchyEmergence hierarchy_emergence_forces_phi echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

built upon an explicit set of priors designed to be as close as possible to innate human priors
Foundation/DiscretenessForcing discreteness_forcing_principle echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to buy arbitrary levels of skills

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Gradient-Based Program Synthesis with Neurally Interpreted Languages
cs.LG 2026-04 unverdicted novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prio...
Are Flat Minima an Illusion?
cs.LG 2026-03 unverdicted novelty 8.0

Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.
VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images
cs.CV 2026-05 unverdicted novelty 7.0

VisAnalog is a new controlled benchmark showing VLMs substantially underperform humans on visual concept transfer under one- to four-step deterministic transformations, with relation inference as the main failure mode.
Test-Time Learning with an Evolving Library
cs.LG 2026-05 unverdicted novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers
cs.AI 2026-05 conditional novelty 7.0

The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.
Prospective Compression in Human Abstraction Learning
cs.AI 2026-05 unverdicted novelty 7.0

Humans exhibit abstraction learning consistent with prospective compression of future tasks in non-stationary domains, unlike retrospective compression algorithms or LLM-based approaches.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
cs.AI 2026-05 conditional novelty 7.0

A vision-language policy learns state-conditioned commitment depth to Pareto-dominate fixed-depth baselines on long-horizon puzzles, achieving up to 12.5 pp higher solve rate with 25% fewer actions.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
cs.AI 2026-05 conditional novelty 7.0

State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
Lattice Deduction Transformers
cs.LG 2026-05 unverdicted novelty 7.0

An 800K-parameter Lattice Deduction Transformer reaches 100% accuracy on Sudoku-Extreme and Snowflake Sudoku and 99.9% on Maze-Hard by using lattice projections and abstract-interpretation supervision, while frontier ...
Intervention Complexity as a Canonical Reward and a Measure of Intelligence
cs.AI 2026-05 unverdicted novelty 7.0

Intervention complexity provides a family of canonical rewards indexed by resource bias that completes the Legg-Hutter framework and enables a two-dimensional view of intelligence as competence plus learning efficiency.
AI scientists produce results without reasoning scientifically
cs.AI 2026-04 conditional novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
Yanasse: Finding New Proofs from Deep Vision's Analogies, Part 1
cs.AI 2026-04 unverdicted novelty 7.0

A domain-independent analogy engine transfers Lean tactic patterns from probability to representation theory, producing four new machine-verified proofs.
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs
cs.AI 2026-04 accept novelty 7.0

The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism
cs.LO 2026-04 unverdicted novelty 7.0

ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
Factorization Regret mediates compositional generalization in latent space
cs.LG 2026-03 unverdicted novelty 7.0

Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.
DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning
cs.AI 2025-11 unverdicted novelty 7.0

DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spati...
Less is More: Recursive Reasoning with Tiny Networks
cs.LG 2025-10 unverdicted novelty 7.0

TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.
VCBench: Benchmarking LLMs in Venture Capital
cs.AI 2025-09 unverdicted novelty 7.0

VCBench is a new privacy-preserving benchmark showing LLMs like DeepSeek-V3 achieve over six times the market baseline precision in predicting founder success.
PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
cs.CL 2025-06 conditional novelty 7.0

PuzzleWorld benchmark reveals state-of-the-art AI models solve only 18% of complex puzzlehunt problems with 40% stepwise accuracy, matching novices but trailing enthusiasts, while fine-tuning on traces yields modest gains.
PRIMETIME : Limits of LLMs in Temporal Primitives
cs.NE 2025-04 unverdicted novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
cs.LG 2024-10 accept novelty 7.0

LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
Automated Design of Agentic Systems
cs.AI 2024-08 conditional novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...
Open-World Evaluations for Measuring Frontier AI Capabilities
cs.AI 2026-05 conditional novelty 6.0

Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one ...
optimize_anything: A Universal API for Optimizing any Text Parameter
cs.CL 2026-05 unverdicted novelty 6.0

A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.
Generative Recursive Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

GRAM turns recursive latent reasoning into a generative probabilistic model via stochastic trajectories and amortized variational inference, claiming better performance on structured reasoning tasks than deterministic...
Generative Recursive Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design
cs.LG 2026-05 unverdicted novelty 6.0

LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published ...
The Evaluation Trap: Benchmark Design as Theoretical Commitment
cs.AI 2026-05 unverdicted novelty 6.0

AI benchmarks trap progress by operationalizing assumptions that redefine capabilities around the benchmarks themselves, and Epistematics provides an audit procedure to detect when evaluations cannot discriminate clai...
The Generalized Turing Test: A Foundation for Comparing Intelligence
cs.AI 2026-05 unverdicted novelty 6.0

The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while pr...
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
Continuous Latent Diffusion Language Model
cs.CL 2026-05 unverdicted novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
Intervention Complexity as a Canonical Reward and a Measure of Intelligence
cs.AI 2026-05 unverdicted novelty 6.0

Intervention complexity provides a family of environment-derived universal rewards indexed by resource bias that completes the Legg-Hutter framework without external normative input.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
cs.CV 2026-04 unverdicted novelty 6.0

Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.
Representation-Guided Parameter-Efficient LLM Unlearning
cs.CL 2026-04 unverdicted novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
cs.LG 2026-04 unverdicted novelty 6.0

C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outp...
ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence
cs.AI 2026-03 unverdicted novelty 6.0

ARC-AGI-3 is a benchmark where humans solve 100% of tasks but frontier AI systems score below 1% as of March 2026, using efficiency-based scoring grounded in human baselines.
ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning
cs.LG 2025-10 unverdicted novelty 6.0

ScaLoRA analytically derives per-update column scalings that let low-rank increments accumulate into high-rank weight updates, yielding faster convergence and higher accuracy than prior LoRA variants on LLMs up to 12B...
Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
cs.AI 2025-10 unverdicted novelty 6.0

Introduces group matching score for better evaluation of compositional reasoning and Test-Time Matching (TTM) algorithm for unsupervised self-improvement in multimodal models, achieving SOTA gains including surpassing...
AInstein: Can LLMs Solve Research Problems From Parametric Memory Alone?
cs.AI 2025-10 unverdicted novelty 6.0

LLMs generate valid solutions to over 70% of AI research problems from parametric memory alone but rediscover the exact published approach less than 19% of the time, with performance limited by cross-domain analogical...
Video models are zero-shot learners and reasoners
cs.LG 2025-09 unverdicted novelty 6.0

Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
cs.CL 2025-06 conditional novelty 6.0

High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
cs.LG 2024-07 unverdicted novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Probabilistic Tiny Recursive Model
cs.AI 2026-05 conditional novelty 5.0

PTRM adds stochastic Gaussian noise to Tiny Recursive Model recursion for parallel trajectory exploration and Q-head selection, raising Sudoku-Extreme accuracy from 87.4% to 98.75% and Pencil Puzzle Bench from 62.6% t...
Predicting Performance of Symbolic and Prompt Programs with Examples
cs.LG 2026-05 unverdicted novelty 5.0

Proposes RAP, a retrieval-based approximate prior method, to predict performance of symbolic programs and LLM prompts on new tasks using a Bernoulli model and corpus-derived performance distributions.
Deep Vision: A Formal Proof of Wolstenholmes Theorem in Lean 4
cs.LO 2026-04 accept novelty 5.0

Wolstenholme's theorem is formally verified in Lean 4 via expansion of a shifted factorial product and vanishing power sums modulo p.
The Rise and Fall of $G$ in AGI
q-bio.NC 2026-04 unverdicted novelty 5.0

PCA on AI model benchmarks reveals a general intelligence factor that rises then falls as specialized reasoning models appear, inverting the expected move toward parsimonious mechanisms.
Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency
cs.LG 2026-04 unverdicted novelty 5.0

KoPE adds Kuramoto-based oscillatory phase states and synchronization to Vision Transformers, improving training, parameter, and data efficiency on structured vision tasks.
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
cs.AI 2026-03 unverdicted novelty 5.0

An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
Intelligence Inertia: Physical Isomorphism and Applications
cs.AI 2026-03 unverdicted novelty 5.0

Intelligence Inertia models the computational resistance to structural change in neural networks via a heuristic relativistic analogy, yielding a J-shaped cost curve that diverges from classical approximations.
How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence
cs.CL 2026-03 unverdicted novelty 5.0

AI's compositional reasoning failures originate in psychological learning paradigms that shaped its architectures, and the ReSynth trimodular framework is proposed to embed systematicity structurally.
Position: AI Evaluations Should be Grounded on a Theory of Capability
cs.AI 2025-09 conditional novelty 5.0

AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
The Serial Scaling Hypothesis
cs.LG 2025-07 unverdicted novelty 5.0

The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
Hierarchical Reasoning Model
cs.AI 2025-06 unverdicted novelty 5.0

HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples ...
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
cs.AI 2026-05 unverdicted novelty 4.0

Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
Measuring AI Reasoning: A Guide for Researchers
cs.AI 2026-05 unverdicted novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · cited by 62 Pith papers

[1]

I-athlon: Towards a mul- tidimensional turing test

Sam S Adams, Guruduth Banavar, and Murray Campbell. I-athlon: Towards a mul- tidimensional turing test. AI Magazine, (1):78–84, 2016

work page 2016
[2]

Anderson and Christian Lebiere

John R. Anderson and Christian Lebiere. The newell test for a theory of cognition. Behavioral and Brain Sciences, pages 587–601, 2003

work page 2003
[3]

De Anima

Aristotle. De Anima. c. 350 BC

work page
[4]

Cognitive developmental robotics: A survey.IEEE Transactions on Autonomous Mental Development, pages 12–34, 2009

Minoru Asada et al. Cognitive developmental robotics: A survey.IEEE Transactions on Autonomous Mental Development, pages 12–34, 2009

work page 2009
[5]

ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learn- ing to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079, 2018

work page Pith review arXiv 2018
[6]

Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Int. Res., (1):253–279, May 2013

work page 2013
[7]

The animal-ai environment: Training and testing animal- like artiﬁcial cognition, 2019

Benjamin Beyret, Jos Hernndez-Orallo, Lucy Cheke, Marta Halina, Murray Shana- han, and Matthew Crosby. The animal-ai environment: Training and testing animal- like artiﬁcial cognition, 2019

work page 2019
[8]

Mthodes nouvelles pour le diagnostic du niveau intellectuel des anormaux

Alfred Binet and Thodore Simon. Mthodes nouvelles pour le diagnostic du niveau intellectuel des anormaux. L’anne psychologique, pages 191–244, 1904

work page 1904
[9]

What is artiﬁcial intelligence? psycho- metric ai as an answer

Selmer Bringsjord and Bettina Schimanski. What is artiﬁcial intelligence? psycho- metric ai as an answer. In Proceedings of the 18th International Joint Conference on Artiﬁcial Intelligence, IJCAI’03, pages 887–893, San Francisco, CA, USA, 2003. Morgan Kaufmann Publishers Inc

work page 2003
[10]

Sample-efﬁcient reinforcement learning with stochastic ensemble value expansion, 2018

Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efﬁcient reinforcement learning with stochastic ensemble value expansion, 2018

work page 2018
[11]

The 2005 DARPA Grand Chal- lenge: The Great Robot Race

Martin Buehler, Karl Iagnemma, and Sanjiv Singh. The 2005 DARPA Grand Chal- lenge: The Great Robot Race . Springer Publishing Company, Incorporated, 1st edition, 2007

work page 2005
[12]

Joseph Hoane, Jr., and Feng-hsiung Hsu

Murray Campbell, A. Joseph Hoane, Jr., and Feng-hsiung Hsu. Deep blue. Artif. Intell., (1-2):57–83, 2002

work page 2002
[13]

Raymond B. Cattell. Abilities: Their structure, growth, and action. 1971

work page 1971
[14]

G. Chaitin. Algorithmic Information Theory. Cambridge University Press, 1987. 58

work page 1987
[15]

A theory of program size formally identical to information theory

Gregory J Chaitin. A theory of program size formally identical to information theory. Journal of the ACM (JACM), (3):329–340, 1975

work page 1975
[16]

Deep Learning with Python

Francois Chollet. Deep Learning with Python. Manning Publications, 2017

work page 2017
[17]

Quantifying generalization in reinforcement learning

Karl Cobbe, Oleg Klimov, Christopher Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. CoRR, 2018

work page 2018
[18]

Cultural perceptions of human intelligence

Ebinepre A Cocodia. Cultural perceptions of human intelligence. Journal of Intelli- gence, 2(4):180–196, 2014

work page 2014
[19]

Cosmides and J

L. Cosmides and J. Tooby. Origins of domain speciﬁcity: the evolution of functional organization. page 85116, 1994

work page 1994
[20]

Introduction to classical and modern test theory

Linda Crocker and James Algina. Introduction to classical and modern test theory. ERIC, 1986

work page 1986
[21]

The Origin of Species

Charles Darwin. The Origin of Species. 1859

work page
[22]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large- Scale Hierarchical Image Database. In CVPR09, 2009

work page 2009
[23]

D. K. Detterman. A challenge to watson. Intelligence, page 7778, 2011

work page 2011
[24]

T.G. Evans. A program for the solution of a class of geometric-analogy intelligence- test questions. pages 271–353, 1968

work page 1968
[25]

What is intelligence?: Beyond the Flynn effect

James R Flynn. What is intelligence?: Beyond the Flynn effect. Cambridge Univer- sity Press, 2007

work page 2007
[26]

A learning machine: Part i

Richard M Friedberg. A learning machine: Part i. IBM Journal of Research and Development, 2(1):2–13, 1958

work page 1958
[27]

Beyond the Turing Test (workshop), 2014

Manuela Veloso Gary Marcus, Francesca Rossi. Beyond the Turing Test (workshop), 2014

work page 2014
[28]

Goertzel and C

B. Goertzel and C. Pennachin, editors. Artiﬁcial general intelligence. Springer, New York, 2007

work page 2007
[29]

Intelligence and computer simulation

Bert F Green Jr. Intelligence and computer simulation. Transactions of the New York Academy of Sciences, 1964

work page 1964
[30]

Gr ¨unwald and Paul M

Peter D. Gr ¨unwald and Paul M. B. Vit´anyi. Algorithmic information theory. 2008

work page 2008
[31]

Inductive programming meets the real world

Sumit Gulwani, Jos ´e Hern´andez-Orallo, Emanuel Kitzelmann, Stephen H Muggle- ton, Ute Schmid, and Benjamin Zorn. Inductive programming meets the real world. Communications of the ACM, 58(11):90–99, 2015

work page 2015
[32]

Program Synthesis

Sumit Gulwani, Alex Polozov, and Rishabh Singh. Program Synthesis. 2017

work page 2017
[33]

William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noburu Kuno, Stephanie Milani, Sharada Prasanna Mohanty, Diego Perez Liebana, Rus- lan Salakhutdinov, Nicholay Topin, Manuela Veloso, and Phillip Wang. The minerl competition on sample efﬁcient reinforcement learning using human priors. CoRR, 2019. 59

work page 2019
[34]

Hambleton, H

R. Hambleton, H. Swaminathan, and H. Rogers. Fundamentals of Item Response Theory. Sage Publications, Inc., 1991

work page 1991
[35]

Bachman P

Islam R. Bachman P. Pineau J. Precup D. Henderson, P. and D. Meger. Deep rein- forcement learning that matters. 2018

work page 2018
[36]

Evaluation in artiﬁcial intelligence: from task-oriented to ability-oriented measurement

Jos ´e Hern ´andez-Orallo. Evaluation in artiﬁcial intelligence: from task-oriented to ability-oriented measurement. Artiﬁcial Intelligence Review, pages 397–447, 2017

work page 2017
[37]

The Measure of All Minds: Evaluating Natural and Artiﬁcial Intelligence

Jos ´e Hern´andez-Orallo. The Measure of All Minds: Evaluating Natural and Artiﬁcial Intelligence. Cambridge University Press, 2017

work page 2017
[38]

Measuring universal intelligence: To- wards an anytime intelligence test.Artiﬁcial Intelligence, 174(18):1508–1539, 2010

Jos ´e Hern´andez-Orallo and David L Dowe. Measuring universal intelligence: To- wards an anytime intelligence test.Artiﬁcial Intelligence, 174(18):1508–1539, 2010

work page 2010
[39]

Dowe, and M.Victoria Hern ´andez-Lloreda

Jos ´e Hern´andez-Orallo, David L. Dowe, and M.Victoria Hern ´andez-Lloreda. Uni- versal psychometrics. Cogn. Syst. Res., (C):50–74, March 2014

work page 2014
[40]

A formal deﬁnition of intelli- gence based on an intensional variant of algorithmic complexity

Jos ´e Hern ´andez-Orallo and Neus Minaya-Collado. A formal deﬁnition of intelli- gence based on an intensional variant of algorithmic complexity. 1998

work page 1998
[41]

G.E. Hinton. How neural networks learn from experience. Mind and brain: Read- ings from the Scientiﬁc American magazine, page 113124, 1993

work page 1993
[42]

Human Nature: or The fundamental Elements of Policie

Thomas Hobbes. Human Nature: or The fundamental Elements of Policie. 1650

work page
[43]

Universal artiﬁcial intelligence: Sequential decisions based on al- gorithmic probability

Marcus Hutter. Universal artiﬁcial intelligence: Sequential decisions based on al- gorithmic probability. Springer Science & Business Media, 2004

work page 2004
[44]

D.L. Dowe J. Hernndez-Orallo. Iq tests are not for machines, yet. Intelligence, page 7781, 2012

work page 2012
[45]

Predicting the generalization gap in deep networks with margin distributions

Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. Predicting the generalization gap in deep networks with margin distributions. ArXiv, 2018

work page 2018
[46]

Measuring the tendency of cnns to learn surface sta- tistical regularities

Jason Jo and Yoshua Bengio. Measuring the tendency of cnns to learn surface sta- tistical regularities. ArXiv, 2017

work page 2017
[47]

Raven J. John. Raven Progressive Matrices. Springer, Boston, MA, 2003

work page 2003
[48]

The structure of human intelligence: It is verbal, perceptual, and image rotation (vpr), not ﬂuid and crystallized

Wendy Johnson and Thomas J.Bouchard Jr. The structure of human intelligence: It is verbal, perceptual, and image rotation (vpr), not ﬂuid and crystallized. Intelligence, pages 393–416, 2005

work page 2005
[49]

Obstacle tower: A generalization challenge in vision, control, and planning.Proceedings of the Twenty- Eighth International Joint Conference on Artiﬁcial Intelligence, Aug 2019

Arthur Juliani, Ahmed Khalifa, Vincent-Pierre Berges, Jonathan Harper, Ervin Teng, Hunter Henry, Adam Crespi, Julian Togelius, and Danny Lange. Obstacle tower: A generalization challenge in vision, control, and planning.Proceedings of the Twenty- Eighth International Joint Conference on Artiﬁcial Intelligence, Aug 2019

work page 2019
[50]

Illuminating Generalization in Deep Reinforcement Learning through Procedural Level Generation

Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Ju- lian Togelius, and Sebastian Risi. Illuminating generalization in deep reinforcement learning through procedural level generation. arXiv preprint arXiv:1806.10729 , 2018. 60

work page Pith review arXiv 2018
[51]

Lake, Tomer D

Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gersh- man. Building machines that learn and think like people. CoRR, 2016

work page 2016
[52]

Deep learning

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, (7553):436, 2015

work page 2015
[53]

A collection of deﬁnitions of intelligence

Shane Legg and Marcus Hutter. A collection of deﬁnitions of intelligence. 2007

work page 2007
[54]

Universal intelligence: A deﬁnition of machine intelligence

Shane Legg and Marcus Hutter. Universal intelligence: A deﬁnition of machine intelligence. Minds and machines, 17(4):391–444, 2007

work page 2007
[55]

An introduction to Kolmogorov complexity and its applications, volume 3

Ming Li, Paul Vit ´anyi, et al. An introduction to Kolmogorov complexity and its applications, volume 3. Springer

work page
[56]

An Essay Concerning Human Understanding

John Locke. An Essay Concerning Human Understanding. 1689

work page
[57]

Human performance on the traveling salesman and related problems: A review

James Macgregor and Yun Chu. Human performance on the traveling salesman and related problems: A review. The Journal of Problem Solving, 3, 02 2011

work page 2011
[58]

Human performance on the traveling sales- man problem

James Macgregor and Thomas Ormerod. Human performance on the traveling sales- man problem. Perception & psychophysics, 58:527–39, 06 1996

work page 1996
[59]

Deep Learning: A Critical Appraisal

Gary Marcus. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018

work page Pith review arXiv 2018
[60]

Generality in artiﬁcial intelligence

John McCarthy. Generality in artiﬁcial intelligence. Communications of the ACM, 30(12):1030–1035, 1987

work page 1987
[61]

Machines Who Think: A Personal Inquiry into the History and Prospects of Artiﬁcial Intelligence

Pamela McCorduck. Machines Who Think: A Personal Inquiry into the History and Prospects of Artiﬁcial Intelligence. AK Peters Ltd, 2004

work page 2004
[62]

The cattell-horn-carroll theory of cognitive abilities: Past, present, and future

Kevin McGrew. The cattell-horn-carroll theory of cognitive abilities: Past, present, and future. Contemporary Intellectual Assessment: Theories, Tests, and Issues , 01 2005

work page 2005
[63]

Society of mind

Marvin Minsky. Society of mind. Simon and Schuster, 1988

work page 1988
[64]

Place cells, grid cells, and memory

May-Britt Moser, David C Rowland, and Edvard I Moser. Place cells, grid cells, and memory. Cold Spring Harbor perspectives in biology, 7(2):a021808, 2015

work page 2015
[65]

Shane Mueller, Matt Jones, Brandon Minnery, Ph Julia, and M Hiland. The bica cog- nitive decathlon: A test suite for biologically-inspired cognitive agents.Proceedings of the 16th Conference on Behavior Representation in Modeling and Simulation , 2007

work page 2007
[66]

A. Newell. You cant play 20 questions with nature and win: Projective comments on the papers of this symposium. 1973

work page 1973
[67]

Ex- ploring generalization in deep learning

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Ex- ploring generalization in deep learning. In Advances in Neural Information Process- ing Systems, pages 5947–5956, 2017

work page 2017
[68]

D., Conway, A., Cowan, N., Donkin, C., Farrell, S., Hitch, G

Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepezvari, Satinder Singh, et al. Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568, 2019. 61

work page arXiv 1908
[69]

A. E. Howe P. R. Cohen. How evaluation guides ai research: the message still counts more than the medium. AI Mag, page 35, 1988

work page 1988
[70]

Assessing generalization in deep reinforcement learning

Charles Packer, Katelyn Gao, Jernej Kos, Philipp Kr ¨ahenb¨uhl, Vladlen Koltun, and Dawn Xiaodong Song. Assessing generalization in deep reinforcement learning. ArXiv, 2018

work page 2018
[71]

Gaina, and Daniel Ionita

Diego Perez-Liebana, Katja Hofmann, Sharada Prasanna Mohanty, Noboru Sean Kuno, Andre Kramer, Sam Devlin, Raluca D. Gaina, and Daniel Ionita. The multi- agent reinforcement learning in malm (marl) competition. Technical report, 2019

work page 2019
[72]

General video game ai: a multi-track framework for evaluating agents, games and content generation algorithms

Diego Perez-Liebana, Jialin Liu, Ahmed Khalifa, Raluca D Gaina, Julian Togelius, and Simon M Lucas. General video game ai: a multi-track framework for evaluating agents, games and content generation algorithms. arXiv preprint arXiv:1802.10363, 2018

work page arXiv 2018
[73]

Reproducible, Reusable, and Robust Reinforcement Learning, 2018

Joelle Pineau. Reproducible, Reusable, and Robust Reinforcement Learning, 2018. Neural Information Processing Systems

work page 2018
[74]

S. Pinker. The blank slate: The modern denial of human nature. Viking, New York, 2002

work page 2002
[75]

David M. W. Powers. The total Turing test and the loebner prize. In New Methods in Language Processing and Computational Natural Language Learning, 1998

work page 1998
[76]

Todorov E

Lowrey K. Todorov E. V . Rajeswaran, A. and S. M. Kakade. Towards generalization and simplicity in continuous control. 2017

work page 2017
[77]

Promise of AI not so bright, 2006

Fred Reed. Promise of AI not so bright, 2006

work page 2006
[78]

Emile, or On Education

Jean-Jacques Rousseau. Emile, or On Education. 1762

work page
[79]

Rumelhart, D.E

& McClelland J.L. Rumelhart, D.E. Distributed memory and the representation of general and speciﬁc information.Journal of Experimental Psychology, page 159188, 1985

work page 1985
[80]

Sanghi and D

P. Sanghi and D. L. Dowe. A computer program capable of passing iq tests. page 570575, 2003

work page 2003

Showing first 80 references.

[1] [1]

I-athlon: Towards a mul- tidimensional turing test

Sam S Adams, Guruduth Banavar, and Murray Campbell. I-athlon: Towards a mul- tidimensional turing test. AI Magazine, (1):78–84, 2016

work page 2016

[2] [2]

Anderson and Christian Lebiere

John R. Anderson and Christian Lebiere. The newell test for a theory of cognition. Behavioral and Brain Sciences, pages 587–601, 2003

work page 2003

[3] [3]

De Anima

Aristotle. De Anima. c. 350 BC

work page

[4] [4]

Cognitive developmental robotics: A survey.IEEE Transactions on Autonomous Mental Development, pages 12–34, 2009

Minoru Asada et al. Cognitive developmental robotics: A survey.IEEE Transactions on Autonomous Mental Development, pages 12–34, 2009

work page 2009

[5] [5]

ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learn- ing to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079, 2018

work page Pith review arXiv 2018

[6] [6]

Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Int. Res., (1):253–279, May 2013

work page 2013

[7] [7]

The animal-ai environment: Training and testing animal- like artiﬁcial cognition, 2019

Benjamin Beyret, Jos Hernndez-Orallo, Lucy Cheke, Marta Halina, Murray Shana- han, and Matthew Crosby. The animal-ai environment: Training and testing animal- like artiﬁcial cognition, 2019

work page 2019

[8] [8]

Mthodes nouvelles pour le diagnostic du niveau intellectuel des anormaux

Alfred Binet and Thodore Simon. Mthodes nouvelles pour le diagnostic du niveau intellectuel des anormaux. L’anne psychologique, pages 191–244, 1904

work page 1904

[9] [9]

What is artiﬁcial intelligence? psycho- metric ai as an answer

Selmer Bringsjord and Bettina Schimanski. What is artiﬁcial intelligence? psycho- metric ai as an answer. In Proceedings of the 18th International Joint Conference on Artiﬁcial Intelligence, IJCAI’03, pages 887–893, San Francisco, CA, USA, 2003. Morgan Kaufmann Publishers Inc

work page 2003

[10] [10]

Sample-efﬁcient reinforcement learning with stochastic ensemble value expansion, 2018

Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efﬁcient reinforcement learning with stochastic ensemble value expansion, 2018

work page 2018

[11] [11]

The 2005 DARPA Grand Chal- lenge: The Great Robot Race

Martin Buehler, Karl Iagnemma, and Sanjiv Singh. The 2005 DARPA Grand Chal- lenge: The Great Robot Race . Springer Publishing Company, Incorporated, 1st edition, 2007

work page 2005

[12] [12]

Joseph Hoane, Jr., and Feng-hsiung Hsu

Murray Campbell, A. Joseph Hoane, Jr., and Feng-hsiung Hsu. Deep blue. Artif. Intell., (1-2):57–83, 2002

work page 2002

[13] [13]

Raymond B. Cattell. Abilities: Their structure, growth, and action. 1971

work page 1971

[14] [14]

G. Chaitin. Algorithmic Information Theory. Cambridge University Press, 1987. 58

work page 1987

[15] [15]

A theory of program size formally identical to information theory

Gregory J Chaitin. A theory of program size formally identical to information theory. Journal of the ACM (JACM), (3):329–340, 1975

work page 1975

[16] [16]

Deep Learning with Python

Francois Chollet. Deep Learning with Python. Manning Publications, 2017

work page 2017

[17] [17]

Quantifying generalization in reinforcement learning

Karl Cobbe, Oleg Klimov, Christopher Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. CoRR, 2018

work page 2018

[18] [18]

Cultural perceptions of human intelligence

Ebinepre A Cocodia. Cultural perceptions of human intelligence. Journal of Intelli- gence, 2(4):180–196, 2014

work page 2014

[19] [19]

Cosmides and J

L. Cosmides and J. Tooby. Origins of domain speciﬁcity: the evolution of functional organization. page 85116, 1994

work page 1994

[20] [20]

Introduction to classical and modern test theory

Linda Crocker and James Algina. Introduction to classical and modern test theory. ERIC, 1986

work page 1986

[21] [21]

The Origin of Species

Charles Darwin. The Origin of Species. 1859

work page

[22] [22]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large- Scale Hierarchical Image Database. In CVPR09, 2009

work page 2009

[23] [23]

D. K. Detterman. A challenge to watson. Intelligence, page 7778, 2011

work page 2011

[24] [24]

T.G. Evans. A program for the solution of a class of geometric-analogy intelligence- test questions. pages 271–353, 1968

work page 1968

[25] [25]

What is intelligence?: Beyond the Flynn effect

James R Flynn. What is intelligence?: Beyond the Flynn effect. Cambridge Univer- sity Press, 2007

work page 2007

[26] [26]

A learning machine: Part i

Richard M Friedberg. A learning machine: Part i. IBM Journal of Research and Development, 2(1):2–13, 1958

work page 1958

[27] [27]

Beyond the Turing Test (workshop), 2014

Manuela Veloso Gary Marcus, Francesca Rossi. Beyond the Turing Test (workshop), 2014

work page 2014

[28] [28]

Goertzel and C

B. Goertzel and C. Pennachin, editors. Artiﬁcial general intelligence. Springer, New York, 2007

work page 2007

[29] [29]

Intelligence and computer simulation

Bert F Green Jr. Intelligence and computer simulation. Transactions of the New York Academy of Sciences, 1964

work page 1964

[30] [30]

Gr ¨unwald and Paul M

Peter D. Gr ¨unwald and Paul M. B. Vit´anyi. Algorithmic information theory. 2008

work page 2008

[31] [31]

Inductive programming meets the real world

Sumit Gulwani, Jos ´e Hern´andez-Orallo, Emanuel Kitzelmann, Stephen H Muggle- ton, Ute Schmid, and Benjamin Zorn. Inductive programming meets the real world. Communications of the ACM, 58(11):90–99, 2015

work page 2015

[32] [32]

Program Synthesis

Sumit Gulwani, Alex Polozov, and Rishabh Singh. Program Synthesis. 2017

work page 2017

[33] [33]

William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noburu Kuno, Stephanie Milani, Sharada Prasanna Mohanty, Diego Perez Liebana, Rus- lan Salakhutdinov, Nicholay Topin, Manuela Veloso, and Phillip Wang. The minerl competition on sample efﬁcient reinforcement learning using human priors. CoRR, 2019. 59

work page 2019

[34] [34]

Hambleton, H

R. Hambleton, H. Swaminathan, and H. Rogers. Fundamentals of Item Response Theory. Sage Publications, Inc., 1991

work page 1991

[35] [35]

Bachman P

Islam R. Bachman P. Pineau J. Precup D. Henderson, P. and D. Meger. Deep rein- forcement learning that matters. 2018

work page 2018

[36] [36]

Evaluation in artiﬁcial intelligence: from task-oriented to ability-oriented measurement

Jos ´e Hern ´andez-Orallo. Evaluation in artiﬁcial intelligence: from task-oriented to ability-oriented measurement. Artiﬁcial Intelligence Review, pages 397–447, 2017

work page 2017

[37] [37]

The Measure of All Minds: Evaluating Natural and Artiﬁcial Intelligence

Jos ´e Hern´andez-Orallo. The Measure of All Minds: Evaluating Natural and Artiﬁcial Intelligence. Cambridge University Press, 2017

work page 2017

[38] [38]

Measuring universal intelligence: To- wards an anytime intelligence test.Artiﬁcial Intelligence, 174(18):1508–1539, 2010

Jos ´e Hern´andez-Orallo and David L Dowe. Measuring universal intelligence: To- wards an anytime intelligence test.Artiﬁcial Intelligence, 174(18):1508–1539, 2010

work page 2010

[39] [39]

Dowe, and M.Victoria Hern ´andez-Lloreda

Jos ´e Hern´andez-Orallo, David L. Dowe, and M.Victoria Hern ´andez-Lloreda. Uni- versal psychometrics. Cogn. Syst. Res., (C):50–74, March 2014

work page 2014

[40] [40]

A formal deﬁnition of intelli- gence based on an intensional variant of algorithmic complexity

Jos ´e Hern ´andez-Orallo and Neus Minaya-Collado. A formal deﬁnition of intelli- gence based on an intensional variant of algorithmic complexity. 1998

work page 1998

[41] [41]

G.E. Hinton. How neural networks learn from experience. Mind and brain: Read- ings from the Scientiﬁc American magazine, page 113124, 1993

work page 1993

[42] [42]

Human Nature: or The fundamental Elements of Policie

Thomas Hobbes. Human Nature: or The fundamental Elements of Policie. 1650

work page

[43] [43]

Universal artiﬁcial intelligence: Sequential decisions based on al- gorithmic probability

Marcus Hutter. Universal artiﬁcial intelligence: Sequential decisions based on al- gorithmic probability. Springer Science & Business Media, 2004

work page 2004

[44] [44]

D.L. Dowe J. Hernndez-Orallo. Iq tests are not for machines, yet. Intelligence, page 7781, 2012

work page 2012

[45] [45]

Predicting the generalization gap in deep networks with margin distributions

Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. Predicting the generalization gap in deep networks with margin distributions. ArXiv, 2018

work page 2018

[46] [46]

Measuring the tendency of cnns to learn surface sta- tistical regularities

Jason Jo and Yoshua Bengio. Measuring the tendency of cnns to learn surface sta- tistical regularities. ArXiv, 2017

work page 2017

[47] [47]

Raven J. John. Raven Progressive Matrices. Springer, Boston, MA, 2003

work page 2003

[48] [48]

The structure of human intelligence: It is verbal, perceptual, and image rotation (vpr), not ﬂuid and crystallized

Wendy Johnson and Thomas J.Bouchard Jr. The structure of human intelligence: It is verbal, perceptual, and image rotation (vpr), not ﬂuid and crystallized. Intelligence, pages 393–416, 2005

work page 2005

[49] [49]

Obstacle tower: A generalization challenge in vision, control, and planning.Proceedings of the Twenty- Eighth International Joint Conference on Artiﬁcial Intelligence, Aug 2019

Arthur Juliani, Ahmed Khalifa, Vincent-Pierre Berges, Jonathan Harper, Ervin Teng, Hunter Henry, Adam Crespi, Julian Togelius, and Danny Lange. Obstacle tower: A generalization challenge in vision, control, and planning.Proceedings of the Twenty- Eighth International Joint Conference on Artiﬁcial Intelligence, Aug 2019

work page 2019

[50] [50]

Illuminating Generalization in Deep Reinforcement Learning through Procedural Level Generation

Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Ju- lian Togelius, and Sebastian Risi. Illuminating generalization in deep reinforcement learning through procedural level generation. arXiv preprint arXiv:1806.10729 , 2018. 60

work page Pith review arXiv 2018

[51] [51]

Lake, Tomer D

Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gersh- man. Building machines that learn and think like people. CoRR, 2016

work page 2016

[52] [52]

Deep learning

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, (7553):436, 2015

work page 2015

[53] [53]

A collection of deﬁnitions of intelligence

Shane Legg and Marcus Hutter. A collection of deﬁnitions of intelligence. 2007

work page 2007

[54] [54]

Universal intelligence: A deﬁnition of machine intelligence

Shane Legg and Marcus Hutter. Universal intelligence: A deﬁnition of machine intelligence. Minds and machines, 17(4):391–444, 2007

work page 2007

[55] [55]

An introduction to Kolmogorov complexity and its applications, volume 3

Ming Li, Paul Vit ´anyi, et al. An introduction to Kolmogorov complexity and its applications, volume 3. Springer

work page

[56] [56]

An Essay Concerning Human Understanding

John Locke. An Essay Concerning Human Understanding. 1689

work page

[57] [57]

Human performance on the traveling salesman and related problems: A review

James Macgregor and Yun Chu. Human performance on the traveling salesman and related problems: A review. The Journal of Problem Solving, 3, 02 2011

work page 2011

[58] [58]

Human performance on the traveling sales- man problem

James Macgregor and Thomas Ormerod. Human performance on the traveling sales- man problem. Perception & psychophysics, 58:527–39, 06 1996

work page 1996

[59] [59]

Deep Learning: A Critical Appraisal

Gary Marcus. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018

work page Pith review arXiv 2018

[60] [60]

Generality in artiﬁcial intelligence

John McCarthy. Generality in artiﬁcial intelligence. Communications of the ACM, 30(12):1030–1035, 1987

work page 1987

[61] [61]

Machines Who Think: A Personal Inquiry into the History and Prospects of Artiﬁcial Intelligence

Pamela McCorduck. Machines Who Think: A Personal Inquiry into the History and Prospects of Artiﬁcial Intelligence. AK Peters Ltd, 2004

work page 2004

[62] [62]

The cattell-horn-carroll theory of cognitive abilities: Past, present, and future

Kevin McGrew. The cattell-horn-carroll theory of cognitive abilities: Past, present, and future. Contemporary Intellectual Assessment: Theories, Tests, and Issues , 01 2005

work page 2005

[63] [63]

Society of mind

Marvin Minsky. Society of mind. Simon and Schuster, 1988

work page 1988

[64] [64]

Place cells, grid cells, and memory

May-Britt Moser, David C Rowland, and Edvard I Moser. Place cells, grid cells, and memory. Cold Spring Harbor perspectives in biology, 7(2):a021808, 2015

work page 2015

[65] [65]

Shane Mueller, Matt Jones, Brandon Minnery, Ph Julia, and M Hiland. The bica cog- nitive decathlon: A test suite for biologically-inspired cognitive agents.Proceedings of the 16th Conference on Behavior Representation in Modeling and Simulation , 2007

work page 2007

[66] [66]

A. Newell. You cant play 20 questions with nature and win: Projective comments on the papers of this symposium. 1973

work page 1973

[67] [67]

Ex- ploring generalization in deep learning

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Ex- ploring generalization in deep learning. In Advances in Neural Information Process- ing Systems, pages 5947–5956, 2017

work page 2017

[68] [68]

D., Conway, A., Cowan, N., Donkin, C., Farrell, S., Hitch, G

Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepezvari, Satinder Singh, et al. Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568, 2019. 61

work page arXiv 1908

[69] [69]

A. E. Howe P. R. Cohen. How evaluation guides ai research: the message still counts more than the medium. AI Mag, page 35, 1988

work page 1988

[70] [70]

Assessing generalization in deep reinforcement learning

Charles Packer, Katelyn Gao, Jernej Kos, Philipp Kr ¨ahenb¨uhl, Vladlen Koltun, and Dawn Xiaodong Song. Assessing generalization in deep reinforcement learning. ArXiv, 2018

work page 2018

[71] [71]

Gaina, and Daniel Ionita

Diego Perez-Liebana, Katja Hofmann, Sharada Prasanna Mohanty, Noboru Sean Kuno, Andre Kramer, Sam Devlin, Raluca D. Gaina, and Daniel Ionita. The multi- agent reinforcement learning in malm (marl) competition. Technical report, 2019

work page 2019

[72] [72]

General video game ai: a multi-track framework for evaluating agents, games and content generation algorithms

Diego Perez-Liebana, Jialin Liu, Ahmed Khalifa, Raluca D Gaina, Julian Togelius, and Simon M Lucas. General video game ai: a multi-track framework for evaluating agents, games and content generation algorithms. arXiv preprint arXiv:1802.10363, 2018

work page arXiv 2018

[73] [73]

Reproducible, Reusable, and Robust Reinforcement Learning, 2018

Joelle Pineau. Reproducible, Reusable, and Robust Reinforcement Learning, 2018. Neural Information Processing Systems

work page 2018

[74] [74]

S. Pinker. The blank slate: The modern denial of human nature. Viking, New York, 2002

work page 2002

[75] [75]

David M. W. Powers. The total Turing test and the loebner prize. In New Methods in Language Processing and Computational Natural Language Learning, 1998

work page 1998

[76] [76]

Todorov E

Lowrey K. Todorov E. V . Rajeswaran, A. and S. M. Kakade. Towards generalization and simplicity in continuous control. 2017

work page 2017

[77] [77]

Promise of AI not so bright, 2006

Fred Reed. Promise of AI not so bright, 2006

work page 2006

[78] [78]

Emile, or On Education

Jean-Jacques Rousseau. Emile, or On Education. 1762

work page

[79] [79]

Rumelhart, D.E

& McClelland J.L. Rumelhart, D.E. Distributed memory and the representation of general and speciﬁc information.Journal of Experimental Psychology, page 159188, 1985

work page 1985

[80] [80]

Sanghi and D

P. Sanghi and D. L. Dowe. A computer program capable of passing iq tests. page 570575, 2003

work page 2003