On the Measure of Intelligence
Pith reviewed 2026-05-12 12:59 UTC · model grok-4.3
The pith
Intelligence is the efficiency of acquiring skills from limited experience, not performance on fixed tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Intelligence is formalized as skill-acquisition efficiency: the rate at which a system develops new capabilities given a defined scope of tasks, a level of generalization difficulty, and a quantity of experience, while incorporating its priors. This formulation, rooted in algorithmic information theory, treats skill at any single task as an insufficient proxy because priors and experience heavily modulate observed performance. The definition therefore directs evaluation toward how economically a system converts limited experience into broad competence.
What carries the argument
The formal definition of intelligence as skill-acquisition efficiency from algorithmic information theory, which isolates generalization power by controlling for priors and experience across tasks of varying difficulty.
If this is right
- Benchmarks must limit the priors and experience supplied to systems so that measured performance reflects acquisition efficiency rather than external resources.
- Comparisons between AI and humans become possible once both operate under comparable innate priors on the same task scope.
- AI progress can be tracked by improvements in skill-acquisition efficiency rather than by gains on any single fixed task.
- The Abstraction and Reasoning Corpus provides one concrete realization of these guidelines for measuring fluid, human-like intelligence.
Where Pith is reading between the lines
- Systems optimized under this measure may generalize more readily to open-ended real-world problems than those trained on narrow, data-heavy tasks.
- The definition could be applied to non-benchmark settings by defining new task scopes and measuring acquisition rates under controlled priors.
- If the approach holds, large-scale pretraining on fixed datasets would be revealed as a limited path to general intelligence.
Load-bearing premise
The explicit priors chosen for the Abstraction and Reasoning Corpus are sufficiently close to innate human priors that performance differences on the benchmark reflect genuine differences in generalization power.
What would settle it
An AI system that reaches high scores on the Abstraction and Reasoning Corpus yet fails to acquire skills efficiently when tested on a fresh set of tasks with matched priors and experience would show that the benchmark does not isolate the intended form of intelligence.
read the original abstract
To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to "buy" arbitrary levels of skills for a system, in a way that masks the system's own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper summarizes and critically assesses historical definitions of intelligence from psychology and AI, identifies two implicit conceptions guiding them, argues that task-specific skill benchmarks fail to measure intelligence because skill depends on priors and experience, articulates a new formal definition of intelligence grounded in Algorithmic Information Theory as skill-acquisition efficiency (incorporating scope, generalization difficulty, priors, and experience), proposes guidelines for general AI benchmarks, and introduces the Abstraction and Reasoning Corpus (ARC) built on an explicit set of priors designed to approximate innate human priors for measuring human-like fluid intelligence and enabling fair AI-human comparisons.
Significance. If the definition is sound and the ARC priors sufficiently match human innate priors, the work could meaningfully shift AI evaluation toward measuring generalization efficiency rather than acquired skill, providing a principled alternative to current task-specific benchmarks and influencing the design of future intelligence tests.
major comments (2)
- [Section on the new formal definition] The section articulating the new formal definition: intelligence is defined as skill-acquisition efficiency drawing on AIT, but the manuscript provides only a conceptual description without a precise mathematical formulation, derivation from core AIT quantities (such as Kolmogorov complexity), or operationalization that would allow quantitative computation or direct falsification of the definition.
- [ARC benchmark description] The ARC benchmark description and guidelines section: the central claim that ARC enables fair human-AI comparisons rests on the assumption that its enumerated priors (objectness, basic geometry, counting, etc.) are close enough to innate human priors that performance differences isolate generalization efficiency; however, no derivation, empirical calibration against human data, or sensitivity analysis is supplied to show that modifying any listed prior would not materially alter relative scores.
minor comments (2)
- [Abstract and introduction] The abstract and introduction reference 'two historical conceptions of intelligence' without naming or briefly characterizing them, which reduces clarity for readers unfamiliar with the cited psychology and AI literature.
- [Conclusions] The manuscript would benefit from an explicit statement of the scope of the proposed definition (e.g., whether it applies only to fluid intelligence or extends to other forms) to avoid overgeneralization in the conclusions.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on the manuscript. We address each major comment below, indicating where revisions will be incorporated.
read point-by-point responses
-
Referee: [Section on the new formal definition] The section articulating the new formal definition: intelligence is defined as skill-acquisition efficiency drawing on AIT, but the manuscript provides only a conceptual description without a precise mathematical formulation, derivation from core AIT quantities (such as Kolmogorov complexity), or operationalization that would allow quantitative computation or direct falsification of the definition.
Authors: We acknowledge that the definition is presented conceptually, drawing on AIT to frame intelligence as skill-acquisition efficiency without supplying a closed-form mathematical expression or explicit derivation from Kolmogorov complexity. This choice was made to emphasize the definition's implications for evaluation and to keep it accessible across psychology and AI. In revision, we will expand the section with a more explicit mapping to AIT notions, such as relating efficiency to the incremental reduction in description length for novel tasks, and add discussion of possible operationalizations along with their limitations for direct falsification. revision: partial
-
Referee: [ARC benchmark description] The ARC benchmark description and guidelines section: the central claim that ARC enables fair human-AI comparisons rests on the assumption that its enumerated priors (objectness, basic geometry, counting, etc.) are close enough to innate human priors that performance differences isolate generalization efficiency; however, no derivation, empirical calibration against human data, or sensitivity analysis is supplied to show that modifying any listed prior would not materially alter relative scores.
Authors: The referee is correct that the fairness claim for human-AI comparisons depends on the priors approximating innate human ones, and that the manuscript supplies neither empirical calibration nor sensitivity analysis. We will revise the relevant section to elaborate the rationale for each prior with additional citations from cognitive science literature on core knowledge systems. We will also add an explicit limitations paragraph acknowledging the absence of sensitivity analysis and noting that full empirical calibration is an important avenue for subsequent work. revision: partial
Circularity Check
No circularity: formal AIT-based definition and benchmark guidelines are independent of fitted inputs or self-referential loops.
full rationale
The paper articulates a definition of intelligence as skill-acquisition efficiency drawing directly from established Algorithmic Information Theory concepts (scope, generalization difficulty, priors, experience) without any equations or derivations that reduce back to the paper's own data or assumptions by construction. Guidelines for benchmarks follow from this definition. ARC is presented as one implementation using an explicitly enumerated prior set chosen by the authors; while the claim of closeness to human priors is an unvalidated assumption rather than a derived result, it does not create a self-definitional loop, fitted-parameter prediction, or load-bearing self-citation chain. No steps match the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Intelligence can be formalized as skill-acquisition efficiency using concepts from Algorithmic Information Theory
- domain assumption The priors in ARC are close to innate human priors
Lean theorems connected to this paper
-
Foundation/LawOfExistencelaw_of_existence echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience.
-
Foundation/HierarchyEmergencehierarchy_emergence_forces_phi echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
built upon an explicit set of priors designed to be as close as possible to innate human priors
-
Foundation/DiscretenessForcingdiscreteness_forcing_principle echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to buy arbitrary levels of skills
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Gradient-Based Program Synthesis with Neurally Interpreted Languages
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prio...
-
Are Flat Minima an Illusion?
Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.
-
VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images
VisAnalog is a new controlled benchmark showing VLMs substantially underperform humans on visual concept transfer under one- to four-step deterministic transformations, with relation inference as the main failure mode.
-
Test-Time Learning with an Evolving Library
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
-
Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers
The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.
-
Prospective Compression in Human Abstraction Learning
Humans exhibit abstraction learning consistent with prospective compression of future tasks in non-stationary domains, unlike retrospective compression algorithms or LLM-based approaches.
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
A vision-language policy learns state-conditioned commitment depth to Pareto-dominate fixed-depth baselines on long-horizon puzzles, achieving up to 12.5 pp higher solve rate with 25% fewer actions.
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
-
Lattice Deduction Transformers
An 800K-parameter Lattice Deduction Transformer reaches 100% accuracy on Sudoku-Extreme and Snowflake Sudoku and 99.9% on Maze-Hard by using lattice projections and abstract-interpretation supervision, while frontier ...
-
Intervention Complexity as a Canonical Reward and a Measure of Intelligence
Intervention complexity provides a family of canonical rewards indexed by resource bias that completes the Legg-Hutter framework and enables a two-dimensional view of intelligence as competence plus learning efficiency.
-
AI scientists produce results without reasoning scientifically
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
-
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
-
Yanasse: Finding New Proofs from Deep Vision's Analogies, Part 1
A domain-independent analogy engine transfers Lean tactic patterns from probability to representation theory, producing four new machine-verified proofs.
-
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
-
Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism
ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
-
Factorization Regret mediates compositional generalization in latent space
Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.
-
DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning
DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spati...
-
Less is More: Recursive Reasoning with Tiny Networks
TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.
-
VCBench: Benchmarking LLMs in Venture Capital
VCBench is a new privacy-preserving benchmark showing LLMs like DeepSeek-V3 achieve over six times the market baseline precision in predicting founder success.
-
PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
PuzzleWorld benchmark reveals state-of-the-art AI models solve only 18% of complex puzzlehunt problems with 40% stepwise accuracy, matching novices but trailing enthusiasts, while fine-tuning on traces yields modest gains.
-
PRIMETIME : Limits of LLMs in Temporal Primitives
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
-
Automated Design of Agentic Systems
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...
-
Open-World Evaluations for Measuring Frontier AI Capabilities
Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one ...
-
optimize_anything: A Universal API for Optimizing any Text Parameter
A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.
-
Generative Recursive Reasoning
GRAM turns recursive latent reasoning into a generative probabilistic model via stochastic trajectories and amortized variational inference, claiming better performance on structured reasoning tasks than deterministic...
-
Generative Recursive Reasoning
GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.
-
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published ...
-
The Evaluation Trap: Benchmark Design as Theoretical Commitment
AI benchmarks trap progress by operationalizing assumptions that redefine capabilities around the benchmarks themselves, and Epistematics provides an audit procedure to detect when evaluations cannot discriminate clai...
-
The Generalized Turing Test: A Foundation for Comparing Intelligence
The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while pr...
-
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
-
Intervention Complexity as a Canonical Reward and a Measure of Intelligence
Intervention complexity provides a family of environment-derived universal rewards indexed by resource bias that completes the Legg-Hutter framework without external normative input.
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outp...
-
ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence
ARC-AGI-3 is a benchmark where humans solve 100% of tasks but frontier AI systems score below 1% as of March 2026, using efficiency-based scoring grounded in human baselines.
-
ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning
ScaLoRA analytically derives per-update column scalings that let low-rank increments accumulate into high-rank weight updates, yielding faster convergence and higher accuracy than prior LoRA variants on LLMs up to 12B...
-
Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
Introduces group matching score for better evaluation of compositional reasoning and Test-Time Matching (TTM) algorithm for unsupervised self-improvement in multimodal models, achieving SOTA gains including surpassing...
-
AInstein: Can LLMs Solve Research Problems From Parametric Memory Alone?
LLMs generate valid solutions to over 70% of AI research problems from parametric memory alone but rediscover the exact published approach less than 19% of the time, with performance limited by cross-domain analogical...
-
Video models are zero-shot learners and reasoners
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
-
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
-
Probabilistic Tiny Recursive Model
PTRM adds stochastic Gaussian noise to Tiny Recursive Model recursion for parallel trajectory exploration and Q-head selection, raising Sudoku-Extreme accuracy from 87.4% to 98.75% and Pencil Puzzle Bench from 62.6% t...
-
Predicting Performance of Symbolic and Prompt Programs with Examples
Proposes RAP, a retrieval-based approximate prior method, to predict performance of symbolic programs and LLM prompts on new tasks using a Bernoulli model and corpus-derived performance distributions.
-
Deep Vision: A Formal Proof of Wolstenholmes Theorem in Lean 4
Wolstenholme's theorem is formally verified in Lean 4 via expansion of a shifted factorial product and vanishing power sums modulo p.
-
The Rise and Fall of $G$ in AGI
PCA on AI model benchmarks reveals a general intelligence factor that rises then falls as specialized reasoning models appear, inverting the expected move toward parsimonious mechanisms.
-
Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency
KoPE adds Kuramoto-based oscillatory phase states and synchronization to Vision Transformers, improving training, parameter, and data efficiency on structured vision tasks.
-
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
-
Intelligence Inertia: Physical Isomorphism and Applications
Intelligence Inertia models the computational resistance to structural change in neural networks via a heuristic relativistic analogy, yielding a J-shaped cost curve that diverges from classical approximations.
-
How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence
AI's compositional reasoning failures originate in psychological learning paradigms that shaped its architectures, and the ReSynth trimodular framework is proposed to embed systematicity structurally.
-
Position: AI Evaluations Should be Grounded on a Theory of Capability
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
-
The Serial Scaling Hypothesis
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
-
Hierarchical Reasoning Model
HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples ...
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
Reference graph
Works this paper leans on
-
[1]
I-athlon: Towards a mul- tidimensional turing test
Sam S Adams, Guruduth Banavar, and Murray Campbell. I-athlon: Towards a mul- tidimensional turing test. AI Magazine, (1):78–84, 2016
work page 2016
-
[2]
Anderson and Christian Lebiere
John R. Anderson and Christian Lebiere. The newell test for a theory of cognition. Behavioral and Brain Sciences, pages 587–601, 2003
work page 2003
- [3]
-
[4]
Minoru Asada et al. Cognitive developmental robotics: A survey.IEEE Transactions on Autonomous Mental Development, pages 12–34, 2009
work page 2009
-
[5]
ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst
Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learn- ing to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079, 2018
work page Pith review arXiv 2018
-
[6]
Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling
Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Int. Res., (1):253–279, May 2013
work page 2013
-
[7]
The animal-ai environment: Training and testing animal- like artificial cognition, 2019
Benjamin Beyret, Jos Hernndez-Orallo, Lucy Cheke, Marta Halina, Murray Shana- han, and Matthew Crosby. The animal-ai environment: Training and testing animal- like artificial cognition, 2019
work page 2019
-
[8]
Mthodes nouvelles pour le diagnostic du niveau intellectuel des anormaux
Alfred Binet and Thodore Simon. Mthodes nouvelles pour le diagnostic du niveau intellectuel des anormaux. L’anne psychologique, pages 191–244, 1904
work page 1904
-
[9]
What is artificial intelligence? psycho- metric ai as an answer
Selmer Bringsjord and Bettina Schimanski. What is artificial intelligence? psycho- metric ai as an answer. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, IJCAI’03, pages 887–893, San Francisco, CA, USA, 2003. Morgan Kaufmann Publishers Inc
work page 2003
-
[10]
Sample-efficient reinforcement learning with stochastic ensemble value expansion, 2018
Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion, 2018
work page 2018
-
[11]
The 2005 DARPA Grand Chal- lenge: The Great Robot Race
Martin Buehler, Karl Iagnemma, and Sanjiv Singh. The 2005 DARPA Grand Chal- lenge: The Great Robot Race . Springer Publishing Company, Incorporated, 1st edition, 2007
work page 2005
-
[12]
Joseph Hoane, Jr., and Feng-hsiung Hsu
Murray Campbell, A. Joseph Hoane, Jr., and Feng-hsiung Hsu. Deep blue. Artif. Intell., (1-2):57–83, 2002
work page 2002
-
[13]
Raymond B. Cattell. Abilities: Their structure, growth, and action. 1971
work page 1971
-
[14]
G. Chaitin. Algorithmic Information Theory. Cambridge University Press, 1987. 58
work page 1987
-
[15]
A theory of program size formally identical to information theory
Gregory J Chaitin. A theory of program size formally identical to information theory. Journal of the ACM (JACM), (3):329–340, 1975
work page 1975
-
[16]
Francois Chollet. Deep Learning with Python. Manning Publications, 2017
work page 2017
-
[17]
Quantifying generalization in reinforcement learning
Karl Cobbe, Oleg Klimov, Christopher Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. CoRR, 2018
work page 2018
-
[18]
Cultural perceptions of human intelligence
Ebinepre A Cocodia. Cultural perceptions of human intelligence. Journal of Intelli- gence, 2(4):180–196, 2014
work page 2014
-
[19]
L. Cosmides and J. Tooby. Origins of domain specificity: the evolution of functional organization. page 85116, 1994
work page 1994
-
[20]
Introduction to classical and modern test theory
Linda Crocker and James Algina. Introduction to classical and modern test theory. ERIC, 1986
work page 1986
- [21]
-
[22]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large- Scale Hierarchical Image Database. In CVPR09, 2009
work page 2009
-
[23]
D. K. Detterman. A challenge to watson. Intelligence, page 7778, 2011
work page 2011
-
[24]
T.G. Evans. A program for the solution of a class of geometric-analogy intelligence- test questions. pages 271–353, 1968
work page 1968
-
[25]
What is intelligence?: Beyond the Flynn effect
James R Flynn. What is intelligence?: Beyond the Flynn effect. Cambridge Univer- sity Press, 2007
work page 2007
-
[26]
Richard M Friedberg. A learning machine: Part i. IBM Journal of Research and Development, 2(1):2–13, 1958
work page 1958
-
[27]
Beyond the Turing Test (workshop), 2014
Manuela Veloso Gary Marcus, Francesca Rossi. Beyond the Turing Test (workshop), 2014
work page 2014
-
[28]
B. Goertzel and C. Pennachin, editors. Artificial general intelligence. Springer, New York, 2007
work page 2007
-
[29]
Intelligence and computer simulation
Bert F Green Jr. Intelligence and computer simulation. Transactions of the New York Academy of Sciences, 1964
work page 1964
-
[30]
Peter D. Gr ¨unwald and Paul M. B. Vit´anyi. Algorithmic information theory. 2008
work page 2008
-
[31]
Inductive programming meets the real world
Sumit Gulwani, Jos ´e Hern´andez-Orallo, Emanuel Kitzelmann, Stephen H Muggle- ton, Ute Schmid, and Benjamin Zorn. Inductive programming meets the real world. Communications of the ACM, 58(11):90–99, 2015
work page 2015
-
[32]
Sumit Gulwani, Alex Polozov, and Rishabh Singh. Program Synthesis. 2017
work page 2017
-
[33]
William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noburu Kuno, Stephanie Milani, Sharada Prasanna Mohanty, Diego Perez Liebana, Rus- lan Salakhutdinov, Nicholay Topin, Manuela Veloso, and Phillip Wang. The minerl competition on sample efficient reinforcement learning using human priors. CoRR, 2019. 59
work page 2019
-
[34]
R. Hambleton, H. Swaminathan, and H. Rogers. Fundamentals of Item Response Theory. Sage Publications, Inc., 1991
work page 1991
- [35]
-
[36]
Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement
Jos ´e Hern ´andez-Orallo. Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement. Artificial Intelligence Review, pages 397–447, 2017
work page 2017
-
[37]
The Measure of All Minds: Evaluating Natural and Artificial Intelligence
Jos ´e Hern´andez-Orallo. The Measure of All Minds: Evaluating Natural and Artificial Intelligence. Cambridge University Press, 2017
work page 2017
-
[38]
Jos ´e Hern´andez-Orallo and David L Dowe. Measuring universal intelligence: To- wards an anytime intelligence test.Artificial Intelligence, 174(18):1508–1539, 2010
work page 2010
-
[39]
Dowe, and M.Victoria Hern ´andez-Lloreda
Jos ´e Hern´andez-Orallo, David L. Dowe, and M.Victoria Hern ´andez-Lloreda. Uni- versal psychometrics. Cogn. Syst. Res., (C):50–74, March 2014
work page 2014
-
[40]
A formal definition of intelli- gence based on an intensional variant of algorithmic complexity
Jos ´e Hern ´andez-Orallo and Neus Minaya-Collado. A formal definition of intelli- gence based on an intensional variant of algorithmic complexity. 1998
work page 1998
-
[41]
G.E. Hinton. How neural networks learn from experience. Mind and brain: Read- ings from the Scientific American magazine, page 113124, 1993
work page 1993
-
[42]
Human Nature: or The fundamental Elements of Policie
Thomas Hobbes. Human Nature: or The fundamental Elements of Policie. 1650
-
[43]
Universal artificial intelligence: Sequential decisions based on al- gorithmic probability
Marcus Hutter. Universal artificial intelligence: Sequential decisions based on al- gorithmic probability. Springer Science & Business Media, 2004
work page 2004
-
[44]
D.L. Dowe J. Hernndez-Orallo. Iq tests are not for machines, yet. Intelligence, page 7781, 2012
work page 2012
-
[45]
Predicting the generalization gap in deep networks with margin distributions
Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. Predicting the generalization gap in deep networks with margin distributions. ArXiv, 2018
work page 2018
-
[46]
Measuring the tendency of cnns to learn surface sta- tistical regularities
Jason Jo and Yoshua Bengio. Measuring the tendency of cnns to learn surface sta- tistical regularities. ArXiv, 2017
work page 2017
-
[47]
Raven J. John. Raven Progressive Matrices. Springer, Boston, MA, 2003
work page 2003
-
[48]
Wendy Johnson and Thomas J.Bouchard Jr. The structure of human intelligence: It is verbal, perceptual, and image rotation (vpr), not fluid and crystallized. Intelligence, pages 393–416, 2005
work page 2005
-
[49]
Arthur Juliani, Ahmed Khalifa, Vincent-Pierre Berges, Jonathan Harper, Ervin Teng, Hunter Henry, Adam Crespi, Julian Togelius, and Danny Lange. Obstacle tower: A generalization challenge in vision, control, and planning.Proceedings of the Twenty- Eighth International Joint Conference on Artificial Intelligence, Aug 2019
work page 2019
-
[50]
Illuminating Generalization in Deep Reinforcement Learning through Procedural Level Generation
Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Ju- lian Togelius, and Sebastian Risi. Illuminating generalization in deep reinforcement learning through procedural level generation. arXiv preprint arXiv:1806.10729 , 2018. 60
work page Pith review arXiv 2018
-
[51]
Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gersh- man. Building machines that learn and think like people. CoRR, 2016
work page 2016
-
[52]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, (7553):436, 2015
work page 2015
-
[53]
A collection of definitions of intelligence
Shane Legg and Marcus Hutter. A collection of definitions of intelligence. 2007
work page 2007
-
[54]
Universal intelligence: A definition of machine intelligence
Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds and machines, 17(4):391–444, 2007
work page 2007
-
[55]
An introduction to Kolmogorov complexity and its applications, volume 3
Ming Li, Paul Vit ´anyi, et al. An introduction to Kolmogorov complexity and its applications, volume 3. Springer
-
[56]
An Essay Concerning Human Understanding
John Locke. An Essay Concerning Human Understanding. 1689
-
[57]
Human performance on the traveling salesman and related problems: A review
James Macgregor and Yun Chu. Human performance on the traveling salesman and related problems: A review. The Journal of Problem Solving, 3, 02 2011
work page 2011
-
[58]
Human performance on the traveling sales- man problem
James Macgregor and Thomas Ormerod. Human performance on the traveling sales- man problem. Perception & psychophysics, 58:527–39, 06 1996
work page 1996
-
[59]
Deep Learning: A Critical Appraisal
Gary Marcus. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018
work page Pith review arXiv 2018
-
[60]
Generality in artificial intelligence
John McCarthy. Generality in artificial intelligence. Communications of the ACM, 30(12):1030–1035, 1987
work page 1987
-
[61]
Machines Who Think: A Personal Inquiry into the History and Prospects of Artificial Intelligence
Pamela McCorduck. Machines Who Think: A Personal Inquiry into the History and Prospects of Artificial Intelligence. AK Peters Ltd, 2004
work page 2004
-
[62]
The cattell-horn-carroll theory of cognitive abilities: Past, present, and future
Kevin McGrew. The cattell-horn-carroll theory of cognitive abilities: Past, present, and future. Contemporary Intellectual Assessment: Theories, Tests, and Issues , 01 2005
work page 2005
- [63]
-
[64]
Place cells, grid cells, and memory
May-Britt Moser, David C Rowland, and Edvard I Moser. Place cells, grid cells, and memory. Cold Spring Harbor perspectives in biology, 7(2):a021808, 2015
work page 2015
-
[65]
Shane Mueller, Matt Jones, Brandon Minnery, Ph Julia, and M Hiland. The bica cog- nitive decathlon: A test suite for biologically-inspired cognitive agents.Proceedings of the 16th Conference on Behavior Representation in Modeling and Simulation , 2007
work page 2007
-
[66]
A. Newell. You cant play 20 questions with nature and win: Projective comments on the papers of this symposium. 1973
work page 1973
-
[67]
Ex- ploring generalization in deep learning
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Ex- ploring generalization in deep learning. In Advances in Neural Information Process- ing Systems, pages 5947–5956, 2017
work page 2017
-
[68]
D., Conway, A., Cowan, N., Donkin, C., Farrell, S., Hitch, G
Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepezvari, Satinder Singh, et al. Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568, 2019. 61
-
[69]
A. E. Howe P. R. Cohen. How evaluation guides ai research: the message still counts more than the medium. AI Mag, page 35, 1988
work page 1988
-
[70]
Assessing generalization in deep reinforcement learning
Charles Packer, Katelyn Gao, Jernej Kos, Philipp Kr ¨ahenb¨uhl, Vladlen Koltun, and Dawn Xiaodong Song. Assessing generalization in deep reinforcement learning. ArXiv, 2018
work page 2018
-
[71]
Diego Perez-Liebana, Katja Hofmann, Sharada Prasanna Mohanty, Noboru Sean Kuno, Andre Kramer, Sam Devlin, Raluca D. Gaina, and Daniel Ionita. The multi- agent reinforcement learning in malm (marl) competition. Technical report, 2019
work page 2019
-
[72]
Diego Perez-Liebana, Jialin Liu, Ahmed Khalifa, Raluca D Gaina, Julian Togelius, and Simon M Lucas. General video game ai: a multi-track framework for evaluating agents, games and content generation algorithms. arXiv preprint arXiv:1802.10363, 2018
-
[73]
Reproducible, Reusable, and Robust Reinforcement Learning, 2018
Joelle Pineau. Reproducible, Reusable, and Robust Reinforcement Learning, 2018. Neural Information Processing Systems
work page 2018
-
[74]
S. Pinker. The blank slate: The modern denial of human nature. Viking, New York, 2002
work page 2002
-
[75]
David M. W. Powers. The total Turing test and the loebner prize. In New Methods in Language Processing and Computational Natural Language Learning, 1998
work page 1998
- [76]
- [77]
- [78]
-
[79]
& McClelland J.L. Rumelhart, D.E. Distributed memory and the representation of general and specific information.Journal of Experimental Psychology, page 159188, 1985
work page 1985
-
[80]
P. Sanghi and D. L. Dowe. A computer program capable of passing iq tests. page 570575, 2003
work page 2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.