{"work":{"id":"d8980a59-aa48-447b-8852-b7aca2b41b2c","openalex_id":null,"doi":null,"arxiv_id":"1911.01547","raw_key":null,"title":"On the Measure of Intelligence","authors":null,"authors_text":"Fran\\c{c}ois Chollet","year":2019,"venue":"cs.AI","abstract":"To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to \"buy\" arbitrary levels of skills for a system, in a way that masks the system's own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.","external_url":"https://arxiv.org/abs/1911.01547","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T05:20:24.967190+00:00","pith_arxiv_id":"1911.01547","created_at":"2026-05-09T06:05:36.786746+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":false,"display_title":"On the Measure of Intelligence","render_title":"On the Measure of Intelligence"},"hub":{"state":{"work_id":"d8980a59-aa48-447b-8852-b7aca2b41b2c","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":62,"external_cited_by_count":null,"distinct_field_count":8,"first_pith_cited_at":"2019-12-18T18:36:20+00:00","last_pith_cited_at":"2026-05-22T01:43:32+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-07T13:31:52.531516+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":14},{"context_role":"dataset","n":2}],"polarity_counts":[{"context_polarity":"background","n":13},{"context_polarity":"use_dataset","n":2},{"context_polarity":"unclear","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-15T04:57:41.361623+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":10},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":8},{"title":"10 Preprint","work_id":"2957f2ae-a92d-479e-a1ff-b0b522445d0b","shared_citers":7},{"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","work_id":"a8d50b24-bdf5-46ed-bc4f-2927dfd81f1d","shared_citers":7},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":7},{"title":"Arc prize 2024: Technical report","work_id":"49bd7afe-e99e-4c34-8417-36562ab6cd8d","shared_citers":5},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":5},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":5},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":5},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":4},{"title":"GPQA: A Graduate-Level Google-Proof Q&A Benchmark","work_id":"9e2a976b-f5ad-4aee-af5c-243fe0fe75d2","shared_citers":4},{"title":"Hierarchical reasoning model","work_id":"81fb438f-27d0-4a73-8fa3-5e65f127ca94","shared_citers":4},{"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","shared_citers":4},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":4},{"title":"2 OLMo 2 Furious","work_id":"9ef0dc2b-fdfe-4f14-b235-ef7556dc709a","shared_citers":3},{"title":"AgentBench: Evaluating LLMs as Agents","work_id":"a37549b4-4c94-412d-acc4-4efeb08509be","shared_citers":3},{"title":"arXiv preprint arXiv:2501.04519 (2025)","work_id":"49792b83-569e-4f5f-ae80-e96cbd3b7a43","shared_citers":3},{"title":"A Survey on In-context Learning","work_id":"864701ca-cb36-4a91-9be8-e2b9b20679aa","shared_citers":3},{"title":"A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?","work_id":"d4eaadf8-a3c6-4eee-98fb-1b337dd42e2d","shared_citers":3},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":3},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":3},{"title":"Gemma 2: Improving Open Language Models at a Practical Size","work_id":"4dd94e2f-2b27-4cbf-88a0-4910f0772a57","shared_citers":3},{"title":"Gemma: Open Models Based on Gemini Research and Technology","work_id":"a9ea2870-df28-40b8-a9e0-a7e9a116f793","shared_citers":3},{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":3}],"time_series":[{"n":2,"year":2024},{"n":6,"year":2025},{"n":30,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-15T04:57:39.361659+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-15T04:57:46.022572+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"On the Measure of Intelligence","claims":[{"claim_text":"To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that h","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"depth transformers with this capability. These works have a similar aim to ours, enabling reasoning in latent space, but approach this goal from separate directions. For additional discussions related to the idea of construct- ing a prior that incentivizes reasoning and algorithm learn- ing at the expense of memorization of simple patterns, we also refer to Chollet (2019), Schwarzschild (2023), Li et al. (2020b) and Moulton (2023). 9. Future Work Aside from work extending and analyzing the scali","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"These techniques can be categorized into two main types based on the source of feedback: process reward models (PRMs) and prompted LLMs. The performance comparison are mainly shown in Table 4. Process Feedback from Process Rewarded Model Recent studies highlight the significance of feedback in developing effective PRMs for complex reasoning tasks, particularly in a step-level view [134, 423, 528]. (1) Process Annotated PRM Training: Earlier, Lightman et al. [449] demon- strate that training proc","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks On the Measure of Intelligence because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (2 contexts).","role_counts":[{"n":2,"context_role":"background"}]},"error":null,"updated_at":"2026-05-15T04:57:39.370338+00:00"}},"summary":{"title":"On the Measure of Intelligence","claims":[{"claim_text":"To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that h","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"depth transformers with this capability. These works have a similar aim to ours, enabling reasoning in latent space, but approach this goal from separate directions. For additional discussions related to the idea of construct- ing a prior that incentivizes reasoning and algorithm learn- ing at the expense of memorization of simple patterns, we also refer to Chollet (2019), Schwarzschild (2023), Li et al. (2020b) and Moulton (2023). 9. Future Work Aside from work extending and analyzing the scali","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"These techniques can be categorized into two main types based on the source of feedback: process reward models (PRMs) and prompted LLMs. The performance comparison are mainly shown in Table 4. Process Feedback from Process Rewarded Model Recent studies highlight the significance of feedback in developing effective PRMs for complex reasoning tasks, particularly in a step-level view [134, 423, 528]. (1) Process Annotated PRM Training: Earlier, Lightman et al. [449] demon- strate that training proc","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks On the Measure of Intelligence because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (2 contexts).","role_counts":[{"n":2,"context_role":"background"}]},"graph":{"co_cited":[{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":10},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":8},{"title":"10 Preprint","work_id":"2957f2ae-a92d-479e-a1ff-b0b522445d0b","shared_citers":7},{"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","work_id":"a8d50b24-bdf5-46ed-bc4f-2927dfd81f1d","shared_citers":7},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":7},{"title":"Arc prize 2024: Technical report","work_id":"49bd7afe-e99e-4c34-8417-36562ab6cd8d","shared_citers":5},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":5},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":5},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":5},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":4},{"title":"GPQA: A Graduate-Level Google-Proof Q&A Benchmark","work_id":"9e2a976b-f5ad-4aee-af5c-243fe0fe75d2","shared_citers":4},{"title":"Hierarchical reasoning model","work_id":"81fb438f-27d0-4a73-8fa3-5e65f127ca94","shared_citers":4},{"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","shared_citers":4},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":4},{"title":"2 OLMo 2 Furious","work_id":"9ef0dc2b-fdfe-4f14-b235-ef7556dc709a","shared_citers":3},{"title":"AgentBench: Evaluating LLMs as Agents","work_id":"a37549b4-4c94-412d-acc4-4efeb08509be","shared_citers":3},{"title":"arXiv preprint arXiv:2501.04519 (2025)","work_id":"49792b83-569e-4f5f-ae80-e96cbd3b7a43","shared_citers":3},{"title":"A Survey on In-context Learning","work_id":"864701ca-cb36-4a91-9be8-e2b9b20679aa","shared_citers":3},{"title":"A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?","work_id":"d4eaadf8-a3c6-4eee-98fb-1b337dd42e2d","shared_citers":3},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":3},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":3},{"title":"Gemma 2: Improving Open Language Models at a Practical Size","work_id":"4dd94e2f-2b27-4cbf-88a0-4910f0772a57","shared_citers":3},{"title":"Gemma: Open Models Based on Gemini Research and Technology","work_id":"a9ea2870-df28-40b8-a9e0-a7e9a116f793","shared_citers":3},{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":3}],"time_series":[{"n":2,"year":2024},{"n":6,"year":2025},{"n":30,"year":2026}],"dependency_candidates":[]},"authors":[]}}