Uneven Evolution of Cognition Across Generations of Generative AI Models

Daniel McDuff; Isaac Galatzer-Levy; Jed McGiffin; Xin Liu

arxiv: 2605.06815 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.CV

Uneven Evolution of Cognition Across Generations of Generative AI Models

Isaac Galatzer-Levy , Daniel McDuff , Xin Liu , Jed McGiffin This is my paper

Pith reviewed 2026-05-11 00:54 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords generative AIcognitive assessmentAIQ benchmarkuneven evolutionmultimodal modelsperceptual reasoningverbal comprehensionartificial general intelligence

0 comments

The pith

Generative AI models display uneven cognitive evolution, advancing verbal skills far faster than visual reasoning across generations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies a psychometric framework to track how generative AI models perform on tasks adapted from human intelligence tests and a new benchmark extending beyond human norms. It reports near-ceiling results in verbal comprehension and working memory contrasted with near-floor results in perceptual reasoning, plus faster gains in language-based abstract reasoning than in visually presented equivalents. Tracking six generations in two model families shows these imbalances persist despite overall progress. A sympathetic reader would care because the pattern implies that simply scaling models and optimizing training may leave fundamental gaps in balanced, human-like capabilities.

Core claim

The cognitive abilities of generative models evolve unevenly. Tasks adapted from the Wechsler Adult Intelligence Scale show near-ceiling performance in verbal comprehension and working memory contrasted with near-floor performance in perceptual reasoning. The Artificial Intelligence Quotient benchmark applied to six generations and two model families reveals significant but asymmetric gains, with abstract quantitative reasoning maturing faster when presented linguistically than in a visually analogous format. Visual-perceptual organization remains largely stagnant, indicating an architectural bias toward language-based symbolic manipulation that scaling and optimization alone appear unable t

What carries the argument

The psychometric framework of adapted Wechsler Adult Intelligence Scale tasks combined with the new Artificial Intelligence Quotient benchmark that measures performance trajectories beyond human-normed limits.

If this is right

Verbal comprehension and working memory stay at or above the 98th percentile while perceptual reasoning stays below the 1st percentile.
Abstract quantitative reasoning improves more rapidly in linguistic format than in visually presented format.
Visual-perceptual organization shows little or no improvement over multiple generations and model families.
The dissociation between modalities points to an architectural preference for symbolic language manipulation over integrated visual reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed language-over-vision bias may trace to the sequential processing emphasis in current transformer designs.
Achieving more balanced intelligence could require training regimes or architectures that enforce symmetry across input modalities rather than relying on scale alone.
A direct test would compare performance on identical abstract problems presented as pure images versus textual descriptions of those images.
The pattern raises the question of whether general intelligence requires explicit mechanisms for cross-modal integration that current scaling paths do not supply.

Load-bearing premise

That tasks adapted from human intelligence scales measure equivalent cognitive constructs in generative AI models as they do in people and that the AIQ benchmark extends them without introducing new biases.

What would settle it

A subsequent model generation that achieves balanced high-percentile performance across both linguistic and visual-perceptual tasks without architectural redesign would falsify the claim of fundamental limitations.

read the original abstract

The pursuit of artificial general intelligence necessitates robust methods for evaluating the cognitive capabilities of models beyond narrow task performance. Here, we introduce a psychometric framework to assess the cognitive profiles of generative AI, comparing them to human norms and tracking their evolution across generations. Initial evaluation of leading multimodal models using tasks adapted from the Wechsler Adult Intelligence Scale revealed a profoundly uneven cognitive architecture: near-ceiling performance in verbal comprehension and working memory (>$98^{\text{th}}$ percentile) contrasted with near-floor performance in perceptual reasoning (<$1^{\text{st}}$ percentile). To track developmental trajectories beyond human-normed limits, we developed the Artificial Intelligence Quotient (AIQ) Benchmark and applied it to six generations and two model families, revealing significant but asymmetric performance gains. Notably, we uncovered a sharp dissociation between modalities; abstract quantitative reasoning matured far more rapidly when presented linguistically compared to a visually analogous format, indicating an architectural bias towards language-based symbolic manipulation. While abstract visual reasoning improved, visual-perceptual organization remained largely stagnant. Collectively, these findings demonstrate that the cognitive abilities of generative models are evolving unevenly, suggesting that scaling and optimization approaches to AGI development alone may be insufficient to overcome fundamental architectural limitations in achieving balanced, human-like general intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tracks uneven verbal vs perceptual gains across AI generations with adapted tests and a new benchmark, but the architectural-limit claim rests on unvalidated task equivalence.

read the letter

The punchline is that successive multimodal models keep showing near-ceiling verbal and working-memory scores alongside near-floor perceptual-reasoning scores, with faster linguistic than visual quantitative gains, and the authors introduce an AIQ benchmark to measure past human norms. That generational pattern and the modality split are the concrete observations here. The work does a decent job of applying the same adapted Wechsler subtests to six generations and two model families, which gives a simple longitudinal view that prior single-snapshot evaluations lacked. The AIQ extension is a straightforward attempt to keep scoring when models exceed human ceilings. Those pieces are useful raw data for anyone building evaluation suites. The soft spot is exactly the one the stress-test flags: the paper never demonstrates that the adapted tasks engage comparable processes in these models as they do in humans. No ablations of vision encoders, no checks against training-data imbalance, and no non-generative baselines appear, so the dissociation could easily trace to tokenization choices or data skew rather than an intrinsic limit scaling cannot touch. The conclusion that scaling and optimization alone are insufficient therefore does not follow from the reported numbers. This paper is for researchers who run capability benchmarks and want to see how one psychometric framing plays out over time. Readers already skeptical of direct IQ-test transfer to AI will find the results suggestive but not decisive. It deserves peer review because the tracking exercise is reproducible enough to be worth referee scrutiny, even though the methods section will need substantial strengthening on construct validity before the architectural claim can stand.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a psychometric framework for evaluating the cognitive capabilities of generative AI models. It adapts tasks from the Wechsler Adult Intelligence Scale (WAIS) to assess leading multimodal models, reporting near-ceiling performance in verbal comprehension and working memory contrasted with near-floor performance in perceptual reasoning. To extend beyond human norms, the authors develop the Artificial Intelligence Quotient (AIQ) Benchmark and apply it across six generations of two model families, documenting asymmetric gains and a dissociation between linguistic and visual quantitative reasoning. The central claim is that these patterns indicate fundamental architectural limitations that scaling and optimization alone cannot overcome in pursuit of balanced, human-like general intelligence.

Significance. If the adapted WAIS tasks and new AIQ benchmark are shown to measure comparable cognitive constructs in generative models, the work offers a novel longitudinal perspective on cognitive evolution in AI. The tracking of performance across model generations and the identification of modality-specific asymmetries provide empirical grounding for debates on whether current architectures can achieve general intelligence through scaling. The introduction of AIQ as a benchmark extending beyond human norms is a potential contribution to the field of AI evaluation, provided its validity is established.

major comments (3)

[Abstract] Abstract: the inference that observed performance gaps (near-ceiling verbal/working memory vs. near-floor perceptual reasoning) demonstrate 'fundamental architectural limitations' that scaling cannot address is not supported by the reported evidence. No controls are described for construct equivalence between human-normed WAIS subtests and AI models, such as ablation of linguistic vs. visual pathways or comparison to non-generative baselines.
[AIQ Benchmark description] The description of the AIQ benchmark: without reported details on its item construction, statistical validation, sample sizes, or demonstration that low perceptual scores arise from mechanisms analogous to those in humans, the claim of asymmetric evolution across generations cannot be confidently distinguished from training data imbalances, tokenization effects, or task artifacts.
[Results on modality dissociation] The dissociation between linguistic and visual quantitative reasoning: the paper attributes faster maturation in the linguistic format to an 'architectural bias towards language-based symbolic manipulation,' but this conclusion requires evidence that the visual format engages equivalent processes; absent such validation (e.g., no cross-modal ablation or human-AI process comparison), the architectural-limitation interpretation does not follow.

minor comments (2)

[Abstract and Results] The percentile rankings (e.g., >98th and <1st) are presented without accompanying confidence intervals, details on the human normative sample, or statistical tests for the significance of generational changes.
[Methods] The manuscript would benefit from explicit discussion of how the AIQ benchmark avoids circularity with existing LLM evaluation suites and from clearer notation distinguishing raw scores from normed percentiles.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has prompted us to strengthen the manuscript by clarifying interpretive claims and expanding methodological details. We have revised the abstract and relevant sections to avoid overstatement while preserving the core empirical contributions. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the inference that observed performance gaps (near-ceiling verbal/working memory vs. near-floor perceptual reasoning) demonstrate 'fundamental architectural limitations' that scaling cannot address is not supported by the reported evidence. No controls are described for construct equivalence between human-normed WAIS subtests and AI models, such as ablation of linguistic vs. visual pathways or comparison to non-generative baselines.

Authors: We agree that the original abstract phrasing presented an interpretive leap from observed patterns to claims of fundamental limitations without direct causal evidence such as ablations or non-generative baselines. We have revised the abstract to describe the results as demonstrating uneven cognitive evolution across modalities and to state that the patterns suggest scaling and optimization alone may be insufficient for balanced intelligence, while explicitly noting that stronger causal claims would require targeted validation of construct equivalence. The longitudinal tracking across generations provides correlational grounding for this view but does not substitute for experimental controls. revision: yes
Referee: [AIQ Benchmark description] The description of the AIQ benchmark: without reported details on its item construction, statistical validation, sample sizes, or demonstration that low perceptual scores arise from mechanisms analogous to those in humans, the claim of asymmetric evolution across generations cannot be confidently distinguished from training data imbalances, tokenization effects, or task artifacts.

Authors: We have added a dedicated subsection in the Methods detailing AIQ item construction (including adaptation procedures and generation protocols), statistical validation (reliability coefficients and factor analyses), and exact sample sizes per model and generation. In the Discussion we now explicitly consider alternative explanations such as training data imbalances and tokenization effects, arguing that the persistence of modality asymmetries across two independent model families and multiple generations reduces the likelihood that artifacts alone account for the results, though we acknowledge this remains inferential. revision: yes
Referee: [Results on modality dissociation] The dissociation between linguistic and visual quantitative reasoning: the paper attributes faster maturation in the linguistic format to an 'architectural bias towards language-based symbolic manipulation,' but this conclusion requires evidence that the visual format engages equivalent processes; absent such validation (e.g., no cross-modal ablation or human-AI process comparison), the architectural-limitation interpretation does not follow.

Authors: We have revised the language in the Results and Discussion to describe the dissociation as 'consistent with an architectural bias toward language-based symbolic manipulation' rather than a direct demonstration. We have added an explicit limitations paragraph acknowledging the lack of cross-modal ablations or human-AI process-tracing comparisons in the present study. The parallel task formats and replication across model families provide initial empirical support for the interpretation, but we agree that definitive mechanistic evidence would require the additional validations noted by the referee. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from benchmark testing with no reductive derivations.

full rationale

The paper reports direct empirical measurements of model performance on WAIS-adapted subtests and the newly introduced AIQ benchmark across model generations and families. No mathematical derivations, parameter fittings, or first-principles predictions appear in the text; the observed dissociations (e.g., verbal vs. perceptual scores, linguistic vs. visual quantitative reasoning) are presented as raw test outcomes rather than quantities derived from or equivalent to the input data by construction. The interpretive claim that scaling alone may be insufficient follows as an inference from those independent observations, not as a tautology or self-citation chain. The methodology is self-contained against external human-normed benchmarks without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the validity of human IQ test adaptations for AI and the new benchmark as an unbiased extension; no free parameters are explicitly fitted in the abstract.

axioms (1)

domain assumption Adapted WAIS tasks measure the same underlying cognitive abilities in AI models as in humans
Invoked when comparing model performance to human percentiles

invented entities (1)

Artificial Intelligence Quotient (AIQ) Benchmark no independent evidence
purpose: Track cognitive development in AI beyond human-normed limits
Newly introduced to evaluate generational changes

pith-pipeline@v0.9.0 · 5526 in / 1275 out tokens · 62051 ms · 2026-05-11T00:54:01.926360+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

near-ceiling performance in verbal comprehension and working memory (>98th percentile) contrasted with near-floor performance in perceptual reasoning (<1st percentile)... sharp dissociation between modalities; abstract quantitative reasoning matured far more rapidly when presented linguistically compared to a visually analogous format
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GLMM... significant but asymmetric performance gains... scaling and optimization approaches to AGI development alone may be insufficient to overcome fundamental architectural limitations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Perspectives on psychological science , volume=

636,120 ways to have posttraumatic stress disorder , author=. Perspectives on psychological science , volume=. 2013 , publisher=

work page 2013
[2]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Vision-Language Models Do Not Understand Negation , author=. arXiv preprint arXiv:2501.09425 , year=

work page arXiv
[3]

Moving beyond

Brysbaert, Marc and New, Boris , journal=. Moving beyond. 2009 , publisher=

work page 2009
[4]

Journal of Artificial General Intelligence , volume=

Artificial general intelligence: Concept, state of the art, and future prospects , author=. Journal of Artificial General Intelligence , volume=

work page
[5]

Behavioral and Brain Sciences , volume=

Building machines that learn and think like people , author=. Behavioral and Brain Sciences , volume=. 2017 , publisher=

work page 2017
[6]

Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms , pages=

A collection of definitions of intelligence , author=. Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms , pages=. 2007 , publisher=

work page 2007
[7]

2008 , publisher=

Wechsler Adult Intelligence Scale--Fourth Edition (WAIS-IV) , author=. 2008 , publisher=

work page 2008
[8]

Nature Neuroscience , volume=

How to build a cognitive map , author=. Nature Neuroscience , volume=. 2022 , publisher=

work page 2022
[9]

Transactions on Machine Learning Research , year=

Evaluating Spatial Understanding of Large Language Models , author=. Transactions on Machine Learning Research , year=

work page
[10]

Measuring Progress Toward

Burnell, Ryan and Schulz, Julian and Schellaert, Wout and Hern. Measuring Progress Toward. arXiv preprint arXiv:2311.02521 , year=

work page arXiv
[11]

Adobe Photoshop , version =

work page
[12]

Imagen 3.arXiv preprint arXiv:2408.07009, 2024

Imagen 3 , author=. arXiv preprint arXiv:2408.07009 , year=

work page arXiv
[13]

2019 , eprint=

On the Measure of Intelligence , author=. 2019 , eprint=

work page 2019
[14]

, author=

Assessment practices of professional psychologists: Results of a national survey. , author=. Professional psychology: research and practice , volume=. 2017 , publisher=

work page 2017

[1] [1]

Perspectives on psychological science , volume=

636,120 ways to have posttraumatic stress disorder , author=. Perspectives on psychological science , volume=. 2013 , publisher=

work page 2013

[2] [2]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Vision-Language Models Do Not Understand Negation , author=. arXiv preprint arXiv:2501.09425 , year=

work page arXiv

[3] [3]

Moving beyond

Brysbaert, Marc and New, Boris , journal=. Moving beyond. 2009 , publisher=

work page 2009

[4] [4]

Journal of Artificial General Intelligence , volume=

Artificial general intelligence: Concept, state of the art, and future prospects , author=. Journal of Artificial General Intelligence , volume=

work page

[5] [5]

Behavioral and Brain Sciences , volume=

Building machines that learn and think like people , author=. Behavioral and Brain Sciences , volume=. 2017 , publisher=

work page 2017

[6] [6]

Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms , pages=

A collection of definitions of intelligence , author=. Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms , pages=. 2007 , publisher=

work page 2007

[7] [7]

2008 , publisher=

Wechsler Adult Intelligence Scale--Fourth Edition (WAIS-IV) , author=. 2008 , publisher=

work page 2008

[8] [8]

Nature Neuroscience , volume=

How to build a cognitive map , author=. Nature Neuroscience , volume=. 2022 , publisher=

work page 2022

[9] [9]

Transactions on Machine Learning Research , year=

Evaluating Spatial Understanding of Large Language Models , author=. Transactions on Machine Learning Research , year=

work page

[10] [10]

Measuring Progress Toward

Burnell, Ryan and Schulz, Julian and Schellaert, Wout and Hern. Measuring Progress Toward. arXiv preprint arXiv:2311.02521 , year=

work page arXiv

[11] [11]

Adobe Photoshop , version =

work page

[12] [12]

Imagen 3.arXiv preprint arXiv:2408.07009, 2024

Imagen 3 , author=. arXiv preprint arXiv:2408.07009 , year=

work page arXiv

[13] [13]

2019 , eprint=

On the Measure of Intelligence , author=. 2019 , eprint=

work page 2019

[14] [14]

, author=

Assessment practices of professional psychologists: Results of a national survey. , author=. Professional psychology: research and practice , volume=. 2017 , publisher=

work page 2017