pith. machine review for the scientific record. sign in

arxiv: 2604.09911 · v1 · submitted 2026-04-10 · 🧬 q-bio.NC · cs.AI

Recognition: unknown

The Rise and Fall of G in AGI

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:50 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.AI
keywords general intelligenceSpearman's gpositive manifoldLLM benchmarksprincipal component analysisspecializationAGIpsychometrics
0
0 comments X

The pith

AI models exhibit a general intelligence factor that rises across early benchmarks then falls as specialized reasoning abilities appear.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats LLM benchmark scores as cognitive test data and applies principal component analysis to a matrix of models over time to extract an equivalent of Spearman's g-factor. It finds that nearly all pairwise correlations among benchmarks are positive, forming a positive manifold, with the first principal component initially explaining around 90 percent of the variance in core sets. This dominance peaks around 2023-2024 before dropping sharply once reasoning-specialized models arrive, coinciding with models outsourcing certain tasks to tools. The result is framed as general intelligence suppressing specialized intelligences in AI, producing a pattern the author calls a Ptolemaic succession of increasingly complex architectures rather than parsimonious ones. A reader would care because this reframes AGI progress as a temporary unification of abilities that later gives way to diversification.

Core claim

By constructing a models-by-benchmarks-by-time matrix for 39 models from 2019 to 2025 across 14 benchmarks, the analysis confirms a strong positive manifold with all 28 pairwise correlations positive on an 8-benchmark subset. Principal component analysis on a 5-benchmark core battery shows PC1 explaining 90 percent of variance, falling to 77 percent by 2024; on a four-benchmark battery PC1 peaks at 92 percent in 2023-2024 and drops to 64 percent with the arrival of reasoning-specialized models. Partial correlation matrices reveal increasing specialization beneath the manifold, supporting the claim that in psychometric terms AI models display general intelligence that suppresses specialized能力

What carries the argument

The G-factor obtained as the first principal component of the time-evolving benchmark correlation matrix, which quantifies the positive manifold and tracks its rotation toward specialization.

If this is right

  • AI progress initially unifies performance across diverse benchmarks under a dominant general factor.
  • Specialized intelligences emerge and reduce the explanatory power of the general factor once reasoning tools are integrated.
  • Current LLM architectures follow a pattern of increasing hierarchical complexity instead of replacing complex mechanisms with simpler ones.
  • The positive manifold of general intelligence encompasses multiple high-dimensional problem-solving systems that later differentiate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmark design may need to evolve to isolate the contribution of tool use versus intrinsic model capability.
  • The observed pattern suggests that continued scaling alone will not sustain the same level of cross-benchmark generality.
  • This dynamic could guide the creation of hybrid systems that deliberately preserve specialized modules alongside general ones.

Load-bearing premise

Benchmark performance scores can be treated as equivalent to human cognitive test scores, allowing direct psychometric factor analysis on model releases as subjects.

What would settle it

A failure to observe a decline in the variance explained by the first principal component after 2024, or the appearance of negative pairwise benchmark correlations in later model cohorts, would falsify the falling-G claim.

Figures

Figures reproduced from arXiv: 2604.09911 by David C. Krakauer.

Figure 1
Figure 1. Figure 1: The score matrix X. A representative subset of 21 models (rows, ordered by release date) and 14 benchmarks (columns, grouped by cognitive domain). Cell color encodes score inten￾sity (0–100%); dashes mark missing evaluations. The matrix exhibits two key structural features: (i) a gradient from low scores (upper-left) to high scores (lower-right), reflecting the correlated im￾provement that the PCA must dis… view at source ↗
Figure 2
Figure 2. Figure 2: The phenomenon to be dissected: benchmark performance rising across all tasks simultaneously. Bold markers show the running-maximum score (frontier envelope) for each benchmark; faded markers show all individual model scores. Curves are the best-fit growth model selected by AIC ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The positive manifold in LLM benchmarks. Pairwise Pearson correlations across six major benchmarks, computed using pairwise-complete observations (minimum n = 5 pairs per cell). All 15 off-diagonal entries are positive, satisfying the Spearman criterion for a general factor. Note the near-zero correlation between HumanEval and MMLU-Pro (r = 0.04), suggesting that code generation and hard general knowledge … view at source ↗
Figure 4
Figure 4. Figure 4: Factor loading plot for the 5-benchmark core battery. Arrows show each bench￾mark’s loading on PC1 (G-factor, 90% variance) and PC2 (7% variance). All benchmarks load positively on PC1, confirming a general factor. PC2 separates an execution/fluency pole (GSM8K, HumanEval—positive PC2) from a reasoning pole (MATH, GPQA—negative PC2). MMLU is near the origin on PC2, contributing primarily to G rather than t… view at source ↗
Figure 5
Figure 5. Figure 5: Normalized G: LLM general intelligence factor over time. Each point represents a model’s projection onto PC1 of the 5-benchmark battery (MMLU, GSM8K, MATH, HumanEval, GPQA Diamond), rescaled to a 0–100 range where 0 is the lowest-scoring model (Llama 2 70B Chat) and 100 is the highest (o1-preview). PC1 captures 90% of total variance across the battery. The 19 models with complete data on all five benchmark… view at source ↗
Figure 6
Figure 6. Figure 6: Scree plots across algorithmic epochs. Eigenvalue decomposition of the 4-benchmark battery (MMLU, GSM8K, MATH, HumanEval) computed within each epoch. The red dashed line marks the Kaiser criterion (λ = 1). During Epoch II (2023–2024.03), a single dominant factor captures 92% of variance and no second eigenvalue approaches 1.0. In Epoch IV (2024.09+), a second eigenvalue (1.88) increases alongside the creat… view at source ↗
Figure 7
Figure 7. Figure 7: Expanding-window tracking changing G. Models are added in chronological order of release date; at each step, PCA is computed on the 4-benchmark battery. Top: ρ1, the fraction of total variance captured by PC1 (G). The 90% threshold (red dotted line) is exceeded throughout, with a peak at 95.5% around the Claude 3 Opus release (early 2024). Bottom: All four normalized variance fractions ρ1, ρ2, ρ3, ρ4 plott… view at source ↗
Figure 8
Figure 8. Figure 8: Effective dimensionality of the LLM benchmark space through time. Expanding-window analysis on two benchmark batteries. Top: Number of principal components required for 99% cumulative variance. On the 4-benchmark battery (blue circles), dimensional￾ity stabilizes at 3 of 4—near-maximal compression. On the 5-benchmark battery (purple squares, adding GPQA Diamond), dimensionality rises from 3 to 5 as post-20… view at source ↗
Figure 9
Figure 9. Figure 9: Eigenvalue spectrum through time: cumulative variance partitioned by prin￾cipal component. Each colored band shows one component’s marginal contribution to cumu￾lative variance in the expanding-window PCA; the top of each band is the cumulative variance through that component. (a) 4-benchmark battery: PC1 (G, blue) accounts for 92–95% through￾out; PC1+PC2 exceed 97% at every time point. The spectrum is eff… view at source ↗
Figure 10
Figure 10. Figure 10: Test (i): CUSUM change-point analysis on eigenvalue diagnostics. (a) Variance explained by PC1 (ρ1) in the expanding-window 4-benchmark PCA, showing a peak around the Claude 3 Opus/Gemini Ultra releases (early 2024) followed by decline. (b) CUSUM statistic on ρ1, with maximum absolute deviation marked; p = 0.004 by permutation test, confirming a significant structural break. (c) Dominance ratio δ = λ1/λ2,… view at source ↗
Figure 11
Figure 11. Figure 11: Test (ii): Eigenvector alignment. (a) Angular displacement θ between consec￾utive first eigenvectors as each model enters the expanding window. Blue bars: 4-benchmark battery (max θ = 0.57, near-perfect stability). Purple bars: 5-benchmark battery (max θ = 6.4 at DeepSeek V3 entry). The 5-benchmark battery reveals rotations invisible to the 4-benchmark analysis because GPQA Diamond introduces a dimension … view at source ↗
Figure 12
Figure 12. Figure 12: Partial correlation structure after removing G. (a) Raw correlation matrix (5- benchmark battery, n = 19 models with complete data): all 10 off-diagonal correlations positive. (b) Partial correlation matrix after projecting out PC1: 7 of 10 correlations are now negative, with mean rresid = −0.24. The positive manifold is entirely attributable to G. (c) Revealed group factor structure derived from the resi… view at source ↗
Figure 13
Figure 13. Figure 13: Epoch-specific partial correlations. Partial correlation matrices after removing each epoch’s own PC1 from the 4-benchmark battery. (a) Epoch II (2023.03–2024.03, n = 8, ρ1 = 92%), (b) Epoch III (2024.04–2024.09, n = 7, ρ1 = 80%). Additional panels show the 5-benchmark and global analyses. Of 6 benchmark pairs in the 4-benchmark analysis, 4 maintain the same sign across Epochs II and III; only the two wea… view at source ↗
Figure 14
Figure 14. Figure 14: Trajectory through G-space. Four points are plotted in a space defined by mean benchmark performance, ρ1 (variance explained by G), and effective dimensionality deff. Point 1: Epoch II (within-epoch, 5-benchmark battery). Point 2: Epoch III (within-epoch). Point 3: all models, raw PCA. Point 4: all models, detrended PCA. Solid arrow: temporal trajectory (1→2). Dashed purple arrow: detrending correction (3… view at source ↗
read the original abstract

In the psychological literature the term `general intelligence' describes correlations between abilities and not simply the number of abilities. This paper connects Spearman's $g$-factor from psychometrics, measuring a positive manifold, to the implicit ``$G$-factor'' in claims about artificial general intelligence (AGI) performance on temporally structured benchmarks. By treating LLM benchmark batteries as cognitive test batteries and model releases as subjects, principal component analysis is applied to a models $\times$ benchmarks $\times$ time matrix spanning 39 models (2019--2025) and 14 benchmarks. Preliminary results confirm a strong positive manifold in which all 28 pairwise correlations positive across 8 benchmarks. By analyzing the spectrum of the benchmark correlation through time, PC1 explains 90\% of variance on a 5-benchmark core battery ($n=19$)) reducing to 77\% by 2024. On a four benchmark battery, PC1 is found to peak at 92\% of the variance between 2023--2024 and reduce to 64\% with the arrival of reasoning-specialized models in 2024. This is coincident with a rotation in the G-factor as models outsource `reasoning' to tools. The analysis of partial correlation matrices through time provides evidence for the evolution of specialization beneath the positive manifold of general intelligence (AI-hedgehog) encompassing diverse high dimensional problem solving systems (AI-foxes). In strictly psychometric terms, AI models exhibit general intelligence suppressing specialized intelligences. LLMs invert the ideal of substituting complicated models with parsimonious mechanisms, a `Ptolemaic Succession' of theories, with architectures of increasing hierarchical complication and capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript claims that principal component analysis applied to correlations among LLM benchmark scores across 39 models from 2019-2025 reveals a strong positive manifold, with the first principal component (interpreted as an AI G-factor analogous to Spearman's g) explaining a high proportion of variance (90% on 5-benchmark battery, 92% on 4-benchmark) that subsequently declines (to 77% and 64% respectively) as specialized reasoning models emerge in 2024. This decline, analyzed through time-varying correlation matrices, is interpreted as evidence that general intelligence suppresses specialized intelligences in AI systems, inverting traditional parsimony ideals in a 'Ptolemaic Succession' of increasingly complex architectures.

Significance. If validated with appropriate controls for confounding factors like model scale, this analysis provides a valuable empirical framework for applying psychometric methods to track the evolution of AI capabilities. It offers quantitative support for the idea that AI progress involves not just increasing general ability but also diversification and specialization, with potential implications for AGI development and evaluation. The temporal dimension and use of multiple benchmark subsets are strengths that allow observation of dynamic changes in the correlation structure.

major comments (3)
  1. [PCA results for 5-benchmark battery] In the section describing the PCA on the 5-benchmark core battery (n=19 models), the reported decline in PC1 variance from 90% to 77% is presented without any residualization of benchmark scores against model parameter count, training FLOPs, or release date prior to computing the correlation matrix. This is load-bearing for the central claim, as the positive manifold and its erosion could be artifacts of uncontrolled scaling rather than evidence for a latent g-factor independent of scale.
  2. [Four benchmark battery analysis] In the section on the four-benchmark battery analysis, the peak at 92% variance explained (2023-2024) and subsequent drop to 64% with reasoning-specialized models lacks a control PCA or partial correlation analysis that holds scale or compute fixed. Without this, the rotation in the G-factor and attribution to tool-use specialization cannot be distinguished from effects of benchmark diversification or increasing model heterogeneity.
  3. [Interpretation of partial correlation matrices] In the interpretation of partial correlation matrices through time, the conclusion that AI models exhibit 'general intelligence suppressing specialized intelligences' assumes the observed correlation structure is independent of the dominant scaling trend, but no alternative model (e.g., PCA on residuals after regressing out parameter count) or falsification test is reported. This directly affects the 'Ptolemaic Succession' framing.
minor comments (3)
  1. [Abstract] The abstract introduces terms such as 'AI-hedgehog' and 'AI-foxes' without definition or reference to the underlying analogy, which reduces clarity for readers outside the immediate subfield.
  2. [Data and methods] The data description lacks explicit details on preprocessing steps, including handling of missing benchmark values, score normalization across heterogeneous benchmarks, and criteria for model and benchmark inclusion.
  3. [Results] The reported variance percentages for PC1 lack accompanying statistical significance tests, bootstrap confidence intervals, or sensitivity analyses to different model subsets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key methodological gaps in controlling for scaling effects. We address each major comment below and will incorporate additional analyses in the revised manuscript to strengthen the claims.

read point-by-point responses
  1. Referee: [PCA results for 5-benchmark battery] In the section describing the PCA on the 5-benchmark core battery (n=19 models), the reported decline in PC1 variance from 90% to 77% is presented without any residualization of benchmark scores against model parameter count, training FLOPs, or release date prior to computing the correlation matrix. This is load-bearing for the central claim, as the positive manifold and its erosion could be artifacts of uncontrolled scaling rather than evidence for a latent g-factor independent of scale.

    Authors: We agree that residualization against scale metrics is necessary to isolate any latent g-factor from scaling trends. The temporal decline coincides with the 2024 emergence of specialized models, but this does not fully rule out scale confounds. In revision, we will regress benchmark scores on log(parameter count) and release date, then recompute the correlation matrix and PCA on the residuals. Results will be reported alongside the original analyses. revision: yes

  2. Referee: [Four benchmark battery analysis] In the section on the four-benchmark battery analysis, the peak at 92% variance explained (2023-2024) and subsequent drop to 64% with reasoning-specialized models lacks a control PCA or partial correlation analysis that holds scale or compute fixed. Without this, the rotation in the G-factor and attribution to tool-use specialization cannot be distinguished from effects of benchmark diversification or increasing model heterogeneity.

    Authors: We acknowledge this limitation. To address it, the revised manuscript will include a control PCA restricted to models within a narrow parameter-count range (e.g., 10B-100B) and partial correlation analyses controlling for log(parameter count). This will help separate specialization effects from heterogeneity or benchmark changes. revision: yes

  3. Referee: [Interpretation of partial correlation matrices] In the interpretation of partial correlation matrices through time, the conclusion that AI models exhibit 'general intelligence suppressing specialized intelligences' assumes the observed correlation structure is independent of the dominant scaling trend, but no alternative model (e.g., PCA on residuals after regressing out parameter count) or falsification test is reported. This directly affects the 'Ptolemaic Succession' framing.

    Authors: The referee correctly notes the absence of a direct falsification test. We will add PCA performed on residuals after regressing out parameter count (and, where available, FLOPs) from the benchmark scores. The revised text will discuss whether the suppression interpretation and Ptolemaic Succession framing remain supported after this control. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical PCA on external benchmark data

full rationale

The paper applies principal component analysis to a models-by-benchmarks correlation matrix constructed from publicly available LLM benchmark scores across 39 models and 14 benchmarks. PC1 variance shares (90% to 77%, 92% to 64%) and the positive manifold are computed outputs of this matrix; no step renames a fitted parameter as a prediction, defines G in terms of itself, or reduces a claimed result to a self-citation chain. The psychometric analogy to Spearman's g is interpretive framing rather than a load-bearing derivation, and the temporal analysis of partial correlations is a straightforward data-driven observation without self-referential closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating benchmarks as equivalent to psychometric tests and models as subjects, with no free parameters explicitly fitted beyond benchmark selection; the positive manifold is observed rather than derived.

axioms (1)
  • domain assumption Benchmark scores can be treated as direct analogs to cognitive test scores for the purpose of extracting a general factor via PCA.
    Invoked when applying Spearman's g-factor framework to LLM performance data.

pith-pipeline@v0.9.0 · 5598 in / 1306 out tokens · 21399 ms · 2026-05-10T15:50:24.990765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    J., Deary, I

    Bartholomew, D. J., Deary, I. J., and Lawn, M. (2009). A new lease of life for Thomson’s bonds model of intelligence.Psychological Review, 116:567–579

  2. [2]

    (2013).The hedgehog and the fox: An essay on Tolstoy’s view of history

    Berlin, I. (2013).The hedgehog and the fox: An essay on Tolstoy’s view of history. Princeton University Press

  3. [3]

    M., Schubiger, M

    Burkart, J. M., Schubiger, M. N., and Van Schaik, C. P. (2017). The evolution of general intelligence. Behavioral and Brain Sciences, 40:e195

  4. [4]

    Carroll, J. B. (1993).Human Cognitive Abilities: A Survey of Factor-Analytic Studies. Cambridge University Press, Cambridge

  5. [5]

    Cattell, R. B. (1963). Theory of fluid and crystallized intelligence: A critical experiment.Journal of educational psychology, 54(1):1

  6. [6]

    Brockman, G., et al. (2021). Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374

  7. [7]

    Chollet, F. (2019). On the measure of intelligence.arXiv preprint arXiv:1911.01547

  8. [8]

    (1998).Being There: Putting Brain, Body, and World Together Again

    Clark, A. (1998).Being There: Putting Brain, Body, and World Together Again. MIT Press

  9. [9]

    Hilton, J., Nakano, R., Hesse, C., and Schulman, J. (2021). Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168. Epoch AI (2024). AI benchmarking hub

  10. [10]

    (1983).Frames of Mind: The Theory of Multiple Intelligences

    Gardner, H. (1983).Frames of Mind: The Theory of Multiple Intelligences. Basic Books

  11. [11]

    Gottfredson, L. S. (1997). Why g matters: The complexity of everyday life.Intelligence, 24:79–132

  12. [12]

    Gottfredson, L. S. (2003). Dissecting practical intelligence theory: Its claims and evidence.Intel- ligence, 31(4):343–397

  13. [13]

    Hendrycks, D., Bengio, Y., Song, D., Tegmark, M., Schmidt, E., et al. (2025). A definition of AGI. https://www.agidefinition.ai/paper.pdf

  14. [14]

    Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis.Psychometrika, 30:179–185

  15. [15]

    Hutchins, E. (2000). Distributed cognition.International encyclopedia of the social and behavioral sciences, 138(1):1–10. 30 Ili´ c, D. and Gignac, G. E. (2024). Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement?Intelligence, 106:101858

  16. [16]

    Jensen, A. R. (1998).The g Factor: The Science of Mental Ability. Praeger

  17. [17]

    Jensen, A. R. (2002). Psychometric g: Definition and substantiation. In Sternberg, R. J. and

  18. [18]

    J., Krueger, R

    Johnson, W., Bouchard Jr, T. J., Krueger, R. F., McGue, M., and Gottesman, I. I. (2004). Just one g: Consistent results from three test batteries.Intelligence, 32:95–107

  19. [19]

    Johnson, W., te Nijenhuis, J., and Bouchard Jr., T. J. (2008). Still just 1 g: Consistent results from five test batteries.Intelligence, 36:81–95

  20. [20]

    Jung, R. E. and Haier, R. J. (2007). The Parieto-Frontal Integration Theory (P-FIT) of intelligence: Converging neuroimaging evidence.Behavioral and Brain Sciences, 30:135–154

  21. [21]

    and Conway, A

    Kovacs, K. and Conway, A. R. A. (2016). Process overlap theory: A unified account of the general factor of intelligence.Psychological Inquiry, 27:151–177

  22. [22]

    Gandhi, C. C. (2003). Individual differences in the expression of a “general” learning ability in mice.Journal of Neuroscience, 23(16):6423–6433

  23. [23]

    (1986).Society of mind

    Minsky, M. (1986).Society of mind. Simon and Schuster

  24. [24]

    Legg, S. (2023). Levels of AGI: Operationalizing progress on the path to AGI.arXiv preprint arXiv:2311.02462

  25. [25]

    Neubauer, S., Hublin, J.-J., and Gunz, P. (2018). The evolution of modern human brain shape. Science advances, 4(1):eaao5961

  26. [26]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bow- man, S. R. (2024). GPQA: A graduate-level Google-proof Q&A benchmark.arXiv preprint arXiv:2311.12022

  27. [27]

    O., Marsman, M., van der Maas, H

    Savi, A. O., Marsman, M., van der Maas, H. L. J., and Maris, G. K. J. (2019). The wiring of intelligence.Perspectives on Psychological Science, 14:1034–1061

  28. [28]

    General intelligence,

    Spearman, C. (1904). “General intelligence,” objectively determined and measured.American Journal of Psychology, 15:201–293

  29. [29]

    Sternberg, R. J. (1985).Beyond IQ: A Triarchic Theory of Human Intelligence. Cambridge Uni- versity Press

  30. [30]

    (2012).Masters of the Planet: The Search for Our Human Origins

    Tattersall, I. (2012).Masters of the Planet: The Search for Our Human Origins. Palgrave Macmil- lan

  31. [31]

    Thomson, G. H. (1916). A hierarchy without a general factor.British Journal of Psychology, 8:271–281. 31

  32. [32]

    Thurstone, L. L. (1938).Primary Mental Abilities. Number 1 in Psychometric Monographs. University of Chicago Press, Chicago

  33. [33]

    (2024).The AI mirror: How to reclaim our humanity in an age of machine thinking

    Vallor, S. (2024).The AI mirror: How to reclaim our humanity in an age of machine thinking. Oxford University Press. van der Maas, H. L. J., Dolan, C. V., Grasman, R. P. P. P., Wicherts, J. M., Huizenga, H. M., and

  34. [34]

    Raijmakers, M. E. J. (2006). A dynamical model of general intelligence: The positive manifold of intelligence by mutualism.Psychological Review, 113:842–861. 32