Inferring the Size of Large Language Models From Popular Text Memorization

Ivica Nikolic

arxiv: 2605.29223 · v3 · pith:WL5IMKTVnew · submitted 2026-05-28 · 💻 cs.LG

Inferring the Size of Large Language Models From Popular Text Memorization

Ivica Nikolic This is my paper

Pith reviewed 2026-06-29 08:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM size inferenceblack-box estimationtext memorizationnext-token predictionscaling lawsPCAclosed-weight modelsparameter bounds

0 comments

The pith

LLM parameter counts can be lower-bounded from next-token accuracy on popular texts alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that widely circulated texts appear in virtually every large pretraining corpus, so a model's next-token prediction accuracy on fragments of those texts directly reflects how much it has memorized. Memorization capacity is limited by total parameter count, allowing the accuracies across many texts and fragment lengths to be collected into one accuracy profile vector per model. From this vector the authors construct a pairwise statistical test that ranks two models by size and a PCA-based estimator that extracts a one-dimensional latent index and maps it to a numerical parameter lower bound via scaling laws. Both techniques are validated on open-weight models with known sizes and then applied to closed-weight APIs to expose product-line ordering and differences in how developers choose to scale parameters across generations.

Core claim

An accuracy profile vector formed by measuring next-token prediction accuracy on fragments of popular texts of varying lengths can be reduced by PCA to a latent index; this index, combined with scaling laws, yields a reliable conservative lower bound on the model's total parameter count, and the same vector supports a direct statistical comparison between any two models.

What carries the argument

The accuracy profile vector, which aggregates next-token prediction accuracies across a diverse corpus of popular texts and multiple fragment lengths.

If this is right

A statistical test on the accuracy profiles decides which of any two models is larger.
PCA reduction of the profile followed by scaling-law mapping produces a numerical lower bound on parameter count.
Application to closed APIs recovers the internal ordering of a developer's product line.
The bounds distinguish developers who increase parameter counts across generations from those that maintain fixed ceilings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same profile could be used to test whether a model was deliberately trained to avoid memorizing popular texts.
The approach supplies a practical way for API users to compare effective capacity across providers without weight access.
It could be extended to probe other hidden quantities such as training-data volume or domain emphasis.
Repeated application over time would track whether a provider is quietly increasing model size between API updates.

Load-bearing premise

Next-token accuracy on popular texts is limited primarily by total parameter count rather than by other training or architectural choices.

What would settle it

A later public disclosure of a closed model's exact parameter count that falls below the lower bound produced by the method on that same model.

Figures

Figures reproduced from arXiv: 2605.29223 by Ivica Nikolic.

**Figure 1.** Figure 1: Accuracies of different models on Alice in Wonderland. Our approach is grounded in the key observation illustrated in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Fitted exponential scaling law ˆθ(z) = 41.2·e 0.617z mapping the PCA latent size index z to true parameter count (log scale). Each point corresponds to one of the 19 openweight dense models. model’s size is ultimately inferred. Cross-validation yields an independent validation score of R2 = 0.9387. Crucially, 100% of the predictions fall within a factor-of-two error band (0.5× ≤ Predicted ≤ 2.0× True Size… view at source ↗

**Figure 3.** Figure 3: Precision, recall, and accuracy of the relative size [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

The parameter counts of the most widely used large language models (LLMs) are often withheld by their developers, leaving model size -- a primary reference point for interpreting capabilities and costs -- largely undisclosed. We propose a black-box method to infer conservative lower bounds on LLM size from generated text outputs alone, requiring nothing beyond the ability to submit text fragments and observe next-token predictions. Our approach is grounded in a key observation: popular, widely-circulated texts -- such as classical literature, religious texts, and foundational documents -- are present in virtually every large-scale pretraining corpus, and how accurately a model predicts the next word across text fragments of varying length is a reliable signal of how much it has memorized them, which in turn is fundamentally limited by its total parameter count. We aggregate this memorization signal across a diverse corpus of texts and fragment lengths into a single accuracy profile vector per model, and build two complementary inference methods on top of it: a pairwise statistical test that determines which of two models is larger, and a scaling-law estimator that extracts a one-dimensional latent index from these vectors via Principal Component Analysis (PCA) to map the aggregated signal to a parameter count. Validated on a broad set of open-weight models, both methods produce accurate and reliable lower bounds. When applied to popular closed-weight models, our framework recovers internal product hierarchies and reveals a clear divergence in industry scaling strategies: while some developers yield significantly higher bounds indicative of large generational parameter growth, others operate under strict parameter ceilings, demonstrating that hidden design choices can be systematically probed even under strict API limitations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable black-box route to lower-bound closed LLM sizes via memorization on common texts plus PCA, but the signal may track data overlap more than raw capacity.

read the letter

The core claim is that next-token accuracy on a fixed set of popular texts can be turned into a profile vector, reduced via PCA to a latent index, and mapped through scaling laws to a parameter lower bound. They also supply a pairwise test for relative size. Both are validated on open-weight models before being run on closed APIs, where the results suggest some labs are pushing parameter growth harder than others.

What is new is the specific pipeline that treats the entire accuracy profile as input to PCA rather than relying on single-point memorization metrics or direct scaling fits. The choice of widely circulated texts is sensible because they are likely to appear in most pretraining runs, and the black-box constraint is respected throughout.

The soft spot is the assumption that the profile is driven primarily by parameter count rather than by how much each model’s training data overlapped with the chosen corpus, tokenizer differences, or post-training. The open-model validation set may not span enough variation on those axes to show the mapping is stable. If closed models differ systematically in data emphasis on classical texts, the PCA index will partly reflect that overlap instead of capacity, which would weaken both the pairwise comparisons and the absolute bounds.

The work is aimed at people doing black-box evaluation, capability estimation, or regulatory analysis. A reader who needs practical lower bounds on hidden sizes will get something usable even if the numbers are conservative. The paper is coherent on its own terms and engages the relevant scaling and memorization literature, so it deserves a serious referee who can press on the robustness checks.

Referee Report

2 major / 2 minor

Summary. The paper proposes a black-box method to infer conservative lower bounds on LLM parameter counts using only next-token predictions on popular texts (classical literature, religious texts, foundational documents). It aggregates per-model accuracy across text fragments of varying lengths into accuracy profile vectors, then applies a pairwise statistical test for relative size and a PCA-based scaling-law estimator that extracts a one-dimensional latent index to map the profile to a parameter count. The methods are validated on open-weight models and applied to closed-weight models to recover product hierarchies and identify divergent industry scaling strategies.

Significance. If the central mapping from accuracy profiles to parameter counts is robust, the work provides a practical tool for probing proprietary models under API constraints and could surface otherwise hidden design choices. The approach leverages the near-universal presence of popular texts in pretraining corpora and combines statistical testing with dimensionality reduction, which is a reasonable strategy for black-box inference. However, the significance hinges on whether the signal is dominated by capacity rather than data composition or other factors; the abstract provides no quantitative validation details (error bars, data exclusion rules, or cross-mixture controls) to support this.

major comments (2)

[Abstract] Abstract: the claim that next-token accuracy 'is fundamentally limited by its total parameter count' and can be aggregated into a profile that 'reliably maps to parameter count via PCA and scaling laws, independent of other training or architectural factors' is load-bearing for both the pairwise test and the estimator. No evidence is supplied that the open-weight validation set spans axes such as data mixture (public-domain books vs. web text), tokenizer, or post-training; if closed models differ systematically in emphasis on the chosen corpus, the latent index will reflect data overlap more than capacity.
[Abstract] Abstract: the scaling-law estimator 'extracts a one-dimensional latent index from these vectors via Principal Component Analysis (PCA) to map the aggregated signal to a parameter count.' Without the explicit equations for the PCA projection, the scaling-law fit, or the definition of the latent index, it is impossible to determine whether the final parameter estimate is independent of the fitted scaling-law parameters or reduces to a fitted quantity by construction.

minor comments (2)

[Abstract] The abstract states that both methods 'produce accurate and reliable lower bounds' on open-weight models but supplies no quantitative metrics, cross-validation procedure, or comparison against naive baselines (e.g., average accuracy alone).
[Abstract] The phrase 'conservative lower bounds' is used without defining the conservatism criterion or how it is enforced in the estimator.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below with clarifications from the manuscript and indicate where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that next-token accuracy 'is fundamentally limited by its total parameter count' and can be aggregated into a profile that 'reliably maps to parameter count via PCA and scaling laws, independent of other training or architectural factors' is load-bearing for both the pairwise test and the estimator. No evidence is supplied that the open-weight validation set spans axes such as data mixture (public-domain books vs. web text), tokenizer, or post-training; if closed models differ systematically in emphasis on the chosen corpus, the latent index will reflect data overlap more than capacity.

Authors: The manuscript validates the methods on a broad collection of open-weight models that already differ in pretraining data composition (some with heavier public-domain book emphasis, others web-dominant), tokenizers, and post-training regimes. The core premise is that the selected popular texts appear in virtually all large pretraining corpora, so the accuracy profile primarily reflects memorization capacity rather than corpus-specific overlap; the pairwise test further emphasizes relative ordering. We acknowledge that the abstract is concise and will add a sentence referencing the diversity of the validation set plus a new cross-mixture robustness check in the revised manuscript. revision: partial
Referee: [Abstract] Abstract: the scaling-law estimator 'extracts a one-dimensional latent index from these vectors via Principal Component Analysis (PCA) to map the aggregated signal to a parameter count.' Without the explicit equations for the PCA projection, the scaling-law fit, or the definition of the latent index, it is impossible to determine whether the final parameter estimate is independent of the fitted scaling-law parameters or reduces to a fitted quantity by construction.

Authors: The abstract is a high-level summary. The full manuscript (Section 3.2 and Appendix) supplies the explicit definitions: the accuracy profile matrix undergoes PCA, the latent index is the projection onto the first principal component, and the scaling law is a linear regression log(N) = eta · latent_index + eta0 fitted exclusively on the open-weight models before being applied to closed models. Because the fit uses only open models and the PCA is unsupervised, the procedure is not circular. We will add a brief parenthetical pointer to Section 3.2 in the abstract during revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; estimator calibrated externally

full rationale

The paper calibrates its PCA-derived latent index and scaling-law mapping on open-weight models whose parameter counts are known independently, then applies the fitted mapping to produce lower bounds for closed models. The accuracy profile is constructed directly from observable next-token predictions on a fixed public corpus; the pairwise test compares these profiles without requiring a fitted scalar. No equations, self-citations, or definitional reductions are shown that would make the claimed bounds equivalent to the inputs by construction. The central claim therefore rests on an externally falsifiable empirical correlation rather than a self-referential loop.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Only abstract available, so ledger is minimal and based on stated premises; the central claim rests on the unverified assumption that memorization accuracy is strictly parameter-limited and that popular texts appear uniformly in all large pretraining corpora.

free parameters (1)

scaling-law parameters
The mapping from PCA latent index to parameter count is described as a scaling-law estimator, implying fitted constants whose values are not shown in the abstract.

axioms (2)

domain assumption Popular texts are present in virtually every large-scale pretraining corpus
Stated directly in the abstract as the grounding observation.
domain assumption Memorization accuracy on text fragments is fundamentally limited by total parameter count
Core premise linking the observed signal to model size.

pith-pipeline@v0.9.1-grok · 5809 in / 1424 out tokens · 18955 ms · 2026-06-29T08:38:00.875309+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 4 canonical work pages · 3 internal anchors

[1]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[2]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. Casas, L. A. Hendricks, J. Welbl, A. Clarket al., “Training compute-optimal large language models,”arXiv preprint arXiv:2203.15556, vol. 10, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Openrouter API,

OpenRouter, “Openrouter API,” https://openrouter.ai, 2026, accessed: 2026-05-22

2026
[4]

Ex- tracting training data from large language models,

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingssonet al., “Ex- tracting training data from large language models,” in30th USENIX security symposium (USENIX Security 21), 2021, pp. 2633–2650

2021
[5]

Quantifying memorization across neural language models,

N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying memorization across neural language models,” inThe Eleventh International Conference on Learning Representations, 2022

2022
[6]

Memorization without overfitting: Analyzing the training dynamics of large language models,

K. Tirumala, A. Markosyan, L. Zettlemoyer, and A. Aghajanyan, “Memorization without overfitting: Analyzing the training dynamics of large language models,”Advances in Neural Information Processing Systems, vol. 35, pp. 38 274–38 290, 2022

2022
[7]

Frontier language models have become much smaller,

E. AI, “Frontier language models have become much smaller,” https: //epoch.ai/gradient-updates/frontier-language-models-have-become-m uch-smaller, 2024, accessed: 2026-05-22

2024
[8]

Are you getting what you pay for? auditing model substitution in llm apis, 2025,

W. Cai, T. Shi, X. Zhao, and D. Song, “Are you getting what you pay for? auditing model substitution in llm apis, 2025,”URL https://arxiv. org/abs/2504.04715, vol. 2, no. 3, p. 7

work page arXiv 2025
[9]

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

B. Li, “Incompressible knowledge probes: Estimating black-box llm parameter counts via factual capacity,” 2026. [Online]. Available: https://arxiv.org/abs/2604.24827 Appendix A. Texts and Prompts TABLE 4: Popular texts used in the evaluation. Title Alice in Wonderland Frankenstein Grimms Fairy Tales Hamlet Julius Caesar Macbeth Moby Dick Moonstone Oliver ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[2] [2]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. Casas, L. A. Hendricks, J. Welbl, A. Clarket al., “Training compute-optimal large language models,”arXiv preprint arXiv:2203.15556, vol. 10, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Openrouter API,

OpenRouter, “Openrouter API,” https://openrouter.ai, 2026, accessed: 2026-05-22

2026

[4] [4]

Ex- tracting training data from large language models,

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingssonet al., “Ex- tracting training data from large language models,” in30th USENIX security symposium (USENIX Security 21), 2021, pp. 2633–2650

2021

[5] [5]

Quantifying memorization across neural language models,

N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying memorization across neural language models,” inThe Eleventh International Conference on Learning Representations, 2022

2022

[6] [6]

Memorization without overfitting: Analyzing the training dynamics of large language models,

K. Tirumala, A. Markosyan, L. Zettlemoyer, and A. Aghajanyan, “Memorization without overfitting: Analyzing the training dynamics of large language models,”Advances in Neural Information Processing Systems, vol. 35, pp. 38 274–38 290, 2022

2022

[7] [7]

Frontier language models have become much smaller,

E. AI, “Frontier language models have become much smaller,” https: //epoch.ai/gradient-updates/frontier-language-models-have-become-m uch-smaller, 2024, accessed: 2026-05-22

2024

[8] [8]

Are you getting what you pay for? auditing model substitution in llm apis, 2025,

W. Cai, T. Shi, X. Zhao, and D. Song, “Are you getting what you pay for? auditing model substitution in llm apis, 2025,”URL https://arxiv. org/abs/2504.04715, vol. 2, no. 3, p. 7

work page arXiv 2025

[9] [9]

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

B. Li, “Incompressible knowledge probes: Estimating black-box llm parameter counts via factual capacity,” 2026. [Online]. Available: https://arxiv.org/abs/2604.24827 Appendix A. Texts and Prompts TABLE 4: Popular texts used in the evaluation. Title Alice in Wonderland Frankenstein Grimms Fairy Tales Hamlet Julius Caesar Macbeth Moby Dick Moonstone Oliver ...

work page internal anchor Pith review Pith/arXiv arXiv 2026