Inferring the Size of Large Language Models From Popular Text Memorization
Pith reviewed 2026-06-29 08:38 UTC · model grok-4.3
The pith
LLM parameter counts can be lower-bounded from next-token accuracy on popular texts alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An accuracy profile vector formed by measuring next-token prediction accuracy on fragments of popular texts of varying lengths can be reduced by PCA to a latent index; this index, combined with scaling laws, yields a reliable conservative lower bound on the model's total parameter count, and the same vector supports a direct statistical comparison between any two models.
What carries the argument
The accuracy profile vector, which aggregates next-token prediction accuracies across a diverse corpus of popular texts and multiple fragment lengths.
If this is right
- A statistical test on the accuracy profiles decides which of any two models is larger.
- PCA reduction of the profile followed by scaling-law mapping produces a numerical lower bound on parameter count.
- Application to closed APIs recovers the internal ordering of a developer's product line.
- The bounds distinguish developers who increase parameter counts across generations from those that maintain fixed ceilings.
Where Pith is reading between the lines
- The same profile could be used to test whether a model was deliberately trained to avoid memorizing popular texts.
- The approach supplies a practical way for API users to compare effective capacity across providers without weight access.
- It could be extended to probe other hidden quantities such as training-data volume or domain emphasis.
- Repeated application over time would track whether a provider is quietly increasing model size between API updates.
Load-bearing premise
Next-token accuracy on popular texts is limited primarily by total parameter count rather than by other training or architectural choices.
What would settle it
A later public disclosure of a closed model's exact parameter count that falls below the lower bound produced by the method on that same model.
Figures
read the original abstract
The parameter counts of the most widely used large language models (LLMs) are often withheld by their developers, leaving model size -- a primary reference point for interpreting capabilities and costs -- largely undisclosed. We propose a black-box method to infer conservative lower bounds on LLM size from generated text outputs alone, requiring nothing beyond the ability to submit text fragments and observe next-token predictions. Our approach is grounded in a key observation: popular, widely-circulated texts -- such as classical literature, religious texts, and foundational documents -- are present in virtually every large-scale pretraining corpus, and how accurately a model predicts the next word across text fragments of varying length is a reliable signal of how much it has memorized them, which in turn is fundamentally limited by its total parameter count. We aggregate this memorization signal across a diverse corpus of texts and fragment lengths into a single accuracy profile vector per model, and build two complementary inference methods on top of it: a pairwise statistical test that determines which of two models is larger, and a scaling-law estimator that extracts a one-dimensional latent index from these vectors via Principal Component Analysis (PCA) to map the aggregated signal to a parameter count. Validated on a broad set of open-weight models, both methods produce accurate and reliable lower bounds. When applied to popular closed-weight models, our framework recovers internal product hierarchies and reveals a clear divergence in industry scaling strategies: while some developers yield significantly higher bounds indicative of large generational parameter growth, others operate under strict parameter ceilings, demonstrating that hidden design choices can be systematically probed even under strict API limitations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a black-box method to infer conservative lower bounds on LLM parameter counts using only next-token predictions on popular texts (classical literature, religious texts, foundational documents). It aggregates per-model accuracy across text fragments of varying lengths into accuracy profile vectors, then applies a pairwise statistical test for relative size and a PCA-based scaling-law estimator that extracts a one-dimensional latent index to map the profile to a parameter count. The methods are validated on open-weight models and applied to closed-weight models to recover product hierarchies and identify divergent industry scaling strategies.
Significance. If the central mapping from accuracy profiles to parameter counts is robust, the work provides a practical tool for probing proprietary models under API constraints and could surface otherwise hidden design choices. The approach leverages the near-universal presence of popular texts in pretraining corpora and combines statistical testing with dimensionality reduction, which is a reasonable strategy for black-box inference. However, the significance hinges on whether the signal is dominated by capacity rather than data composition or other factors; the abstract provides no quantitative validation details (error bars, data exclusion rules, or cross-mixture controls) to support this.
major comments (2)
- [Abstract] Abstract: the claim that next-token accuracy 'is fundamentally limited by its total parameter count' and can be aggregated into a profile that 'reliably maps to parameter count via PCA and scaling laws, independent of other training or architectural factors' is load-bearing for both the pairwise test and the estimator. No evidence is supplied that the open-weight validation set spans axes such as data mixture (public-domain books vs. web text), tokenizer, or post-training; if closed models differ systematically in emphasis on the chosen corpus, the latent index will reflect data overlap more than capacity.
- [Abstract] Abstract: the scaling-law estimator 'extracts a one-dimensional latent index from these vectors via Principal Component Analysis (PCA) to map the aggregated signal to a parameter count.' Without the explicit equations for the PCA projection, the scaling-law fit, or the definition of the latent index, it is impossible to determine whether the final parameter estimate is independent of the fitted scaling-law parameters or reduces to a fitted quantity by construction.
minor comments (2)
- [Abstract] The abstract states that both methods 'produce accurate and reliable lower bounds' on open-weight models but supplies no quantitative metrics, cross-validation procedure, or comparison against naive baselines (e.g., average accuracy alone).
- [Abstract] The phrase 'conservative lower bounds' is used without defining the conservatism criterion or how it is enforced in the estimator.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below with clarifications from the manuscript and indicate where revisions will strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that next-token accuracy 'is fundamentally limited by its total parameter count' and can be aggregated into a profile that 'reliably maps to parameter count via PCA and scaling laws, independent of other training or architectural factors' is load-bearing for both the pairwise test and the estimator. No evidence is supplied that the open-weight validation set spans axes such as data mixture (public-domain books vs. web text), tokenizer, or post-training; if closed models differ systematically in emphasis on the chosen corpus, the latent index will reflect data overlap more than capacity.
Authors: The manuscript validates the methods on a broad collection of open-weight models that already differ in pretraining data composition (some with heavier public-domain book emphasis, others web-dominant), tokenizers, and post-training regimes. The core premise is that the selected popular texts appear in virtually all large pretraining corpora, so the accuracy profile primarily reflects memorization capacity rather than corpus-specific overlap; the pairwise test further emphasizes relative ordering. We acknowledge that the abstract is concise and will add a sentence referencing the diversity of the validation set plus a new cross-mixture robustness check in the revised manuscript. revision: partial
-
Referee: [Abstract] Abstract: the scaling-law estimator 'extracts a one-dimensional latent index from these vectors via Principal Component Analysis (PCA) to map the aggregated signal to a parameter count.' Without the explicit equations for the PCA projection, the scaling-law fit, or the definition of the latent index, it is impossible to determine whether the final parameter estimate is independent of the fitted scaling-law parameters or reduces to a fitted quantity by construction.
Authors: The abstract is a high-level summary. The full manuscript (Section 3.2 and Appendix) supplies the explicit definitions: the accuracy profile matrix undergoes PCA, the latent index is the projection onto the first principal component, and the scaling law is a linear regression log(N) = eta · latent_index + eta0 fitted exclusively on the open-weight models before being applied to closed models. Because the fit uses only open models and the PCA is unsupervised, the procedure is not circular. We will add a brief parenthetical pointer to Section 3.2 in the abstract during revision. revision: yes
Circularity Check
No significant circularity; estimator calibrated externally
full rationale
The paper calibrates its PCA-derived latent index and scaling-law mapping on open-weight models whose parameter counts are known independently, then applies the fitted mapping to produce lower bounds for closed models. The accuracy profile is constructed directly from observable next-token predictions on a fixed public corpus; the pairwise test compares these profiles without requiring a fitted scalar. No equations, self-citations, or definitional reductions are shown that would make the claimed bounds equivalent to the inputs by construction. The central claim therefore rests on an externally falsifiable empirical correlation rather than a self-referential loop.
Axiom & Free-Parameter Ledger
free parameters (1)
- scaling-law parameters
axioms (2)
- domain assumption Popular texts are present in virtually every large-scale pretraining corpus
- domain assumption Memorization accuracy on text fragments is fundamentally limited by total parameter count
Reference graph
Works this paper leans on
-
[1]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[2]
Training Compute-Optimal Large Language Models
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. Casas, L. A. Hendricks, J. Welbl, A. Clarket al., “Training compute-optimal large language models,”arXiv preprint arXiv:2203.15556, vol. 10, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Openrouter API,
OpenRouter, “Openrouter API,” https://openrouter.ai, 2026, accessed: 2026-05-22
2026
-
[4]
Ex- tracting training data from large language models,
N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingssonet al., “Ex- tracting training data from large language models,” in30th USENIX security symposium (USENIX Security 21), 2021, pp. 2633–2650
2021
-
[5]
Quantifying memorization across neural language models,
N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying memorization across neural language models,” inThe Eleventh International Conference on Learning Representations, 2022
2022
-
[6]
Memorization without overfitting: Analyzing the training dynamics of large language models,
K. Tirumala, A. Markosyan, L. Zettlemoyer, and A. Aghajanyan, “Memorization without overfitting: Analyzing the training dynamics of large language models,”Advances in Neural Information Processing Systems, vol. 35, pp. 38 274–38 290, 2022
2022
-
[7]
Frontier language models have become much smaller,
E. AI, “Frontier language models have become much smaller,” https: //epoch.ai/gradient-updates/frontier-language-models-have-become-m uch-smaller, 2024, accessed: 2026-05-22
2024
-
[8]
Are you getting what you pay for? auditing model substitution in llm apis, 2025,
W. Cai, T. Shi, X. Zhao, and D. Song, “Are you getting what you pay for? auditing model substitution in llm apis, 2025,”URL https://arxiv. org/abs/2504.04715, vol. 2, no. 3, p. 7
-
[9]
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity
B. Li, “Incompressible knowledge probes: Estimating black-box llm parameter counts via factual capacity,” 2026. [Online]. Available: https://arxiv.org/abs/2604.24827 Appendix A. Texts and Prompts TABLE 4: Popular texts used in the evaluation. Title Alice in Wonderland Frankenstein Grimms Fairy Tales Hamlet Julius Caesar Macbeth Moby Dick Moonstone Oliver ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.