pith. machine review for the scientific record. sign in

arxiv: 2604.25634 · v1 · submitted 2026-04-28 · 💻 cs.CR · cs.CL

Recognition: unknown

The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:40 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords LLM outputsMandelbrot distributionrank-frequencymodel fingerprintingverification primitivestatistical regularitytoken distribution
0
0 comments X

The pith

LLM token outputs from different models converge to the same two-parameter Mandelbrot rank-frequency distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that rank-frequency distributions of tokens produced by six frontier LLMs from five vendors consistently fit a shared two-parameter Mandelbrot distribution, with strong statistical support across held-out domains and generation lengths. This convergence does not erase model distinctions, as the fitted parameters separate models by many standard deviations. A reader would care because the pattern supports a CPU-only scoring method that runs in microseconds per token, enabling statistical checks on model provenance and output anomalies without model internals or sampling. The work frames this as a lightweight triage tool that can combine with log-probability access when available.

Core claim

Token rank-frequency distributions in outputs from six models converge to the Mandelbrot ranking distribution, with 34 of 36 model-by-domain fits exceeding R² = 0.94 and 35 of 36 preferring Mandelbrot over Zipf by AIC. The cross-model spread in the parameter q exceeds per-model bootstrap error by more than an order of magnitude, producing tens of standard deviations of separation from a few thousand tokens.

What carries the argument

The two-parameter Mandelbrot distribution fitted to token rank frequencies, which captures the observed universality while keeping model parameters separable for fingerprinting and scoring.

If this is right

  • Text can be tested for claimed model family without watermarks or internal access, supporting provenance checks and substitution audits.
  • A model-agnostic reference distribution allows single-pass scoring of black-box outputs, usable in rank-only mode on closed APIs.
  • The score flags lexical anomalies and unsupported entities while remaining insensitive to reasoning errors expressed in appropriate vocabulary.
  • The primitive serves as a fast first-pass layer that composes with existing sampling-based or source-conditioned verifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on longer outputs or additional sampling temperatures to check whether the distribution remains stable.
  • Parameter separability might allow silent-substitution detection in production API traffic without rate limits or model access.
  • The rank-only mode opens a path to lightweight auditing of closed-source systems where only generated text is observable.

Load-bearing premise

The Mandelbrot convergence and parameter separability seen in the six tested models and five domains will continue to hold for other models, sampling methods, lengths, and domains.

What would settle it

A new LLM or domain whose token rank frequencies produce a poor Mandelbrot fit (R² below 0.9) or yield parameter values that overlap within bootstrap standard deviation of another model.

Figures

Figures reproduced from arXiv: 2604.25634 by Adrian de Valois-Franklin, Alex Bogdan.

Figure 1
Figure 1. Figure 1: Cross-model overlay of normalized rank-frequency curves. Across the distribution, vendor differences are view at source ↗
Figure 2
Figure 2. Figure 2: Per-model Mandelbrot fits. The Mandelbrot form consistently captures the head flattening that the simpler view at source ↗
Figure 3
Figure 3. Figure 3: AIC model selection. Positive ∆AIC values favor Mandelbrot over Zipf. The practical point is not aesthetic fit, but model selection: the richer form is repeatedly justified by the data. 8 view at source ↗
Figure 4
Figure 4. Figure 4: Domain parameter variation. Within-model dispersion across domains exceeds between-model dispersion at a view at source ↗
Figure 5
Figure 5. Figure 5: Latency-by-accuracy Pareto frontier. The primitive occupies the sub-millisecond region of the frontier at view at source ↗
Figure 6
Figure 6. Figure 6: Taxonomy validation across benchmarks. The figure summarizes where the scoring layer is useful inside view at source ↗
read the original abstract

We report a striking statistical regularity in frontier LLM outputs that enables a CPU-only scoring primitive running at 2.6 microseconds per token, with estimated latency up to 100,000$\times$ (five orders of magnitude) below existing sampling-based detectors. Across six contemporary models from five independent vendors, two generation sizes, and five held-out domains, token rank-frequency distributions converge to the same two-parameter Mandelbrot ranking distribution, with 34 of 36 model-by-domain fits exceeding $R^{2} = 0.94$ and 35 of 36 favoring Mandelbrot over Zipf by AIC. The shared family does not collapse the models into statistical duplicates. Fitted Mandelbrot parameters remain cleanly separable between models: the cross-model spread in $q$ (1.63 to 3.69) exceeds its per-model bootstrap standard deviation (0.03 to 0.10) by more than an order of magnitude, yielding tens of standard deviations of separation per few thousand output tokens. Two capabilities follow. First, statistical model fingerprinting: text from a vendor-delivered LLM can be tested against its claimed model family without cryptographic watermarks or access to model internals, supporting provenance verification and silent-substitution audits. Second, a model-agnostic reference distribution for black-box output assessment, from which we derive a single-pass scoring primitive that composes with model log probabilities when available and degrades to a rank-only mode usable on closed APIs. Pilot results on FRANK, TruthfulQA, and HaluEval map where the primitive helps (lexical anomalies, unsupported entities) and where it structurally cannot (reasoning errors in domain-appropriate vocabulary). We position the primitive as a first-pass triage layer in compound evaluation stacks, not as a replacement for sampling-based or source-conditioned verifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that token rank-frequency distributions from six LLMs (five vendors) across five held-out domains and two generation sizes converge to the same two-parameter Mandelbrot distribution, with 34/36 fits exceeding R²=0.94 and 35/36 favoring Mandelbrot over Zipf by AIC. This enables a CPU-only, 2.6 µs/token scoring primitive for model fingerprinting and black-box anomaly detection that is up to 100,000× faster than sampling-based methods, while preserving model separability via the q parameter.

Significance. If the universality and separability hold beyond the tested conditions, the result supplies a lightweight, model-agnostic reference distribution and single-pass primitive that could serve as an efficient first-pass triage layer for provenance verification and lexical anomaly detection in compound evaluation stacks. The reported bootstrap separation (tens of standard deviations) and AIC comparisons constitute concrete empirical strengths if the underlying data and fitting procedures are fully reproducible.

major comments (3)
  1. [Experimental Setup / Generation Details] The generation procedure never varies or even states the sampling hyperparameters (temperature, top-p). Because these parameters explicitly reshape the output token probabilities and therefore the empirical rank-frequency distributions under test, the observed convergence to a shared Mandelbrot family and the clean cross-model separation in q could be artifacts of a single fixed decoding regime rather than intrinsic model properties.
  2. [Results / Reference Distribution Construction] The model-agnostic reference distribution and anomaly thresholds are constructed from the same fitted Mandelbrot parameters that were estimated on the very data used to demonstrate the pattern. This creates a circularity risk for both the fingerprinting separation claim and the definition of 'normal' outputs against which anomalies are scored.
  3. [Methods] The manuscript provides insufficient detail on data exclusion rules, bootstrap implementation, raw token counts per fit, and exact maximum-likelihood procedures for the Mandelbrot parameters. Without these, it is impossible to confirm that the reported R² values, AIC preferences, and per-model standard deviations (0.03–0.10) are free of post-hoc selection or incorrect error propagation.
minor comments (2)
  1. [Abstract] Clarify whether the 100,000× latency claim is measured against a specific baseline detector and whether it includes the cost of rank extraction on closed APIs.
  2. [Introduction] Introduce the Mandelbrot parameters q and p with their functional form in the introduction or early methods section rather than assuming familiarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which identifies key areas for improving experimental transparency and methodological clarity. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Experimental Setup / Generation Details] The generation procedure never varies or even states the sampling hyperparameters (temperature, top-p). Because these parameters explicitly reshape the output token probabilities and therefore the empirical rank-frequency distributions under test, the observed convergence to a shared Mandelbrot family and the clean cross-model separation in q could be artifacts of a single fixed decoding regime rather than intrinsic model properties.

    Authors: We agree that the sampling hyperparameters must be explicitly documented and that testing variations would further support the universality claim. All generations in the study used temperature=1.0 and top_p=1.0 (full-distribution sampling with no truncation) via each model's standard API interface; this choice was intended to reflect typical unconstrained generation. We will revise the Methods section to state these parameters clearly and add a limitations paragraph noting that the results pertain to this standard regime. While the cross-model and cross-domain consistency provides supporting evidence against a pure artifact, we acknowledge the referee's point and will not claim broader invariance without additional experiments. revision: partial

  2. Referee: [Results / Reference Distribution Construction] The model-agnostic reference distribution and anomaly thresholds are constructed from the same fitted Mandelbrot parameters that were estimated on the very data used to demonstrate the pattern. This creates a circularity risk for both the fingerprinting separation claim and the definition of 'normal' outputs against which anomalies are scored.

    Authors: The referee correctly flags a circularity concern in the current presentation. The shared Mandelbrot family is characterized from the collected outputs, and anomaly scoring measures deviation from that form. Model separability, however, is established via bootstrap standard deviations on the fitted q parameters rather than direct use of the reference for classification. We will revise the Results and Methods sections to (a) distinguish the descriptive fitting step from the operational use of a precomputed reference, (b) specify how a fixed reference distribution can be derived from an independent corpus or the study's averaged parameters, and (c) include a small held-out demonstration applying the reference to new text not used in fitting. These changes remove the circularity for practical deployment while preserving the reported statistics. revision: yes

  3. Referee: [Methods] The manuscript provides insufficient detail on data exclusion rules, bootstrap implementation, raw token counts per fit, and exact maximum-likelihood procedures for the Mandelbrot parameters. Without these, it is impossible to confirm that the reported R² values, AIC preferences, and per-model standard deviations (0.03–0.10) are free of post-hoc selection or incorrect error propagation.

    Authors: We accept that the Methods section requires substantial expansion for reproducibility. We will add: explicit data-exclusion criteria (sequences shorter than 500 tokens were discarded to ensure stable rank-frequency tails); bootstrap details (1,000 token-level resamples with replacement, with parameter standard deviations taken from the resulting empirical distribution); raw token counts (approximately 20,000–60,000 tokens per model-domain combination); and the precise MLE procedure (optimization of the two-parameter Mandelbrot likelihood via the powerlaw library with the same convergence tolerances used for all fits). These additions will allow independent verification of all reported R², AIC, and separation figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observation of distribution family supports independent application

full rationale

The paper reports empirical rank-frequency fits to the Mandelbrot family on held-out generations from six models, documents R² and AIC values, and notes parameter separability via bootstrap. It then defines a scoring primitive and fingerprinting method that apply the observed family as a reference. No equation or step equates a claimed result to its own fitted inputs by construction, renames a fit as a prediction, or relies on self-citation for a uniqueness theorem. The derivation remains self-contained: the universality claim rests on direct data fits, while the primitive is a downstream use of the fitted family rather than a tautological re-derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests entirely on empirical observation and parameter fitting rather than a first-principles derivation; the verification primitive is built directly on the fitted Mandelbrot parameters.

free parameters (2)
  • Mandelbrot q parameter = 1.63 to 3.69
    Shape parameter fitted to token rank-frequency data per model-domain pair; values range 1.63–3.69 across models
  • Mandelbrot p parameter
    Second parameter of the two-parameter Mandelbrot distribution fitted to the same rank-frequency data
axioms (1)
  • domain assumption LLM token generation produces rank-frequency distributions adequately described by the two-parameter Mandelbrot generalization of Zipf's law
    Invoked as the basis for all subsequent fitting, separability claims, and primitive construction

pith-pipeline@v0.9.0 · 5633 in / 1461 out tokens · 65418 ms · 2026-05-07T15:40:09.305364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,” inProc. Int. Conf. Learn. Representations (ICLR), 2023, arXiv:2302.09664

  2. [2]

    Nature , year =

    S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, pp. 625–630, 2024, doi: 10.1038/s41586-024-07421-0

  3. [3]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

    P. Manakul, A. Liusie, and M. J. F. Gales, “SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2023, arXiv:2303.08896

  4. [4]

    arXiv , url =:2407.07071 , primaryclass =

    Y .-S. Chuang, L. Qiu, C.-Y . Hsieh, R. Krishna, Y . Kim, and J. Glass, “Lookback Lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps,” arXiv preprint arXiv:2407.07071, 2024

  5. [5]

    Malik, and Yarin Gal

    J. Kossen, J. Han, M. A. Razzak, L. Schut, S. Malik, and Y . Gal, “Semantic entropy probes: Robust and cheap hallucination detection in LLMs,” arXiv preprint arXiv:2406.15927, 2024

  6. [6]

    An informational theory of the statistical structure of language,

    B. B. Mandelbrot, “An informational theory of the statistical structure of language,” inCommunication Theory, W. Jackson, Ed. London, U.K.: Butterworths, 1953, pp. 486–502

  7. [7]

    G. K. Zipf,The Psycho-Biology of Language: An Introduction to Dynamic Philology. Boston, MA, USA: Houghton Mifflin, 1935

  8. [8]

    A mathematical theory of communication,

    C. E. Shannon, “A mathematical theory of communication,”Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948; vol. 27, no. 4, pp. 623–656, 1948

  9. [9]

    Information theory and psycholinguistics: A theory of word frequencies,

    B. B. Mandelbrot, “Information theory and psycholinguistics: A theory of word frequencies,” inReadings in Mathematical Social Science, P. F. Lazarsfeld and N. W. Henry, Eds. Cambridge, MA, USA: MIT Press, 1966, pp. 151–168

  10. [10]

    The free-energy principle: A unified brain theory?,

    K. Friston, “The free-energy principle: A unified brain theory?,”Nature Reviews Neuroscience, vol. 11, no. 2, pp. 127–138, 2010

  11. [11]

    A tutorial on the free-energy framework for modelling perception and learning,

    R. Bogacz, “A tutorial on the free-energy framework for modelling perception and learning,”Journal of Mathe- matical Psychology, vol. 76, pp. 198–211, 2017

  12. [12]

    AI models collapse when trained on recursively generated data,

    I. Shumailov, Z. Shumaylov, Y . Zhao, N. Papernot, R. Anderson, and Y . Gal, “AI models collapse when trained on recursively generated data,”Nature, vol. 631, pp. 755–759, 2024

  13. [13]

    Understanding the effects of rlhf on llm generalisation and diversity.arXiv preprint arXiv:2310.06452,

    R. Kirk, B. Vidgen, P. Röttger, and S. A. Hale, “Understanding the effects of RLHF on LLM generalisation and diversity,” arXiv preprint arXiv:2310.06452, 2023

  14. [14]

    Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics,

    A. Pagnoni, V . Balachandran, and Y . Tsvetkov, “Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics,” inProc. North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL-HLT), 2021

  15. [15]

    TruthfulQA: Measuring how models mimic human falsehoods,

    S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” inProc. 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022

  16. [16]

    HaluEval: A large-scale hallucination evaluation benchmark for large language models,

    J. Li, X. Cheng, X. Zhao, J.-Y . Nie, and J.-R. Wen, “HaluEval: A large-scale hallucination evaluation benchmark for large language models,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2023

  17. [17]

    Zipf’s word frequency law in natural language: A critical review and future directions,

    S. T. Piantadosi, “Zipf’s word frequency law in natural language: A critical review and future directions,” Psychonomic Bulletin & Review, vol. 21, no. 5, pp. 1112–1130, 2014

  18. [18]

    Modeling the unigram distribution,

    I. Nikkarinen, T. Pimentel, A. Williams, and R. Cotterell, “Modeling the unigram distribution,” inFindings of the Association for Computational Linguistics: ACL, 2022, arXiv:2106.02289

  19. [19]

    Contrastive decoding: Open-ended text generation as optimization,

    X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, L. Zettlemoyer, and M. Lewis, “Contrastive decoding: Open-ended text generation as optimization,” inProc. 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023

  20. [20]

    ROUGE: A package for automatic evaluation of summaries,

    C.-Y . Lin, “ROUGE: A package for automatic evaluation of summaries,” inProc. ACL Workshop on Text Summarization Branches Out, 2004, pp. 74–81

  21. [21]

    SummaC: Re-visiting NLI-based models for inconsistency detection in summarization,

    P. Laban, T. Schnabel, P. N. Bennett, and M. A. Hearst, “SummaC: Re-visiting NLI-based models for inconsistency detection in summarization,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 163–177, 2022

  22. [22]

    Survey of hallucination in natural language generation,

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023. 24 The Surprising Universality of LLM Outputs

  23. [23]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023

  24. [24]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Y . Zhang, Y . Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y . Zhang, Y . Chen, L. Wang, A. T. Luu, W. Bi, F. Shi, and S. Shi, “Siren’s song in the AI ocean: A survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023

  25. [25]

    A watermark for large language models,

    J. Kirchenbauer, J. Geiping, Y . Wen, J. Katz, I. Miers, and T. Goldstein, “A watermark for large language models,” inProc. 40th International Conference on Machine Learning (ICML), 2023

  26. [26]

    Universal emergence of local Zipf-Mandelbrot law,

    D. Cugini, A. Timpanaro, G. Livan, and G. Guarnieri, “Universal emergence of local Zipf-Mandelbrot law,” arXiv preprint arXiv:2407.15946, 2024

  27. [27]

    LLM-generated natural language meets scaling laws: New explorations and data augmentation methods,

    Z. Wang, G. Xu, and M. Ren, “LLM-generated natural language meets scaling laws: New explorations and data augmentation methods,” arXiv preprint arXiv:2407.00322, 2024

  28. [28]

    Mandelbrot’s model for Zipf’s law: Can Mandelbrot’s model be justified?,

    Y . I. Manin, “Mandelbrot’s model for Zipf’s law: Can Mandelbrot’s model be justified?,”Journal of Quantitative Linguistics, vol. 16, no. 3, pp. 274–285, 2009

  29. [29]

    Your large language models are leaving fingerprints,

    H. McGovern, R. Stureborg, Y . Suhara, and D. Alikaniotis, “Your large language models are leaving fingerprints,” arXiv preprint arXiv:2405.14057, 2024

  30. [30]

    FDLLM: A dedicated detector for black-box LLMs fingerprinting,

    Z. Fu, J. Chen, L. Zhang, T. Yang, J. Niu, H. Sun, R. Li, P. Liu, J. Wang, F. He, Q. Yue, and Y . Zhang, “FDLLM: A dedicated detector for black-box LLMs fingerprinting,” arXiv preprint arXiv:2501.16029, 2025

  31. [31]

    Learned hallucination detection in black-box LLMs using token-level entropy production rate,

    C. Moslonka, H. Randrianarivo, A. Garnier, and E. Malherbe, “Learned hallucination detection in black-box LLMs using token-level entropy production rate,” arXiv preprint arXiv:2509.04492, 2025

  32. [32]

    The geometry of truth: Layer-wise semantic dynamics for hallucination detection in large language models,

    A. H. Mir, “The geometry of truth: Layer-wise semantic dynamics for hallucination detection in large language models,” arXiv preprint arXiv:2510.04933, 2025

  33. [33]

    Power-law distributions in empirical data,

    A. Clauset, C. R. Shalizi, and M. E. J. Newman, “Power-law distributions in empirical data,”SIAM Review, vol. 51, no. 4, pp. 661–703, 2009

  34. [34]

    Analysis of the discrete distribution patterns of AI-generated content based on the Zipf-Mandelbrot law,

    Y . Zhu, L. Cai, Y . Lu, Y . Zhang, and J. Ye, “Analysis of the discrete distribution patterns of AI-generated content based on the Zipf-Mandelbrot law,”Journal of Modern Information, vol. 45, no. 11, pp. 167–177, 2025

  35. [35]

    HalluField: Detecting LLM hallucinations via field-theoretic modeling,

    M. Vu, B. K. Tran, S. A. Shah, G. Zollicoffer, X. N. Hoang, and M. Bhattarai, “HalluField: Detecting LLM hallucinations via field-theoretic modeling,” arXiv preprint arXiv:2509.10753, 2025

  36. [36]

    Zipf’s and Heaps’ laws for tokens and LLM-generated texts,

    N. Mikhaylovskiy, “Zipf’s and Heaps’ laws for tokens and LLM-generated texts,” inFindings of the Association for Computational Linguistics: EMNLP 2025, pp. 15469–15481, 2025

  37. [37]

    Position: The Platonic Representation Hypothesis,

    M. Huh, B. Cheung, T. Wang, and P. Isola, “Position: The Platonic Representation Hypothesis,” inProc. 41st International Conference on Machine Learning (ICML), vol. 235, pp. 20617–20642, 2024. 25