Evaluating Computational Language Models with Scaling Properties of Natural Language
Pith reviewed 2026-05-25 18:38 UTC · model grok-4.3
The pith
Only gated recurrent neural network models reproduce the long memory behavior of natural language.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through testing multiple types of language models on five scaling properties, the analysis shows that language models based on recurrent neural networks with a gating mechanism are the only computational models that can reproduce the long memory behavior of natural language.
What carries the argument
The five scaling properties (Zipf's law, Heaps' law, Ebeling's method, Taylor's law, and long-range correlation analysis) used as benchmarks for whether models reproduce natural language statistics.
If this is right
- Gated recurrent models including LSTM, GRU, and QRNN reproduce the long memory scaling of natural language.
- The exponent of Taylor's law serves as a good indicator of overall model quality.
- Standard n-gram, PCFG, Pitman-Yor process, and GAN models fail to capture long memory.
- The scaling properties provide an evaluation approach distinct from next-word prediction accuracy.
Where Pith is reading between the lines
- Gating mechanisms appear necessary for models to sustain statistical dependencies over long distances.
- These properties could be turned into auxiliary objectives during model training.
- The same scaling checks might apply to sequence models outside language, such as those for music or biological sequences.
Load-bearing premise
That the five scaling properties are appropriate and sufficient benchmarks for evaluating the quality of computational language models.
What would settle it
A demonstration that any non-gated model, such as a basic n-gram or GAN variant, matches the long-range correlation scaling of natural language as closely as gated RNN models.
read the original abstract
In this article, we evaluate computational models of natural language with respect to the universal statistical behaviors of natural language. Statistical mechanical analyses have revealed that natural language text is characterized by scaling properties, which quantify the global structure in the vocabulary population and the long memory of a text. We study whether five scaling properties (given by Zipf's law, Heaps' law, Ebeling's method, Taylor's law, and long-range correlation analysis) can serve for evaluation of computational models. Specifically, we test $n$-gram language models, a probabilistic context-free grammar (PCFG), language models based on Simon/Pitman-Yor processes, neural language models, and generative adversarial networks (GANs) for text generation. Our analysis reveals that language models based on recurrent neural networks (RNNs) with a gating mechanism (i.e., long short-term memory, LSTM; a gated recurrent unit, GRU; and quasi-recurrent neural networks, QRNNs) are the only computational models that can reproduce the long memory behavior of natural language. Furthermore, through comparison with recently proposed model-based evaluation methods, we find that the exponent of Taylor's law is a good indicator of model quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates computational language models (n-gram, PCFG, Simon/Pitman-Yor, neural LMs, GANs) against five scaling properties of natural language (Zipf's law, Heaps' law, Ebeling's method, Taylor's law, long-range correlation analysis). It concludes that only RNN-based models with gating (LSTM, GRU, QRNN) reproduce the long-memory behavior of natural language, and that the Taylor's law exponent serves as a useful quality indicator compared to other evaluation methods.
Significance. If the model comparisons are controlled for capacity, training regime, and generation length, the result would strengthen the case for gating mechanisms in capturing long-range dependencies and introduce scaling exponents as falsifiable benchmarks beyond perplexity. The work also supplies a concrete test (long-range correlation analysis) that could be applied to newer architectures.
major comments (3)
- [Abstract / long-range correlation results] Abstract and results on long-range correlation: the claim that only gated RNNs reproduce long memory is load-bearing, yet the manuscript provides no evidence that parameter budgets, effective context lengths, training corpus sizes, or generated text lengths were equalized across gated RNNs and the non-gated baselines (n-gram, PCFG, Pitman-Yor, non-gated neural, GAN). Without these controls the architecture-specific conclusion does not follow.
- [Long-range correlation analysis] Long-range correlation analysis section: the paper does not report error bars, number of independent runs, or sensitivity to the choice of window sizes and lag ranges used to estimate the scaling exponent; this leaves open whether the reported failure of non-gated models is statistically robust or an artifact of hyper-parameter settings.
- [Comparison with model-based evaluation methods] Comparison with model-based evaluation methods: the assertion that Taylor's law exponent is 'a good indicator' requires a quantitative correlation table or regression against held-out perplexity or human judgments; the current presentation leaves the strength of this indicator unclear.
minor comments (3)
- [Methods] Notation for the five scaling exponents is introduced without a consolidated table; a single table listing each exponent, its expected value on natural language, and the estimator used would improve clarity.
- [Methods] The manuscript cites the original scaling-law papers but does not discuss whether the chosen estimators (e.g., for Ebeling's method) match the exact procedures in those references; a short reproducibility note would help.
- [Results figures] Figure captions for the scaling plots do not state the corpus size or number of tokens used for each model; adding this information would allow direct comparison of effective sample sizes.
Simulated Author's Rebuttal
We thank the referee for the constructive comments that identify areas where additional controls and quantitative details would strengthen the manuscript. We respond to each major comment below and will incorporate revisions to address the concerns.
read point-by-point responses
-
Referee: [Abstract / long-range correlation results] Abstract and results on long-range correlation: the claim that only gated RNNs reproduce long memory is load-bearing, yet the manuscript provides no evidence that parameter budgets, effective context lengths, training corpus sizes, or generated text lengths were equalized across gated RNNs and the non-gated baselines (n-gram, PCFG, Pitman-Yor, non-gated neural, GAN). Without these controls the architecture-specific conclusion does not follow.
Authors: We agree that without explicit controls for parameter budgets, context lengths, and generation lengths the architecture-specific claim is weakened. The original experiments used standard configurations and the same training corpus for all models, but capacities were not systematically matched. We will add a table detailing model sizes, effective context windows, training steps, and generated text lengths, plus a discussion section on potential confounds and the limits this places on interpreting the results as purely architecture-driven. revision: yes
-
Referee: [Long-range correlation analysis] Long-range correlation analysis section: the paper does not report error bars, number of independent runs, or sensitivity to the choice of window sizes and lag ranges used to estimate the scaling exponent; this leaves open whether the reported failure of non-gated models is statistically robust or an artifact of hyper-parameter settings.
Authors: Error bars, run counts, and sensitivity analyses were not included in the original submission. We will recompute the long-range correlation exponents over multiple independent generations (reporting at least 5 runs per model), add error bars, and include a sensitivity study varying window sizes and lag ranges in the revised manuscript and supplementary material to demonstrate that the distinction between gated and non-gated models is robust. revision: yes
-
Referee: [Comparison with model-based evaluation methods] Comparison with model-based evaluation methods: the assertion that Taylor's law exponent is 'a good indicator' requires a quantitative correlation table or regression against held-out perplexity or human judgments; the current presentation leaves the strength of this indicator unclear.
Authors: The manuscript offers only a qualitative comparison. We will add a table reporting Pearson or Spearman correlations between the Taylor's law exponent and held-out perplexity across all evaluated models, along with a brief regression analysis. If human judgments are available from related work we will reference them; otherwise we will note the limitation and focus on the perplexity correlation. revision: yes
Circularity Check
No circularity: direct empirical comparison of scaling properties
full rationale
The paper computes five scaling properties (Zipf, Heaps, Ebeling, Taylor, long-range correlation) on natural language text and on text generated by multiple model classes, then reports which models match the natural-language exponents. This is a straightforward benchmark comparison with no fitted parameters renamed as predictions, no self-definitional equations, and no load-bearing self-citations invoked to justify uniqueness. The derivation chain consists of measurement followed by side-by-side reporting and therefore remains self-contained against external corpora.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Natural language text exhibits universal scaling properties (Zipf's law, Heaps' law, Ebeling's method, Taylor's law, long-range correlation) that computational models should reproduce.
Reference graph
Works this paper leans on
-
[1]
Altmann, Eduardo G. and Martin Gerlach. 2017. Statistical laws in linguistics. Creativity and Universality in Language, pages 7--26
work page 2017
-
[2]
Altmann, Eduardo G., Janet B. Pierrehumbert, and Adilson E. Motter. 2009. Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS One, 4(11):e7678
work page 2009
-
[3]
Baeza-Yates , Ricardo and Gonzalo Navarro. 2000. Block addressing indices for approximate text retrieval. Journal of the American Society for Information Science, 51(1):69--82
work page 2000
-
[4]
Bradbury, James, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-recurrent neural networks. In Proceedings of International Conference on Learning Representations, Toulon
work page 2017
-
[5]
Che, Tong, Yanran Li, Ruixiang Zhang, Devon R Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. 2017. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
Chen, Stanley F. and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359--394
work page 1999
-
[7]
Cho, Kyunghyun, Bart van Merri \" e nboer, C aglar G \" u l c ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724--1734, Doha
work page 2014
-
[8]
Clauset, Aaron, Cosma Rohilla Shalizi, and M.E.J. Newman. 2009. Power-law distributions in empirical data. SIAM review, 51(4):661--703
work page 2009
-
[9]
Ebeling, Werner and Alexander Neiman. 1995. Long-range correlations between letters and sentences in texts. Physica A, 215(3):233--241
work page 1995
-
[10]
Ebeling, Werner and Thorsten P\"oschel. 1994. Entropy and long-range correlations in literary english. Europhysics Letters, 26(4):241--246
work page 1994
-
[11]
Eisler, Zolt \'a n, Imre Bartos, and Janos Kert \'e sz. 2007. Fluctuation scaling in complex systems: Taylor's law and beyond. Advances in Physics, 57(1):89--142
work page 2007
-
[12]
Fedus, William, Ian Goodfellow, and Andrew M. Dai. 2018. Mask GAN : Better text generation via filling in the \_\_\_\_\_\_\_. In Proceedings of International Conference on Learning Representations, Vancouver
work page 2018
-
[13]
Forney, G. David. 1973. The viterbi algorithm. Proceedings of the IEEE, 61(3):268--278
work page 1973
-
[14]
Gerlach, Martin and Eduardo G. Altmann. 2013. Stochastic model for the vocabulary growth in natural languages. Physical Review X, 3(2):021006
work page 2013
-
[15]
Goldwater, Sharon, Thomas L. Griffiths, and Mark Johnson. 2011. Producing power-law distributions and damping word frequencies with two-stage language models. Journal of Machine Learning Research, 12:2335--2382
work page 2011
-
[16]
Grave, Edouard, Armand Joulin, and Nicolas Usunier. 2017. Improving neural language models with a continuous cache. In Proceedings of International Conference on Learning Representations, Toulon
work page 2017
-
[17]
Guo, Jiaxian, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In The Thirty-Second AAAI Conference, pages 5141--5148, Louisiana
work page 2018
-
[18]
Hochreiter, Sepp and J \"u rgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735--1780
work page 1997
-
[19]
Katz, Slava M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3):400--401
work page 1987
-
[20]
Kingman, J.F.C. 1963. The exponential decay of markov transition probabilities. Proceedings of the London Mathematical Society, s3-13(1):337--358
work page 1963
-
[21]
Kneser, Reinhard and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, volume 1, pages 181--184, Michigan
work page 1995
-
[22]
Kobayashi, Tatsuru and Kumiko Tanaka-Ishii. 2018. Taylor 's law for human linguistic sequences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1138--1148, Melbourne
work page 2018
-
[23]
van Leijenhorst, Dick and Theo van der Weide. 2005. A formal derivation of Heaps ' law. Information Sciences, 170(2-4):263--272
work page 2005
-
[24]
Lennartz, Sabine and Armin Bunde. 2009. Eliminating finite-size effects and detecting the amount of whitenoise in short records with long-term memory. Physical Review E, 79(6):066101
work page 2009
-
[25]
Li, Wentian. 1989. Mutual information functions of natural language texts. Santa Fe Institute Working Paper
work page 1989
-
[26]
Lin, Chin-Yew. 2004. Rouge: a package for automatic evaluation of summaries. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics Workshop, Barcelona
work page 2004
-
[27]
Lin, Henry W. and Max Tegmark. 2017. Critical behavior in physics and probabilistic formal languages. entropy, 19(7):299
work page 2017
-
[28]
Lin, Kevin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. 2017. Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pages 3155--3165, California
work page 2017
-
[29]
Lin, Tsung-Yi, Michael Maire, Serge Belongie, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740--755, Zurich
work page 2014
-
[30]
Loper, Edward and Steven Bird. 2002. Nltk: The natural language toolkit. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics Workshop, pages 63--70, Pennsylvania
work page 2002
-
[31]
L\"u, Linyuan, Zi-Ke Zhang, and Tao Zhou. 2010. Zipf 's law leads to Heaps ' law: Analyzing their relation in finite-size systems. PLoS One, 5(12):e14139
work page 2010
-
[32]
Lu, Sidi, Lantao Yu, Weinan Zhang, and Yong Yu. 2018. Cot: Cooperative training for generative modeling. arXiv preprint arXiv:1804.03782
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[33]
Manning, Chris and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. MIT Press
work page 1999
-
[34]
Melis, G \'a bor, Chris Dyer, and Phil Blunsom. 2018. On the state of the art of evaluation in neural language models. In Proceedings of International Conference on Learning Representations, Vancouver
work page 2018
-
[35]
Merity, Stephen, Nitish Keskar, and Richard Socher. 2018 a . An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. 2018 b . Regularizing and optimizing LSTM language models. In Proceedings of International Conference on Learning Representations, Vancouver
work page 2018
-
[37]
Merity, Stephen, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. In Proceedings of International Conference on Learning Representations, San Juan
work page 2016
-
[38]
Mikolov, Tom \'a s , Martin Karafi \' a t, Luk \'a s Burget, Jan Honza C ernock \' y , and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, pages 1045--1048, Chiba
work page 2010
-
[39]
Mikolov, Tom \'a s and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. In The IEEE Workshop on Spoken Language Technology, pages 234--239, Florida
work page 2012
-
[40]
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Pennsylvania
work page 2002
-
[41]
Rajeswar, Sai, Sandeep Subramanian, Francis Dutil, Christopher Pal, and Aaron Courville. 2017. Adversarial generation of natural language. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics Workshop, pages 241--251, Vancouver
work page 2017
-
[42]
Simon, Herbert A. 1955. On a class of skew distribution functions. Biometrika, 42(3/4):425--440
work page 1955
- [43]
-
[44]
Stolcke, Andreas. 2002. Srilm - an extensible language modeling toolkit. In International Conference on Spoken Language Processing, pages 901--904, Colorado
work page 2002
-
[45]
Takahashi, Shuntaro and Kumiko Tanaka-Ishii. 2017. Do neural nets learn statistical laws behind natural language? PLoS One, 12(12):e0189326
work page 2017
-
[46]
Tanaka-Ishii, Kumiko and Armin Bunde. 2016. Long-range memory in literary texts: On the universal clustering of the rare words. PLoS One, 11(11):e0164658
work page 2016
-
[47]
Tanaka-Ishii, Kumiko and Tatsuru Kobayashi. 2018. Taylor's law for linguistic sequences and random walk models. Journal of Physics Communications, 2(11):115024
work page 2018
-
[48]
Taylor, Lionel Roy. 1961. Aggregation, variance and the mean. Nature, 189(4766):732--735
work page 1961
-
[49]
Teh, Yee Whye. 2006. A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics, pages 985--992, Sydney
work page 2006
-
[50]
Yang, Zhilin, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. Breaking the softmax bottleneck: A high-rank RNN language model. In Proceedings of International Conference on Learning Representations, Vancouver
work page 2018
-
[51]
Yu, Lantao, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: sequence generative adversarial nets with policy gradient. In The Thirty-First AAAI Conference, pages 2852--2858, California
work page 2017
-
[52]
Zhang, Yizhe, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. 2017. Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[53]
Zhu, Yaoming, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. arXiv preprint arXiv:1802.01886
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[54]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...
-
[55]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.