pith. sign in

arxiv: 1906.09379 · v1 · pith:4NEZQAQInew · submitted 2019-06-22 · 💻 cs.CL

Evaluating Computational Language Models with Scaling Properties of Natural Language

Pith reviewed 2026-05-25 18:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords language modelsscaling propertieslong memorymodel evaluationrecurrent neural networksZipf's lawTaylor's law
0
0 comments X

The pith

Only gated recurrent neural network models reproduce the long memory behavior of natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies five scaling properties drawn from statistical mechanics to test how well different computational language models capture the global structure of natural language. These properties quantify vocabulary distributions and long-range statistical dependencies across texts. Testing shows that n-gram models, probabilistic context-free grammars, process-based models, and generative adversarial networks fall short on the long memory aspect. Only recurrent models that incorporate gating mechanisms succeed in matching this behavior. The work also identifies the exponent of one property as a practical signal of model quality.

Core claim

Through testing multiple types of language models on five scaling properties, the analysis shows that language models based on recurrent neural networks with a gating mechanism are the only computational models that can reproduce the long memory behavior of natural language.

What carries the argument

The five scaling properties (Zipf's law, Heaps' law, Ebeling's method, Taylor's law, and long-range correlation analysis) used as benchmarks for whether models reproduce natural language statistics.

If this is right

  • Gated recurrent models including LSTM, GRU, and QRNN reproduce the long memory scaling of natural language.
  • The exponent of Taylor's law serves as a good indicator of overall model quality.
  • Standard n-gram, PCFG, Pitman-Yor process, and GAN models fail to capture long memory.
  • The scaling properties provide an evaluation approach distinct from next-word prediction accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Gating mechanisms appear necessary for models to sustain statistical dependencies over long distances.
  • These properties could be turned into auxiliary objectives during model training.
  • The same scaling checks might apply to sequence models outside language, such as those for music or biological sequences.

Load-bearing premise

That the five scaling properties are appropriate and sufficient benchmarks for evaluating the quality of computational language models.

What would settle it

A demonstration that any non-gated model, such as a basic n-gram or GAN variant, matches the long-range correlation scaling of natural language as closely as gated RNN models.

read the original abstract

In this article, we evaluate computational models of natural language with respect to the universal statistical behaviors of natural language. Statistical mechanical analyses have revealed that natural language text is characterized by scaling properties, which quantify the global structure in the vocabulary population and the long memory of a text. We study whether five scaling properties (given by Zipf's law, Heaps' law, Ebeling's method, Taylor's law, and long-range correlation analysis) can serve for evaluation of computational models. Specifically, we test $n$-gram language models, a probabilistic context-free grammar (PCFG), language models based on Simon/Pitman-Yor processes, neural language models, and generative adversarial networks (GANs) for text generation. Our analysis reveals that language models based on recurrent neural networks (RNNs) with a gating mechanism (i.e., long short-term memory, LSTM; a gated recurrent unit, GRU; and quasi-recurrent neural networks, QRNNs) are the only computational models that can reproduce the long memory behavior of natural language. Furthermore, through comparison with recently proposed model-based evaluation methods, we find that the exponent of Taylor's law is a good indicator of model quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper evaluates computational language models (n-gram, PCFG, Simon/Pitman-Yor, neural LMs, GANs) against five scaling properties of natural language (Zipf's law, Heaps' law, Ebeling's method, Taylor's law, long-range correlation analysis). It concludes that only RNN-based models with gating (LSTM, GRU, QRNN) reproduce the long-memory behavior of natural language, and that the Taylor's law exponent serves as a useful quality indicator compared to other evaluation methods.

Significance. If the model comparisons are controlled for capacity, training regime, and generation length, the result would strengthen the case for gating mechanisms in capturing long-range dependencies and introduce scaling exponents as falsifiable benchmarks beyond perplexity. The work also supplies a concrete test (long-range correlation analysis) that could be applied to newer architectures.

major comments (3)
  1. [Abstract / long-range correlation results] Abstract and results on long-range correlation: the claim that only gated RNNs reproduce long memory is load-bearing, yet the manuscript provides no evidence that parameter budgets, effective context lengths, training corpus sizes, or generated text lengths were equalized across gated RNNs and the non-gated baselines (n-gram, PCFG, Pitman-Yor, non-gated neural, GAN). Without these controls the architecture-specific conclusion does not follow.
  2. [Long-range correlation analysis] Long-range correlation analysis section: the paper does not report error bars, number of independent runs, or sensitivity to the choice of window sizes and lag ranges used to estimate the scaling exponent; this leaves open whether the reported failure of non-gated models is statistically robust or an artifact of hyper-parameter settings.
  3. [Comparison with model-based evaluation methods] Comparison with model-based evaluation methods: the assertion that Taylor's law exponent is 'a good indicator' requires a quantitative correlation table or regression against held-out perplexity or human judgments; the current presentation leaves the strength of this indicator unclear.
minor comments (3)
  1. [Methods] Notation for the five scaling exponents is introduced without a consolidated table; a single table listing each exponent, its expected value on natural language, and the estimator used would improve clarity.
  2. [Methods] The manuscript cites the original scaling-law papers but does not discuss whether the chosen estimators (e.g., for Ebeling's method) match the exact procedures in those references; a short reproducibility note would help.
  3. [Results figures] Figure captions for the scaling plots do not state the corpus size or number of tokens used for each model; adding this information would allow direct comparison of effective sample sizes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that identify areas where additional controls and quantitative details would strengthen the manuscript. We respond to each major comment below and will incorporate revisions to address the concerns.

read point-by-point responses
  1. Referee: [Abstract / long-range correlation results] Abstract and results on long-range correlation: the claim that only gated RNNs reproduce long memory is load-bearing, yet the manuscript provides no evidence that parameter budgets, effective context lengths, training corpus sizes, or generated text lengths were equalized across gated RNNs and the non-gated baselines (n-gram, PCFG, Pitman-Yor, non-gated neural, GAN). Without these controls the architecture-specific conclusion does not follow.

    Authors: We agree that without explicit controls for parameter budgets, context lengths, and generation lengths the architecture-specific claim is weakened. The original experiments used standard configurations and the same training corpus for all models, but capacities were not systematically matched. We will add a table detailing model sizes, effective context windows, training steps, and generated text lengths, plus a discussion section on potential confounds and the limits this places on interpreting the results as purely architecture-driven. revision: yes

  2. Referee: [Long-range correlation analysis] Long-range correlation analysis section: the paper does not report error bars, number of independent runs, or sensitivity to the choice of window sizes and lag ranges used to estimate the scaling exponent; this leaves open whether the reported failure of non-gated models is statistically robust or an artifact of hyper-parameter settings.

    Authors: Error bars, run counts, and sensitivity analyses were not included in the original submission. We will recompute the long-range correlation exponents over multiple independent generations (reporting at least 5 runs per model), add error bars, and include a sensitivity study varying window sizes and lag ranges in the revised manuscript and supplementary material to demonstrate that the distinction between gated and non-gated models is robust. revision: yes

  3. Referee: [Comparison with model-based evaluation methods] Comparison with model-based evaluation methods: the assertion that Taylor's law exponent is 'a good indicator' requires a quantitative correlation table or regression against held-out perplexity or human judgments; the current presentation leaves the strength of this indicator unclear.

    Authors: The manuscript offers only a qualitative comparison. We will add a table reporting Pearson or Spearman correlations between the Taylor's law exponent and held-out perplexity across all evaluated models, along with a brief regression analysis. If human judgments are available from related work we will reference them; otherwise we will note the limitation and focus on the perplexity correlation. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of scaling properties

full rationale

The paper computes five scaling properties (Zipf, Heaps, Ebeling, Taylor, long-range correlation) on natural language text and on text generated by multiple model classes, then reports which models match the natural-language exponents. This is a straightforward benchmark comparison with no fitted parameters renamed as predictions, no self-definitional equations, and no load-bearing self-citations invoked to justify uniqueness. The derivation chain consists of measurement followed by side-by-side reporting and therefore remains self-contained against external corpora.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework rests on the domain assumption that the listed scaling properties are universal features of natural language suitable for model assessment. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Natural language text exhibits universal scaling properties (Zipf's law, Heaps' law, Ebeling's method, Taylor's law, long-range correlation) that computational models should reproduce.
    Invoked as the basis for evaluating all tested models in the abstract.

pith-pipeline@v0.9.0 · 5738 in / 1125 out tokens · 27903 ms · 2026-05-25T18:38:14.107854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

  1. [1]

    and Martin Gerlach

    Altmann, Eduardo G. and Martin Gerlach. 2017. Statistical laws in linguistics. Creativity and Universality in Language, pages 7--26

  2. [2]

    Pierrehumbert, and Adilson E

    Altmann, Eduardo G., Janet B. Pierrehumbert, and Adilson E. Motter. 2009. Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS One, 4(11):e7678

  3. [3]

    Baeza-Yates , Ricardo and Gonzalo Navarro. 2000. Block addressing indices for approximate text retrieval. Journal of the American Society for Information Science, 51(1):69--82

  4. [4]

    Bradbury, James, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-recurrent neural networks. In Proceedings of International Conference on Learning Representations, Toulon

  5. [5]

    Che, Tong, Yanran Li, Ruixiang Zhang, Devon R Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. 2017. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983

  6. [6]

    and Joshua Goodman

    Chen, Stanley F. and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359--394

  7. [7]

    e nboer, C aglar G \

    Cho, Kyunghyun, Bart van Merri \" e nboer, C aglar G \" u l c ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724--1734, Doha

  8. [8]

    Clauset, Aaron, Cosma Rohilla Shalizi, and M.E.J. Newman. 2009. Power-law distributions in empirical data. SIAM review, 51(4):661--703

  9. [9]

    Ebeling, Werner and Alexander Neiman. 1995. Long-range correlations between letters and sentences in texts. Physica A, 215(3):233--241

  10. [10]

    Ebeling, Werner and Thorsten P\"oschel. 1994. Entropy and long-range correlations in literary english. Europhysics Letters, 26(4):241--246

  11. [11]

    Eisler, Zolt \'a n, Imre Bartos, and Janos Kert \'e sz. 2007. Fluctuation scaling in complex systems: Taylor's law and beyond. Advances in Physics, 57(1):89--142

  12. [12]

    Fedus, William, Ian Goodfellow, and Andrew M. Dai. 2018. Mask GAN : Better text generation via filling in the \_\_\_\_\_\_\_. In Proceedings of International Conference on Learning Representations, Vancouver

  13. [13]

    Forney, G. David. 1973. The viterbi algorithm. Proceedings of the IEEE, 61(3):268--278

  14. [14]

    Gerlach, Martin and Eduardo G. Altmann. 2013. Stochastic model for the vocabulary growth in natural languages. Physical Review X, 3(2):021006

  15. [15]

    Griffiths, and Mark Johnson

    Goldwater, Sharon, Thomas L. Griffiths, and Mark Johnson. 2011. Producing power-law distributions and damping word frequencies with two-stage language models. Journal of Machine Learning Research, 12:2335--2382

  16. [16]

    Grave, Edouard, Armand Joulin, and Nicolas Usunier. 2017. Improving neural language models with a continuous cache. In Proceedings of International Conference on Learning Representations, Toulon

  17. [17]

    Guo, Jiaxian, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In The Thirty-Second AAAI Conference, pages 5141--5148, Louisiana

  18. [18]

    Hochreiter, Sepp and J \"u rgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735--1780

  19. [19]

    Katz, Slava M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3):400--401

  20. [20]

    Kingman, J.F.C. 1963. The exponential decay of markov transition probabilities. Proceedings of the London Mathematical Society, s3-13(1):337--358

  21. [21]

    Kneser, Reinhard and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, volume 1, pages 181--184, Michigan

  22. [22]

    Kobayashi, Tatsuru and Kumiko Tanaka-Ishii. 2018. Taylor 's law for human linguistic sequences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1138--1148, Melbourne

  23. [23]

    van Leijenhorst, Dick and Theo van der Weide. 2005. A formal derivation of Heaps ' law. Information Sciences, 170(2-4):263--272

  24. [24]

    Lennartz, Sabine and Armin Bunde. 2009. Eliminating finite-size effects and detecting the amount of whitenoise in short records with long-term memory. Physical Review E, 79(6):066101

  25. [25]

    Li, Wentian. 1989. Mutual information functions of natural language texts. Santa Fe Institute Working Paper

  26. [26]

    Lin, Chin-Yew. 2004. Rouge: a package for automatic evaluation of summaries. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics Workshop, Barcelona

  27. [27]

    and Max Tegmark

    Lin, Henry W. and Max Tegmark. 2017. Critical behavior in physics and probabilistic formal languages. entropy, 19(7):299

  28. [28]

    Lin, Kevin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. 2017. Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pages 3155--3165, California

  29. [29]

    Lawrence Zitnick

    Lin, Tsung-Yi, Michael Maire, Serge Belongie, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740--755, Zurich

  30. [30]

    Loper, Edward and Steven Bird. 2002. Nltk: The natural language toolkit. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics Workshop, pages 63--70, Pennsylvania

  31. [31]

    L\"u, Linyuan, Zi-Ke Zhang, and Tao Zhou. 2010. Zipf 's law leads to Heaps ' law: Analyzing their relation in finite-size systems. PLoS One, 5(12):e14139

  32. [32]

    Lu, Sidi, Lantao Yu, Weinan Zhang, and Yong Yu. 2018. Cot: Cooperative training for generative modeling. arXiv preprint arXiv:1804.03782

  33. [33]

    Manning, Chris and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. MIT Press

  34. [34]

    Melis, G \'a bor, Chris Dyer, and Phil Blunsom. 2018. On the state of the art of evaluation in neural language models. In Proceedings of International Conference on Learning Representations, Vancouver

  35. [35]

    Merity, Stephen, Nitish Keskar, and Richard Socher. 2018 a . An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240

  36. [36]

    Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. 2018 b . Regularizing and optimizing LSTM language models. In Proceedings of International Conference on Learning Representations, Vancouver

  37. [37]

    Merity, Stephen, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. In Proceedings of International Conference on Learning Representations, San Juan

  38. [38]

    Mikolov, Tom \'a s , Martin Karafi \' a t, Luk \'a s Burget, Jan Honza C ernock \' y , and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, pages 1045--1048, Chiba

  39. [39]

    Mikolov, Tom \'a s and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. In The IEEE Workshop on Spoken Language Technology, pages 234--239, Florida

  40. [40]

    Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Pennsylvania

  41. [41]

    Rajeswar, Sai, Sandeep Subramanian, Francis Dutil, Christopher Pal, and Aaron Courville. 2017. Adversarial generation of natural language. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics Workshop, pages 241--251, Vancouver

  42. [42]

    Simon, Herbert A. 1955. On a class of skew distribution functions. Biometrika, 42(3/4):425--440

  43. [43]

    Fairfield

    Smith, H. Fairfield. 1938. An empirical law describing hetero-geneity in the yields of agricultural crops. Journal of Agriculture Science, 28(1):1--23

  44. [44]

    Stolcke, Andreas. 2002. Srilm - an extensible language modeling toolkit. In International Conference on Spoken Language Processing, pages 901--904, Colorado

  45. [45]

    Takahashi, Shuntaro and Kumiko Tanaka-Ishii. 2017. Do neural nets learn statistical laws behind natural language? PLoS One, 12(12):e0189326

  46. [46]

    Tanaka-Ishii, Kumiko and Armin Bunde. 2016. Long-range memory in literary texts: On the universal clustering of the rare words. PLoS One, 11(11):e0164658

  47. [47]

    Tanaka-Ishii, Kumiko and Tatsuru Kobayashi. 2018. Taylor's law for linguistic sequences and random walk models. Journal of Physics Communications, 2(11):115024

  48. [48]

    Taylor, Lionel Roy. 1961. Aggregation, variance and the mean. Nature, 189(4766):732--735

  49. [49]

    Teh, Yee Whye. 2006. A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics, pages 985--992, Sydney

  50. [50]

    Yang, Zhilin, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. Breaking the softmax bottleneck: A high-rank RNN language model. In Proceedings of International Conference on Learning Representations, Vancouver

  51. [51]

    Yu, Lantao, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: sequence generative adversarial nets with policy gradient. In The Thirty-First AAAI Conference, pages 2852--2858, California

  52. [52]

    Zhang, Yizhe, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. 2017. Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850

  53. [53]

    Zhu, Yaoming, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. arXiv preprint arXiv:1802.01886

  54. [54]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

  55. [55]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...