Evaluating Computational Language Models with Scaling Properties of Natural Language

Kumiko Tanaka-Ishii; Shuntaro Takahashi

arxiv: 1906.09379 · v1 · pith:4NEZQAQInew · submitted 2019-06-22 · 💻 cs.CL

Evaluating Computational Language Models with Scaling Properties of Natural Language

Shuntaro Takahashi , Kumiko Tanaka-Ishii This is my paper

Pith reviewed 2026-05-25 18:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords language modelsscaling propertieslong memorymodel evaluationrecurrent neural networksZipf's lawTaylor's law

0 comments

The pith

Only gated recurrent neural network models reproduce the long memory behavior of natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies five scaling properties drawn from statistical mechanics to test how well different computational language models capture the global structure of natural language. These properties quantify vocabulary distributions and long-range statistical dependencies across texts. Testing shows that n-gram models, probabilistic context-free grammars, process-based models, and generative adversarial networks fall short on the long memory aspect. Only recurrent models that incorporate gating mechanisms succeed in matching this behavior. The work also identifies the exponent of one property as a practical signal of model quality.

Core claim

Through testing multiple types of language models on five scaling properties, the analysis shows that language models based on recurrent neural networks with a gating mechanism are the only computational models that can reproduce the long memory behavior of natural language.

What carries the argument

The five scaling properties (Zipf's law, Heaps' law, Ebeling's method, Taylor's law, and long-range correlation analysis) used as benchmarks for whether models reproduce natural language statistics.

If this is right

Gated recurrent models including LSTM, GRU, and QRNN reproduce the long memory scaling of natural language.
The exponent of Taylor's law serves as a good indicator of overall model quality.
Standard n-gram, PCFG, Pitman-Yor process, and GAN models fail to capture long memory.
The scaling properties provide an evaluation approach distinct from next-word prediction accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Gating mechanisms appear necessary for models to sustain statistical dependencies over long distances.
These properties could be turned into auxiliary objectives during model training.
The same scaling checks might apply to sequence models outside language, such as those for music or biological sequences.

Load-bearing premise

That the five scaling properties are appropriate and sufficient benchmarks for evaluating the quality of computational language models.

What would settle it

A demonstration that any non-gated model, such as a basic n-gram or GAN variant, matches the long-range correlation scaling of natural language as closely as gated RNN models.

read the original abstract

In this article, we evaluate computational models of natural language with respect to the universal statistical behaviors of natural language. Statistical mechanical analyses have revealed that natural language text is characterized by scaling properties, which quantify the global structure in the vocabulary population and the long memory of a text. We study whether five scaling properties (given by Zipf's law, Heaps' law, Ebeling's method, Taylor's law, and long-range correlation analysis) can serve for evaluation of computational models. Specifically, we test $n$-gram language models, a probabilistic context-free grammar (PCFG), language models based on Simon/Pitman-Yor processes, neural language models, and generative adversarial networks (GANs) for text generation. Our analysis reveals that language models based on recurrent neural networks (RNNs) with a gating mechanism (i.e., long short-term memory, LSTM; a gated recurrent unit, GRU; and quasi-recurrent neural networks, QRNNs) are the only computational models that can reproduce the long memory behavior of natural language. Furthermore, through comparison with recently proposed model-based evaluation methods, we find that the exponent of Taylor's law is a good indicator of model quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gated RNNs match long memory scaling where others tested do not, but the architecture claim needs checks on whether comparisons were equalized for capacity and length.

read the letter

Gated RNNs are the only models in this paper that reproduce the long memory behavior of natural language according to the scaling analysis, while n-grams, PCFGs, Pitman-Yor processes, non-gated neural models, and GANs fall short on that metric. Taylor's law exponent also tracks model quality better than some existing model-based evaluations. The paper applies five scaling properties—Zipf's law, Heaps' law, Ebeling's method, Taylor's law, and long-range correlation—to generated text from each model class and compares the outcomes directly to natural language corpora. This produces a clear empirical distinction on the long-range correlation measure. The approach is useful because it evaluates global statistical structure rather than local perplexity or downstream task scores, and the side-by-side testing across model families gives a broader picture than single-architecture studies. The result on gated models and the Taylor exponent comparison are the concrete new pieces. The work engages the statistical mechanics literature on language scaling without obvious circularity. The main soft spot is the fairness of the model comparisons. The central claim attributes the long memory difference to the presence of gating, but that only follows if the models were matched on parameter budget, training data, optimization, and the length of text generated for the scaling measurements. The abstract gives no indication those factors were controlled, so if the paper does not demonstrate that non-gated models still fail under equalized conditions, the architecture-specific conclusion rests on weaker ground. That concern is not minor for the main result. The assumption that these five properties are sufficient benchmarks is reasonable given their established status for natural language, though the paper does not prove they capture everything needed for model quality. This paper is aimed at NLP researchers working on generative model evaluation and statistical properties of text. A reader focused on scaling laws or alternatives to perplexity would find the specific findings worth examining. It shows honest engagement with the relevant literature and produces falsifiable comparisons, so it deserves a serious referee even if revisions on methods transparency are needed.

Referee Report

3 major / 3 minor

Summary. The paper evaluates computational language models (n-gram, PCFG, Simon/Pitman-Yor, neural LMs, GANs) against five scaling properties of natural language (Zipf's law, Heaps' law, Ebeling's method, Taylor's law, long-range correlation analysis). It concludes that only RNN-based models with gating (LSTM, GRU, QRNN) reproduce the long-memory behavior of natural language, and that the Taylor's law exponent serves as a useful quality indicator compared to other evaluation methods.

Significance. If the model comparisons are controlled for capacity, training regime, and generation length, the result would strengthen the case for gating mechanisms in capturing long-range dependencies and introduce scaling exponents as falsifiable benchmarks beyond perplexity. The work also supplies a concrete test (long-range correlation analysis) that could be applied to newer architectures.

major comments (3)

[Abstract / long-range correlation results] Abstract and results on long-range correlation: the claim that only gated RNNs reproduce long memory is load-bearing, yet the manuscript provides no evidence that parameter budgets, effective context lengths, training corpus sizes, or generated text lengths were equalized across gated RNNs and the non-gated baselines (n-gram, PCFG, Pitman-Yor, non-gated neural, GAN). Without these controls the architecture-specific conclusion does not follow.
[Long-range correlation analysis] Long-range correlation analysis section: the paper does not report error bars, number of independent runs, or sensitivity to the choice of window sizes and lag ranges used to estimate the scaling exponent; this leaves open whether the reported failure of non-gated models is statistically robust or an artifact of hyper-parameter settings.
[Comparison with model-based evaluation methods] Comparison with model-based evaluation methods: the assertion that Taylor's law exponent is 'a good indicator' requires a quantitative correlation table or regression against held-out perplexity or human judgments; the current presentation leaves the strength of this indicator unclear.

minor comments (3)

[Methods] Notation for the five scaling exponents is introduced without a consolidated table; a single table listing each exponent, its expected value on natural language, and the estimator used would improve clarity.
[Methods] The manuscript cites the original scaling-law papers but does not discuss whether the chosen estimators (e.g., for Ebeling's method) match the exact procedures in those references; a short reproducibility note would help.
[Results figures] Figure captions for the scaling plots do not state the corpus size or number of tokens used for each model; adding this information would allow direct comparison of effective sample sizes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that identify areas where additional controls and quantitative details would strengthen the manuscript. We respond to each major comment below and will incorporate revisions to address the concerns.

read point-by-point responses

Referee: [Abstract / long-range correlation results] Abstract and results on long-range correlation: the claim that only gated RNNs reproduce long memory is load-bearing, yet the manuscript provides no evidence that parameter budgets, effective context lengths, training corpus sizes, or generated text lengths were equalized across gated RNNs and the non-gated baselines (n-gram, PCFG, Pitman-Yor, non-gated neural, GAN). Without these controls the architecture-specific conclusion does not follow.

Authors: We agree that without explicit controls for parameter budgets, context lengths, and generation lengths the architecture-specific claim is weakened. The original experiments used standard configurations and the same training corpus for all models, but capacities were not systematically matched. We will add a table detailing model sizes, effective context windows, training steps, and generated text lengths, plus a discussion section on potential confounds and the limits this places on interpreting the results as purely architecture-driven. revision: yes
Referee: [Long-range correlation analysis] Long-range correlation analysis section: the paper does not report error bars, number of independent runs, or sensitivity to the choice of window sizes and lag ranges used to estimate the scaling exponent; this leaves open whether the reported failure of non-gated models is statistically robust or an artifact of hyper-parameter settings.

Authors: Error bars, run counts, and sensitivity analyses were not included in the original submission. We will recompute the long-range correlation exponents over multiple independent generations (reporting at least 5 runs per model), add error bars, and include a sensitivity study varying window sizes and lag ranges in the revised manuscript and supplementary material to demonstrate that the distinction between gated and non-gated models is robust. revision: yes
Referee: [Comparison with model-based evaluation methods] Comparison with model-based evaluation methods: the assertion that Taylor's law exponent is 'a good indicator' requires a quantitative correlation table or regression against held-out perplexity or human judgments; the current presentation leaves the strength of this indicator unclear.

Authors: The manuscript offers only a qualitative comparison. We will add a table reporting Pearson or Spearman correlations between the Taylor's law exponent and held-out perplexity across all evaluated models, along with a brief regression analysis. If human judgments are available from related work we will reference them; otherwise we will note the limitation and focus on the perplexity correlation. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of scaling properties

full rationale

The paper computes five scaling properties (Zipf, Heaps, Ebeling, Taylor, long-range correlation) on natural language text and on text generated by multiple model classes, then reports which models match the natural-language exponents. This is a straightforward benchmark comparison with no fitted parameters renamed as predictions, no self-definitional equations, and no load-bearing self-citations invoked to justify uniqueness. The derivation chain consists of measurement followed by side-by-side reporting and therefore remains self-contained against external corpora.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework rests on the domain assumption that the listed scaling properties are universal features of natural language suitable for model assessment. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Natural language text exhibits universal scaling properties (Zipf's law, Heaps' law, Ebeling's method, Taylor's law, long-range correlation) that computational models should reproduce.
Invoked as the basis for evaluating all tested models in the abstract.

pith-pipeline@v0.9.0 · 5738 in / 1125 out tokens · 27903 ms · 2026-05-25T18:38:14.107854+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

[1]

and Martin Gerlach

Altmann, Eduardo G. and Martin Gerlach. 2017. Statistical laws in linguistics. Creativity and Universality in Language, pages 7--26

work page 2017
[2]

Pierrehumbert, and Adilson E

Altmann, Eduardo G., Janet B. Pierrehumbert, and Adilson E. Motter. 2009. Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS One, 4(11):e7678

work page 2009
[3]

Baeza-Yates , Ricardo and Gonzalo Navarro. 2000. Block addressing indices for approximate text retrieval. Journal of the American Society for Information Science, 51(1):69--82

work page 2000
[4]

Bradbury, James, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-recurrent neural networks. In Proceedings of International Conference on Learning Representations, Toulon

work page 2017
[5]

Che, Tong, Yanran Li, Ruixiang Zhang, Devon R Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. 2017. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

and Joshua Goodman

Chen, Stanley F. and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359--394

work page 1999
[7]

e nboer, C aglar G \

Cho, Kyunghyun, Bart van Merri \" e nboer, C aglar G \" u l c ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724--1734, Doha

work page 2014
[8]

Clauset, Aaron, Cosma Rohilla Shalizi, and M.E.J. Newman. 2009. Power-law distributions in empirical data. SIAM review, 51(4):661--703

work page 2009
[9]

Ebeling, Werner and Alexander Neiman. 1995. Long-range correlations between letters and sentences in texts. Physica A, 215(3):233--241

work page 1995
[10]

Ebeling, Werner and Thorsten P\"oschel. 1994. Entropy and long-range correlations in literary english. Europhysics Letters, 26(4):241--246

work page 1994
[11]

Eisler, Zolt \'a n, Imre Bartos, and Janos Kert \'e sz. 2007. Fluctuation scaling in complex systems: Taylor's law and beyond. Advances in Physics, 57(1):89--142

work page 2007
[12]

Fedus, William, Ian Goodfellow, and Andrew M. Dai. 2018. Mask GAN : Better text generation via filling in the \_\_\_\_\_\_\_. In Proceedings of International Conference on Learning Representations, Vancouver

work page 2018
[13]

Forney, G. David. 1973. The viterbi algorithm. Proceedings of the IEEE, 61(3):268--278

work page 1973
[14]

Gerlach, Martin and Eduardo G. Altmann. 2013. Stochastic model for the vocabulary growth in natural languages. Physical Review X, 3(2):021006

work page 2013
[15]

Griffiths, and Mark Johnson

Goldwater, Sharon, Thomas L. Griffiths, and Mark Johnson. 2011. Producing power-law distributions and damping word frequencies with two-stage language models. Journal of Machine Learning Research, 12:2335--2382

work page 2011
[16]

Grave, Edouard, Armand Joulin, and Nicolas Usunier. 2017. Improving neural language models with a continuous cache. In Proceedings of International Conference on Learning Representations, Toulon

work page 2017
[17]

Guo, Jiaxian, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In The Thirty-Second AAAI Conference, pages 5141--5148, Louisiana

work page 2018
[18]

Hochreiter, Sepp and J \"u rgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735--1780

work page 1997
[19]

Katz, Slava M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3):400--401

work page 1987
[20]

Kingman, J.F.C. 1963. The exponential decay of markov transition probabilities. Proceedings of the London Mathematical Society, s3-13(1):337--358

work page 1963
[21]

Kneser, Reinhard and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, volume 1, pages 181--184, Michigan

work page 1995
[22]

Kobayashi, Tatsuru and Kumiko Tanaka-Ishii. 2018. Taylor 's law for human linguistic sequences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1138--1148, Melbourne

work page 2018
[23]

van Leijenhorst, Dick and Theo van der Weide. 2005. A formal derivation of Heaps ' law. Information Sciences, 170(2-4):263--272

work page 2005
[24]

Lennartz, Sabine and Armin Bunde. 2009. Eliminating finite-size effects and detecting the amount of whitenoise in short records with long-term memory. Physical Review E, 79(6):066101

work page 2009
[25]

Li, Wentian. 1989. Mutual information functions of natural language texts. Santa Fe Institute Working Paper

work page 1989
[26]

Lin, Chin-Yew. 2004. Rouge: a package for automatic evaluation of summaries. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics Workshop, Barcelona

work page 2004
[27]

and Max Tegmark

Lin, Henry W. and Max Tegmark. 2017. Critical behavior in physics and probabilistic formal languages. entropy, 19(7):299

work page 2017
[28]

Lin, Kevin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. 2017. Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pages 3155--3165, California

work page 2017
[29]

Lawrence Zitnick

Lin, Tsung-Yi, Michael Maire, Serge Belongie, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740--755, Zurich

work page 2014
[30]

Loper, Edward and Steven Bird. 2002. Nltk: The natural language toolkit. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics Workshop, pages 63--70, Pennsylvania

work page 2002
[31]

L\"u, Linyuan, Zi-Ke Zhang, and Tao Zhou. 2010. Zipf 's law leads to Heaps ' law: Analyzing their relation in finite-size systems. PLoS One, 5(12):e14139

work page 2010
[32]

Lu, Sidi, Lantao Yu, Weinan Zhang, and Yong Yu. 2018. Cot: Cooperative training for generative modeling. arXiv preprint arXiv:1804.03782

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

Manning, Chris and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. MIT Press

work page 1999
[34]

Melis, G \'a bor, Chris Dyer, and Phil Blunsom. 2018. On the state of the art of evaluation in neural language models. In Proceedings of International Conference on Learning Representations, Vancouver

work page 2018
[35]

Merity, Stephen, Nitish Keskar, and Richard Socher. 2018 a . An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. 2018 b . Regularizing and optimizing LSTM language models. In Proceedings of International Conference on Learning Representations, Vancouver

work page 2018
[37]

Merity, Stephen, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. In Proceedings of International Conference on Learning Representations, San Juan

work page 2016
[38]

Mikolov, Tom \'a s , Martin Karafi \' a t, Luk \'a s Burget, Jan Honza C ernock \' y , and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, pages 1045--1048, Chiba

work page 2010
[39]

Mikolov, Tom \'a s and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. In The IEEE Workshop on Spoken Language Technology, pages 234--239, Florida

work page 2012
[40]

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Pennsylvania

work page 2002
[41]

Rajeswar, Sai, Sandeep Subramanian, Francis Dutil, Christopher Pal, and Aaron Courville. 2017. Adversarial generation of natural language. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics Workshop, pages 241--251, Vancouver

work page 2017
[42]

Simon, Herbert A. 1955. On a class of skew distribution functions. Biometrika, 42(3/4):425--440

work page 1955
[43]

Fairfield

Smith, H. Fairfield. 1938. An empirical law describing hetero-geneity in the yields of agricultural crops. Journal of Agriculture Science, 28(1):1--23

work page 1938
[44]

Stolcke, Andreas. 2002. Srilm - an extensible language modeling toolkit. In International Conference on Spoken Language Processing, pages 901--904, Colorado

work page 2002
[45]

Takahashi, Shuntaro and Kumiko Tanaka-Ishii. 2017. Do neural nets learn statistical laws behind natural language? PLoS One, 12(12):e0189326

work page 2017
[46]

Tanaka-Ishii, Kumiko and Armin Bunde. 2016. Long-range memory in literary texts: On the universal clustering of the rare words. PLoS One, 11(11):e0164658

work page 2016
[47]

Tanaka-Ishii, Kumiko and Tatsuru Kobayashi. 2018. Taylor's law for linguistic sequences and random walk models. Journal of Physics Communications, 2(11):115024

work page 2018
[48]

Taylor, Lionel Roy. 1961. Aggregation, variance and the mean. Nature, 189(4766):732--735

work page 1961
[49]

Teh, Yee Whye. 2006. A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics, pages 985--992, Sydney

work page 2006
[50]

Yang, Zhilin, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. Breaking the softmax bottleneck: A high-rank RNN language model. In Proceedings of International Conference on Learning Representations, Vancouver

work page 2018
[51]

Yu, Lantao, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: sequence generative adversarial nets with policy gradient. In The Thirty-First AAAI Conference, pages 2852--2858, California

work page 2017
[52]

Zhang, Yizhe, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. 2017. Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

Zhu, Yaoming, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. arXiv preprint arXiv:1802.01886

work page internal anchor Pith review Pith/arXiv arXiv 2018
[54]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

work page
[55]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

and Martin Gerlach

Altmann, Eduardo G. and Martin Gerlach. 2017. Statistical laws in linguistics. Creativity and Universality in Language, pages 7--26

work page 2017

[2] [2]

Pierrehumbert, and Adilson E

Altmann, Eduardo G., Janet B. Pierrehumbert, and Adilson E. Motter. 2009. Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS One, 4(11):e7678

work page 2009

[3] [3]

Baeza-Yates , Ricardo and Gonzalo Navarro. 2000. Block addressing indices for approximate text retrieval. Journal of the American Society for Information Science, 51(1):69--82

work page 2000

[4] [4]

Bradbury, James, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-recurrent neural networks. In Proceedings of International Conference on Learning Representations, Toulon

work page 2017

[5] [5]

Che, Tong, Yanran Li, Ruixiang Zhang, Devon R Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. 2017. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

and Joshua Goodman

Chen, Stanley F. and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359--394

work page 1999

[7] [7]

e nboer, C aglar G \

Cho, Kyunghyun, Bart van Merri \" e nboer, C aglar G \" u l c ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724--1734, Doha

work page 2014

[8] [8]

Clauset, Aaron, Cosma Rohilla Shalizi, and M.E.J. Newman. 2009. Power-law distributions in empirical data. SIAM review, 51(4):661--703

work page 2009

[9] [9]

Ebeling, Werner and Alexander Neiman. 1995. Long-range correlations between letters and sentences in texts. Physica A, 215(3):233--241

work page 1995

[10] [10]

Ebeling, Werner and Thorsten P\"oschel. 1994. Entropy and long-range correlations in literary english. Europhysics Letters, 26(4):241--246

work page 1994

[11] [11]

Eisler, Zolt \'a n, Imre Bartos, and Janos Kert \'e sz. 2007. Fluctuation scaling in complex systems: Taylor's law and beyond. Advances in Physics, 57(1):89--142

work page 2007

[12] [12]

Fedus, William, Ian Goodfellow, and Andrew M. Dai. 2018. Mask GAN : Better text generation via filling in the \_\_\_\_\_\_\_. In Proceedings of International Conference on Learning Representations, Vancouver

work page 2018

[13] [13]

Forney, G. David. 1973. The viterbi algorithm. Proceedings of the IEEE, 61(3):268--278

work page 1973

[14] [14]

Gerlach, Martin and Eduardo G. Altmann. 2013. Stochastic model for the vocabulary growth in natural languages. Physical Review X, 3(2):021006

work page 2013

[15] [15]

Griffiths, and Mark Johnson

Goldwater, Sharon, Thomas L. Griffiths, and Mark Johnson. 2011. Producing power-law distributions and damping word frequencies with two-stage language models. Journal of Machine Learning Research, 12:2335--2382

work page 2011

[16] [16]

Grave, Edouard, Armand Joulin, and Nicolas Usunier. 2017. Improving neural language models with a continuous cache. In Proceedings of International Conference on Learning Representations, Toulon

work page 2017

[17] [17]

Guo, Jiaxian, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In The Thirty-Second AAAI Conference, pages 5141--5148, Louisiana

work page 2018

[18] [18]

Hochreiter, Sepp and J \"u rgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735--1780

work page 1997

[19] [19]

Katz, Slava M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3):400--401

work page 1987

[20] [20]

Kingman, J.F.C. 1963. The exponential decay of markov transition probabilities. Proceedings of the London Mathematical Society, s3-13(1):337--358

work page 1963

[21] [21]

Kneser, Reinhard and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, volume 1, pages 181--184, Michigan

work page 1995

[22] [22]

Kobayashi, Tatsuru and Kumiko Tanaka-Ishii. 2018. Taylor 's law for human linguistic sequences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1138--1148, Melbourne

work page 2018

[23] [23]

van Leijenhorst, Dick and Theo van der Weide. 2005. A formal derivation of Heaps ' law. Information Sciences, 170(2-4):263--272

work page 2005

[24] [24]

Lennartz, Sabine and Armin Bunde. 2009. Eliminating finite-size effects and detecting the amount of whitenoise in short records with long-term memory. Physical Review E, 79(6):066101

work page 2009

[25] [25]

Li, Wentian. 1989. Mutual information functions of natural language texts. Santa Fe Institute Working Paper

work page 1989

[26] [26]

Lin, Chin-Yew. 2004. Rouge: a package for automatic evaluation of summaries. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics Workshop, Barcelona

work page 2004

[27] [27]

and Max Tegmark

Lin, Henry W. and Max Tegmark. 2017. Critical behavior in physics and probabilistic formal languages. entropy, 19(7):299

work page 2017

[28] [28]

Lin, Kevin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. 2017. Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pages 3155--3165, California

work page 2017

[29] [29]

Lawrence Zitnick

Lin, Tsung-Yi, Michael Maire, Serge Belongie, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740--755, Zurich

work page 2014

[30] [30]

Loper, Edward and Steven Bird. 2002. Nltk: The natural language toolkit. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics Workshop, pages 63--70, Pennsylvania

work page 2002

[31] [31]

L\"u, Linyuan, Zi-Ke Zhang, and Tao Zhou. 2010. Zipf 's law leads to Heaps ' law: Analyzing their relation in finite-size systems. PLoS One, 5(12):e14139

work page 2010

[32] [32]

Lu, Sidi, Lantao Yu, Weinan Zhang, and Yong Yu. 2018. Cot: Cooperative training for generative modeling. arXiv preprint arXiv:1804.03782

work page internal anchor Pith review Pith/arXiv arXiv 2018

[33] [33]

Manning, Chris and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. MIT Press

work page 1999

[34] [34]

Melis, G \'a bor, Chris Dyer, and Phil Blunsom. 2018. On the state of the art of evaluation in neural language models. In Proceedings of International Conference on Learning Representations, Vancouver

work page 2018

[35] [35]

Merity, Stephen, Nitish Keskar, and Richard Socher. 2018 a . An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. 2018 b . Regularizing and optimizing LSTM language models. In Proceedings of International Conference on Learning Representations, Vancouver

work page 2018

[37] [37]

Merity, Stephen, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. In Proceedings of International Conference on Learning Representations, San Juan

work page 2016

[38] [38]

Mikolov, Tom \'a s , Martin Karafi \' a t, Luk \'a s Burget, Jan Honza C ernock \' y , and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, pages 1045--1048, Chiba

work page 2010

[39] [39]

Mikolov, Tom \'a s and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. In The IEEE Workshop on Spoken Language Technology, pages 234--239, Florida

work page 2012

[40] [40]

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Pennsylvania

work page 2002

[41] [41]

Rajeswar, Sai, Sandeep Subramanian, Francis Dutil, Christopher Pal, and Aaron Courville. 2017. Adversarial generation of natural language. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics Workshop, pages 241--251, Vancouver

work page 2017

[42] [42]

Simon, Herbert A. 1955. On a class of skew distribution functions. Biometrika, 42(3/4):425--440

work page 1955

[43] [43]

Fairfield

Smith, H. Fairfield. 1938. An empirical law describing hetero-geneity in the yields of agricultural crops. Journal of Agriculture Science, 28(1):1--23

work page 1938

[44] [44]

Stolcke, Andreas. 2002. Srilm - an extensible language modeling toolkit. In International Conference on Spoken Language Processing, pages 901--904, Colorado

work page 2002

[45] [45]

Takahashi, Shuntaro and Kumiko Tanaka-Ishii. 2017. Do neural nets learn statistical laws behind natural language? PLoS One, 12(12):e0189326

work page 2017

[46] [46]

Tanaka-Ishii, Kumiko and Armin Bunde. 2016. Long-range memory in literary texts: On the universal clustering of the rare words. PLoS One, 11(11):e0164658

work page 2016

[47] [47]

Tanaka-Ishii, Kumiko and Tatsuru Kobayashi. 2018. Taylor's law for linguistic sequences and random walk models. Journal of Physics Communications, 2(11):115024

work page 2018

[48] [48]

Taylor, Lionel Roy. 1961. Aggregation, variance and the mean. Nature, 189(4766):732--735

work page 1961

[49] [49]

Teh, Yee Whye. 2006. A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics, pages 985--992, Sydney

work page 2006

[50] [50]

Yang, Zhilin, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. Breaking the softmax bottleneck: A high-rank RNN language model. In Proceedings of International Conference on Learning Representations, Vancouver

work page 2018

[51] [51]

Yu, Lantao, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: sequence generative adversarial nets with policy gradient. In The Thirty-First AAAI Conference, pages 2852--2858, California

work page 2017

[52] [52]

Zhang, Yizhe, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. 2017. Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850

work page internal anchor Pith review Pith/arXiv arXiv 2017

[53] [53]

Zhu, Yaoming, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. arXiv preprint arXiv:1802.01886

work page internal anchor Pith review Pith/arXiv arXiv 2018

[54] [54]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

work page

[55] [55]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page