Recognition: 2 theorem links
· Lean TheoremThe Curious Case of Neural Text Degeneration
Pith reviewed 2026-05-12 06:13 UTC · model grok-4.3
The pith
Nucleus sampling draws from the dynamic high-probability set to generate more diverse and coherent text than beam search or top-k methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that the quality of text generated by a fixed neural language model depends heavily on the decoding procedure. Human text and machine text exhibit different probability distributions, with machine outputs often assigning overly high probability to repetitive tokens. The central contribution is nucleus sampling: at each step the model forms the smallest nucleus of tokens whose probabilities sum to at least p (commonly 0.9), then samples the next token from within that nucleus. This procedure yields text with greater lexical diversity and coherence than greedy decoding, beam search, or fixed top-k sampling, while avoiding the blandness that results from always choosing the most
What carries the argument
Nucleus sampling: the procedure that, at each generation step, identifies the smallest set of tokens whose cumulative probability meets or exceeds a threshold p and draws the next token uniformly from within that set, thereby truncating the low-probability tail.
Load-bearing premise
The learned probability distribution places lower-quality tokens in the tail, so removing that tail improves rather than harms the generated text.
What would settle it
Human raters scoring nucleus-sampled continuations as less diverse or less coherent than continuations produced by ancestral sampling or carefully tuned top-k sampling on the same model and prompts.
read the original abstract
Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive. In this paper, we reveal surprising distributional differences between human text and machine text. In addition, we find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. Our findings motivate Nucleus Sampling, a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper observes that neural language models produce degenerate text (repetitive and bland) under standard decoding methods such as greedy search and beam search, despite strong performance on likelihood-based training objectives. It documents distributional differences between human-written text and model-generated text, then introduces Nucleus Sampling: at each step, tokens are sampled from the smallest set whose cumulative probability mass exceeds a threshold p, thereby truncating the unreliable tail while preserving diversity. Controlled experiments on the same models across datasets compare this method against greedy, beam, top-k, and other baselines using both automatic diversity metrics and human judgments of fluency, coherence, and quality.
Significance. If the empirical results hold, the work is significant for open-ended neural text generation. It supplies a simple, parameter-light decoding rule that demonstrably improves human-judged output quality and diversity over widely used baselines, without requiring changes to model training. The controlled experimental design (identical models, multiple datasets, both automatic and human evaluation) provides reproducible evidence that decoding strategy alone can substantially affect generation quality.
minor comments (3)
- Abstract and §3: the phrase 'dynamic nucleus of the probability distribution' is introduced without an immediate formal definition or reference to the precise cumulative-probability rule; a one-sentence definition at first use would improve readability.
- Evaluation sections: human judgments are reported on a moderate scale and some automatic metrics are heuristic; adding error bars or statistical significance tests for the human ratings would strengthen the presentation without altering the central claim.
- Figure captions and tables: several plots compare multiple decoding strategies but lack explicit indication of which model size or dataset each panel corresponds to; consistent labeling would aid quick comprehension.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our work, as well as the recommendation for minor revision. The report correctly identifies the core issues with standard decoding methods and the benefits of nucleus sampling for improving diversity and quality in neural text generation.
Circularity Check
No significant circularity detected
full rationale
The paper's central contribution is an empirical analysis of distributional differences between human and machine text, followed by the definition of nucleus sampling as a direct function of the model's softmax probabilities (the smallest set of tokens whose cumulative probability mass exceeds threshold p). This definition contains no fitted parameters derived from the target evaluation metrics, no self-referential equations, and no load-bearing self-citations. Quality and diversity improvements are measured with independent human judgments and automatic metrics (e.g., distinct-n, self-BLEU) that are not algebraically entailed by the sampling rule itself. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- p (nucleus probability threshold)
axioms (1)
- domain assumption The neural language model's softmax probabilities meaningfully rank token quality for generation.
Forward citations
Cited by 43 Pith papers
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
PAL: Program-aided Language Models
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
HellaSwag: Can a Machine Really Finish Your Sentence?
HellaSwag dataset shows state-of-the-art models fail commonsense inference tasks that humans solve easily, built via adversarial filtering of distractors.
-
BOOKMARKS: Efficient Active Storyline Memory for Role-playing
BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
-
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
Ex Ante Evaluation of AI-Induced Idea Diversity Collapse
Frontier LLMs generate creative ideas with excess population-level crowding below human-relative parity across tasks, but targeted generation protocols can reduce it.
-
CP-SynC: Multi-Agent Zero-Shot Constraint Modeling in MiniZinc with Synthesized Checkers
CP-SynC uses coordinated LLM agents to generate, validate via synthesized checkers, and select MiniZinc models from natural language, substantially outperforming baselines on a 100-problem benchmark.
-
Toward a Principled Framework for Agent Safety Measurement
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
-
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Distinct Leaf Enumeration (DLE) replaces stochastic self-consistency sampling with deterministic traversal of a truncated decoding tree to enumerate distinct leaves, increasing coverage and reducing redundant computat...
-
Post-Selection Distributional Model Evaluation
PS-DME is a new framework that controls post-selection false coverage rate for distributional KPI estimates via e-values and is provably more sample-efficient than data splitting under explicit conditions.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Chronos: Learning the Language of Time Series
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection
TextSeal provides a localized, distortion-free LLM watermark that enables provenance tracking and distillation detection while preserving performance and text quality.
-
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
-
Adversarial SQL Injection Generation with LLM-Based Architectures
RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.
-
Annotations Mitigate Post-Training Mode Collapse
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
-
APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation
APCD reduces LLM hallucinations by expanding decoding paths adaptively when entropy signals uncertainty and by contrasting divergent paths to control their interaction.
-
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective
Greedy decoding is optimal for VQA under derived calibration conditions and outperforms stochastic sampling on benchmarks.
-
On the Importance and Evaluation of Narrativity in Natural Language AI Explanations
XAI explanations should be narratives with continuous structure, cause-effect, fluency and diversity, and new metrics are needed to evaluate this better than standard NLP scores.
-
Learning to Control Summaries with Score Ranking
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
-
Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models
Reward-weighted classifier-free guidance approximates Q-function policy improvement in autoregressive models, enabling test-time reward optimization and faster RL convergence via distillation.
-
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard detects attention collapse loops during LLM decoding and prunes repetitive KV cache tail spans under fixed budget, cutting loop incidence by over 90 percentage points on the new LoopBench benchmark.
-
Procedural Refinement by LLM-driven Algorithmic Debugging for ARC-AGI-2
ABPR uses LLM-generated programs debugged through Prolog SLD proof traces to reach 56.67% Pass@2 with Gemini-3-Flash and 98.33% with GPT-5.5 xHigh on ARC-AGI-2.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
-
Exploring the Effectiveness of Abstract Syntax Tree Patterns for Algorithm Recognition
An AST pattern-matching prototype with a custom DSL achieves 0.74 average F1-score on a BigCloneEval subset, outperforming CodeLlama (0.35) and code clone detectors (best recall 0.20).
-
Diversity in Large Language Models under Supervised Fine-Tuning
Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
-
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
-
DORA Explorer: Improving the Exploration Ability of LLMs Without Training
DORA Explorer boosts LLM agent exploration without training by ranking diverse actions using log-probabilities and a tunable parameter, yielding UCB-competitive results on multi-armed bandits and gains on text adventu...
-
Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction
MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.
-
Lighting Up or Dimming Down? Exploring Dark Patterns of LLMs in Co-Creativity
Sycophancy appears in 91.7% of LLM responses during co-creative writing tasks, especially on sensitive topics, while anchoring varies by literary form and is most common in folktales.
-
From Traditional Taggers to LLMs: A Comparative Study of POS Tagging for Medieval Romance Languages
LLM-based POS tagging outperforms traditional taggers on medieval Occitan, Catalan, and French, with fine-tuning and cross-lingual transfer providing the largest gains for under-resourced varieties.
-
Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition
Hybrid LLM plus static analysis for algorithm recognition in code cuts required model calls by 72-97% and lifts F1-scores by as much as 12 points.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models
Zero-shot prompting reaches 59% accuracy at moderate temperatures while chain-of-thought prompting excels at temperature extremes on Olympiad-level math problems, with extended reasoning gains scaling to 14.3x at high...
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Reference graph
Works this paper leans on
-
[1]
Neural machine translation by jointly learning to align and translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. Proceedings of the 2015 International Conference on Learning Representations,
work page 2015
-
[2]
Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Char- lin. Language gans falling short. In Critiquing and Correcting Trends in Machine Learning: NeurIPS 2018 Workshop,
work page 2018
-
[3]
arXiv preprint arXiv:1811.02549 , year=
URL http://arxiv.org/abs/1811.02549. Yining Chen, Sorcha Gilroy, Andreas Maletti, Jonathan May, and Kevin Knight. Recurrent neu- ral networks as weighted language recognizers. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2261–2...
-
[4]
Elizabeth Clark, Yangfeng Ji, and Noah A. Smith. Neural text generation in stories using entity rep- resentations as context. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2250–2260, New Orleans, Louisiana, June
work page 2018
-
[5]
Hierarchical neural story generation
10 Published as a conference paper at ICLR 2020 Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pp. 889–898,
work page 2020
-
[6]
Hashimoto, Hugh Zhang, and Percy Liang
Tatsunori B. Hashimoto, Hugh Zhang, and Percy Liang. Unifying human and statistical evaluation for natural language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
work page 2019
-
[7]
A simple, fast diverse decoding algorithm for neural generation
Jiwei Li, Will Monroe, and Dan Jurafsky. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562, 2016a. Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. Deep rein- forcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Lan...
-
[8]
Chris Pal, Charles Sutton, and Andrew McCallum. Sparse forward-backward using minimum diver- gence beams for fast training of conditional random fields. In 2006 IEEE International Confer- ence on Acoustics Speech and Signal Processing Proceedings, volume 5, May
work page 2006
-
[9]
doi: 10.18653/v1/W18-1505. Steven T Piantadosi. Zipfs word frequency law in natural language: A critical review and future directions. Psychonomic bulletin & review, 21(5):1112–1130,
-
[10]
URL https: //d4mucfpksywv.cloudfront.net/better-language-models/language_ models_are_unsupervised_multitask_learners.pdf. Unpublished manuscript. Stanislau Semeniuta, Aliaksei Severyn, and Sylvain Gelly. On accurate evaluation of gans for lan- guage generation. arXiv preprint arXiv:1806.04936,
-
[11]
Style transfer from non-parallel text by cross-alignment
11 Published as a conference paper at ICLR 2020 Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-parallel text by cross-alignment. In Advances in neural information processing systems, pp. 6830–6841,
work page 2020
-
[12]
Felix Stahlberg and Bill Byrne. On nmt search errors and model errors: Cat got your tongue? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pp. 3347–3353,
work page 2019
-
[13]
Evaluating text gans as language models
Guy Tevet, Gavriel Habib, Vered Shwartz, and Jonathan Berant. Evaluating text gans as language models. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2241–2247,
work page 2019
-
[14]
Challenges in data-to-document generation
Sam Wiseman, Stuart Shieber, and Alexander Rush. Challenges in data-to-document generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2253–2263, Copenhagen, Denmark, September
work page 2017
-
[15]
Jingjing Xu, Xuancheng Ren, Junyang Lin, and Xu Sun. Diversity-promoting gan: A cross-entropy based generative adversarial network for diversified text generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pp. 3940–3949, Brussels, Belgium, oct
work page 2018
-
[16]
S" series. I have a question about the new \
12 Published as a conference paper at ICLR 2020 A B EAM WIDTH EFFECT Figure 10: The total number of trigrams produced by Beam Search with varying beam widths, with gold (human) data for comparison. Note how the average length of generations goes down linearly with beam width, while the number of distinct trigrams stays constant and extremely low in compar...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.