arxiv: 2605.11128 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

Alfy Samuel, Amin Banayeeanzade, Dhruv Tarsadiya, Fatemeh Bahrani, Leonardo Blas, Meisam Razaviyayn, Qingchuan Yang, Robin Jia, Sai Praneeth Karimireddy

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:59 UTC · model grok-4.3

classification 💻 cs.CL

keywords diversity collapseLLM calibrationorder miscalibrationshape miscalibrationdecodingsamplingvalidity-diversity trade-offlanguage model generation

0 comments

The pith

Miscalibration in how LLMs rank and weight valid tokens causes diversity to collapse during generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that diversity loss in language models is not fixed by changing sampling methods but stems from two specific flaws in the model's probability distribution at each decoding step. Order miscalibration means valid next tokens are not consistently ranked above invalid ones, forcing any cutoff to admit errors or miss good options. Shape miscalibration means probability mass piles up on a handful of valid continuations while a long tail mixes valid and invalid tokens, so staying valid requires narrowing the output space. These local problems compound over multiple steps to produce low variety in complete sequences. The authors demonstrate this pattern holds across 14 models of different families and sizes using tasks where the set of valid continuations is known exactly.

Core claim

Diversity collapse arises because LLMs exhibit order miscalibration, where valid tokens are not reliably ranked above invalid ones, and shape miscalibration, where probability concentrates on few valid continuations amid heavy tails containing mixed validity; these failures at individual steps accumulate into large sequence-level diversity losses.

What carries the argument

The validity-diversity framework that decomposes diversity collapse into order calibration (ranking of valid over invalid tokens) and shape calibration (allocation of probability mass across valid continuations).

If this is right

Sampling rules that only use rank or probability thresholds cannot escape the validity-diversity trade-off created by miscalibration.
Fixing the underlying distribution calibration would allow higher diversity at the same validity level.
Local ranking and mass-allocation errors multiply across decoding steps to create the observed global collapse.
Tasks with fully known valid sets can isolate these distribution flaws from other generation issues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that penalize incorrect ranking or over-concentration on few valid tokens might reduce the bottleneck without changing inference.
The same order and shape issues could limit diversity in non-text generation domains where validity is easier to define.
Post-training calibration methods targeting token-level ranking and tail behavior could be tested as a direct remedy.
Models that appear well-calibrated on standard benchmarks may still fail on diversity because those benchmarks do not separate order from shape errors.

Load-bearing premise

That the set of valid continuations can be defined objectively and exhaustively for the diagnostic tasks so that miscalibration effects can be measured separately from other factors.

What would settle it

A model that produces high sequence-level diversity on the controlled diagnostic tasks while keeping validity near the oracle cutoff level, without any post-hoc adjustment to its token probabilities.

Figures

Figures reproduced from arXiv: 2605.11128 by Alfy Samuel, Amin Banayeeanzade, Dhruv Tarsadiya, Fatemeh Bahrani, Leonardo Blas, Meisam Razaviyayn, Qingchuan Yang, Robin Jia, Sai Praneeth Karimireddy.

**Figure 1.** Figure 1: (Left) The token distribution of a generation step from QWEN3.5-35B-A3B. The distribution is very sharp in the front, followed by a heavy tail with mixed valid and invalid tokens. As a result, (Right) many valid tokens are unlikely to appear in the output under any temperature sampling. † tokens are subsampled non-uniformly for enhanced visualization. ∗Equal Contribution. Preprint. arXiv:2605.11128v1 [cs.… view at source ↗

**Figure 2.** Figure 2: (Left) We sweep the logits and cutoff thresholds at each conditional distribution to enumerate retained tokens up to a certain depth, followed by greedy decoding from each leaf. A judge model then evaluates the validity of the generated sequences, allowing us to attribute a validity label to each token. We then measure the number of valid/invalid tokens that were retained/dropped by the cutoff strategy to … view at source ↗

**Figure 3.** Figure 3: Local precision–recall trade-off when sweeping the cutoff in a single generation step. Single-step precision–recall trade-off. The extracted labels allow us to measure the precision–recall trade-off at a single decoding step to observe how valid tokens are distributed in the conditional. In [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Precision–recall trade-offs across Qwen-3, Llama-3, Olmo-3 on 9 sizes and training stage. Evaluations are averaged on 3 random positions and queries. (Top) Average area under the precision–recall frontier. (Bottom) Average recall at precision 0.8 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Effects of temperature scaling in random number generation on [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Robustness of token-validity estimates to the continuation procedure. We compare local precision–recall curves computed from token labels obtained by greedy decoding after each forced candidate token with labels obtained from stochastic sampling. The resulting curves nearly overlap across cutoff indices, indicating that our estimated precision–recall trade-off is not sensitive to using greedy decoding as t… view at source ↗

**Figure 8.** Figure 8: Detailed view of Figure [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Validity–Diversity trade-off frontiers for constrained and unconstrained random number [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Sequence probability for random state generation task. Sequences are sorted by probability, [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Prompting GPT-5.5 to randomly name a city in the world. The vast majority of answers, with or without user chat history, collapses to “Valparaíso, Chile.” This shows a strong collapse in diversity. Substituting these two asymptotics into Div(P (T) ) ≤ exp(−m cV (ϵ)) gives the claimed diversity bounds. Interpretation. The result shows that high validity requires every local invalid-token probability to be … view at source ↗

**Figure 12.** Figure 12: Sequence-level probability distribution for a coding task from LiveCodeBench [ [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Logit fitting on Llama-3.1-8B-Instruct The theoretical models isolate mechanisms under stylized assumptions. Our theoretical results are designed to formalize clean mechanisms rather than to fully model all LLM distributions. In particular, the shape calibration analysis uses a ranked geometric distribution and, in its cleanest form, assumes invariant valid branching across prefixes. These assumptions mak… view at source ↗

**Figure 14.** Figure 14: Logit fitting on Qwen3.5-35B-A3B may have heterogeneous branching factors, non-geometric tails, prefix-dependent valid sets, and interactions between syntax, semantics, and instruction-following constraints. The theory should therefore be read as a mechanistic explanation of why validity–diversity trade-offs can arise, rather than as a literal generative model of all LLM behavior. 27 [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 15.** Figure 15: Logit fitting on Olmo-3-7B-Instruct Oracle baselines are diagnostic, not deployable. Several experiments use oracle information, such as the ground-truth valid-token set size or exact validity constraints in controlled tasks. These oracle baselines are not meant to be practical decoding methods. Their purpose is to separate failure modes. For example, an oracle-size cutoff tests whether a rank-based metho… view at source ↗

read the original abstract

Diversity is essential for language-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs. While prior work has developed benchmarks for measuring this lack of diversity, less is known about how the step-by-step probability distributions at inference time cause the problem. We introduce a validity--diversity framework that attributes diversity collapse to how an LLM allocates probability mass across valid and invalid continuations during decoding. This framework decomposes the bottleneck into two complementary forms of miscalibration. First, order calibration: valid tokens are not reliably ranked above invalid tokens, so rank-based cutoff rules must trade off between recovering valid continuations and admitting invalid ones. Second, shape calibration: probability mass is overly concentrated only on few valid continuations while having a heavy-tail of mixed valid and invalid tokens, so maintaining high validity limits diversity. We formalize both mechanisms and show that local failures compound across decoding steps, producing strong sequence-level losses in diversity. Empirically, we develop controlled diagnostics for probing these bottlenecks, including tasks with exactly known valid sets and oracle cutoff baselines. Across 14 language models spanning multiple families and scales, we find that diversity collapse is not merely a limitation of particular sampling heuristics, but a consequence of order and shape miscalibration in the LLM distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins diversity collapse on order and shape miscalibration in LLM next-token distributions and shows it compounds across steps, using diagnostics with known valid sets on 14 models.

read the letter

The core claim is that low diversity in LLM outputs stems from how models allocate probability mass—valid tokens often aren't ranked highest, and the mass on valid continuations is too peaked with messy tails—rather than from sampling methods alone. They formalize this as order calibration and shape calibration, then run controlled tasks where valid sets are known exactly and compare against oracle cutoffs. Across model families and sizes the pattern holds, and they show local ranking and tail problems add up to sequence-level diversity loss. That decomposition and the multi-model scope are the main new pieces; prior diversity work mostly measured the symptom without this kind of attribution to the distribution itself. The controlled setup with oracle baselines is a clear improvement over just reporting diversity scores, and the compounding argument is straightforward once the per-step mechanics are laid out. The weakest part is the reliance on tasks with exhaustively defined valid sets. If those sets are rule-based or narrow, the observed miscalibration could partly reflect how the rules interact with training data rather than a general property of the model's calibration. The abstract gives no error bars or sensitivity checks, so it's unclear how stable the attribution is when validity definitions shift. The framework itself looks internally consistent and doesn't appear to be circular. This is useful for anyone working on sampling, calibration fixes, or diversity benchmarks in generation. A reader who already cares about why temperature or top-p still produce repetitive text will get concrete diagnostics to try. It is worth sending to peer review because the idea is fresh, the empirical base is broad, and the questions it raises about validity definitions are addressable with revisions rather than fatal.

Referee Report

2 major / 2 minor

Summary. The paper introduces a validity-diversity framework that decomposes LLM diversity collapse into order miscalibration (valid tokens not reliably ranked above invalid ones, forcing trade-offs in rank-based cutoffs) and shape miscalibration (probability mass overly concentrated on few valid continuations amid heavy tails of mixed tokens). It formalizes these mechanisms, shows local failures compound across decoding steps to produce sequence-level diversity losses, and supports the attribution via controlled diagnostics on tasks with exactly known valid sets plus oracle baselines. Experiments across 14 models from multiple families and scales conclude that the bottleneck is inherent to the LLM distribution's calibration rather than sampling heuristics.

Significance. If the central attribution holds, the work offers a diagnostic lens for why diversity remains limited despite advances in sampling, with potential to redirect efforts toward better-calibrated next-token distributions for applications like creative generation. Credit is due for the multi-model evaluation spanning scales and families, the use of oracle cutoffs as baselines, and the explicit decomposition into order and shape components that enables targeted probing.

major comments (2)

[§4] §4 (Controlled Diagnostics): The claim that diversity collapse is isolated to order and shape miscalibration rests on tasks having 'exactly known valid sets' that are objective and exhaustive. However, the manuscript does not provide explicit verification that these partitions (e.g., syntactic or string-match rules) are defined independently of patterns in the models' training data; any correlation would confound the attribution of ranking failures and heavy tails to calibration rather than task artifacts. This is load-bearing for the conclusion that miscalibration—not heuristics—is the primary bottleneck.
[§5] §5 (Empirical Results): The reported sequence-level diversity losses and cross-model patterns lack error bars, confidence intervals, or details on data-exclusion rules and run-to-run variability. Without these, it is difficult to evaluate the robustness of the claim that local miscalibration effects reliably compound across 14 models.

minor comments (2)

[§3] Notation for the validity-diversity decomposition (around Eq. 3-5) could be clarified with an explicit table mapping symbols to their definitions to aid readers in following the compounding argument.
[§5] Figure 2 (example generation traces) would benefit from annotations highlighting the exact points of order vs. shape miscalibration to make the local-to-global compounding more visually immediate.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below, indicating where revisions will be made and where limitations remain.

read point-by-point responses

Referee: [§4] §4 (Controlled Diagnostics): The claim that diversity collapse is isolated to order and shape miscalibration rests on tasks having 'exactly known valid sets' that are objective and exhaustive. However, the manuscript does not provide explicit verification that these partitions (e.g., syntactic or string-match rules) are defined independently of patterns in the models' training data; any correlation would confound the attribution of ranking failures and heavy tails to calibration rather than task artifacts. This is load-bearing for the conclusion that miscalibration—not heuristics—is the primary bottleneck.

Authors: The valid sets are constructed from objective, a priori rules that do not depend on model outputs or training patterns. In the syntactic task, validity is defined by formal grammar constraints (e.g., balanced parentheses, valid JSON syntax) specified independently of any corpus. In the string-match task, validity requires exact matching to a manually enumerated set of templates. These definitions are exhaustive by construction and predate any model evaluation. Oracle baselines further isolate miscalibration by demonstrating that perfect order and shape calibration recovers full diversity. We will add an appendix subsection in the revision that explicitly lists the rule definitions for each task and argues their independence from training-data patterns. A quantitative overlap analysis with proprietary training corpora is not feasible for all 14 models. revision: partial
Referee: [§5] §5 (Empirical Results): The reported sequence-level diversity losses and cross-model patterns lack error bars, confidence intervals, or details on data-exclusion rules and run-to-run variability. Without these, it is difficult to evaluate the robustness of the claim that local miscalibration effects reliably compound across 14 models.

Authors: We agree that statistical details on variability are needed. The revised manuscript will report standard deviations across five independent runs for all sequence-level metrics, add 95% confidence intervals to the cross-model plots, document prompt sampling and exclusion criteria (e.g., discarding prompts with zero valid tokens), and include a variability analysis in the appendix. These additions will directly support the claim that local effects compound reliably. revision: yes

standing simulated objections not resolved

Full empirical verification that valid-set partitions have zero correlation with patterns in the proprietary training data of all 14 evaluated models.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a validity-diversity framework defined from first principles that decomposes diversity collapse into order and shape miscalibration, using externally specified valid sets and oracle baselines for controlled diagnostics. Empirical measurements across 14 models are obtained by applying these independent partitions and baselines rather than by fitting parameters whose outputs are then renamed as predictions. No equations or central claims reduce by construction to quantities defined inside the same experiment, and no load-bearing steps invoke self-citations or uniqueness theorems that would force the result. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that valid and invalid continuations can be exhaustively enumerated for diagnostic tasks; no free parameters or new entities are introduced in the abstract description.

axioms (1)

domain assumption Validity of token continuations can be objectively defined for the chosen diagnostic tasks.
Required to separate valid from invalid tokens when measuring order and shape calibration.

pith-pipeline@v0.9.0 · 5576 in / 1207 out tokens · 77994 ms · 2026-05-13T03:59:14.766433+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We introduce a validity–diversity framework that attributes diversity collapse to how an LLM allocates probability mass across valid and invalid continuations... order calibration... shape calibration... geometric decaying... Zipf-like behavior
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear
Theorem 4.2 (Compounding effect of decoding steps)... Theorem 5.2 (Validity–Diversity trade-off)... Assumption 5.1 (Invariant valid branching)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

[1]

Ackley, Geoffrey E

David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algorithm for boltzmann machines.Cognitive Science, 9(1):147–169, 1985

work page 1985
[2]

Jointly measuring diversity and quality in text generation models

Danial Alihosseini, Ehsan Montahaei, and Mahdieh Soleymani Baghshah. Jointly measuring diversity and quality in text generation models. InProceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 90–98, 2019

work page 2019
[3]

Epsvec: Efficient and private synthetic data generation via dataset vectors, 2026

Amin Banayeeanzade, Qingchuan Yang, Deqing Fu, Spencer Hong, Erin Babinsky, Alfy Samuel, Anoop Kumar, Robin Jia, and Sai Praneeth Karimireddy. Epsvec: Efficient and private synthetic data generation via dataset vectors, 2026

work page 2026
[4]

Varshney

Sourya Basu, Govardana Sachitanandam Ramachandran, Nitish Shirish Keskar, and Lav R. Varshney. Mirostat: A neural text decoding algorithm that directly controls perplexity. In International Conference on Learning Representations, 2021

work page 2021
[5]

Softmax bottleneck makes language models unable to represent multi-mode word distributions

Haw-Shiuan Chang and Andrew McCallum. Softmax bottleneck makes language models unable to represent multi-mode word distributions. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8048–8073, 2022

work page 2022
[6]

Dlcrec: A novel approach for managing diversity in llm-based recommender systems

Jiaju Chen, Chongming Gao, Shuai Yuan, Shuchang Liu, Qingpeng Cai, and Peng Jiang. Dlcrec: A novel approach for managing diversity in llm-based recommender systems. InProceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, page 857–865, 2025

work page 2025
[7]

Modifying large language model post-training for diverse creative writing

John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yuqian Sun, and Max Kreminski. Modifying large language model post-training for diverse creative writing. In Second Conference on Language Modeling, 2025

work page 2025
[8]

Hierarchical neural story generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. InProceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, 2018

work page 2018
[9]

Closing the curious case of neural text degeneration

Matthew Finlayson, John Hewitt, Alexander Koller, Swabha Swayamdipta, and Ashish Sab- harwal. Closing the curious case of neural text degeneration. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[10]

The variance paradox: How ai reduces diversity but increases novelty, 2026

Bijean Ghafouri. The variance paradox: How ai reduces diversity but increases novelty, 2026

work page 2026
[11]

A survey on llm-as-a-judge.The Innovation, page 101253, 2026

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Zhouchi Lin, Bowen Zhang, Lionel Ni, Wen Gao, Yuanzhuo Wang, and Jian Guo. A survey on llm-as-a-judge.The Innovation, page 101253, 2026

work page 2026
[12]

Benchmarking linguistic diversity of large language models.Transactions of the Association for Computational Linguistics, 13:1507–1526, 2025

Yanzhu Guo, Guokan Shang, and Chloé Clavel. Benchmarking linguistic diversity of large language models.Transactions of the Association for Computational Linguistics, 13:1507–1526, 2025

work page 2025
[13]

Truncation sampling as language model desmoothing

John Hewitt, Christopher Manning, and Percy Liang. Truncation sampling as language model desmoothing. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 3414–3427, 2022

work page 2022
[14]

M. O. Hill. Diversity and evenness: A unifying notation and its consequences.Ecology, 54(2): 427–432, 1973

work page 1973
[15]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations, 2020

work page 2020
[16]

Luchini, Reet Patel, Antoine Bosselut, Lonneke Van Der Plas, and Roger E

Mete Ismayilzada, Antonio Laverghetta Jr., Simone A. Luchini, Reet Patel, Antoine Bosselut, Lonneke Van Der Plas, and Roger E. Beaty. Creative preference optimization. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 9580–9609, 2025. 10

work page 2025
[17]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[18]

Artificial hivemind: The open-ended homogeneity of language models (and beyond)

Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, and Yejin Choi. Artificial hivemind: The open-ended homogeneity of language models (and beyond). InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025
[19]

Entropy and diversity.Oikos, 113(2):363–375, 2006

Lou Jost. Entropy and diversity.Oikos, 113(2):363–375, 2006

work page 2006
[20]

Reasoning with sampling: Your base model is smarter than you think

Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[21]

Evaluating the quality of randomness and entropy in tasks supported by large language models, 2025

Rabimba Karanjai, Yang Lu, Ranjith Chodavarapu, Lei Xu, and Weidong Shi. Evaluating the quality of randomness and entropy in tasks supported by large language models, 2025

work page 2025
[22]

Understanding the effects of RLHF on LLM generalisation and diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of RLHF on LLM generalisation and diversity. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[23]

Content analysis: An introduction to its methodology

Klaus Krippendorff. Content analysis: An introduction to its methodology. 1980

work page 1980
[24]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[25]

Jointly reinforcing diversity and quality in language model generations, 2025

Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang. Jointly reinforcing diversity and quality in language model generations, 2025

work page 2025
[26]

Preserving diversity in supervised fine-tuning of large language models

Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, and Ruoyu Sun. Preserving diversity in supervised fine-tuning of large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[27]

Locally typical sampling

Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. Locally typical sampling. Transactions of the Association for Computational Linguistics, 11, 2023

work page 2023
[28]

Turning up the heat: Min-p sampling for creative and coherent LLM outputs

Nguyen Nhat Minh, Andrew Baker, Clement Neo, Allen G Roush, Andreas Kirsch, and Ravid Shwartz-Ziv. Turning up the heat: Min-p sampling for creative and coherent LLM outputs. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[29]

String seed of thought: Prompting LLMs for distribution- faithful and diverse generation

Kou Misaki and Takuya Akiba. String seed of thought: Prompting LLMs for distribution- faithful and diverse generation. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[30]

One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual diversity

Sonia Krishna Murthy, Tomer Ullman, and Jennifer Hu. One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual diversity. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025

work page 2025
[31]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002
[32]

Mind the gap: Con- formative decoding to improve output diversity of instruction-tuned large language models, 2025

Max Peeperkorn, Tom Kouwenhoven, Dan Brown, and Anna Jordanous. Mind the gap: Con- formative decoding to improve output diversity of instruction-tuned large language models, 2025

work page 2025
[33]

Top-h decoding: Adapting the creativity and coherence with bounded entropy in text generation

Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, and Massoud Pedram. Top-h decoding: Adapting the creativity and coherence with bounded entropy in text generation. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. 11

work page 2026
[34]

Min-p, max exaggeration: A critical analysis of min-p sampling in language models, 2025

Rylan Schaeffer, Joshua Kazdan, and Yegor Denisov-Blanch. Min-p, max exaggeration: A critical analysis of min-p sampling in language models, 2025

work page 2025
[35]

Evaluating the diversity and quality of LLM generated content

Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, and Osbert Bastani. Evaluating the diversity and quality of LLM generated content. InSecond Conference on Language Modeling, 2025

work page 2025
[36]

The shrinking landscape of linguistic diversity in the age of large language models, 2025

Zhivar Sourati, Farzan Karimi-Malekabadi, Meltem Ozcan, Colin McDaniel, Alireza Ziabari, Jackson Trager, Ala Tak, Meng Chen, Fred Morstatter, and Morteza Dehghani. The shrinking landscape of linguistic diversity in the age of large language models, 2025

work page 2025
[37]

Contrastive search is what you need for neural text generation

Yixuan Su and Nigel Collier. Contrastive search is what you need for neural text generation. Transactions on Machine Learning Research, 2023

work page 2023
[38]

Control the temperature: Selective sampling for diverse and high-quality LLM outputs

Sergey Troshin, Wafaa Mohammed, Yan Meng, Christof Monz, Antske Fokkens, and Vlad Niculae. Control the temperature: Selective sampling for diverse and high-quality LLM outputs. InSecond Conference on Language Modeling, 2025

work page 2025
[39]

Shared nature, unique nurture: Prism for pluralistic reasoning via in-context structure modeling, 2026

Guancheng Tu, Shiyang Zhang, Tianyu Zhang, Yi Zhang, and Diji Yang. Shared nature, unique nurture: Prism for pluralistic reasoning via in-context structure modeling, 2026

work page 2026
[40]

Diverse beam search for improved description of complex scenes

Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. Diverse beam search for improved description of complex scenes. Proceedings of the AAAI Conference on Artificial Intelligence, Apr. 2018

work page 2018
[41]

Multilingual prompting for improving LLM generation diversity

Qihan Wang, Shidong Pan, Tal Linzen, and Emily Black. Multilingual prompting for improving LLM generation diversity. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6367–6389, 2025

work page 2025
[42]

Optimizing diversity and quality through base-aligned model collaboration, 2025

Yichen Wang, Chenghao Yang, Tenghao Huang, Muhao Chen, Jonathan May, and Mina Lee. Optimizing diversity and quality through base-aligned model collaboration, 2025

work page 2025
[43]

Base models beat aligned models at randomness and creativity

Peter West and Christopher Potts. Base models beat aligned models at randomness and creativity. InSecond Conference on Language Modeling, 2025

work page 2025
[44]

Echoes in ai: Quantifying lack of plot diversity in llm outputs.Proceedings of the National Academy of Sciences, 2025

Weijia Xu, Nebojsa Jojic, Sudha Rao, Chris Brockett, and Bill Dolan. Echoes in ai: Quantifying lack of plot diversity in llm outputs.Proceedings of the National Academy of Sciences, 2025

work page 2025
[45]

Llm probability concentration: How alignment shrinks the generative horizon, 2026

Chenghao Yang, Sida Li, and Ari Holtzman. Llm probability concentration: How alignment shrinks the generative horizon, 2026

work page 2026
[46]

Generation space size: Understanding and calibrating open-endedness of llm generations, 2025

Sunny Yu, Ahmad Jabbar, Robert Hawkins, Dan Jurafsky, and Myra Cheng. Generation space size: Understanding and calibrating open-endedness of llm generations, 2025

work page 2025
[47]

Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[48]

Tomz, Christopher D

Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, and Weiyan Shi. Verbalized sampling: How to mitigate mode collapse and unlock llm diversity, 2025

work page 2025
[49]

Noveltybench: Evaluating creativity and diversity in language models

Yiming Zhang, Harshita Diddee, Susan Holm, Hanchen Liu, Xinyue Liu, Vinay Samuel, Barry Wang, and Daphne Ippolito. Noveltybench: Evaluating creativity and diversity in language models. InSecond Conference on Language Modeling, 2025

work page 2025
[50]

Balancing diversity and risk in LLM sampling: How to select your method and parameter for open-ended text generation

Yuxuan Zhou, Margret Keuper, and Mario Fritz. Balancing diversity and risk in LLM sampling: How to select your method and parameter for open-ended text generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26352–26365, 2025

work page 2025
[51]

Improving open-ended text generation via adaptive decoding

Wenhong Zhu, Hongkun Hao, Zhiwei He, Yiming Ai, and Rui Wang. Improving open-ended text generation via adaptive decoding. InForty-first International Conference on Machine Learning, 2024. 12 A LLM-as-a-judge Details A.1 Prompt and Model Token validity is scored based on the greedy-decoding completion. We use Qwen3.5-35B-A3B-FP8 with thinking enabled as ou...

work page 2024
[52]

grammar, spelling and punctuation,

work page
[53]

semantic soundness, validity, and relevance to the question,

work page
[54]

reason": <brief explanation>,

overall quality. When evaluating grammar, check for spelling mistakes, punctuation errors, and grammatical issues. If spaces are missing between words, extra punctuations in the middle of sentences, or incorrect capitalization, that should be considered a grammar error. Additionally, if the generation contains non-English characters, that should be consid...

work page 2045
[55]

(Order Calibration) p is order calibrated if for any valid token v∈ V and invalid token w∈ V , it assigns a higher probability to the valid token, i.e.,p(v|y <t)≥p(w|y <t)

work page
[56]

calibration

(Shape Calibration) p is shape calibrated if for any token v∈ V , it assigns probability mass to v according to the number of valid continuations starting withv, i.e.,p(v|y <t)∝N(y <t ◦v). Note that shape calibration is stronger than order calibration: even if order calibration is resolved, shape calibration can still persist. However, perfect shape calib...

work page