pith. sign in

arxiv: 2509.02510 · v2 · submitted 2025-09-02 · 💻 cs.CL · cs.AI· stat.ML

Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation

Pith reviewed 2026-05-18 19:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AIstat.ML
keywords top-H decodingentropy constrained samplingLLM text generationcreative writingmin-p samplingcoherencediversitydecoding strategy
0
0 comments X

The pith

Top-H decoding is a greedy algorithm that solves an entropy-constrained mass maximization problem to balance creativity and coherence in LLM text generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates the balance between diversity and coherence in open-ended text generation as an entropy-constrained minimum divergence problem. It proves equivalence to an entropy-constrained mass maximization problem and introduces top-H decoding as a computationally efficient greedy algorithm to solve it. This method better uses the full probability distribution compared to heuristics like min-p sampling. A sympathetic reader would care because it shows measurable gains in creative writing tasks while remaining robust for question answering.

Core claim

Top-H decoding, a greedy algorithm for the NP-hard entropy-constrained mass maximization problem, which is equivalent to entropy-constrained minimum divergence, outperforms min-p sampling by up to 25.63% on creative writing benchmarks and maintains performance on datasets like GPQA, GSM8K, and MT-Bench.

What carries the argument

Top-H decoding: a greedy algorithm that iteratively selects the minimal set of tokens achieving a target entropy bound while maximizing total probability mass.

Load-bearing premise

The equivalence between the entropy-constrained minimum divergence problem and the entropy-constrained mass maximization problem holds true, and the greedy top-H algorithm provides a good enough solution to the NP-hard problem.

What would settle it

A controlled experiment on the same creative writing benchmarks where min-p sampling achieves equal or better scores than top-H, or where coherence drops at high temperatures, would disprove the performance advantage.

Figures

Figures reproduced from arXiv: 2509.02510 by Erfan Baghaei Potraghloo, Massoud Pedram, Seyedarmin Azizi, Souvik Kundu.

Figure 1
Figure 1. Figure 1: Probability distribution of two different types with [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a)-(c): Length-controlled win rates (%) comparison of different SoTA sampling with top-H [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of T scaling on genera￾tion coherence in min-p, top-p vs top-H. Impact of α parameter. The only hyperparameter in top￾H sampling is α, which directly controls the maximum allowable entropy for the distribution q. As such, care￾ful tuning of α is essential. To determine an appropriate value, we randomly select 50 development samples from the Alpaca-Eval dataset and use LLaMA3.1–8B–Instruct to generat… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of the parameter α on creativity and coherence. Empirical optimality of the top-H decoding strategy. We now empirically evaluate the competitiveness of the greedy algorithm of the top-H relative to the optimal solution of the ECMM problem, found by exhaustive search. We randomly sample 20 prompts from the Alpaca-Eval dataset and generate responses using the top-H method. At each generation step, the… view at source ↗
Figure 5
Figure 5. Figure 5: Empirical evaluation of top-H performance relative to the optimal solu￾tion of the ECMM problem. For comparison, we also compute the greedy solution S g using Algorithm 1. At each generation step, we calculate the ratio ΓSg /ΓS∗ , and report the mean and variance of this ratio (across different generation steps) for 20 dif￾ferent evaluation prompts, as visualized in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Probability distribution of two different types with associated top-H thresholds. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
read the original abstract

Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while preserving logical coherence. Existing truncated sampling techniques, including temperature scaling, top-\$p\$ (nucleus) sampling, and min-\$p\$ sampling, aim to manage this trade-off. However, they exhibit limitations, particularly in the effective incorporation of the confidence of the model into the corresponding sampling strategy. For example, min-\$p\$ sampling relies on a single top token as a heuristic for confidence, eventually underutilizing the information of the probability distribution. Toward effective incorporation of the confidence of the model, in this paper, we present **top-H** decoding. We first establish the theoretical foundation of the interplay between creativity and coherence in truncated sampling by formulating an **entropy-constrained minimum divergence** problem. We then prove this minimization problem to be equivalent to an **entropy-constrained mass maximization** (ECMM) problem, which is NP-hard. Finally, we present top-H decoding, a computationally efficient greedy algorithm to solve the ECMM problem. Extensive empirical evaluations demonstrate that top-H outperforms the state-of-the-art (SoTA) alternative of min-\$p\$ sampling by up to **25.63%** on creative writing benchmarks, while maintaining robustness on question-answering datasets such as GPQA, GSM8K, and MT-Bench. Additionally, an *LLM-as-judge* evaluation confirms that top-H indeed produces coherent outputs even at higher temperatures, where creativity is especially critical. In summary, top-H advances SoTA in open-ended text generation and can be *easily integrated* into creative writing applications. The code is available at https://github.com/ErfanBaghaei/Top-H-Decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces top-H decoding, a greedy algorithm for open-ended text generation in LLMs. It formulates an entropy-constrained minimum divergence problem, proves equivalence to the NP-hard entropy-constrained mass maximization (ECMM) problem, and claims the greedy top-H solves ECMM efficiently. Experiments report up to 25.63% gains over min-p sampling on creative writing benchmarks, with maintained robustness on GPQA, GSM8K, and MT-Bench, plus LLM-as-judge confirmation of coherence at higher temperatures.

Significance. If the equivalence holds and the greedy algorithm reliably approximates the ECMM optimum, the work could supply a more principled, distribution-aware truncation method that improves the creativity-coherence tradeoff beyond heuristic approaches like min-p. Code availability and the empirical scale of the reported gains would be clear strengths supporting adoption in creative applications.

major comments (2)
  1. [Abstract and theoretical foundation] Abstract: the asserted proof of equivalence between the entropy-constrained minimum divergence problem and the ECMM problem, together with the NP-hardness claim, is stated without any derivation steps, theorem statements, or equations. This equivalence is load-bearing for the claim that top-H solves the underlying optimization problem rather than being an ad-hoc truncation rule.
  2. [Method] Method section describing the top-H greedy algorithm: no approximation ratio, worst-case guarantee, or comparison against exact ECMM solutions on small instances is supplied. Without such analysis the 25.63% empirical advantage cannot be confidently attributed to better mass maximization under the entropy bound, directly engaging the concern that performance may arise from any other truncation heuristic.
minor comments (2)
  1. [Experiments] Experiments: report the number of runs, variance, and exact hyper-parameter controls used for the creative-writing and QA benchmarks to allow reproduction of the quantitative deltas.
  2. [Method] Notation: clarify whether the entropy bound is enforced exactly or approximately in the greedy selection rule, and provide the precise stopping criterion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we intend to make to strengthen the presentation of the theoretical results and the analysis of the algorithm.

read point-by-point responses
  1. Referee: [Abstract and theoretical foundation] Abstract: the asserted proof of equivalence between the entropy-constrained minimum divergence problem and the ECMM problem, together with the NP-hardness claim, is stated without any derivation steps, theorem statements, or equations. This equivalence is load-bearing for the claim that top-H solves the underlying optimization problem rather than being an ad-hoc truncation rule.

    Authors: We agree that the abstract, constrained by length, summarizes the equivalence and NP-hardness result at a high level without including derivation steps or equations. The full formulation of the entropy-constrained minimum divergence problem, the proof of equivalence to the ECMM problem, and the NP-hardness argument appear in the Method section of the manuscript. To address the concern that the claim appears unsupported in the abstract, we will revise the abstract to include a concise statement of the key equivalence result. This change will make explicit that top-H is derived from the optimization problem. revision: yes

  2. Referee: [Method] Method section describing the top-H greedy algorithm: no approximation ratio, worst-case guarantee, or comparison against exact ECMM solutions on small instances is supplied. Without such analysis the 25.63% empirical advantage cannot be confidently attributed to better mass maximization under the entropy bound, directly engaging the concern that performance may arise from any other truncation heuristic.

    Authors: The referee correctly identifies the absence of approximation guarantees or small-scale exact comparisons in the current manuscript. Because the ECMM problem is NP-hard, exact solutions are only tractable for small vocabularies. We will add a new subsection with experiments on synthetic small instances, comparing the greedy top-H selections against exact ECMM optima obtained via exhaustive search or dynamic programming. We will also include a brief discussion of the greedy algorithm's relation to known approximation techniques for constrained submodular maximization and acknowledge that a formal worst-case ratio is not derived. These additions will help link the observed performance gains more directly to the mass-maximization objective under the entropy constraint. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is a fresh optimization framing

full rationale

The paper introduces an entropy-constrained minimum divergence problem, proves its mathematical equivalence to the ECMM problem, and proposes a greedy top-H algorithm as an efficient heuristic for the NP-hard ECMM. This chain relies on new definitions and a proof rather than reducing any claimed result to fitted parameters, self-citations, or prior equations by construction. Empirical performance claims are evaluated separately on benchmarks and do not feed back into the derivation. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unshown proof that the minimum-divergence problem equals the mass-maximization problem and on the assumption that a greedy procedure suffices for the NP-hard task.

axioms (1)
  • domain assumption Entropy-constrained minimum divergence problem is equivalent to entropy-constrained mass maximization problem
    Stated as proved in the paper and used as the basis for the algorithm.

pith-pipeline@v0.9.0 · 5891 in / 1220 out tokens · 39184 ms · 2026-05-18T19:20:22.073064+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

    cs.CL 2026-02 unverdicted novelty 7.0

    Top-W applies Wasserstein-regularized truncation on token-embedding geometry to create a closed-form optimal crop for LLM sampling that outperforms prior methods by up to 33.7% on GSM8K, GPQA, AlpacaEval, and MT-Bench.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024

  2. [2]

    A learning algorithm for boltzmann machines

    David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9 0 (1): 0 147--169, 1985

  3. [3]

    Mirostat: A neural text decoding algorithm that directly controls perplexity

    Sourya Basu, Govardana Sachitanandam Ramachandran, Nitish Shirish Keskar, and Lav R Varshney. Mirostat: A neural text decoding algorithm that directly controls perplexity. arXiv preprint arXiv:2007.14966, 2020

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  5. [5]

    Lm evaluation harness, 2023

    EleutherAI. Lm evaluation harness, 2023. URL https://github.com/EleutherAI/lm-evaluation-harness. Version 0.4.5

  6. [6]

    Hierarchical Neural Story Generation

    Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833, 2018

  7. [7]

    Garey and David S

    Michael R. Garey and David S. Johnson. Computers and intractability: A guide to the theory of np-completeness. In W.H. Freeman. W. H. Freeman and Company, 1979. Problem SP13: Cardinality-Constrained Subset Sum is NP-complete

  8. [8]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  9. [9]

    Truncation sampling as language model desmoothing

    John Hewitt, Christopher D Manning, and Percy Liang. Truncation sampling as language model desmoothing. arXiv preprint arXiv:2210.15191, 2022

  10. [10]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

  11. [11]

    Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities

    Mina Lee, Percy Liang, and Qian Yang. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems, pp.\ 1--19, 2022

  12. [12]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023

  13. [13]

    N., Baker, A., Neo, C., Roush, A., Kirsch, A., and Shwartz-Ziv, R

    Minh Nguyen, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, and Ravid Shwartz-Ziv. Turning up the heat: Min-p sampling for creative and coherent llm outputs. arXiv preprint arXiv:2407.01082, 2024

  14. [14]

    GPT-4o System Card

    OpenAI. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. URL https://arxiv.org/abs/2410.21276

  15. [15]

    Papadimitriou

    Christos H. Papadimitriou. Computational Complexity. Addison-Wesley, Reading, Massachusetts, 1994. ISBN 978-0-201-53082-7

  16. [16]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024

  17. [17]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  18. [18]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36: 0 46595--46623, 2023

  19. [19]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...