pith. machine review for the scientific record.
sign in

arxiv: 2505.15134 · v1 · pith:PT6R2GSCnew · submitted 2025-05-21 · 💻 cs.LG · cs.AI

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Pith reviewed 2026-05-18 15:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords entropy minimizationLLM reasoningunlabeled datareinforcement learninginference-time optimizationmath reasoningcoding benchmarksSciCode
0
0 comments X

The pith

Entropy minimization on a model's own outputs improves LLM reasoning without any labeled data or updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that entropy minimization alone can substantially raise large language model performance on math, physics, and coding problems by concentrating probability on the model's most confident generations. This is tested through three concrete methods: token-level fine-tuning on self-generated unlabeled data, reinforcement learning that uses negative entropy as the sole reward, and inference-time logit adjustment with no training at all. On the Qwen-7B model the reinforcement-learning version reaches or surpasses strong RL baselines trained on 60,000 labeled examples. The inference-time version lets Qwen-32B match or beat GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the SciCode benchmark while using one-third the compute of self-consistency. The work therefore claims that substantial reasoning ability already exists inside many pretrained models and can be surfaced simply by reducing predictive uncertainty.

Core claim

Entropy minimization trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' performance on challenging math, physics, and coding tasks. We explore three approaches: EM-FT minimizes token-level entropy similarly to instruction finetuning but on unlabeled outputs drawn from the model; EM-RL uses reinforcement learning with negative entropy as the only reward; and EM-INF performs inference-time logit adjustment to reduce entropy without training or updates. On Qwen-7B, EM-RL without labeled data achieves comparable or better performance than GRP

What carries the argument

Entropy minimization, the objective that concentrates probability mass on the model's own highest-confidence outputs to elicit latent reasoning.

If this is right

  • Reinforcement learning for reasoning can succeed with a reward signal consisting solely of negative entropy.
  • Inference-time logit adjustment can deliver large gains with zero parameter updates or extra training data.
  • Pretrained models already encode much of the knowledge needed for hard reasoning tasks once uncertainty is reduced.
  • Self-generated unlabeled data suffices to improve performance on math, physics, and coding benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same entropy-minimization signal might be combined with existing sampling methods to further reduce compute on difficult problems.
  • If high-confidence outputs are mostly correct, the method could shrink the volume of human labels required for future reasoning models.
  • On tasks where models hold strong but incorrect priors, entropy minimization might need an external accuracy check to avoid locking in mistakes.

Load-bearing premise

Concentrating probability mass on the model's most confident outputs will raise actual reasoning accuracy rather than reinforcing pre-existing errors or hallucinations.

What would settle it

A controlled run in which entropy minimization is applied to a model whose high-confidence outputs are known to contain systematic errors, followed by measurement of whether error rate rises or falls on a held-out reasoning benchmark.

read the original abstract

Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that entropy minimization (EM), applied via fine-tuning on unlabeled model outputs (EM-FT), reinforcement learning with negative entropy reward (EM-RL), or inference-time logit adjustment (EM-INF), substantially improves LLM performance on math, physics, and coding benchmarks without any labeled data. On Qwen-7B, EM-RL matches or exceeds GRPO and RLOO trained on 60K labeled examples; on Qwen-32B, EM-INF matches or exceeds GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on SciCode while being 3x more efficient than self-consistency.

Significance. If the empirical gains hold after controlling for error amplification, the result would indicate that many pretrained LLMs already encode substantial reasoning capability that can be elicited by sharpening probability mass on the model's own high-confidence generations, reducing reliance on labeled data for reasoning improvement and offering a simple, efficient alternative to RL-based methods.

major comments (2)
  1. [§4 and Table 2] §4 (Experiments) and Table 2: the central claim that EM-RL improves accuracy without labeled data rests on benchmark gains over GRPO/RLOO, but the manuscript provides no error-analysis breakdown (e.g., per-problem accuracy stratified by whether the initial high-probability generation was correct or incorrect). Without this, it is impossible to distinguish genuine reasoning improvement from confidence amplification around pre-existing errors.
  2. [§3.2] §3.2 (EM-RL formulation): the reward is defined solely as negative entropy over sampled trajectories; the paper does not report whether the policy updates increase the probability of correct answers on held-out problems where the base model initially assigns low probability to the correct solution, which is required to substantiate the claim that EM surfaces latent correct reasoning rather than reinforcing confident mistakes.
minor comments (2)
  1. [Figure 3 and §5.1] Figure 3 caption and §5.1: the efficiency comparison with self-consistency should explicitly state the number of samples and temperature settings used for the baseline to allow direct replication.
  2. [§2] §2 (Related Work): the discussion of prior entropy-regularization methods in RLHF omits citation to the original entropy-regularized policy gradient literature (e.g., Haarnoja et al., 2018) that would clarify the novelty of using entropy minimization as the sole objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below and commit to revisions that strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [§4 and Table 2] §4 (Experiments) and Table 2: the central claim that EM-RL improves accuracy without labeled data rests on benchmark gains over GRPO/RLOO, but the manuscript provides no error-analysis breakdown (e.g., per-problem accuracy stratified by whether the initial high-probability generation was correct or incorrect). Without this, it is impossible to distinguish genuine reasoning improvement from confidence amplification around pre-existing errors.

    Authors: We agree that a stratified error analysis is necessary to distinguish genuine reasoning gains from error amplification. In the revised manuscript we will add this breakdown for the Qwen-7B experiments, reporting accuracy changes separately for problems where the base model’s initial high-probability output was correct versus incorrect. This will be presented as an additional table in §4. revision: yes

  2. Referee: [§3.2] §3.2 (EM-RL formulation): the reward is defined solely as negative entropy over sampled trajectories; the paper does not report whether the policy updates increase the probability of correct answers on held-out problems where the base model initially assigns low probability to the correct solution, which is required to substantiate the claim that EM surfaces latent correct reasoning rather than reinforcing confident mistakes.

    Authors: We acknowledge the value of directly measuring probability shifts on held-out problems where the base model initially assigns low probability to the correct answer. We will run the requested analysis and report the resulting probability increases for correct solutions after EM-RL training. These results will be added to §3.2 of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper's central claims consist of empirical performance comparisons (EM-RL matching GRPO/RLOO on math/physics/coding tasks, EM-INF matching proprietary models on SciCode) using public benchmarks and external baselines. No derivation chain, equations, or self-citations are presented that reduce the reported improvements to quantities defined in terms of the method's own fitted parameters or prior author work by construction. The approach is self-contained against external evaluation, with results driven by held-out test performance rather than internal redefinitions or load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and relies on standard LLM training and RL practices; no explicit free parameters, axioms, or invented entities are introduced or fitted in the abstract.

pith-pipeline@v0.9.0 · 5772 in / 1287 out tokens · 63551 ms · 2026-05-18T15:52:43.392752+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

    cs.AI 2026-03 conditional novelty 8.0

    SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

  2. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...

  3. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  4. Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models

    eess.AS 2026-05 unverdicted novelty 7.0

    Derives a rigorous entropy minimization formulation for autoregressive test-time adaptation that decomposes into policy gradient and entropy terms, reinterpreting prior methods and improving Whisper ASR across 20+ domains.

  5. The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?

    cs.CL 2026-03 unverdicted novelty 7.0

    The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.

  6. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  7. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 6.0

    RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...

  8. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

  9. Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

    cs.LG 2026-04 unverdicted novelty 6.0

    A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

  10. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  11. Can LLMs Learn to Reason Robustly under Noisy Supervision?

    cs.LG 2026-04 conditional novelty 6.0

    Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning be...

  12. Multi-Token Prediction via Self-Distillation

    cs.CL 2026-02 unverdicted novelty 6.0

    Self-distillation turns pretrained autoregressive LMs into multi-token predictors that decode over 3x faster with under 5% accuracy drop on GSM8K.

  13. Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

    cs.LG 2025-09 unverdicted novelty 6.0

    Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.

  14. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.

  15. Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

  16. A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning

    cs.LG 2025-10 unverdicted novelty 5.0

    SePT enables LLMs to improve math reasoning on multiple benchmarks by iteratively training on their own low-temperature generated responses using an online data refresh mechanism.

  17. Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

    cs.LG 2025-09 conditional novelty 5.0

    The paper identifies confounds in RLVR evaluations that inflate apparent gains and proposes a minimum standard for budget-matched, contamination-aware assessment with calibration tracking.

  18. Failure Modes of Maximum Entropy RLHF

    cs.LG 2025-09 unverdicted novelty 5.0

    Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.

  19. Self-Aligned Reward: Towards Effective and Efficient Reasoners

    cs.LG 2025-09 unverdicted novelty 5.0

    Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.

  20. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  21. A Survey of Reinforcement Learning for Large Reasoning Models

    cs.CL 2025-09 accept novelty 3.0

    A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · cited by 19 Pith papers · 29 internal anchors

  1. [1]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024

  2. [2]

    Prompt-reverse inconsistency: Llm self-inconsistency beyond generative randomness and prompt paraphrasing

    Jihyun Janice Ahn and Wenpeng Yin. Prompt-reverse inconsistency: Llm self-inconsistency beyond generative randomness and prompt paraphrasing. arXiv preprint arXiv:2504.01282, 2025

  3. [3]

    Ext5: Towards extreme multi-task scaling for transfer learning

    Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, et al. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952, 2021

  4. [4]

    Llm stability: A detailed analysis with some surprises

    Berk Atil, Alexa Chittams, Liseng Fu, Ferhan Ture, Lixinyu Xu, and Breck Baldwin. Llm stability: A detailed analysis with some surprises. arXiv preprint arXiv:2408.04667, 2024

  5. [5]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  6. [6]

    Inference-time scaling for complex tasks: Where we stand and what lies ahead

    Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, et al. Inference-time scaling for complex tasks: Where we stand and what lies ahead. arXiv preprint arXiv:2504.00294, 2025

  7. [7]

    Theoretical guarantees on the best-of-n alignment policy

    Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander D’Amour, Jacob Eisenstein, Chirag Nagpal, and Ananda Theertha Suresh. Theoretical guarantees on the best-of-n alignment policy. arXiv preprint arXiv:2401.01879, 2024

  8. [8]

    Mixmatch: A holistic approach to semi-supervised learning

    David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems, 32, 2019

  9. [9]

    Parameter-free online test-time adaptation

    Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. Parameter-free online test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8344–8353, 2022

  10. [10]

    Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

    Sébastien Bubeck, Varun Chadrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

  11. [11]

    Teaching large language models to self-debug

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations,

  12. [12]

    URL https://openreview.net/forum?id=KuPixIqPiq

  13. [13]

    Elements of information theory

    Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999. 10

  14. [14]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

  15. [15]

    Beyond exact gradients: Convergence of stochastic soft-max policy gradient methods with entropy regularization

    Yuhao Ding, Junzi Zhang, Hyunin Lee, and Javad Lavaei. Beyond exact gradients: Convergence of stochastic soft-max policy gradient methods with entropy regularization. arXiv preprint arXiv:2110.10117, 2021

  16. [16]

    Diversity-aware buffer for coping with temporally correlated data streams in online test-time adaptation

    Mario Döbler, Florian Marencke, Robert A Marsden, and Bin Yang. Diversity-aware buffer for coping with temporally correlated data streams in online test-time adaptation. In ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7665–7669. IEEE, 2024

  17. [17]

    RAFT: Reward ranked finetuning for generative foundation model alignment

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=m7p5O7zblY

  18. [18]

    The entropy of markov trajectories

    Laura Ekroot and Thomas M Cover. The entropy of markov trajectories. IEEE Transactions on Information Theory, 39(4):1418–1421, 2002

  19. [19]

    Maximum entropy rl (provably) solves some robust rl problems

    Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257, 2021

  20. [20]

    Hierarchical neural story generation

    Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Annual Meeting of the Association for Computational Linguistics , 2018. URL https: //api.semanticscholar.org/CorpusID:44134226

  21. [21]

    Generalizing Skills with Semi-Supervised Reinforcement Learning

    Chelsea Finn, Tianhe Yu, Justin Fu, Pieter Abbeel, and Sergey Levine. Generalizing skills with semi-supervised reinforcement learning. arXiv preprint arXiv:1612.00429, 2016

  22. [22]

    Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025

  23. [23]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023

  24. [24]

    Semi-supervised learning by entropy minimization

    Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. Advances in neural information processing systems, 17, 2004

  25. [25]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  26. [26]

    A baseline for few-shot image classification

    S Dhillon Guneet, Chaudhari Pratik, Ravichandran Avinash, and S Stefano. A baseline for few-shot image classification. In International Conference on Learning Representations (ICLR), volume 10, 2020

  27. [27]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024

  28. [28]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. Pmlr, 2018

  29. [29]

    OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilin- gual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilin- gual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors, Proceedi...

  30. [30]

    doi: 10.18653/v1/2024.acl-long.211

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.211. URL https://aclanthology.org/2024.acl-long.211/

  31. [31]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

  32. [32]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

  33. [33]

    Self-improvement in language models: The sharpening mechanism

    Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jor- dan T Ash, and Akshay Krishnamurthy. Self-improvement in language models: The sharpening mechanism. arXiv preprint arXiv:2412.01951, 2024

  34. [34]

    Large Language Models Can Self-Improve

    Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022

  35. [35]

    Svqn: Sequential variational soft q-learning networks

    Shiyu Huang, Hang Su, Jun Zhu, and Ting Chen. Svqn: Sequential variational soft q-learning networks. In International Conference on Learning Representations , 2020. URL https: //openreview.net/forum?id=r1xPh2VtPB

  36. [36]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  37. [38]

    Can language models reason about individualistic human values and preferences?, 2024

    Liwei Jiang, Taylor Sorensen, Sydney Levine, and Yejin Choi. Can language models reason about individualistic human values and preferences?, 2024. URL https://arxiv.org/abs/ 2410.03868

  38. [39]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  39. [40]

    Optimal control as a graphical model inference problem

    Hilbert J Kappen, Vicenç Gómez, and Manfred Opper. Optimal control as a graphical model inference problem. Machine learning, 87:159–182, 2012

  40. [41]

    Infonerf: Ray entropy minimization for few-shot neural volume rendering

    Mijeong Kim, Seonguk Seo, and Bohyung Han. Infonerf: Ray entropy minimization for few-shot neural volume rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12912–12921, 2022

  41. [42]

    Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks

    Dong-Hyun Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. 2013. URL https://api.semanticscholar.org/CorpusID: 18507866

  42. [43]

    Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

    Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018

  43. [44]

    Solving quantitative reasoning problems with language models

    Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun ...

  44. [45]

    Numinamath

    Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/ project-numina/aimo-progress-prize/blob/main/report/nu...

  45. [46]

    Revisiting self-consistency from dynamic distributional alignment perspective on answer aggregation

    Yiwei Li, Ji Zhang, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, et al. Revisiting self-consistency from dynamic distributional alignment perspective on answer aggregation. arXiv preprint arXiv:2502.19830, 2025

  46. [47]

    Inference-time scaling for generalist reward modeling

    Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling. arXiv preprint arXiv:2504.02495, 2025

  47. [48]

    Reft: Reasoning with reinforced fine-tuning

    Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 3, 2024

  48. [49]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems , 36:46534–46594, 2023

  49. [50]

    Universal test-time adaptation through weight ensembling, diversity weighting, and prior correction

    Robert A Marsden, Mario Döbler, and Bin Yang. Universal test-time adaptation through weight ensembling, diversity weighting, and prior correction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2555–2565, 2024

  50. [51]

    The role of baselines in policy gradient optimization

    Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, and Dale Schuur- mans. The role of baselines in policy gradient optimization. Advances in Neural Information Processing Systems, 35:17818–17830, 2022

  51. [52]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

  52. [53]

    Information-theoretic semi- supervised metric learning via entropy regularization

    Gang Niu, Bo Dai, Makoto Yamada, and Masashi Sugiyama. Information-theoretic semi- supervised metric learning via entropy regularization. Neural Computation, 26:1717–1762,

  53. [54]

    URL https://api.semanticscholar.org/CorpusID:15064396

  54. [55]

    Real- istic evaluation of deep semi-supervised learning algorithms

    Avital Oliver, Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Real- istic evaluation of deep semi-supervised learning algorithms. Advances in neural information processing systems, 31, 2018

  55. [56]

    Pre- dictable reinforcement learning dynamics through entropy rate minimization

    Daniel Jarne Ornia, Giannis Delimpaltadakis, Jens Kober, and Javier Alonso-Mora. Pre- dictable reinforcement learning dynamics through entropy rate minimization. arXiv preprint arXiv:2311.18703, 2023

  56. [57]

    Language model self-improvement by reinforcement learning contem- plation

    Jing-Cheng Pang, Pengyuan Wang, Kaiyuan Li, Xiong-Hui Chen, Jiacheng Xu, Zongzhang Zhang, and Yang Yu. Language model self-improvement by reinforcement learning contem- plation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=38E4yUbrgr

  57. [58]

    Surf: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning

    Jongjin Park, Younggyo Seo, Jinwoo Shin, Honglak Lee, Pieter Abbeel, and Kimin Lee. Surf: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. arXiv preprint arXiv:2203.10050, 2022

  58. [59]

    The entropy enigma: Success and failure of entropy minimization

    Ori Press, Ravid Shwartz-Ziv, Yann LeCun, and Matthias Bethge. The entropy enigma: Success and failure of entropy minimization. arXiv preprint arXiv:2405.05012, 2024

  59. [60]

    Recursive introspection: Teaching language model agents how to self-improve

    Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar. Recursive introspection: Teaching language model agents how to self-improve. Advances in Neural Information Processing Systems, 37:55249–55285, 2024

  60. [61]

    Improving language understanding by generative pre-training

    Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018. URL https://api.semanticscholar.org/CorpusID:49313245

  61. [62]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. 13

  62. [63]

    Unsupervised domain adaptation using feature-whitening and consensus loss

    Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Samuel Rota Bulo, Nicu Sebe, and Elisa Ricci. Unsupervised domain adaptation using feature-whitening and consensus loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9471–9480, 2019

  63. [64]

    If your data distribution shifts, use self-learning

    Evgenia Rusak, Steffen Schneider, George Pachitariu, Luisa Eck, Peter Gehler, Oliver Bring- mann, Wieland Brendel, and Matthias Bethge. If your data distribution shifts, use self-learning. arXiv preprint arXiv:2104.12928, 2021

  64. [65]

    Semi-supervised domain adaptation via minimax entropy

    Kuniaki Saito, Donghyun Kim, Stan Sclaroff, Trevor Darrell, and Kate Saenko. Semi-supervised domain adaptation via minimax entropy. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8050–8058, 2019

  65. [66]

    Beyond chinchilla-optimal: Accounting for inference in language model scaling laws

    Nikhil Sardana, Jacob Portes, Sasha Doubov, and Jonathan Frankle. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448, 2023

  66. [67]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  67. [68]

    Rewarding progress: Scaling automated process verifiers for llm reasoning

    Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning. arXiv preprint arXiv:2410.08146, 2024

  68. [69]

    Scaling test-time compute without verification or rl is suboptimal

    Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal. arXiv preprint arXiv:2502.12118, 2025

  69. [70]

    A mathematical theory of communication

    Claude E Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948

  70. [71]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  71. [72]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

  72. [73]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

  73. [74]

    Test-time prompt tuning for zero-shot generalization in vision-language models

    Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022

  74. [75]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv. org/abs/2408.03314, 11, 2024

  75. [76]

    On the self-verification limitations of large language models on reasoning and planning tasks

    Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limitations of large language models on reasoning and planning tasks. arXiv preprint arXiv:2402.08115, 2024

  76. [77]

    Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback,

  77. [78]

    URL https://arxiv.org/abs/2009.01325

  78. [79]

    Policy gradi- ent methods for reinforcement learning with function approximation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradi- ent methods for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems , volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/ file/464d828b85b0b...

  79. [80]

    Policy gradient meth- ods for reinforcement learning with function approximation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

  80. [81]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

Showing first 80 references.