arxiv: 2505.15134 · v1 · pith:PT6R2GSCnew · submitted 2025-05-21 · 💻 cs.LG · cs.AI

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal , Zimin Zhang , Lifan Yuan , Jiawei Han , Hao Peng This is my paper

Pith reviewed 2026-05-18 15:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords entropy minimizationLLM reasoningunlabeled datareinforcement learninginference-time optimizationmath reasoningcoding benchmarksSciCode

0 comments

The pith

Entropy minimization on a model's own outputs improves LLM reasoning without any labeled data or updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that entropy minimization alone can substantially raise large language model performance on math, physics, and coding problems by concentrating probability on the model's most confident generations. This is tested through three concrete methods: token-level fine-tuning on self-generated unlabeled data, reinforcement learning that uses negative entropy as the sole reward, and inference-time logit adjustment with no training at all. On the Qwen-7B model the reinforcement-learning version reaches or surpasses strong RL baselines trained on 60,000 labeled examples. The inference-time version lets Qwen-32B match or beat GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the SciCode benchmark while using one-third the compute of self-consistency. The work therefore claims that substantial reasoning ability already exists inside many pretrained models and can be surfaced simply by reducing predictive uncertainty.

Core claim

Entropy minimization trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' performance on challenging math, physics, and coding tasks. We explore three approaches: EM-FT minimizes token-level entropy similarly to instruction finetuning but on unlabeled outputs drawn from the model; EM-RL uses reinforcement learning with negative entropy as the only reward; and EM-INF performs inference-time logit adjustment to reduce entropy without training or updates. On Qwen-7B, EM-RL without labeled data achieves comparable or better performance than GRP

What carries the argument

Entropy minimization, the objective that concentrates probability mass on the model's own highest-confidence outputs to elicit latent reasoning.

If this is right

Reinforcement learning for reasoning can succeed with a reward signal consisting solely of negative entropy.
Inference-time logit adjustment can deliver large gains with zero parameter updates or extra training data.
Pretrained models already encode much of the knowledge needed for hard reasoning tasks once uncertainty is reduced.
Self-generated unlabeled data suffices to improve performance on math, physics, and coding benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy-minimization signal might be combined with existing sampling methods to further reduce compute on difficult problems.
If high-confidence outputs are mostly correct, the method could shrink the volume of human labels required for future reasoning models.
On tasks where models hold strong but incorrect priors, entropy minimization might need an external accuracy check to avoid locking in mistakes.

Load-bearing premise

Concentrating probability mass on the model's most confident outputs will raise actual reasoning accuracy rather than reinforcing pre-existing errors or hallucinations.

What would settle it

A controlled run in which entropy minimization is applied to a model whose high-confidence outputs are known to contain systematic errors, followed by measurement of whether error rate rises or falls on a held-out reasoning benchmark.

read the original abstract

Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that entropy minimization (EM), applied via fine-tuning on unlabeled model outputs (EM-FT), reinforcement learning with negative entropy reward (EM-RL), or inference-time logit adjustment (EM-INF), substantially improves LLM performance on math, physics, and coding benchmarks without any labeled data. On Qwen-7B, EM-RL matches or exceeds GRPO and RLOO trained on 60K labeled examples; on Qwen-32B, EM-INF matches or exceeds GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on SciCode while being 3x more efficient than self-consistency.

Significance. If the empirical gains hold after controlling for error amplification, the result would indicate that many pretrained LLMs already encode substantial reasoning capability that can be elicited by sharpening probability mass on the model's own high-confidence generations, reducing reliance on labeled data for reasoning improvement and offering a simple, efficient alternative to RL-based methods.

major comments (2)

[§4 and Table 2] §4 (Experiments) and Table 2: the central claim that EM-RL improves accuracy without labeled data rests on benchmark gains over GRPO/RLOO, but the manuscript provides no error-analysis breakdown (e.g., per-problem accuracy stratified by whether the initial high-probability generation was correct or incorrect). Without this, it is impossible to distinguish genuine reasoning improvement from confidence amplification around pre-existing errors.
[§3.2] §3.2 (EM-RL formulation): the reward is defined solely as negative entropy over sampled trajectories; the paper does not report whether the policy updates increase the probability of correct answers on held-out problems where the base model initially assigns low probability to the correct solution, which is required to substantiate the claim that EM surfaces latent correct reasoning rather than reinforcing confident mistakes.

minor comments (2)

[Figure 3 and §5.1] Figure 3 caption and §5.1: the efficiency comparison with self-consistency should explicitly state the number of samples and temperature settings used for the baseline to allow direct replication.
[§2] §2 (Related Work): the discussion of prior entropy-regularization methods in RLHF omits citation to the original entropy-regularized policy gradient literature (e.g., Haarnoja et al., 2018) that would clarify the novelty of using entropy minimization as the sole objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below and commit to revisions that strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§4 and Table 2] §4 (Experiments) and Table 2: the central claim that EM-RL improves accuracy without labeled data rests on benchmark gains over GRPO/RLOO, but the manuscript provides no error-analysis breakdown (e.g., per-problem accuracy stratified by whether the initial high-probability generation was correct or incorrect). Without this, it is impossible to distinguish genuine reasoning improvement from confidence amplification around pre-existing errors.

Authors: We agree that a stratified error analysis is necessary to distinguish genuine reasoning gains from error amplification. In the revised manuscript we will add this breakdown for the Qwen-7B experiments, reporting accuracy changes separately for problems where the base model’s initial high-probability output was correct versus incorrect. This will be presented as an additional table in §4. revision: yes
Referee: [§3.2] §3.2 (EM-RL formulation): the reward is defined solely as negative entropy over sampled trajectories; the paper does not report whether the policy updates increase the probability of correct answers on held-out problems where the base model initially assigns low probability to the correct solution, which is required to substantiate the claim that EM surfaces latent correct reasoning rather than reinforcing confident mistakes.

Authors: We acknowledge the value of directly measuring probability shifts on held-out problems where the base model initially assigns low probability to the correct answer. We will run the requested analysis and report the resulting probability increases for correct solutions after EM-RL training. These results will be added to §3.2 of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper's central claims consist of empirical performance comparisons (EM-RL matching GRPO/RLOO on math/physics/coding tasks, EM-INF matching proprietary models on SciCode) using public benchmarks and external baselines. No derivation chain, equations, or self-citations are presented that reduce the reported improvements to quantities defined in terms of the method's own fitted parameters or prior author work by construction. The approach is self-contained against external evaluation, with results driven by held-out test performance rather than internal redefinitions or load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and relies on standard LLM training and RL practices; no explicit free parameters, axioms, or invented entities are introduced or fitted in the abstract.

pith-pipeline@v0.9.0 · 5772 in / 1287 out tokens · 63551 ms · 2026-05-18T15:52:43.392752+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs... EM-FT minimizes token-level entropy... EM-RL: reinforcement learning with negative entropy as the only reward... EM-INF: inference-time logit adjustment to reduce entropy
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
cs.AI 2026-03 conditional novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 7.0

RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.
Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models
eess.AS 2026-05 unverdicted novelty 7.0

Derives a rigorous entropy minimization formulation for autoregressive test-time adaptation that decomposes into policy gradient and entropy terms, reinterpreting prior methods and improving Whisper ASR across 20+ domains.
The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?
cs.CL 2026-03 unverdicted novelty 7.0

The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 6.0

RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
cs.LG 2026-04 unverdicted novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Can LLMs Learn to Reason Robustly under Noisy Supervision?
cs.LG 2026-04 conditional novelty 6.0

Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning be...
Multi-Token Prediction via Self-Distillation
cs.CL 2026-02 unverdicted novelty 6.0

Self-distillation turns pretrained autoregressive LMs into multi-token predictors that decode over 3x faster with under 5% accuracy drop on GSM8K.
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
cs.LG 2025-09 unverdicted novelty 6.0

Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
cs.CL 2026-04 unverdicted novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning
cs.LG 2025-10 unverdicted novelty 5.0

SePT enables LLMs to improve math reasoning on multiple benchmarks by iteratively training on their own low-temperature generated responses using an online data refresh mechanism.
Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards
cs.LG 2025-09 conditional novelty 5.0

The paper identifies confounds in RLVR evaluations that inflate apparent gains and proposes a minimum standard for budget-matched, contamination-aware assessment with calibration tracking.
Failure Modes of Maximum Entropy RLHF
cs.LG 2025-09 unverdicted novelty 5.0

Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
Self-Aligned Reward: Towards Effective and Efficient Reasoners
cs.LG 2025-09 unverdicted novelty 5.0

Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
A Survey of Reinforcement Learning for Large Reasoning Models
cs.CL 2025-09 accept novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · cited by 19 Pith papers · 29 internal anchors

[1]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Prompt-reverse inconsistency: Llm self-inconsistency beyond generative randomness and prompt paraphrasing

Jihyun Janice Ahn and Wenpeng Yin. Prompt-reverse inconsistency: Llm self-inconsistency beyond generative randomness and prompt paraphrasing. arXiv preprint arXiv:2504.01282, 2025

work page arXiv 2025
[3]

Ext5: Towards extreme multi-task scaling for transfer learning

Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, et al. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952, 2021

work page arXiv 2021
[4]

Llm stability: A detailed analysis with some surprises

Berk Atil, Alexa Chittams, Liseng Fu, Ferhan Ture, Lixinyu Xu, and Breck Baldwin. Llm stability: A detailed analysis with some surprises. arXiv preprint arXiv:2408.04667, 2024

work page arXiv 2024
[5]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Inference-time scaling for complex tasks: Where we stand and what lies ahead

Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, et al. Inference-time scaling for complex tasks: Where we stand and what lies ahead. arXiv preprint arXiv:2504.00294, 2025

work page arXiv 2025
[7]

Theoretical guarantees on the best-of-n alignment policy

Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander D’Amour, Jacob Eisenstein, Chirag Nagpal, and Ananda Theertha Suresh. Theoretical guarantees on the best-of-n alignment policy. arXiv preprint arXiv:2401.01879, 2024

work page arXiv 2024
[8]

Mixmatch: A holistic approach to semi-supervised learning

David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems, 32, 2019

work page 2019
[9]

Parameter-free online test-time adaptation

Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. Parameter-free online test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8344–8353, 2022

work page 2022
[10]

Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

Sébastien Bubeck, Varun Chadrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

work page 2023
[11]

Teaching large language models to self-debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations,

work page
[12]

URL https://openreview.net/forum?id=KuPixIqPiq

work page
[13]

Elements of information theory

Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999. 10

work page 1999
[14]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Beyond exact gradients: Convergence of stochastic soft-max policy gradient methods with entropy regularization

Yuhao Ding, Junzi Zhang, Hyunin Lee, and Javad Lavaei. Beyond exact gradients: Convergence of stochastic soft-max policy gradient methods with entropy regularization. arXiv preprint arXiv:2110.10117, 2021

work page arXiv 2021
[16]

Diversity-aware buffer for coping with temporally correlated data streams in online test-time adaptation

Mario Döbler, Florian Marencke, Robert A Marsden, and Bin Yang. Diversity-aware buffer for coping with temporally correlated data streams in online test-time adaptation. In ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7665–7669. IEEE, 2024

work page 2024
[17]

RAFT: Reward ranked finetuning for generative foundation model alignment

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=m7p5O7zblY

work page 2023
[18]

The entropy of markov trajectories

Laura Ekroot and Thomas M Cover. The entropy of markov trajectories. IEEE Transactions on Information Theory, 39(4):1418–1421, 2002

work page 2002
[19]

Maximum entropy rl (provably) solves some robust rl problems

Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257, 2021

work page arXiv 2021
[20]

Hierarchical neural story generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Annual Meeting of the Association for Computational Linguistics , 2018. URL https: //api.semanticscholar.org/CorpusID:44134226

work page 2018
[21]

Generalizing Skills with Semi-Supervised Reinforcement Learning

Chelsea Finn, Tianhe Yu, Justin Fu, Pieter Abbeel, and Sergey Levine. Generalizing skills with semi-supervised reinforcement learning. arXiv preprint arXiv:1612.00429, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[22]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Semi-supervised learning by entropy minimization

Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. Advances in neural information processing systems, 17, 2004

work page 2004
[25]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

A baseline for few-shot image classification

S Dhillon Guneet, Chaudhari Pratik, Ravichandran Avinash, and S Stefano. A baseline for few-shot image classification. In International Conference on Learning Representations (ICLR), volume 10, 2020

work page 2020
[27]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018
[29]

OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilin- gual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilin- gual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors, Proceedi...

work page
[30]

doi: 10.18653/v1/2024.acl-long.211

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.211. URL https://aclanthology.org/2024.acl-long.211/

work page doi:10.18653/v1/2024.acl-long.211 2024
[31]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

work page 2021
[32]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[33]

Self-improvement in language models: The sharpening mechanism

Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jor- dan T Ash, and Akshay Krishnamurthy. Self-improvement in language models: The sharpening mechanism. arXiv preprint arXiv:2412.01951, 2024

work page arXiv 2024
[34]

Large Language Models Can Self-Improve

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022

work page internal anchor Pith review arXiv 2022
[35]

Svqn: Sequential variational soft q-learning networks

Shiyu Huang, Hang Su, Jun Zhu, and Ting Chen. Svqn: Sequential variational soft q-learning networks. In International Conference on Learning Representations , 2020. URL https: //openreview.net/forum?id=r1xPh2VtPB

work page 2020
[36]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Can language models reason about individualistic human values and preferences?, 2024

Liwei Jiang, Taylor Sorensen, Sydney Levine, and Yejin Choi. Can language models reason about individualistic human values and preferences?, 2024. URL https://arxiv.org/abs/ 2410.03868

work page arXiv 2024
[39]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[40]

Optimal control as a graphical model inference problem

Hilbert J Kappen, Vicenç Gómez, and Manfred Opper. Optimal control as a graphical model inference problem. Machine learning, 87:159–182, 2012

work page 2012
[41]

Infonerf: Ray entropy minimization for few-shot neural volume rendering

Mijeong Kim, Seonguk Seo, and Bohyung Han. Infonerf: Ray entropy minimization for few-shot neural volume rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12912–12921, 2022

work page 2022
[42]

Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks

Dong-Hyun Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. 2013. URL https://api.semanticscholar.org/CorpusID: 18507866

work page 2013
[43]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun ...

work page 2022
[45]

Numinamath

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/ project-numina/aimo-progress-prize/blob/main/report/nu...

work page 2024
[46]

Revisiting self-consistency from dynamic distributional alignment perspective on answer aggregation

Yiwei Li, Ji Zhang, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, et al. Revisiting self-consistency from dynamic distributional alignment perspective on answer aggregation. arXiv preprint arXiv:2502.19830, 2025

work page arXiv 2025
[47]

Inference-time scaling for generalist reward modeling

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling. arXiv preprint arXiv:2504.02495, 2025

work page arXiv 2025
[48]

Reft: Reasoning with reinforced fine-tuning

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 3, 2024

work page arXiv 2024
[49]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems , 36:46534–46594, 2023

work page 2023
[50]

Universal test-time adaptation through weight ensembling, diversity weighting, and prior correction

Robert A Marsden, Mario Döbler, and Bin Yang. Universal test-time adaptation through weight ensembling, diversity weighting, and prior correction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2555–2565, 2024

work page 2024
[51]

The role of baselines in policy gradient optimization

Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, and Dale Schuur- mans. The role of baselines in policy gradient optimization. Advances in Neural Information Processing Systems, 35:17818–17830, 2022

work page 2022
[52]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Information-theoretic semi- supervised metric learning via entropy regularization

Gang Niu, Bo Dai, Makoto Yamada, and Masashi Sugiyama. Information-theoretic semi- supervised metric learning via entropy regularization. Neural Computation, 26:1717–1762,

work page
[54]

URL https://api.semanticscholar.org/CorpusID:15064396

work page
[55]

Real- istic evaluation of deep semi-supervised learning algorithms

Avital Oliver, Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Real- istic evaluation of deep semi-supervised learning algorithms. Advances in neural information processing systems, 31, 2018

work page 2018
[56]

Pre- dictable reinforcement learning dynamics through entropy rate minimization

Daniel Jarne Ornia, Giannis Delimpaltadakis, Jens Kober, and Javier Alonso-Mora. Pre- dictable reinforcement learning dynamics through entropy rate minimization. arXiv preprint arXiv:2311.18703, 2023

work page arXiv 2023
[57]

Language model self-improvement by reinforcement learning contem- plation

Jing-Cheng Pang, Pengyuan Wang, Kaiyuan Li, Xiong-Hui Chen, Jiacheng Xu, Zongzhang Zhang, and Yang Yu. Language model self-improvement by reinforcement learning contem- plation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=38E4yUbrgr

work page 2024
[58]

Surf: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning

Jongjin Park, Younggyo Seo, Jinwoo Shin, Honglak Lee, Pieter Abbeel, and Kimin Lee. Surf: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. arXiv preprint arXiv:2203.10050, 2022

work page arXiv 2022
[59]

The entropy enigma: Success and failure of entropy minimization

Ori Press, Ravid Shwartz-Ziv, Yann LeCun, and Matthias Bethge. The entropy enigma: Success and failure of entropy minimization. arXiv preprint arXiv:2405.05012, 2024

work page arXiv 2024
[60]

Recursive introspection: Teaching language model agents how to self-improve

Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar. Recursive introspection: Teaching language model agents how to self-improve. Advances in Neural Information Processing Systems, 37:55249–55285, 2024

work page 2024
[61]

Improving language understanding by generative pre-training

Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018. URL https://api.semanticscholar.org/CorpusID:49313245

work page 2018
[62]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. 13

work page 2020
[63]

Unsupervised domain adaptation using feature-whitening and consensus loss

Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Samuel Rota Bulo, Nicu Sebe, and Elisa Ricci. Unsupervised domain adaptation using feature-whitening and consensus loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9471–9480, 2019

work page 2019
[64]

If your data distribution shifts, use self-learning

Evgenia Rusak, Steffen Schneider, George Pachitariu, Luisa Eck, Peter Gehler, Oliver Bring- mann, Wieland Brendel, and Matthias Bethge. If your data distribution shifts, use self-learning. arXiv preprint arXiv:2104.12928, 2021

work page arXiv 2021
[65]

Semi-supervised domain adaptation via minimax entropy

Kuniaki Saito, Donghyun Kim, Stan Sclaroff, Trevor Darrell, and Kate Saenko. Semi-supervised domain adaptation via minimax entropy. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8050–8058, 2019

work page 2019
[66]

Beyond chinchilla-optimal: Accounting for inference in language model scaling laws

Nikhil Sardana, Jacob Portes, Sasha Doubov, and Jonathan Frankle. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448, 2023

work page arXiv 2023
[67]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[68]

Rewarding progress: Scaling automated process verifiers for llm reasoning

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning. arXiv preprint arXiv:2410.08146, 2024

work page arXiv 2024
[69]

Scaling test-time compute without verification or rl is suboptimal

Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal. arXiv preprint arXiv:2502.12118, 2025

work page arXiv 2025
[70]

A mathematical theory of communication

Claude E Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948

work page 1948
[71]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

work page 2023
[74]

Test-time prompt tuning for zero-shot generalization in vision-language models

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022

work page 2022
[75]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv. org/abs/2408.03314, 11, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

On the self-verification limitations of large language models on reasoning and planning tasks

Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limitations of large language models on reasoning and planning tasks. arXiv preprint arXiv:2402.08115, 2024

work page arXiv 2024
[77]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback,

work page
[78]

URL https://arxiv.org/abs/2009.01325

work page internal anchor Pith review Pith/arXiv arXiv 2009
[79]

Policy gradi- ent methods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradi- ent methods for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems , volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/ file/464d828b85b0b...

work page 1999
[80]

Policy gradient meth- ods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

work page 1999
[81]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

work page 2024

Showing first 80 references.